Gender of White House Visitors

No comments:
Last summer, as as part of my internship working with these awesome people at MSR,  I spent a lot of time playing with public data sources. One fascinating dataset that I chose as a benchmark (for what is currently known as Tempe at MSR) is the White House Visitor records, which (as of last July) had over 3 Million records of visitors to the White House during the Obama administration.

This dataset has been in the news before, and is (in my opinion) a great example of public disclosure that we should be pushing for in government. A whole other conversation of course is how/when such records should be released, and by whom. The White House Visitor dataset is also known to be incomplete, censoring records for national or personal security reasons, and maybe other reasons too.

Here is just one question I came up with: Do more men or women visit the White House? My guess was that a majority of visitors would be men.

To make this slightly more interesting, I also posted a very simple survey last summer that asked people to guess if more men or women visited the White House. The survey itself was hastily done (read: poorly done) but nearly 300 people kindly responded with their guesses. The distribution of answers looked like this:
Survey results, from 297 participants.
Basically the most common guess agreed with me, that it would be around ~60% men.

First challenge: How does one assign gender, when all you're given is a name?
I've described how I do this before, but this was actually the project where I first tried it! I downloaded the US Social Security Administration's full Baby Name dataset, which has a huge list of name-gender info for more than a century. I've limited myself to names only since 1920 here.

For every name in the SSA dataset I count the # male and # female instances, and assume a flat probability of gender. In other words, I assign fractional people to each gender (e.g. 0.74 of a woman and 0.26 of a man for a given name), with no fractional thresholding. This is not the best way to assign genders, but it is the most straight forward.

aside: I would love to test the robustness of this method using a large corpus of names with known genders (e.g. some personnel records or similar)

The White House Visitor dataset included 3,246,486 entries. Of these 3,105,695 (about 96%) had a name match to my SSA dataset. Of these names, only 4.7% had a SSA-based assumed gender that was lower than 75%. In other words, over 95% of the White House Visitors had a first name that was a single gender more than 75% of the time in the SSA dataset. This means we can actually answer the initial question...

Gender Ratio of White House Visitors:

3,105,695 visitors

That's not bad! Consider: the gender ratio of the entire US (all ages), according to the 2010 Census is 49.1% male and 50.9% female. Note also there are many repeat visitors to the White House, which may induce a gender bias. Undoubtedly there are also some data entry problems, but we can assume those are gender-neutral.

But that's not all... the dataset also included a column describing who the visitor was scheduled to visit! Here are the gender ratios for a few selected descriptions:

1) Tourists

2,070,385 tourists
This is for all records with any variant of the words tour/tourist/tours/etc. I find this a little surprising, and it should be looked in to further!


172,794 POTUS vistors
Here I've included only records that included the exact term "POTUS" (stands for President of the United States). The "President's Men" are by and large just that: men. This makes sense to me, given the high fraction of men among CEOs, military leaders, and politicians. I'm even a bit surprised it's this close, honestly. I think this is actually a sign of very good progress.


27,989 FLOTUS visitors
The First Lady has 6x fewer visitors listed, and is dominated by female visitors. This implies a very different source of visitors to the First Lady. If watching all 7 seasons of West Wing taught me anything, it's that the First Lady is expected to deal with women's issues. There's a fascinating discussion to be had on the role of the First Lady, and what constituency she should be expected/allowed to deal with, particularly for one as well educated and brilliant as Michelle Obama.


This study is in no way conclusive, and each of the subsets of visitors I have selected may have large overlaps. However, it does provide a hint that the White House is not simply a "Boys Club", but instead some gender equality does appear to be reaching the highest office in the land. How will this change in a different administration, or with a different political party in office? (I'd love to see the records from the previous administrations!) What if a woman is elected POTUS? These are questions that only more data and time can answer. As I said in my survey of gender in astronomy talks, if we can bias the answer simply by studying the problem I'd be thrilled.

Better Living Through Data

One running theme on this blog has been that of data-driven self study. A favorite source for data about myself is my laptop battery logs. Last summer I shared what an entire year of laptop battery usage looks like, in remarkable detail. Today I'm excited to show the follow up data!

Here is what two years of laptop battery use looks like, sampled every minute I've used my computer(s). This includes 293,952 data points, at time of writing. Since the "batlog" script runs every minute, that translates to over 204 days of computer use in the last ~2 years! Yowza

Update: Per several requests, I have added a more detailed install guide in the README file on github. 
This newer 2013 MacBook Air is holding up much better than the 2012 model, and I'm consistently still getting 6-8 hours of life out of the battery at least. The scatter on the battery capacity for the 2013 model is higher, which is mildly interesting. For reference, Time = 0 for the older model (blue) occurred at Tue Aug 14 10:41:46 PDT 2012, and for the newer model (red) at Sat Aug 24 12:16:00 PDT 2013.

Creature of Habit

The story of the battery is fascinating to me, from a technical perspective at least. I wonder if there is value in this sort of very well sampled data for engineering. However, this time I wanted to focus more on my own computer usage and behavior. For the past 2 years, here is when I am using my computer:
A fun thing to notice: my computer apparently wakes up a few times every night... I wonder what it's dreaming about? As I've pointed out before, the large gap last summer was my internship at MSR, where I didn't use my personal computer most days. You can also see some very long days, big streaks where I'm staying up most of the night. These are when I'm observing on a telescope usually.  The figure nicely shows I'm in bed by ~1230 or 1AM, and up (and working) by around 8 or 9AM.

Here are some round numbers:
Total computer use: 204 days
Longest day: 19 hours
Median day: 6.9 hours
Median week: 50.6 hours
Of course, these are averaging data from over 2 years, and only include the time I work on this computer specifically. All told, that's still a lot of use...

I was also curious to know what my day-of-week trends look like. Here is that data, also breaking it down between the 2 laptops:

With the newer computer (Year 2) I'm working considerably less in the evenings. I'd consider this better living! (also less blogging as a trade off, alas)

Combining both years (grey trend above) I actually do the most work on Tuesday. This surprised me, as I try to make Monday my power day. Monday/Tuesday are remarkably close. Here are the actual fractions:
Mon: 18.47 %
Tue: 18.68 %
Wed: 16.70 %
Thu: 17.43 %
Fri: 13.67 %
Sat:  7.01 %
Sun:  8.01 %
I wonder how this compares with other measures of productivity... emails? commits to repos? sentiment analysis of my social network posts?

Follow the Power

As the "batlog" script gathers time data, it also saves capacity and battery charge. If you compare subsequent data points you can tell if the computer was charging, plugged in, or discharging. This simple added bit of information tells volumes about what I'm doing every day. Here are the charging patterns of my laptop throughout the day:
I don't think I've ever seen a figure quite like this before. There's TONS of detail here, I love it! If the original time of day versus day figure is a silhouette of my computer usage (no details), then this version is getting to be a "fingerprint" for my life!

For clarification, here's the color scheme I've used (Spectral, the slightly less obnoxious rainbow!) based on the great implementation of the Brewer tables for IDL by Michael Galloy

The first feature that popped out to me: you can see I spend most mornings at a cafe. This shows up as these color streaks every morning, as I sip coffee and drain my battery slowly. The length of this rainbow stripe each AM is a really good measure of how long I'm in the cafe. 

I wonder if my total productivity, or pages or lines of code written per day, is correlated positively with  the time I spend in that first morning's stint of the cafe. I don't have the data to answer that (yet) but my gut tells me I'm happier when I spend that extra hour working in the solitude of a coffee house.

Then the laptop charges all day while I'm at work, staying plugged in at my desk. You electronics experts out there, is this very bad for my computer battery?

Too Much Computer?

All this data, and all this time "plugged in", drives me to pose a question: can we be modern scientists and not spend 7-10hrs a day staring at a computer? Too much media/computer use isn't good for your brain. The posture you adopt is bad for your health.

I worry computers are making people less creative in some ways, and too much time online is certainly bad for your soul. You need sunlight, air, dirt. You also need to talk with people to synthesis things and generate new ideas. You need to do this a lot as a scientist. A lot more than we do, I think.

Can we use data like this to learn about our habits, and then positively inform our actions? Could such monitoring aid interventions from the computer itself? Software that says "today's been a long slog, plugged in and running for 10 hours, try and go outside" or "lots of time off the charger today, maybe go visit the office?" Could this help make us happier and more productive? What other data could we passively collect that might help inform positive change?

Needing More Data

I think my dataset is unmatched for its personal detail and duration. One of the coolest things about the blog post from a year ago was that people started sending me their battery readings. A bunch of people also got excited about this project and checked out my github repo. If you did, I would *love* for you to send me your data!!! Also, please send me some metadata, such as:

  • What model/year is your computer?
  • Is this a work or home computer primarily? Or both?
  • Briefly describe a typical weekday, in relation to your computer (when you use it, where, etc)
  • Your age/gender

More examples of this kind of passively collected quantified-self data would make for an awesome study about modern computer usage, and is something I'd like to pursue in the next year! Reach out if you have thoughts!

Guest Post: High Stakes Dice

No comments:
Today I'm featuring another guest post from my good friend, Meredith. This short writeup (originally from her blog) demonstrates some basic statistics, and how they might apply to a very real world example. Given the misuse and misunderstanding of these basic stats in the media and current political discussions, and rampant junk science in my Facebook feed, I think this is a timely reminder.... take it away Meredith!
Unlikely things happen all the time.
Here’s an example. Let’s say you are rolling a 20-sided dice. You probably won’t roll a 20. I mean, you might, but you have a 1-in-20 chance, which is only 5%. This argument works for any number on the dice. Yet, you will roll some number between 1 and 20. No matter what you get, it was unlikely… but at the same time, you were bound to get an unlikely result. Weird, huh?
Now let’s say you have a very funny-looking dice with 100 sides on it. Each number only has a 1% chance of coming up. So, let’s raise the stakes a little. Each time you roll, getting 1–99 is just fine. Nothing happens. But, if you roll a 100, you have to pay $10,000.
So, don’t worry! 99% of the time you will be just fine. Just don’t roll the dice any more than you have to—it’s a pretty boring game without any apparent reward, anyway—and try not to worry too hard, because statistics is on your side. Right?

You’re curious, though. You wonder… how many times would you need to roll the dice for it to be more likely to get that 100, just once, than to avoid it completely? If you do the math1, you’ll find that 69 rolls puts you above the 50% mark. In other words, you are more likely than not to get a 100 if you roll 69 times.
Feeling lucky? Want to keep rolling? By the time you’ve rolled that strange 100-sided dice 700 times, you are more than 99.9% likely to get the dreaded 100.

Contraception fails much more often than 1% of the time.
Every time a woman has sex with a man, she rolls a dice. Depending on her contraceptive method of choice, or lack thereof, her dice has a different number of sides on it. But each roll always holds the possibility of pregnancy. Depending on her work, health, and insurance situations, she could be out a lot more than $10,000 in the coming year, not to mention having a child to raise.
Is your dice a condom? If you use them perfectly, that’s a 2% failure rate over one year. You only need to roll 35 times to be more likely than not to get pregnant2.
Is your dice a birth control pill? If you use them perfectly, that’s a 0.3% failure rate over one year. You need to roll 231 times to be more likely than not to get pregnant2.

This is the absolute best case scenario for these common contraceptive methods. It is why methods like implants and IUDs with extremely low failure rates of 0.05–0.2% are gaining popularity. It is also why emergency contraception exists—think of this as a second “bonus dice” you can roll if you get unlucky with the first one.
We can play this game all day. Women play this game their whole reproductive lives. You can’t take our dice away. You can’t tell us not to roll (well, you can try, but it does absolutely no good). But apparently some employers can deny us access to certain dice and virtually all bonus dice based on a “sincerely-held belief” in junk science.
And yes, women could ignore our employers’ preferences, save our hard-earned money, and go buy whichever dice we like. But this game has a different set of rules. Suddenly we have to be able to afford the dice we want. Suddenly it is not the same game other women can play for free.
Someday, I hope all women (and men!) can have free access to all manner of highly effective, side-effect-free, reversible birth control. I know that doesn’t seem very likely to happen any time soon. But then again, unlikely things happen all the time.

The math is actually pretty easy. I’ll use the notation P(something) to indicate the probability that something will happen.
P(not rolling 100) = 99/100 = 0.99
P(not rolling 100, with n rolls) = 0.99n
P(rolling 100, with n rolls) = 1 – P(not rolling 100, with n rolls) = 1 – 0.99n
For this last probability to be more likely than not, it needs to be greater than 50%. So when we solve this equation for n number of rolls:
1 – 0.99n = 0.5
We get n must be 69. In other words, if we roll 69 times, we’re more likely than not to get a 100.
If instead we want to be 99.9% sure of getting a 100, we write it like this:
1 – 0.99n = 0.999
Which tells us n must be 688 (nearly 700). If we roll 688+ times, we are 99.9% likely to roll at least one 100.
Statistics from this siteNote that per-year failure rates are not necessarily the same as per-roll failure rates. Contraception failure rates are typically calculated as “the difference between the number of pregnancies expected to occur if no method is used and the number expected to take place with that method,” so while this analysis may not be completely sound, the take-home message is unchanged: highly effective birth control is incredibly important.

Lunar Coincidence

No comments:
Something fundamental has been on my mind (again) recently:
Why is the Moon almost exactly the same angular size as the Sun?????

To be clear: what I mean is the Moon and Sun appear to be the same size in the sky, which has the thrilling consequence of generating the occasional total solar eclipse, like this:

This is one of those big "whoa dude" factoids to me. How can the apparent size of the sun and moon be so close?!  Why should it be so?! Most people would say it's a coincidence. This is something I've wondered about for a long time.

Recently a very interesting paper by Steven Balbus discussed this phenomena, and the possible consequence it has on life. Consider: if the Moon was more dense but the same mass and distance, it would have nearly the same tidal effect on the Earth, yet wouldn't cause total eclipses. If the Moon was a bit further away it wouldn't raise the same tides (and possibly do a host of other interesting things), which might be fundamental for life as we know it...

So it's a handy fact that the Moon is the right mass and distance that helps create life, and a damned coincidence that it also happens to be the same angular size as the Sun in our sky! Consider the cultural implications our Sun/Moon being equal in size. The result is frequent appearance in myth and legend as opposing gods.
- - -
So I started wondering... there are lots of moons in our solar system (we don't know of any moons in other planetary systems yet). Do any other moons exhibit this kind of coincidence, where the apparent diameter of the moon is the same as the Sun, as seen from the surface of the planet?

If we assume all moons are spheres, this is an easy enough calculation to do. You just need to gather the separations and sizes (radii) of the Sun, the planets, and all their moons throughout the solar system! A little geometry (see kids, not just useful for mini-golf) and you can figure out how large the Sun and each moon appears in the sky as seen from the "surface" of each planet...

Here's a graph to that effect:

To get total solar eclipses you need moons that land on the line of equality in this graph (dotted line).  Indeed, 3 other satellites (besides our bff, Luna) exhibit this coincidence! Of course, you can't stand on the "surface" of Saturn or Uranus, so this is all kind of silly... Let's take a look at the "winners":

Prometheus (orbiting Saturn) 

Pandora (orbiting Saturn)

Perdita (orbiting Uranus... maybe)

The first two are potato shaped rocks (each about 40miles across), not the grand sphere we're used to seeing in our night sky. Pandora isn't quite like James Cameron's imagined moon. The third may not even be a "moon"... It's the little fleck the yellow arrow points at. The discovery of Perdita was disputed for a while, and only recently been reconfirmed using HST.

That these moons (which could exactly cause total solar eclipses) are so small is really a statement about how far the giant planets truly are from the Sun. Out there the Sun is just a bright star in the sky!

There are lots of other moons that would appear very large in the sky as well. The famous Jovian moons are huge and close. Note how crazy big Charon appears compared to Pluto - this is really a "binary planet" configuration (yeah yeah yeah, I know Pluto's not a planet).
Aside: binary planets are something I've been muttering about for a couple years now... I've got $10 that says we find one in the next 5-10 years.

I'm tickled to imagine: what if beings lived in the clouds of Saturn, floating in the thin cold air, soaking up the faint sunlight. Very occasionally that somewhat brighter star would wink out completely, only to be re-lit by Prometheus, bringer of fire...
Related Posts Plugin for WordPress, Blogger...