This dataset has been in the news before, and is (in my opinion) a great example of public disclosure that we should be pushing for in government. A whole other conversation of course is how/when such records should be released, and by whom. The White House Visitor dataset is also known to be incomplete, censoring records for national or personal security reasons, and maybe other reasons too.
Here is just one question I came up with: Do more men or women visit the White House? My guess was that a majority of visitors would be men.
To make this slightly more interesting, I also posted a very simple survey last summer that asked people to guess if more men or women visited the White House. The survey itself was hastily done (read: poorly done) but nearly 300 people kindly responded with their guesses. The distribution of answers looked like this:
|Survey results, from 297 participants.|
First challenge: How does one assign gender, when all you're given is a name?
I've described how I do this before, but this was actually the project where I first tried it! I downloaded the US Social Security Administration's full Baby Name dataset, which has a huge list of name-gender info for more than a century. I've limited myself to names only since 1920 here.
For every name in the SSA dataset I count the # male and # female instances, and assume a flat probability of gender. In other words, I assign fractional people to each gender (e.g. 0.74 of a woman and 0.26 of a man for a given name), with no fractional thresholding. This is not the best way to assign genders, but it is the most straight forward.
aside: I would love to test the robustness of this method using a large corpus of names with known genders (e.g. some personnel records or similar)
The White House Visitor dataset included 3,246,486 entries. Of these 3,105,695 (about 96%) had a name match to my SSA dataset. Of these names, only 4.7% had a SSA-based assumed gender that was lower than 75%. In other words, over 95% of the White House Visitors had a first name that was a single gender more than 75% of the time in the SSA dataset. This means we can actually answer the initial question...
Gender Ratio of White House Visitors:
That's not bad! Consider: the gender ratio of the entire US (all ages), according to the 2010 Census is 49.1% male and 50.9% female. Note also there are many repeat visitors to the White House, which may induce a gender bias. Undoubtedly there are also some data entry problems, but we can assume those are gender-neutral.
But that's not all... the dataset also included a column describing who the visitor was scheduled to visit! Here are the gender ratios for a few selected descriptions:
|172,794 POTUS vistors|
|27,989 FLOTUS visitors|
This study is in no way conclusive, and each of the subsets of visitors I have selected may have large overlaps. However, it does provide a hint that the White House is not simply a "Boys Club", but instead some gender equality does appear to be reaching the highest office in the land. How will this change in a different administration, or with a different political party in office? (I'd love to see the records from the previous administrations!) What if a woman is elected POTUS? These are questions that only more data and time can answer. As I said in my survey of gender in astronomy talks, if we can bias the answer simply by studying the problem I'd be thrilled.