World Elevations, as Traced by Airports

No comments:
I was looking through some old blog posts and datasets today, and found a gem worth revisiting. One of the simplest and most pleasing datasets I've played around with on this blog was from OurAirports.com, a totally open source database of 46k and counting airports/landing strips/helipads.

I've blogged about this dataset before in Airports of the World, which featured this image:


I went back to this dataset and found another interesting/simple parameter besides latitude and longitude. Most of the airstrips included runway elevation! So I naturally wondered: could we see an elevation map of the world using only airport locations?
Click image for higher resolution!


I've used an adaptive pixel size here to generate this figure, so where there are more airports you see finer resolution. (Code available on github) The US has amazing detail, and as the number density of airports drops off the pixels gradually get bigger!
Click image for higher resolution!

I think the dataset is really lacking detail in Asia. Check out this area of Eastern Asia and some of the South Pacific. Fascinating (to me) there are some VERY high elevation airports/landing pads in China in the Himalayas.
Click image for higher resolution!
I really like the use of the adaptive pixelization, especially in the USA map. I played around with different kinds of grid/pixel schemes, including using voronoi triangular regions, but I liked the aesthetic of this simple brute-force pixel approach. (Code available on github)

One comment I made about the initial Airports of the World visualization was simply my amazement in how much of our planet is accessible by air travel. This new version adds another dimension, and shows the incredible range of elevations that people live at.

Gender of White House Visitors

No comments:
Last summer, as as part of my internship working with these awesome people at MSR,  I spent a lot of time playing with public data sources. One fascinating dataset that I chose as a benchmark (for what is currently known as Tempe at MSR) is the White House Visitor records, which (as of last July) had over 3 Million records of visitors to the White House during the Obama administration.

This dataset has been in the news before, and is (in my opinion) a great example of public disclosure that we should be pushing for in government. A whole other conversation of course is how/when such records should be released, and by whom. The White House Visitor dataset is also known to be incomplete, censoring records for national or personal security reasons, and maybe other reasons too.

Here is just one question I came up with: Do more men or women visit the White House? My guess was that a majority of visitors would be men.

To make this slightly more interesting, I also posted a very simple survey last summer that asked people to guess if more men or women visited the White House. The survey itself was hastily done (read: poorly done) but nearly 300 people kindly responded with their guesses. The distribution of answers looked like this:
Survey results, from 297 participants.
Basically the most common guess agreed with me, that it would be around ~60% men.


First challenge: How does one assign gender, when all you're given is a name?
I've described how I do this before, but this was actually the project where I first tried it! I downloaded the US Social Security Administration's full Baby Name dataset, which has a huge list of name-gender info for more than a century. I've limited myself to names only since 1920 here.

For every name in the SSA dataset I count the # male and # female instances, and assume a flat probability of gender. In other words, I assign fractional people to each gender (e.g. 0.74 of a woman and 0.26 of a man for a given name), with no fractional thresholding. This is not the best way to assign genders, but it is the most straight forward.

aside: I would love to test the robustness of this method using a large corpus of names with known genders (e.g. some personnel records or similar)

The White House Visitor dataset included 3,246,486 entries. Of these 3,105,695 (about 96%) had a name match to my SSA dataset. Of these names, only 4.7% had a SSA-based assumed gender that was lower than 75%. In other words, over 95% of the White House Visitors had a first name that was a single gender more than 75% of the time in the SSA dataset. This means we can actually answer the initial question...

Gender Ratio of White House Visitors:

3,105,695 visitors

That's not bad! Consider: the gender ratio of the entire US (all ages), according to the 2010 Census is 49.1% male and 50.9% female. Note also there are many repeat visitors to the White House, which may induce a gender bias. Undoubtedly there are also some data entry problems, but we can assume those are gender-neutral.

But that's not all... the dataset also included a column describing who the visitor was scheduled to visit! Here are the gender ratios for a few selected descriptions:

1) Tourists

2,070,385 tourists
This is for all records with any variant of the words tour/tourist/tours/etc. I find this a little surprising, and it should be looked in to further!

2) POTUS

172,794 POTUS vistors
Here I've included only records that included the exact term "POTUS" (stands for President of the United States). The "President's Men" are by and large just that: men. This makes sense to me, given the high fraction of men among CEOs, military leaders, and politicians. I'm even a bit surprised it's this close, honestly. I think this is actually a sign of very good progress.

3) FLOTUS

27,989 FLOTUS visitors
The First Lady has 6x fewer visitors listed, and is dominated by female visitors. This implies a very different source of visitors to the First Lady. If watching all 7 seasons of West Wing taught me anything, it's that the First Lady is expected to deal with women's issues. There's a fascinating discussion to be had on the role of the First Lady, and what constituency she should be expected/allowed to deal with, particularly for one as well educated and brilliant as Michelle Obama.

Conclusions

This study is in no way conclusive, and each of the subsets of visitors I have selected may have large overlaps. However, it does provide a hint that the White House is not simply a "Boys Club", but instead some gender equality does appear to be reaching the highest office in the land. How will this change in a different administration, or with a different political party in office? (I'd love to see the records from the previous administrations!) What if a woman is elected POTUS? These are questions that only more data and time can answer. As I said in my survey of gender in astronomy talks, if we can bias the answer simply by studying the problem I'd be thrilled.

Better Living Through Data

10 comments:
One running theme on this blog has been that of data-driven self study. A favorite source for data about myself is my laptop battery logs. Last summer I shared what an entire year of laptop battery usage looks like, in remarkable detail. Today I'm excited to show the follow up data!

Here is what two years of laptop battery use looks like, sampled every minute I've used my computer(s). This includes 293,952 data points, at time of writing. Since the "batlog" script runs every minute, that translates to over 204 days of computer use in the last ~2 years! Yowza

Update: Per several requests, I have added a more detailed install guide in the README file on github. 
This newer 2013 MacBook Air is holding up much better than the 2012 model, and I'm consistently still getting 6-8 hours of life out of the battery at least. The scatter on the battery capacity for the 2013 model is higher, which is mildly interesting. For reference, Time = 0 for the older model (blue) occurred at Tue Aug 14 10:41:46 PDT 2012, and for the newer model (red) at Sat Aug 24 12:16:00 PDT 2013.

Creature of Habit

The story of the battery is fascinating to me, from a technical perspective at least. I wonder if there is value in this sort of very well sampled data for engineering. However, this time I wanted to focus more on my own computer usage and behavior. For the past 2 years, here is when I am using my computer:
A fun thing to notice: my computer apparently wakes up a few times every night... I wonder what it's dreaming about? As I've pointed out before, the large gap last summer was my internship at MSR, where I didn't use my personal computer most days. You can also see some very long days, big streaks where I'm staying up most of the night. These are when I'm observing on a telescope usually.  The figure nicely shows I'm in bed by ~1230 or 1AM, and up (and working) by around 8 or 9AM.

Here are some round numbers:
Total computer use: 204 days
Longest day: 19 hours
Median day: 6.9 hours
Median week: 50.6 hours
Of course, these are averaging data from over 2 years, and only include the time I work on this computer specifically. All told, that's still a lot of use...

I was also curious to know what my day-of-week trends look like. Here is that data, also breaking it down between the 2 laptops:

With the newer computer (Year 2) I'm working considerably less in the evenings. I'd consider this better living! (also less blogging as a trade off, alas)

Combining both years (grey trend above) I actually do the most work on Tuesday. This surprised me, as I try to make Monday my power day. Monday/Tuesday are remarkably close. Here are the actual fractions:
Mon: 18.47 %
Tue: 18.68 %
Wed: 16.70 %
Thu: 17.43 %
Fri: 13.67 %
Sat:  7.01 %
Sun:  8.01 %
I wonder how this compares with other measures of productivity... emails? commits to repos? sentiment analysis of my social network posts?

Follow the Power

As the "batlog" script gathers time data, it also saves capacity and battery charge. If you compare subsequent data points you can tell if the computer was charging, plugged in, or discharging. This simple added bit of information tells volumes about what I'm doing every day. Here are the charging patterns of my laptop throughout the day:
I don't think I've ever seen a figure quite like this before. There's TONS of detail here, I love it! If the original time of day versus day figure is a silhouette of my computer usage (no details), then this version is getting to be a "fingerprint" for my life!

For clarification, here's the color scheme I've used (Spectral, the slightly less obnoxious rainbow!) based on the great implementation of the Brewer tables for IDL by Michael Galloy

The first feature that popped out to me: you can see I spend most mornings at a cafe. This shows up as these color streaks every morning, as I sip coffee and drain my battery slowly. The length of this rainbow stripe each AM is a really good measure of how long I'm in the cafe. 

I wonder if my total productivity, or pages or lines of code written per day, is correlated positively with  the time I spend in that first morning's stint of the cafe. I don't have the data to answer that (yet) but my gut tells me I'm happier when I spend that extra hour working in the solitude of a coffee house.

Then the laptop charges all day while I'm at work, staying plugged in at my desk. You electronics experts out there, is this very bad for my computer battery?

Too Much Computer?

All this data, and all this time "plugged in", drives me to pose a question: can we be modern scientists and not spend 7-10hrs a day staring at a computer? Too much media/computer use isn't good for your brain. The posture you adopt is bad for your health.

I worry computers are making people less creative in some ways, and too much time online is certainly bad for your soul. You need sunlight, air, dirt. You also need to talk with people to synthesis things and generate new ideas. You need to do this a lot as a scientist. A lot more than we do, I think.

Can we use data like this to learn about our habits, and then positively inform our actions? Could such monitoring aid interventions from the computer itself? Software that says "today's been a long slog, plugged in and running for 10 hours, try and go outside" or "lots of time off the charger today, maybe go visit the office?" Could this help make us happier and more productive? What other data could we passively collect that might help inform positive change?

Needing More Data

I think my dataset is unmatched for its personal detail and duration. One of the coolest things about the blog post from a year ago was that people started sending me their battery readings. A bunch of people also got excited about this project and checked out my github repo. If you did, I would *love* for you to send me your data!!! Also, please send me some metadata, such as:

  • What model/year is your computer?
  • Is this a work or home computer primarily? Or both?
  • Briefly describe a typical weekday, in relation to your computer (when you use it, where, etc)
  • Your age/gender

More examples of this kind of passively collected quantified-self data would make for an awesome study about modern computer usage, and is something I'd like to pursue in the next year! Reach out if you have thoughts!

Guest Post: High Stakes Dice

No comments:
Today I'm featuring another guest post from my good friend, Meredith. This short writeup (originally from her blog) demonstrates some basic statistics, and how they might apply to a very real world example. Given the misuse and misunderstanding of these basic stats in the media and current political discussions, and rampant junk science in my Facebook feed, I think this is a timely reminder.... take it away Meredith!
Unlikely things happen all the time.
Here’s an example. Let’s say you are rolling a 20-sided dice. You probably won’t roll a 20. I mean, you might, but you have a 1-in-20 chance, which is only 5%. This argument works for any number on the dice. Yet, you will roll some number between 1 and 20. No matter what you get, it was unlikely… but at the same time, you were bound to get an unlikely result. Weird, huh?
Now let’s say you have a very funny-looking dice with 100 sides on it. Each number only has a 1% chance of coming up. So, let’s raise the stakes a little. Each time you roll, getting 1–99 is just fine. Nothing happens. But, if you roll a 100, you have to pay $10,000.
So, don’t worry! 99% of the time you will be just fine. Just don’t roll the dice any more than you have to—it’s a pretty boring game without any apparent reward, anyway—and try not to worry too hard, because statistics is on your side. Right?

You’re curious, though. You wonder… how many times would you need to roll the dice for it to be more likely to get that 100, just once, than to avoid it completely? If you do the math1, you’ll find that 69 rolls puts you above the 50% mark. In other words, you are more likely than not to get a 100 if you roll 69 times.
Feeling lucky? Want to keep rolling? By the time you’ve rolled that strange 100-sided dice 700 times, you are more than 99.9% likely to get the dreaded 100.

Contraception fails much more often than 1% of the time.
Every time a woman has sex with a man, she rolls a dice. Depending on her contraceptive method of choice, or lack thereof, her dice has a different number of sides on it. But each roll always holds the possibility of pregnancy. Depending on her work, health, and insurance situations, she could be out a lot more than $10,000 in the coming year, not to mention having a child to raise.
Is your dice a condom? If you use them perfectly, that’s a 2% failure rate over one year. You only need to roll 35 times to be more likely than not to get pregnant2.
Is your dice a birth control pill? If you use them perfectly, that’s a 0.3% failure rate over one year. You need to roll 231 times to be more likely than not to get pregnant2.

This is the absolute best case scenario for these common contraceptive methods. It is why methods like implants and IUDs with extremely low failure rates of 0.05–0.2% are gaining popularity. It is also why emergency contraception exists—think of this as a second “bonus dice” you can roll if you get unlucky with the first one.
We can play this game all day. Women play this game their whole reproductive lives. You can’t take our dice away. You can’t tell us not to roll (well, you can try, but it does absolutely no good). But apparently some employers can deny us access to certain dice and virtually all bonus dice based on a “sincerely-held belief” in junk science.
And yes, women could ignore our employers’ preferences, save our hard-earned money, and go buy whichever dice we like. But this game has a different set of rules. Suddenly we have to be able to afford the dice we want. Suddenly it is not the same game other women can play for free.
Someday, I hope all women (and men!) can have free access to all manner of highly effective, side-effect-free, reversible birth control. I know that doesn’t seem very likely to happen any time soon. But then again, unlikely things happen all the time.

The math is actually pretty easy. I’ll use the notation P(something) to indicate the probability that something will happen.
P(not rolling 100) = 99/100 = 0.99
P(not rolling 100, with n rolls) = 0.99n
P(rolling 100, with n rolls) = 1 – P(not rolling 100, with n rolls) = 1 – 0.99n
For this last probability to be more likely than not, it needs to be greater than 50%. So when we solve this equation for n number of rolls:
1 – 0.99n = 0.5
We get n must be 69. In other words, if we roll 69 times, we’re more likely than not to get a 100.
If instead we want to be 99.9% sure of getting a 100, we write it like this:
1 – 0.99n = 0.999
Which tells us n must be 688 (nearly 700). If we roll 688+ times, we are 99.9% likely to roll at least one 100.
Statistics from this siteNote that per-year failure rates are not necessarily the same as per-roll failure rates. Contraception failure rates are typically calculated as “the difference between the number of pregnancies expected to occur if no method is used and the number expected to take place with that method,” so while this analysis may not be completely sound, the take-home message is unchanged: highly effective birth control is incredibly important.
Related Posts Plugin for WordPress, Blogger...