Random Forest for Time Series Forecasting

No comments:
I recently spent a week at the 2014 Astro Hack Week, a week-long summer school + hack event full of astronomers (and some brave others). The week was full of high level chats about statistics, data analysis, coffee, and astrophysics. There was a great crowd of people, many of whom you can (and should) follow on Twitter. Below is a quick post I wrote up detailing one of my afternoon "hack projects", which was originally posted on the HackWeek's blog here.

After Josh Bloom's wonderful lecture on Random Forest regression I was excited to out his example code on myKepler data. Josh explained regression with machine learning as taking many data points with a variety of features/atributes, and using relationships between these features to predict some other parameter. He explained that the Random Forest algorithm works by constructing many decision trees, which are used to construct the final prediction.

I wondered: could I use the Random Forest (RF) to do time series forecasting? Of course, as Jake noted, RF only predicts single properties. As a result, RF isn't a good choice for doing trend forecasting over long time periods. (well, maybe) Instead, this would use RF to just predict the next datapoint.

I only had a couple hours to play with the code, so the implementation was simple. I took a Quarter long chunk of long cadence light curve for the exoplanet host, Kepler 17. I smoothed the light curve, and down-sampled it to 0.2 day bins. Here is what the light curve looked like. You can see significant starspot modulations, which are nearly sinusoidal with a period of ~12 days, and evolving in amplitude.

I binned the light curve up in to 20 day windows. I moved each subsequent window by one datum. Each time window was considered one "data point", and all the flux values within the window were the "features". The flux value of the next data point after the window was the value to predict. Since the light curve was smoothly varying, the RF method did a very good job of predicting the next flux values! The experiment would be more interesting with more stochastic (read: profitable) data.

One neat output of the RF regression is an analysis of which input features were important in the final decision tree used to predict! In this next figure I show this "Importance" metric as a function of feature. Since the features were evenly spaced flux values, we can show them sorted by their lag time behind the prediction value!
What you can see is the last two data points before the prediction (first two features here) carry most of the weight. This is because the curve in Figure 1 is slowly varying with time. However, there is also "importance power" shown at times centered around 12 days: the rotation period!

Indeed, this looks somewhat like a periodogram! It is showing the times which correlate most strongly with a given data point. This isn't exactly right... and I haven't wrapped my head around it fully. You might need to re-run the entire RF prediction for the n+1 data point, and the n+2, and the n+3.... to build the actual periodogram. But nevertheless we are showing time-spectral power without assuming any shape.

This kind of analysis has a couple huge parameters that need tuning: How big should the window be? How down-sampled should the light curve be?

Also a problem at present: This implementation requires uniform (continuous) sampling, and each "data point" (window) was required to have the same number of points. I think RF can handle missing data or features, but I haven't played with it enough to know the mechanics of doing so. Could it deal with irregularly sampled data, making the times and fluxes be the features, instead of just the fluxes? I don't know... yet. Maybe I'll find out at the next Hack Week!

My sloppy, wandering code on the subject can be found on github here

Map of FM Radio Station Towers

No comments:
Here's a curious map I made.

I was recently driving in the southwest, cruising along long stretches of highway that get no FM radio reception. Usually we need to bring CDs or hook up the iPhone to the car, but we were lucky enough to have a rental with SiriusXM, and it was pretty awesome... but I digress.

As a child my dad told me that FM basically only worked along line-of-sight, and not over very long distances, and that's why we had to listen to The Cars on cassette while driving to the Grand Canyon instead of the radio (I kid, Dad. And also I love The Cars still).

So while I was driving along HWY-380 in New Mexico I started to think about the distribution of radio coverage. To cover most of the country there must be thousands of radio towers! Indeed, there are...  around 27,000 of them in the US alone! Here's a map of their coverage across the country...
(click image for high res)

Super neat! I got the geometries from this handy FCC site. There's a ton of other information available from the FCC, such as radio call-sign. A follow-up project idea: Google or Bing search (whichever API is easier to use) every ratio station in this database (using call sign and city name) and categorize the station's genre based on the results. You could then use Google Maps to chart a road trip, and these coverage maps indexed by genre to determine optimal radio station choices along the way!

On the one hand, this is largely a population density map, with some geography mixed in showing some mountains in the contours. However, there's a more gridded layout along the East coast and in the mid west, providing almost uniform coverage of FM service.

Also, see if you can spot the radio-quiet region around the Green Bank radio telescope!

And just because it's kind of neat, here's Hawaii's radio coverage. Some of the stations only cover one island (it would have been better if I drew the islands... hmm), and you can see on the Big Island (bottom right) coverage holes near those famous big mountains they have!
I also made the figure for Alaska, but it was very sparse. It turns out (not surprisingly) the place in the country with the worst radio coverage per square mile is absolutely Alaska... and that's OK. We should build radio telescopes there.

World Elevations, as Traced by Airports

No comments:
I was looking through some old blog posts and datasets today, and found a gem worth revisiting. One of the simplest and most pleasing datasets I've played around with on this blog was from OurAirports.com, a totally open source database of 46k and counting airports/landing strips/helipads.

I've blogged about this dataset before in Airports of the World, which featured this image:

I went back to this dataset and found another interesting/simple parameter besides latitude and longitude. Most of the airstrips included runway elevation! So I naturally wondered: could we see an elevation map of the world using only airport locations?
Click image for higher resolution!

I've used an adaptive pixel size here to generate this figure, so where there are more airports you see finer resolution. (Code available on github) The US has amazing detail, and as the number density of airports drops off the pixels gradually get bigger!
Click image for higher resolution!

I think the dataset is really lacking detail in Asia. Check out this area of Eastern Asia and some of the South Pacific. Fascinating (to me) there are some VERY high elevation airports/landing pads in China in the Himalayas.
Click image for higher resolution!
I really like the use of the adaptive pixelization, especially in the USA map. I played around with different kinds of grid/pixel schemes, including using voronoi triangular regions, but I liked the aesthetic of this simple brute-force pixel approach. (Code available on github)

One comment I made about the initial Airports of the World visualization was simply my amazement in how much of our planet is accessible by air travel. This new version adds another dimension, and shows the incredible range of elevations that people live at.

Gender of White House Visitors

No comments:
Last summer, as as part of my internship working with these awesome people at MSR,  I spent a lot of time playing with public data sources. One fascinating dataset that I chose as a benchmark (for what is currently known as Tempe at MSR) is the White House Visitor records, which (as of last July) had over 3 Million records of visitors to the White House during the Obama administration.

This dataset has been in the news before, and is (in my opinion) a great example of public disclosure that we should be pushing for in government. A whole other conversation of course is how/when such records should be released, and by whom. The White House Visitor dataset is also known to be incomplete, censoring records for national or personal security reasons, and maybe other reasons too.

Here is just one question I came up with: Do more men or women visit the White House? My guess was that a majority of visitors would be men.

Related Posts Plugin for WordPress, Blogger...