Hack the textbook figures

Every single figure in the text book Statistics, Data Mining, and Machine Learning in Astronomy is downloadable and fully reproducible online. Jake VanderPlas accomplished this heroic feat as a graduate student at the University of Washington. Jake recalled the origin story to some of us at the hack week. He explained that he would usually have the figure done the same week it was conceived, and was really pretty happy with the whole experience of being a part of making the textbook and ultimately becoming a coauthor. His figures are now indispensable. Because of Jake's investment, generations of astronomers to come can now benefit from reproducing the explanatory material in the Textbook. The figures are complementary to the textbook prose. The textbook prose explains the theoretical framework underlying the concepts. Equations are derived. But by digging into the textbook figure Python code, the reader can see how the method is implemented, and try it out by tweaking the input. "What happens if I double the noise? Or decimate the number of data points? Or change this-or-that parameter? How long does it take to run?"

These and other questions motivated my hack idea, which was to dig into the source code of textbook figures and do some hacking.

So on Wednesday of the Hack Week a table of about 8 of us all hacked the book figures. The figure above is one of those figures,

Read more…

Bayesian Evidence Calculation

In a Bayesian framework, object classification or model comparison can be done naturally by comparing the Bayesian evidence between two or more models, given the data. The evidence is the integral of the likelihood of the data over the entire prior volume for all the model parameters, weighted by the prior. (The ratio of evidence for two different models is known as the Bayes Factor.) This multi-dimensional integral gets increasingly computationally intensive as the number of parameters increases. As a result, several clever algorithms have been developed to efficiently approximate the answer.

In this hack, I looked at a couple specific implementations of such algorithms in Python.

Read more…

Hacked Ethnographic Fieldnotes

Brittany Fiore-Silfvast is a postdoctoral fellow with UW's Moore/Sloan-funded Data Science Environment and works as a data science ethnographer. Her research focuses on the social and organizational dimensions of data-intensive transformations in arenas such as scientific research, healthcare, global development, design and construction, and warfare.

What is data science ethnography anyway?

As an ethnographer of data science, I immerse myself in particular communities to understand how they make sense of the world, how they communicate, what motivates them, and how they work together. I spent a week at astro data hack week, which might as well have been a foreign culture to me. I participated as an active listener, trying to sensitize myself to the culture and discern patterns that may not be self-evident to people within the community. Ethnography can have the effect of making the ordinary strange, such that the norms, objects, and practices that the community takes for granted become fascinating, informative sites for learning and discovery. Many of the astro hackers were probably thinking, "Why is this woman hanging around watching me code on my laptop? There is nothing interesting here." But I assured them it was interesting to me because I was seeing their everyday practice in the context of a complex social and technical world that is in flux.

Ethnography can be thought of as a form of big data. Typically hundreds of pages of fieldnotes, interview transcripts, and artifacts from the field would be recorded over a long period of time until the ethnographer determines they have reached a point of saturation. The analysis process co-occurs with the data collection, iteratively shaping the focus of the research and observation strategy. Across this massive dataset with an abundance of unwieldy dimensions, the ethnographer has to make sense.

Read more…

Time Series Forecasting with Random Forest

After Josh Bloom's wonderful lecture on Random Forest regression I was excited to out his example code on my Kepler data. Josh explained regression with machine learning as taking many data points with a variety of features/atributes, and using relationships between these features to predict some other parameter. He explained that the Random Forest algorithm works by constructing many decision trees, which are used to construct the final prediction.

I wondered: could I use the Random Forest (RF) to do time series forecasting? Of course, as Jake noted, RF only predicts single properties. As a result, RF isn't a good choice for doing trend forecasting over long time periods. (well, maybe) Instead, this would use RF to just predict the next datapoint.

Read more…

K2 Photometry

For my AstroHackWeek project, I decided to hack on the new images coming from NASA's K2 mission, the second generation of the Kepler satellite. The original Kepler mission obtained exquisite precision in the photometry because the satellite's pointing was stable to better than a hundredth of a pixel. For K2, this is no longer the case. Therefore, we'll need to work a little harder to extract useful photometric measurements from these data. That being said, these pointing variations also break some of the degeneracies between the flat field of the detector and the PSF so we might be able to learn some things about Kepler that we couldn't have with the previous data releases.

At the hack week, I got a proof-of-concept implemented but there's definitely a lot to do if we want to develop a general method. The basic idea is to build a flexible probabilistic model inspired by what we know about the physical properties of Kepler and then optimize the parameters of this model to produce a light curve.

The figure at the top of this page shows a single frame observed in the engineering phase of K2 on the left and, on the right, the optimized model for the same frame. The code lives (and is being actively developed) on GitHub dfm/kpsf and the K2 data can be downloaded from MAST using Python and the git version of kplr.

Read more…

Hack Week Responses: Blogs and Twitter

We're still working on getting some hack summaries here on the website. In the mean-time, some of the hack week participants have been blogging and writing about their own thoughts from the week! I wanted to compile a few of these responses here:

  • Ruth Angus' post on AstroBites: a great high-level summary of what went on through the week! Ruth does a great job capturing the spirit of the event, as well as some details on things she learned and took home.

  • David Hogg's daily research posts: If you've never come across Hogg's Research Blog, it's worth digging around in it for a while. Each day, Hogg writes a brief summary of what he worked on or thought about for the day, and the result is a nearly decade-long log of his scientific ideas and interactions. He did this through the hack week as well, and you can read his thoughts here: Day 1, Day 2, Day 3, Day 4, Day 5.

  • Adrian Price-Whelan's IPython Cell Macros hack: so often when starting up an IPython notebook, we find ourselves importing all the same tools in the first cell. Now there's a button which adds this cell automatically!

  • UW eScience has a quick writeup: drawn mostly from our earlier post here. Great to have recognition from one of our sponsoring organizations!

And of course, there's always Twitter. Many of the week's participants were tweeting with the #AstroHackWeek tag. Below are a few randomly-chosen highlights.

I know that I, for one, am really hoping this happens again.

Read more…

Multi-Output Random Forests

Classic machine learning algorithms map multiple inputs to a single output. For example, you might have five photometric observations of a galaxy, and predict a single attribute or label (like the redshift, metallicity, etc.) When multiple ouputs are desired, standard practice is to essentially run two independent classifications: first predict one variable, then the next. The problem with this approach is that it completely ignores correlations in the outputs.

This is my Thursday hack, which was to explore ideas to improve on this within Random Forests.

Read more…

Astro Hack Week Wrapup

The first Astro Hack Week took place from September 15-19, 2014 at University of Washington. We had about 45 attendees through the week. We spent the mornings together learning new coding, statistics, and data analysis skills, and spent the afternoons working in pairs and groups on a wide variety of projects. These projects spanned a range of topics, and comprised everything from short exercises to development of teaching materials to full-blown research projects which will likely lead to publications!

Along with these hacks, the afternoons were also punctuated by informal breakout sessions on everything from using Git to constructing Probabilistic Graphical Models. Thanks to all the participants who stepped up to lead these breakouts and share their expertise with others!

We've set up this blog to report and record some of the results of the workshop. Over the next few weeks, we hope that everyone who attended will write a short post (or two!) and let us know what they worked on and learned during the week!

Finally, a huge thanks to our sponsors, the Moore Foundation, the Sloan Foundation, and the UW eScience Institute.

Stay tuned, and look for more posts in the coming days and weeks!