Sarah Loebman, a graduate student at the University of Washington (UW), Seattle, studies the Milky Way galaxy to find out how it arrived at its present structure. She works with two groups in the astronomy department there, one that surveys the night skies and another that runs high-resolution simulations. Both contend with huge volumes of data. “Earlier in my career, I spent a good part of the day just loading data into my computer,” she says.
When a physics colleague won a NASA grant to explore how database technology could be used in astronomy, she collaborated with him and with faculty from the computer science department, to see what could be done with her unwieldy data sets. The first thing she did was sign up for a graduate-level class on database management systems. It changed the way she saw her work: “Using a database shifted my focus to look beyond just one moment in time in a simulation,” she says. Soon she was helping other colleagues deal with their data and streamline their work.
“Learning how to clean up data, how to detect and deal with its inconsistencies, is a hands-on craft, just like learning to use equipment in a physical lab.” —Greg Wilson
Loebman, who in 2009 presented a paper called “Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?” will now be heading to the University of Michigan, Ann Arbor, for postdoctoral work. She believes she won the fellowship, from the Michigan Society of Fellows, on the strength of her interdisciplinary work.
The coming deluge
Ed Lazowska, who holds the Bill & Melinda Gates Chair in Computer Science & Engineering at UW, believes that data-driven discovery will become the norm, as he toldCareers in a recent interview. This new environment, he says, will create and reward researchers (like Loebman) who are well versed in both the methodologies of their specific fields and the applications of data science. He calls such people “pi-shaped” because they have two full legs, one in each camp.
“All science is fast becoming what is called data science,” says Bill Howe of UW’s eScience Institute. Today, there are sensors in gene sequencers, telescopes, forest canopies, roads, bridges, buildings, and point-of-sale terminals. Every ant in a colony can be tagged. The challenge is to extract knowledge from this vast quantity of data and transform it into something of value. Lately, Lazowska says, he has been hearing this refrain from researchers in engineering, the sciences, the social sciences, law, medicine, and even the humanities: “I am drowning in data and need help analyzing and managing it.”
Learning to code, and becoming comfortable with large datasets, may soon be a necessity in many traditional scientific fields. Many scientists already write scripts for the “plumbing” that automates routine data-related tasks and moves data around among various analysis tools. Those basic skills—and that basic infrastructure—sets the stage for more rapid, automated data management. But, to make optimal use of that rapidly accumulating data, they need additional computer expertise, in databases, visualization, machine learning, and parallel systems.
Motivated researchers can pick up the skills needed to get a handle on big data in a reasonable amount of time, Howe says, although it’s likely to be easier for those who have some expertise in statistics and certain branches of mathematics. “It is doable,” agrees Greg Wilson, founder of Software Carpentry, an organization funded by the Mozilla and Alfred P. Sloan foundations, which has been helping scientists build better software for 15 years.
Wilson designed courses on program construction, debugging, and version control, “purely as an exercise in self-defense,” he says. He graduated in 1992 and 6 years later, got his chance to teach these principles of efficient coding to scientists and engineers at the Los Alamos National Laboratory. Wilson did stints in industry and academia, and he is now a full-time employee of the Mozilla Foundation and trains volunteers to teach programming boot camps at campuses worldwide.
To young researchers looking to move into data-driven science, Wilson offers this advice: Choose data-intensive projects, remain unfazed, and learn strategies for managing the volume of data. “Learning how to clean up data, how to detect and deal with its inconsistencies, is a hands-on craft, just like learning to use equipment in a physical lab,” he says. You get better with practice, and there is no better time and place to practice than graduate school, he adds. And most of the difficult problems aren’t really programming problems: “They’re about knowing what kinds of analyses make sense, and whether the results of those analyses also make sense.” Open source communities, he says, are good places to seek programming mentors.
Finding classes to take
To facilitate the broadening of traditional researchers into data science, some universities have put together certificate programs in data science or data mining. There also are introductory online classes, such as this one on Coursera.
Julie Messier, a fourth year graduate student in ecology at the University of Arizona, measures 35 plant traits over 4oo trees and across 25 species in a temperate forest reserve in Canada. She must employ programming and statistical techniques to analyze the matrix of data she has amassed for her thesis. She found a semester-long course at the University of Utah—”Programming for Biologists”—that seemed ideal, but Utah was far away and the class is not offered online. Ethan White, the course’s instructor, directed her to Software Carpentry, for which he serves as a volunteer. Recognizing that others in her department were in the same boat she was in, Messier organized a 2-day crash course on programming at the Tucson campus.
Software Carpentry volunteers expect compensation for travel costs (including housing), but they require no other payment. With structured courses students will learn things they need plus things that they may never find any use for, whereas with learning “on the fly,” students learn only what they need—but oftentimes the process is inefficient and frustrating, she says. She found the crash course to be a good supplement to the learning-as-needed process. (Read Messier’s blog about her experience.)
The “on-the-fly” approach may not be perfect, but it seems to have worked for Jevin West, postdoctoral researcher in physics at Umeå University in Sweden who has a biology Ph.D. West co-founded the Eigenfactor Project, which aims to rank and map scientific knowledge on the principle that scholarly literature forms a vast network, where the nodes are papers and citations are the links. “We can use the networks to evaluate scholarly influence and (most importantly) use them to better navigate the ever-expanding literature,” he explains. When he first became interested in computation he had never taken a formal programming course—but he was surrounded by people who know the field well. He says, “I was fortunate. I could always ask them questions.”
An early start helps. Andrew White, a graduate student in UW’s chemical engineering department, played with his parents’ discarded Apple computer as a child. In high school, he briefly aspired to become a computer hacker, and now he uses computational modeling to discover new biomaterials. He has created Web-based applications that others can use to analyze data, and he has designed an online application for sharing datasets with fellow researchers.
To learn the craft, self-taught coders like him, he says, typically read online tutorials and books, view each other’s code, and ask questions in public forums. Then, in graduate school, he took a few electives from the computer science department to ensure he had thoroughly learned the fundamentals of programming.
There are many paths to pi-shapedness.