When the MIT Big Data Challenge asked, “What can you learn from data about 2.3 million taxi rides?” graduate students in professor Marta González’s research lab had some answers.
Based on their experience writing machine-learning algorithms that find meaningful patterns in very large data sets, and on their skill applying those patterns to understand how people use transportation in urban areas, the students were able to predict the number of taxi pickups that had occurred in 700 time intervals at 36 locations in the Boston area.
Their predictions were the best in the competition, earning them the number one spot and $4,000 in prize money. The scientific visualization of the data prepared by one team member garnered a second-place prize and an additional $1,000. The awards were announced mid-March.
Graduate student Yingxiang Yang of the Department of Civil and Environmental Engineering (CEE) led the prediction team — which included CEE graduate students Lauren Alexander, Serdar Colak, and Suma Desu — and Engineering Systems Division graduate student Jameson Toole, whose visualization won second prize.
The students work with González, an assistant professor in CEE, whose Human Mobility and Networks (HuMNet) Lab culls through massive repositories of passive data generated by cellphones and other networked systems. To do this, lab members employ methods from statistical physics and network theory to identify relevant patterns and make inferences about human mobility and other aspects of city science.
Their familiarity with human mobility in urban areas gave the team an edge in the MIT Big Data Challenge, Yang and Toole say.
“The key, and the hard part really, is to figure out what features in all these data sets are going to be useful and which can be ignored,” Yang says. For instance, they knew already that precipitation is the most relevant weather predictor in transportation decisions, so they could ignore other weather data.
The MIT Big Data Challenge: Transportation in the City of Boston, was sponsored by MIT’s Computer Science and Artificial Intelligence Laboratory, the City of Boston, and Transportation@MIT. The goal was for competitors to develop algorithms that could take several large data sets and anticipate taxi need in prescribed locations around Boston. Some also created compelling spatial visualizations to convey the data in insightful ways.
Competition officials provided roughly six months of data including hourly weather conditions; numbers and locations of taxi dropoffs and pickups divided into two-hour intervals; transaction data from the MBTA; information about events; and geolocations of tweets. The taxi pickup data omitted information from 700 of the two-hour intervals and required teams to predict that missing information.
While many of the other teams placing among the top finalists focused on machine-learning tools and/or artificial intelligence as an end, that’s only the beginning of the process for the HuMNet Lab.
“I always tell my students: Use human intelligence to inform artificial intelligence,” says González. “We want to apply our results to real-world problems."
This is why Toole, in preparing his visualization, determined the likely routes taken by taxis on airport runs. “Our brain doesn’t think in terms of census tracts; it thinks about streets,” Toole says. “So I wanted to map things to roads because that’s the way we know the city.”
Instead of showing only the basics — highlighting the number of taxi pickups and dropoffs by date and time of day — he added the numbers per census tract, displayed when the cursor rolls over a tract. But it also goes beyond the traditional heatmap to show taxi routes and magnitude of road use when a user clicks on a census tract.
He included the census tract boundaries, because the demographics available for census tracts are valuable information to the HuMNet Lab to use in later research for making inferences about population groups and activities at locations visited.
Examples of González’s and the HuMNet Lab’s research is the mining of cellphone and census data to pinpoint the feeder roads and source communities that generate most of the traffic congestion in Boston and San Francisco metropolitan areas, and discovering underlying common motifs in the daily travel behavior of entire populations of cities on different continents.
González hopes her work can one day be fed back online to the urbanites whose data she uses, helping them make better travel decisions and allowing them to interact with the information to help other users.
She teaches a graduate subject on big data, 1.204 Transportation Networks. And beginning next spring, she will offer a new undergraduate subject, 1.022 Urban Networks, that will draw on engineering, applied mathematics, computer science, and statistical physics to analyze real-world data sets. She’s also excited about a new doctoral degree program, the Program in Computational Science and Engineering in CEE, which incorporates classes on Internet databases and machine learning methods — both of which were important to her students in meeting the MIT Big Data Challenge.
González sees worlds of possibility for using big data in city sciences. “Now that all this passively generated data is available, we can apply the same scientific methods traditionally used to describe systems in the natural world to make models of cities,” she says. “This can help bring urban policy decisions into the domain of science.”