Jacob Gao: at the Intersection of Databases and Machine Learning

Zekai "Jacob" GaoSixth-year CS PhD student Zekai “Jacob” Gao chose Rice University for his graduate work for two reasons: CS Professor Chris Jermaine and the Houston location. “I knew about Chris because he’d received a best paper award in SIGMOD,” Gao said. “And I’m a city lover, so I chose the offer from Rice in Houston over another a university in a smaller town.”

SIGMOD is the Association for Computing Machinery (ACM)’s Special Interest Group on Management of Data, a group that specializes in large-scale data management problems and databases. “Whenever you do research in databases, you just know SIGMOD,” said Gao, “so when I saw Chris publish his prototype paper, I got in touch with him and came to Rice.”

Gao said his interest in databases began in his undergraduate years. “This area at the intersection of databases and machine learning, well it is a hot topic in industry. For example, Pandora plays you the next song based on what you like or what other listeners like, that is machine learning.”

Pandora is an appropriate example for Gao, who interned there as well as at Facebook over the last two summers. “It all ties back to the close relationship of databases to machine learning and how we can make machine learning more efficient and scalable,” he said.

One way to create greater efficiencies is to group data and model parameterizations. Gao said, “Very often in a large-scale machine learning computation, you set up the model parameters and run a single data record set through the model. Then you set up another set of parameters for the next data record and run that. It’s a very time-consuming process.”

Jermaine and Gao are collaborating with Niketan Pansare of IBM’s Almaden Research Center on a SimSQL database project. “We argue that we’ve found a type of pattern in machine learning algorithms that needs more careful study,” said Gao. “If many data records share the same set of model parameters then we can find opportunities for optimization because we’ll be able to group these data together. That means we can avoid repetitive model parameterizations and speed up the computation because these data can be fitted to just one set of the model’s parameters.”

Gao said, “We sought out Niketan because we wanted to make SimSQL more widely available for public use. There is a more general-use system called Spark and Niketan is very familiar with it. So we each worked on one system to see if this kind of pattern can be found, and what use we could make of it. We’ve had a lot of Skype dates as we’re collaborating.”

Communication is important to Gao, who was surprised at how little people seemed to talk among themselves when he first arrived at Rice five years ago. For example, Pansare was in his research group, but the two never actually met until Gao finished his first year at Rice. “It [communication] has become much better since the CS GSA was founded,” said Gao.

“Sometimes you need to de-stress and you want to know what your peers are thinking and working on. Just talking with them socially helps,” he said. “The CS GSA organizes lots of study breaks and other opportunities to talk about what we’re working on. We discover we have the same kind of worries, and it is important to be able to offer support and find support through each other at the same level.”

Gao advises new graduate students to find ways to talk with faculty members and peers. “I think it’s very important to promote what you do, person to person. The more you talk about it, the more you know what you are working on, and it helps you find the problems in your research.”

He also encourages students to work with their advisor as well as in groups. “It is different from team to team, but in my research it’s a good thing if you actually work with your advisor so you can learn from them. On the other hand, you may find a lot of problems or pressure when you aren’t sharing the load with other people. Working in teams helps, but you still need to work on your own with your advisor.”

Although Gao thoroughly enjoys his research and his team, he hopes to find a place in an industry that needs his database and machine learning expertise. “I’d really like to work on real world problems,” he said.