Jia Zou, a research scientist at Rice University, enjoys the creative possibilities CS has to offer. Zou’s interest for scientific research was born the day she entered an international, 72-hour mathematical contest in modeling with two classmates from college.
The competition, sponsored by the Consortium for Mathematics and Its Applications (COMAP), required each team to build a mathematical model to investigate a real world issue.
Each team was challenged to devise an algorithm which uses data provided by the anemometer to adjust the water-flow from a fountain as wind conditions change. Zou and her friends worked through restrictions such as attractive spectacle and soaking avoidance. Winning meritorious prize ushered Zou into a lifelong love for research.
“It’s really exciting. It’s not like I incrementally improve a system. I get to build new systems from ground up with a great team,” she said.
Zou received her Ph.D. in Computer Science from Tsinghua University, China. She worked at IBM Research-China as a researcher. Currently her research interests focuses on large-scale distributed systems and big data management systems.
Zou works on the PlinyCompute project with a team led by Chris Jermaine, who is principal investigator (PI) of the Pliny Project funded by DARPA.
The team wrote a paper about the PlinyCompute system which was accepted to the ACM SIGMOD Conference 2018. The paper is called “PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development.” They will present the paper this summer.
Zou explained that UDF-Centric systems such as Spark are the trend for analytics, but they have drawbacks. PlinyCompute makes user defined function (UDF) easily analyzable in an efficient way for query optimization.
The paper describes how PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries, can help programmers.
“More and more people need to do complex analytics of big data. This requires a large-scale distributed analytics platform to help users to develop their analytic applications,” she said.
PlinyCompute is unique because looking from a macro, large-scale view, it gives programmers a high-level declarative interface that helps them set up distributed computations through automatic relational database style optimizations. Looking from a micro, small-scale view, PlinyCompute gives experienced systems programmers capabilities to access a persistent object data model and API as well as a memory management system designed for high-performance, data-intensive operations.
“Several years ago a lot of new systems widely adopted in the industry emerged such as Hadoop and Spark. They allow users to use high level programming language like Java to develop the analytics application. Although those systems allow users to define many complex manipulations in Java, they have several significant drawbacks,” Zou said.
“To solve those problems we decided to use C++ to avoid the JVM run time overhead, to enable the efficient compilation of analyzable UDF and also the high-performance object model,” she said.
Zou was responsible for designing and implementing key system components including PlinyCompute’s distributed query processing engine, storage and buffer pool management , physical optimizer and query scheduler, as well as cluster management in the PlinyCompute prototype presented in the paper. She also lead the system integration, performance evaluation and optimization.
Zou’s paper highlights many benefits of PlinyCompute, such as greater speed. Implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speed up of 2x to more than 50x or more compared to equivalent implementations on Spark.
“Existing works like High Performance Computing (HPC) and tensorflow provide good performance but bad productivity. Systems like Spark provides good productivity but worse performance. We provide good productivity that allows people to more easily and freely develop the applications and good performance, which makes things faster,” she said.
The daily work of a research scientist can be divided in two parts, Zou said. She works on the existing research agenda while keeping an eye on future projects.
“Academic research requires more leadership. When working in the industry, you are serving a product line and your research agenda is well defined by the company’s strategy. In the academic world there’s more flexibility in choosing research problems, but such flexibility means you need to drive yourself harder, you need to be more creative because no one provides you with research problems, you have to find them on your own,” she said.
Zou wants to use her research to improve lives in the future.
“I want to continue to do really good work in building and improving UDF-centric systems to facilitate analytics on big and fast arriving data. There’s a lot of promising research in those areas with a lot of challenges. It can really benefit people’s lives and the world.”
–Cintia Listenbee, CS Publicist