Carlos Monroy’s Magic with Massive Data

Carlos MonroyCS Research Scientist Carlos Monroy works magic with massive data sets, so it makes sense that his first experience with a computer felt magical. “In my senior year of high school, my school acquired computers to introduce students to the new ‘computing’ field,” he said.

The machines had a black, 5-inch screen displaying green text and used a floppy disk to store information. Monroy said, “I was fascinated by the process: instructions someone had written and stored on the floppy disk could make something happen on the screen, in that magic box.”

But he planned to major in business in college. He said, “In Guatemala, you have to declare your major when you begin. I was standing in line to sign up for classes when I noticed a table about a new major, computer science.” Based on a few brochures, Monroy changed his major in a matter of minutes.

He graduated with his bachelor’s in CS at the peak of the dot com frenzy, so he created a small software consulting firm and began automating a customer service center for a large electronics company. He also worked part-time as a researcher and lecturer for his university.

“All the time, I was really curious, always wanting to learn more.” Although he originally planned to pursue an MBA, one of his friends at Texas A&M encouraged Monroy to apply to the CS program. He intended only to stay two years, but his advisor suggested continuing as PhD student.

He said, “We had been collaborating with people in Literature. They were comparing early printed copies of the Cervantes book “Don Quixote,” a literally masterpiece in Spanish. They were trying to understand what exactly Cervantes wrote, because his original manuscript did not survive. Although there were multiple copies, none matched. Intense analytics on the text was being done with CS. Since I spoke Spanish, they asked if I would be interested and that was my first project.”

Another scholar was analyzing Picasso’s artworks, both the images and an extensive historical narrative, with the idea of combining text and images. Monroy became involved when the collection had reached 4,000 images [out of 50,000 to be catalogued]. “He needed a better infrastructure,” said Monroy, who readily accepted the challenge.

About the time Monroy needed to choose a dissertation topic, a friend who was researching sunken ships invited him to visit their lab. He said, “They had boxes and boxes of drawings, notes, and photos and my first question was, ‘What database do you use?’ They were trying to document the wrecks in order to compile ideas of how ships were built.”

Monroy’s research team wanted to link the disparate databases, even though they had no common standards. “We proposed a multi-lingual search engine for shipbuilding. The texts [describing the construction of ships] were re-indexed in French, English, Spanish and Portuguese,” he said, “but these were 17th and 18th century technical manuscripts. The languages have changed and even the meaning of words has changed since then. Today, Google is clever about giving you the top search results, tailored for you. Our top results would be based on words, and relationships to words.”

Monroy is also interested in improving learning practices. “The way you play a game is a footprint for how you learn,” he said. “How do game players (learners) solve their problems? Which resources are the most effective and what is the best sequence to impact outcomes. What about pacing?”

His interest led to an invitation to participate in a week-long meeting to brainstorm with the National Science Foundation about big ideas for teaching and learning. “We were asked to come up with creative ways to analyze these data,” he said. “My term – learniformatics – looks at this problem as a system, where all the parts are interconnected. If we understand those interconnections, we can better understand the whole process.”

Monroy’s current project has a similar theme. He is part of a team seeking to improve computer programming. Several Rice computer scientists–along with teams from other institutions–are working on a joint effort encompassing big data, machine learning, program languages and databases.

“I am part oCarlos Monroyf a team led by Dr. Chris Jermaine that is working on a high performance distributed store and compute platform. It will enable us to store large amounts of data and perform analytics tasks to support programmers, improving software productivity, security and quality,” he said.

At present, PLINY’s curated source datasets totals about 14 Terabytes, encompassing nearly 125 million files from 400K projects. Recently one of the teams reached the milestone of processing 1 Billion lines of code.

But he is also excited about the collaboration with different disciplines. Recently, the Rice Data Management Team invited Monroy to talk about data curation, as part of the Data Week, in his talk entitled The Art of Data Curation in the Age of Data Science, Monroy shared some ideas about his experience working with diverse datasets in a wide range of domains.

“The Data Science initiative at Rice is really interesting to me,” he said, “because of my current projects and past work in text analytics and data mining. Through this initiative, we in CS can enable and help scholars and researchers across the community, we also enrich our own discipline through the problems and expertise that they bring to the table. Business, statistics, humanities, visualization… this is the type of mutual enrichment I find really critical and fascinating.”