Perhaps envious that it attained neither the top party nor the top engineering school designations this year (though it came really close in both categories), my alma mater has lured the makers of Budweiser, AB InBev, to campus: not to one of the lively beer-soaked campustown bars on Green Street, but to its bucolic and cutting-edge research park on the south end of campus. While that area of farms and engineering labs certainly could use additional places to party, AB InBev’s venture at U of I’s Research Park has a somewhat different purpose. The beer maker will work with an interdisciplinary team of experts, led by computer scientists in data mining, artificial intelligence, networking, and parallel computing, to develop tools and algorithms for discerning market trends and consumer preferences. The site, which will appropriately be called Bud Lab, will function as a state-of-the-art data mining and analytics research center. Bud Lab provides yet another example of the intense interest government and industry have in extracting meaning from unprecedented amounts of data.
The field of data mining is a rapidly growing discipline within Computer Science. Recent articles in the popular press, including this one from CNN Money, have described the growth in careers as an employment boom. A telling passage from that article reads as follows:
“In the U.S. alone, a McKinsey & Company report projects a shortfall of between 140,000 and 190,000 ‘deep analytical; big data professionals by 2018 — that is, people with highly technical skills in machine learning, statistics, and/or computer science, the actual hands-on big data people that know how to crunch huge data sets into meaningful information.”
That’s great news for Computer Scientists. Accordingly, at Lewis, we recently added courses in data mining and machine learning to our undergraduate Computer Science curriculum, and we won a site license to the popular data analytics tool Qlikview through the Qlikview Academic Program. Our department is committed to preparing new Computer Scientists for increasingly important jobs as data engineers.
The ubiquity of data and the proliferation of software for working with it have led to a rapid escalation in the amount and variety of data stored. It also has led to a growing demand for tools to make sense of it. After all, too much data can be a problem if you don’t know how to look at it clearly. Indeed, having developed an infrastructure to collect and store data about their projects, customers, and processes, many organizations now wrestle with the problem of making sense of all the data they are collecting so that it starts to have value. A new group of companies, collectively called “Big Data”, has emerged to help organizations process and organize their data to make sense of it. Even companies whose core businesses aren’t in IT have pursued opportunities to enter the field. AB InBev’s initiative at U of I is an example of the investments big companies are making in finding effective ways to turn data into business gold.
Traditionally, Computer Scientists have helped organizations store, manage, and query their data using relational database management systems, or RDBMSs. RDBMSs organize data as a set of tables that refer to each other through the sharing of columns. A language called Structured Query Language (SQL) allows database engineers to ask questions about the data in an English-like language, hiding the complexities of the relational calculus operations that must take place behind the scenes to filter and retrieve the data. As the volume of data has increased, however, the performance limitations of this arrangement have surfaced and have become a growing concern for database designers and users. Computer Scientists have developed algorithms and platforms to address these performance limitations, giving rise to a new and growing category of technologies colloquially called “NoSQL”. The NoSQL databases have been popularized by Google and Amazon and other firms that must manage vast quantities of data to keep their organizations informed and working.
My previous employer, the electric utility sector, also sees the need for effective data mining and analytics. In fact, I recently was involved in a project that attempted to compress and organize data from new devices called synchrophasor measurement units, or PMUs, so that utilities could make sense of the data they collect at rates of 60 to 120 times per second. Likewise, scientists from biology, chemistry, and physics increasingly conduct experiments that capture data on a microsecond to nanosecond time scale, or they gather data from processes involving millions or even billions of data points. These huge data gathering efforts are made possible because sensor and storage data have advanced to the point where all that data can be stored. As a result, the problem has now shifted to how to organize and store it all.
Virtually every pursuit that gathers and acts upon data can be improved with the help of data mining solutions. That is why data mining is such a critical part of the future of Computer Science.
While I am excited for the future, I must admit that I’m more than a little jealous of the researchers at the Bud Lab. I’ve often joked that I’d work for beer. Those people actually get to do it.