We now live in a world where it seems that everything about us is (or soon will be) tracked and recorded: what we eat, what we watch, how we socialize, what we like and dislike, our vital health statistics—and the list goes on.
Such unprecedented access to personal data presents potentially enormous opportunities to, for instance, help government officials make better policy decisions, allow businesses to operate more efficiently and profitably, streamline the use of public resources, support more personalized healthcare and drug design, and otherwise improve the overall quality of life in our society. The key to seizing these opportunities lies in our ability to convert the available data into significant decisions.
Data Science and Statistics: Opportunities and Challenges
This new 6-week online course begins October 4.
An upcoming six-week online MIT Professional Education course, Data Science: Data to Insights, offered in partnership with the MIT Institute for Data, Systems, and Society (IDSS), will focus on analytics. But it will also address such concerns as the latest trends in machine learning: how to extract meaningful insights and preferences from customer data in general and how to ask the right questions to make better business decisions.
Over the past few decades, we have built infrastructure that can store and process massive amounts of data. However, we still lack the critical ability to seamlessly stitch together various pieces of data to make meaningful predictions that lead to high-impact decisions. Given the endless opportunities that can be unlocked by addressing this shortcoming, I believe this is one of the defining challenges of our times.
Educational institutions can play a leading role in addressing this important challenge. At MIT, the IDSS and its new Statistics and Data Science Center (SDSC) will help address the challenge of turning data into real-world decisions with a two-pronged approach:
- Educating our students to be able to work with large amounts of data and to use the tools to extract meaningful information from it. Put another way, we must educate students in all disciplines to be both data scientists and statisticians. This requires that institutions design a streamlined, interdisciplinary educational program that includes elements from engineering, mathematical sciences, and the social sciences.
- Developing a research program that eventually produces a statistical data-processing system that can be readily used to make all sorts of accurate predictions. Such a system needs to work with heterogeneous data sources, operate at scale, and lead to predictions that can be effectively interpreted. This ambitious program could help mobilize an interdisciplinary and exciting intellectual effort in data science and statistics for the next decade or beyond.
Thinking About Decisions
Let’s consider how decisions are made. In a typical organization, basic operational tasks depend on decisions about how to invest available resources among different competing options, with an eye on one or more objectives.
For example, the U.S. government makes such decisions while developing its budget. A trading firm invests money in different financial instruments to create portfolios with high returns and, potentially, well-understood risks. A retail organization makes decisions about which merchandise purchases will generate high revenues and profits. A household makes decisions about how to get the most out of the family income. A rational individual makes decisions about what to eat (and what not to eat) to get enough energy and stay healthy.
All such decisions, in a nutshell, boil down to making predictions, then undertaking certain optimization activities using those predictions.
How It Works: A Retail Example
Now let’s look at a concrete example involving an apparel retailer. The retailer’s primary operational problem is figuring out which products to showcase for customers, given various operational constraints such as its budget for buying inventory, the limits on its stores’ shelf space, and its suppliers’ schedules. The question of choosing which products to showcase arises at different times for different types of decisions, such as deciding which products to purchase across the chain of stores, which to ship to various locations from distribution centers, which products to discount, which to promote via e-mail, and which to show to customers when they visit stores or e-commerce sites.
All these questions, in essence, require an understanding of what people like and dislike. Some existing systems do provide these insights, and might indicate, for instance, that blue shirts are trending while red shorts have stopped selling. But how do we convert these insights into action?
Data-Driven Decision Making
Conceptually, data-driven decision making requires connecting decision variables and options to data, and then solving an optimization problem with varying objectives. Operationally, this requires building a data-processing system that might be extremely large-scale and that might need to operate in real time with three high-level components: interfaces, infrastructure, and algorithms.
Interfaces. These provide ways to deliver information to end-customers and sensors to collect information. For example, Web-based (browser) interfaces or mobile applications allow the collection of information about online customer activities. Similarly, such interfaces can help a decision maker in a retail organization interact with data and insights, as well as obtain decision support. The standardization of such interfaces has allowed for massive innovation in this domain over the past decade.
Infrastructure. The role of infrastructure is to provide a means for seamlessly storing and processing massive amounts of data. The need for such infrastructure arose naturally in the late 1990s as the Internet era picked up steam. It’s no surprise that Web-search companies have pioneered basic innovations. Interestingly, Web search, a seemingly simple feature, has led to the development of a generic scalable storage and computation infrastructure. That, in turn, has been the primary reason for recent exciting innovations in scalable computation and data processing.
Algorithms. Data-processing algorithms transform the raw data collected into valuable insights and decisions. Appropriate models are used to connect that data to decision variables. For example, when raw data is generated by people, it may make sense to use a behavioral model to connect that observed data to decision variables. The resulting algorithms use the computation and storage infrastructure, based on the data obtained through the interface, and produce end results that can be delivered to the end user through the interface.
Yet a major challenge is enabling the development of data-processing algorithms for everyone. Unlike the availability of standardized interfaces or a generic computation and storage architecture, we are far from having a generic, data-processing, algorithmic architecture.
Let’s revisit the retail example above. Specifically, consider the decision task of which products to show to customers when they are visiting the e-commerce site—that is, how do we personalize each customer’s experience? Naturally, this depends on data about the specific customer, as well as the data collected about others.
That data is collected through a customer’s browsing history and clicks on the e-commerce website, past purchases, and other online activity gleaned through our Web and mobile interfaces. It is likely stored in a storage infrastructure. It is transformed into real-time, personalized decisions via potentially sophisticated data-processing algorithms that use behavioral models from the social sciences, along with methods from mathematical statistics and machine learning. The data-processing algorithms use the computation infrastructure to be able to make such decisions in real time. In this way, personalized decisions are delivered to the customer through the interface.
Key to building this type of personalization or recommendation system is having access to a skilled team of data scientists and statisticians who can identify appropriate statistical methods and behavioral models to develop data-processing algorithms. They can then design human-friendly interfaces that can collect useful data and subsequently deliver decisions. While this is an expensive undertaking, some of the largest retailers have already taken this route.
On the other hand, the personalization/recommendation system has specific functions that take a very similar form across organizations. That similarity has allowed the development of generic recommendation systems. Therefore, many retailers end up purchasing such systems from outside vendors who simply plug in the personalization system through the interfaces.
Closing the Loop
As discussed previously, one major challenge is going from data to decisions. We already have a lot of data—and we have a good infrastructure to store and process it—but we need to figure out how to process it. The discussion of the personalization/recommendation system explains precisely the two approaches that we can use simultaneously to address this challenge.
First, we must enable organizations to build their team of skilled data scientists. Second, we should develop a generic data-processing algorithmic architecture. Specifically, this data-processing architecture needs to focus on developing a generic prediction system. That’s because a decision-making system basically has two components: predicting the unknowns and using the predictions to perform optimization. Over the past few decades, significant progress has been made to develop the theory and practice of optimization. However, we still can’t define what the generic and universal prediction problem is.
IDSS, SDSC, and ‘Data Science’
MIT launched the IDSS to address societal questions emerging over the next century. While many of these issues involve multiple disciplines, they are all connected through one common challenge: data-driven decisions. To develop an education program and enable research in data science and statistics at the IDSS, MIT created the SDSC under the IDSS umbrella.
We will help address the challenge of transforming data into decisions by enabling the two approaches that I have described through both the SDSC and the IDSS. Specifically, the SDSC will educate sophisticated data scientists and statisticians through interdisciplinary educational programs. The IDSS will provide an interdisciplinary research environment that will allow its members to undertake ambitious research programs in statistics and data science.
Meanwhile, our new six-week, online course, “Data Science: Data to Insights”, which begins October 4, will share the latest information about ways to apply data science techniques to more effectively address your organization’s many challenges. To learn more about how to create your company’s data-analysis future, please visit the course registration page.
Acknowledgements:The author thanks Munther Dahleh and Philippe Rigollet for providing feedback on an earlier version of this article, and Stefanie Koperniak and Myriam Joseph for proofreading and editing it.
Devavrat Shah, co-director of the Data Science: Data to Insights course, is a professor in MIT’s Department of Electrical Engineering and Computer Science, director of the SDSC, and a core faculty member at the IDSS. He is also a member of MIT’s Laboratory for Information and Decision Systems (LIDS) and the Operations Research Center (ORC).