The Big Data Conundrum: How to Define It?

Big Data is revolutionizing 21st-century business without anybody knowing what it actually means. Now computer scientists have come up with a definition they hope everyone can agree on.

Emerging Technology from the arXivarchive page

October 3, 2013

One of the biggest new ideas in computing is “big data.” There is unanimous agreement that big data is revolutionizing commerce in the 21st century. When it comes to business, big data offers unprecedented insight, improved decision-making, and untapped sources of profit.

And yet ask a chief technology officer to define big data and he or she will will stare at the floor. Chances are, you will get as many definitions as the number of people you ask. And that’s a problem for anyone attempting to buy or sell or use big data services—what exactly is on offer?

Today, Jonathan Stuart Ward and Adam Barker at the University of St Andrews in Scotland take the issue in hand. These guys survey the various definitions offered by the world’s biggest and most influential high-tech organisations. They then attempt to distill from all this noise a definition that everyone can agree on.

Ward and Barker cast their net far and wide but the results are mixed.Formal definitions are hard to come by with many organisations preferring to give anecdotal examples.

In particular, the notion of “big” is tricky to pin down, not least because a data set that seems large today will almost certainly seem small in the not-too-distant future. Where one organizsation gives hard figures for what constitutes “big,” another gives a relative definition, implying that big data will always be more than conventional techniques can handle.

Some organizations point out that large data sets are not always complex and small data sets are always simple. Their point is that the complexity of a data set is an important factor in deciding whether it is “big.”

Here is a summary of the kind of descriptions Ward and Barker discovered from various influential organizations:

1. Gartner. In 2001, a Meta (now Gartner) report noted the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. This report predated the term “dig data” but proposed a three-fold definition encompassing the “three Vs”: Volume, Velocity and Variety.This idea has since become popular and sometimes includes a fourth V: veracity, to cover questions of trust and uncertainty.

2. Oracle. Big data is the derivation of value from traditional relational database-driven business decision making, augmented with new sources of unstructured data.

3. Intel. Big data opportunities emerge in organizations generating a median of 300 terabytes of data a week. The most common forms of data analyzed in this way are business transactions stored in relational databases, followed by documents, e-mail, sensor data, blogs, and social media.

4. Microsoft. “Big data is the term increasingly used to describe the process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information.”

5. The Method for an Integrated Knowledge Environment open-source project. The MIKE project argues that big data is not a function of the size of a data set but its complexity. Consequently, it is the high degree of permutations and interactions within a data set that defines big data.

6. The National Institute of Standards and Technology. NIST argues that big data is data which “exceed(s) the capacity or capability of current or conventional methods and systems.” In other words, the notion of “big” is relative to the current standard of computation.

A mixed bag if ever there was one.

In addition to the search for definitions, Ward and Barker attempted to better understand the way people use the phrase big data by searching Google Trends to see what words are most commonly associated with it. They say these are: data analytics, Hadoop, NoSQL, Google, IBM, and Oracle.

These guys bravely finish their survey with a definition of their own in which they attempt to bring together these disparate ideas. Here’s their definition:

“Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”

A game attempt at a worthy goal—a definition that everyone can agree is certainly overdue.

But will this do the trick? Answers please in the comments section below.

Ref: arxiv.org/abs/1309.5821: Undefined By Data: A Survey of Big Data Definitions

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.