Dark logo

Defining Big Data

Published February 6, 2017
Doug Rose
Author | Agility | Artificial Intelligence | Data Ethics

Big data is a term that describes an immense volume of diverse data typically analyzed to identify patterns, trends, and associations. However, the term “big data” didn’t start out that way. In 1997, NASA researchers Michael Cox and David Ellsworth described a “big data problem” they were struggling with. Their supercomputers were performing simulations of airflow around aircraft and generating massive volumes of data that couldn’t be processed or visualized effectively. The data were pushing the limits of their computer storage and processing, which was a problem — a big problem. In this context, the term “big data problem” was used more to describe a big problem than big data; NASA was facing a big, data problem, not so much a big-data problem.

A decade later, a McKinsey report entitled “Big data: The next frontier for innovation, competition, and productivity” reinforced the use of the term “big data” in the context of a problem that “leaders in every sector will have to grapple with.” The authors refer to big data as data that exceeds the capability of commonly used hardware and software.

Over time, defining big data has taken on a life and meaning of its own, beyond the context of a problem, to include the potential value of that data, as well. Now, big data poses both big problems and big opportunities.

What Is Big Data?

Many organizations that start big-data projects don’t actually have big data. They may have a lot of data, but volume is just one criterion. These organizations may also mistakenly think that they have a big-data problem, because of the challenges they face in capturing, storing, and processing their data. However, data doesn’t constitute big data unless it meets the following criteria (also known as the four V’s):

Volume: In the world of big data, volume is no longer measured in megabytes and gigabytes but is now measured in terabytes, petabytes, exabytes, zettabytes, and yottabytes. Imagine the volume of data generated around the world every day by the over six billion smart phone users. Add to that Internet data and machine-generated data from the growing number of devices that comprise the Internet of Things (IoT), along with data from numerous other sources.

  • Variety: While an organization’s internal data may be highly structured and predictable, external data is very diverse. Organizations may need a variety of data to perform relevant analytics, including web browsing history, social media posts, weather data, traffic data, data from wearable wireless health monitors, tweets, videos, geolocation data from smart phones, and much more. Figuring out how to procure, store, extract, combine, and analyze diverse data is a huge challenge.
  • Velocity: The speed at which data is generated and flows through computer systems throughout the world is unprecedented. Organizations now have the ability to capture streaming data and analyze it in near real time to achieve and maintain a competitive edge. You have probably experienced this when searching the web; while you’re searching and browsing the web, organizations are capturing your data, analyzing it, and displaying ads targeted to your perceived interests.
  • Veracity: Veracity is a measure of the uncertainty of the data. Because so much of big data comes from outside an organization and may be biased or generated intentionally to promote a certain agenda, big data carries a degree of uncertainty. Organizations must constantly evaluate the veracity of the data they obtain from external sources, so they’re not making decisions based on false or misleading information.

A Real Big Data Problem

An interesting example of a big data problem is the challenge surrounding self-driving cars. To enable a self-driving car to safely navigate from point A to point B without running over pedestrians or crashing into objects, you would need to collect, process, and analyze a heavy stream of diverse data, including audio, video, traffic reports, GPS location data, and more, all flowing into the database in real time and at a high velocity. You would also need to evaluate which data is most reliable; for example, the historical data showing that the left lane is open to traffic, or the live video of a sign telling drivers to merge right. Is that person standing at the corner going to dart out in front of the car or wait for Walk signal? Whether the driver is human or the car is navigated by big data, a split-second decision is often required to prevent a serious accident. A driverless car would have to instantly process the video, audio, and traffic coordinates, and then “decide” what to do. That’s a big data problem.

Solving Big Data Problems

Technology is evolving to solve most big data problems, and the cloud is playing a key role in this process. The cloud offers virtually unlimited storage and compute, so organizations no longer need to bump up against limitations in their on-premises data warehouses. In addition, business intelligence (BI) software is becoming increasingly sophisticated, enabling organizations to extract value from data without requiring a high level of technical expertise from users.

Still, many organizations struggle with data problems, both big and small. Some continue to struggle to meet storage and compute limitations simply because they are reluctant to move their on-premises data warehouses to the cloud. However, most organizations that struggle with data simply don’t know what to do with the data they have and the vast amounts of diverse data that are now readily available. Their problem with data is that they haven’t developed the culture of curiosity and innovation required to put all the data available to good use. In many ways, this shortcoming in organizations poses the real big data problem.

Related Posts
July 31, 2017
Data Critical Thinking

Data critical thinking is a key skill for your team to get the most from data reports and analysis.

Read More
June 5, 2017
Data Science Team Pitfalls

The common mistakes with most data science teams. See the pitfalls and lead culture change to avoid them.

Read More
January 9, 2017
What Is Data Science?

Data science is a multi-disciplinary approach to extracting insight from data. The disciplines involved include computer science/information technology, math/statistics, and domain knowledge/expertise (for example, knowledge of a specific industry). The process of extracting insight from data is typically broken down into the following five stages: Shifting the Focus from Data to Science The best way […]

Read More
1 2 3 19
9450 SW Gemini Drive #32865
Beaverton, Oregon, 97008-7105
Dark logo
© 2022 Doug Enterprises, LLC All Rights Reserved
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram