Big Data - 'small' study guide
1. A Big Data 'Story':
Big Data Analytics: Data To Big Money by Frank Ohlhorst
Big Data For Dummies
A few weeks before the H1N1 virus made headlines, engineers at the Internet giant Google published a remarkable paper in the scientific journal Nature. It created a splash among health officials and computer scientists but was otherwise overlooked. The authors explained how Google could “predict” the spread of the winter flu in the United States, not just nationally, but down to specific regions and even states. The company could achieve this by looking at what people were searching for on the Internet. Since Google receives more than three billion search queries every day and saves them all, it had plenty of data to work with.
Google took the 50 million most common search terms that Americans type and compared the list with CDC data on the spread of seasonal flu between 2003 and 2008. The idea was to identify areas infected by the flu virus by what people searched for on the Internet. Others had tried to do this with Internet search terms, but no one else had as much data, processing power, and statistical know-how as Google.
While the Googlers guessed that the searches might be aimed at getting flu information—typing phrases like “medicine for cough and fever”—that wasn’t the point: they didn’t know, and they designed a system that didn’t care. All their system did was look for correlations between the frequency of certain search queries and the spread of the flu over time and space. In total, they processed a staggering 450 million different mathematical models in order to test the search terms, comparing their predictions against actual flu cases from the CDC in 2007 and 2008. And they struck gold: their software found a combination of 45 search terms that, when used together in a mathematical model, had a strong correlation between their prediction and the official figures nationwide. Like the CDC, they could tell where the flu had spread, but unlike the CDC they could tell it in near real time, not a week or two after the fact.
Thus when the H1N1 crisis struck in 2009, Google’s system proved to be a more useful and timely indicator than government statistics with their natural reporting lags. Public health officials were armed with valuable information.
Strikingly, Google’s method does not involve distributing mouth swabs or contacting physicians’ offices. Instead, it is built on “big data”—the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. With it, by the time the next pandemic comes around, the world will have a better tool at its disposal to predict and thus prevent its spread.
There is no rigorous definition of big data. Initially the idea was that the volume of information had grown so large that the quantity being examined no longer fit into the memory that computers use for processing, so engineers needed to revamp the tools they used for analyzing it all. That is the origin of new processing technologies like Google’s MapReduce and its open-source equivalent, Hadoop, which came out of Yahoo. These let one manage far larger quantities of data than before, and the data—importantly—need not be placed in tidy rows or classic database tables. Other data-crunching technologies that dispense with the rigid hierarchies and homogeneity of yore are also on the horizon. At the same time, because Internet companies could collect vast troves of data and had a burning financial incentive to make sense of them, they became the leading users of the latest processing technologies, superseding offline companies that had, in some cases, decades more experience.
2. Then what is 'Big Data'?
Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools.
Searching the Web for clues reveals an almost universal definition, shared by the majority of those promoting the ideology of Big Data, that can be condensed into something like this: Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it.
The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visualization of data.
There is much more to be said about what Big Data actually is.
The concept has evolved to include not only the size of the data set but also the processes involved in leveraging the data.
Big Data has even become synonymous with other business concepts, such as business intelligence, analytics, and data mining.
Paradoxically, Big Data is not that new. Although massive data sets have been created in just the last two years, Big Data has its roots in the scientific and medical communities, where the complex analysis of massive amounts of data has been done for drug development, physics modeling, and other forms of research, all of which involve large data sets. Yet it is these very roots of the concept that have changed what Big Data has come to be.
Note: Knowing what Big Data is and knowing its value are two different things.
For small and medium businesses (SMB), Big Data analytics can deliver value for multiple business segments.
Big Data, whether done in-house or on a hosted offering, provides value to businesses of any size
3. Sources for Big Data:
- Structure of the data (structured, unstructured, semi structured, table based, proprietary)
- Source of the data (internal, external, private, public)
- Value of the data (generic, unique, specialized)
- Quality of the data (verified, static, streaming)
- Storage of the data (remotely accessed, shared, dedicated platforms, portable)
- Relationship of the data (superset, subset, correlated)
Many industries fall under the umbrella of new data creation and digitization of existing data, and most are becoming appropriate sources for Big Data resources. Those industries include the following:
- Transportation, logistics, retail, utilities, and telecommunications. Sensor data are being generated at an accelerating rate from fleet GPS transceivers, RFID (radio-frequency identification) tag readers, smart meters, and cell phones (call data records); these data are used to optimize operations and drive operational BI to realize immediate business opportunities.
- Health care. The health care industry is quickly moving to electronic medical records and images, which it wants to use for short-term public health monitoring and long-term epidemiological research programs.
- Government. Many government agencies are digitizing public records, such as census information, energy usage, budgets, Freedom of Information Act documents, electoral data, and law enforcement reporting.
- Entertainment media. The entertainment industry has moved to digital recording, production, and delivery in the past five years and is now collecting large amounts of rich content and user viewing behaviors.
- Life sciences. Low-cost gene sequencing (less than $1,000) can generate tens of terabytes of information that must be analyzed to look for genetic variations and potential treatment effectiveness.
- Video surveillance. Video surveillance is still transitioning from closed-caption television to Internet protocol television cameras and recording systems that organizations want to analyze for behavioral patterns (security and service enhancement).
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.
5. Go to Basics
a) Structured Data: The term structured data generally refers to data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, a customer’s name, address, and so on). Most experts agree that this kind of data accounts for about 20 percent of the data that is out there. Structured data is the data that you’re probably used to dealing with. It’s usually stored in a database. You can query it using a language like structured query language (SQL), which we discuss later in the “Defining Unstructured Data” section.
Your company may already be collecting structured data from “traditional”sources. These might include your customer relationship management (CRM)data, operational enterprise resource planning (ERP) data, and financial data.
Often these data elements are integrated in a data warehouse for analysis.
b) Unstructured Data: Unstructured data is data that does not follow a specified format. If 20 percent of the data available to enterprises is structured data, the other 80 percent is unstructured. Unstructured data is really most of the data that you will encounter. Until recently, however, the technology didn’t really support doing much with it except storing it or analyzing it manually.
Here are some examples of machine-generated unstructured data:
✓ Satellite images: This includes weather data or the data that the government
captures in its satellite surveillance imagery. Just think about Google Earth, and you get the picture (pun intended).
✓ Scientific data: This includes seismic imagery, atmospheric data, and high energy physics.
✓ Photographs and video: This includes security, surveillance, and traffic video.
✓ Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles.
The following list shows a few examples of human-generated unstructured data:
✓ Text internal to your company: Think of all the text within documents, logs, survey results, and e-mails. Enterprise information actually represents a large percent of the text information in the world today.
✓ Social media data: This data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.
✓ Mobile data: This includes data such as text messages and location information.
✓ Website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram.
Hypervisor : A hypervisor is the technology responsible for ensuring that resource sharing takes place in an orderly and repeatable way. It is the traffic cop that allows multiple operating systems to share a single host. It creates and runs virtual machines. The hypervisor sits at the lowest levels of the hardware environment and uses a thin layer of code (often called a fabric) to enable dynamic resource sharing.
✓ Hadoop Distributed File System: (http://hadoop.apache.org/ ) A reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines.
✓ MapReduce engine: A high-performance parallel/distributed data processing
implementation of the MapReduce algorithm.