Analytics & Big Data

Session 8: Big data I

Prof. Dr. Gerit Wagner

(2026-05-04)

Explain the characteristics of big data and their implications for analytics workflows.
Assess key challenges of big data, including scalability, heterogeneity, and data quality.
Outline distributed data architectures and their role in large-scale analytics.

TODO: Explain how big data environments reshape model development, validation, and deployment. This creates continuity with: Generalization problem, Overfitting, Model evaluation

TODO

Limitations of the Data Warehouse Approach

Traditional DWH solutions are designed to provide a single point of truth. Important aspects are:

Merge and unify data from multiple data sources
High data quality
Proper historization of the data
Data Governance and Compliance

This results in the following problem areas in today’s world:

Lack of flexibility due to high effort for changes
Time expenditure due to transfer from the operational sources and the aggregations
Past orientation of data (snapshot of the past); there is an increasing need for ad hoc and real-time analyses
Partial knowledge, as only structured data is stored, with increasing need for social media data, etc.
Patchwork: due to gradual introduction of data warehouses, many isolated solutions exist and are operated separately both technically and methodologically

What is Big Data?

The basic idea behind the phrase ‘Big Data’ is that everything we do is increasingly leaving a digital trace (or data), which we (and others) can use and analyze. > – Bernard Marr

Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers. > – Forrester

Big Data is not about the size of the data, it’s about the value within the data. > – David Wellman

4V’s of Big Data

Volume (Scale)

The Model of Generating and Consuming Data has Changed

Collecting Data

Types of Data People are Creating (I)

Activity Data

Simple activities like listening to music or reading a book are now generating data. Digital music players and eBooks collect data on our activities. Your smartphone collects data on how you use it and your web browser collects information on what you are searching for. Your credit card company collects data on where you shop and the shops collect data on what you buy. It is hard to imagine any activity that does not generate data.

Conversation Data

Our conversations are now digitally recorded. It all started with emails, but nowadays most of our conversations leave a digital trail. Consider all the conversations we have on social media sites like Facebook or Twitter. Even many of our phone conversations are now digitally recorded.

Photo and Video Image Data

Think about all the pictures we take on our smartphones or digital cameras. We upload and share hundreds of thousands of them on social media sites every second. An increasing number of cameras record video images, and we upload hundreds of hours of video to YouTube and other platforms every minute.

Types of Data People are Creating (II)

Sensor Data

We are increasingly surrounded by sensors that collect and share data. Take your smart phone, it contains a global positioning sensor to track exactly where you are every second of the day, it includes an accelerometer to track the speed and direction at which you are travelling. We now have sensors in many devices and products.

The Internet of Things Data

We now have smart TVs that are able to collect and process data, we have smart watches, smart fridges, and smart alarms. The Internet of Things, or Internet of Everything connects these devices so that e.g. the traffic sensors on the road send data to your alarm clock which will wake you up earlier than planned because the blocked road means you have to leave earlier to make your 9am meeting.

Variety (Complexity)

A Single View to the Customer

Velocity (Speed)

Data is being generated fast and needs to be processed fast.

Late decisions result in missing opportunities

Examples

E-Promotions: Based on your current location, your purchase history, and your interests → send promotions immediately for the store next to you.
Healthcare monitoring: Sensors track your activities and health → any abnormal measurements require immediate reaction.
Production and Logistics: With real-time POS data, Langnese can adjust production and delivery instantly → saving millions of euros monthly.

Real-Time Analytics

Veracity (Uncertainty)

Organizations must now analyze both structured and unstructured data that is uncertain and imprecise.

In many cases, it is not known whether the data is correct (e.g. fake news) or representative (e.g. biased expressions of opinion in forums).

It may be prudent to assign a Data Veracity score and ranking for specific data sets to avoid making decisions based on analysis of uncertain and imprecise data.

How is Big Data actually used?

Better understand and target customers Companies expand their traditional data with social media data, browser, text analytics or sensor data to get a more complete picture of their customers.
Understand and Optimize Business Processes Retailers are able to optimize their stock based on predictive models generated from social media data, web search trends, weather forecasts…
Improving Health Use the data from smart watches, wearable devices, Google Trends, health research, electronic medical record to diagnose disease, predict epidemics, …
Improving Security and Law Enforcement Use big data analytics to predict criminal activity, foil terrorist plots, detect cyber attacks, and detect fraudulent credit card transactions.
Improving and Optimizing Cities and Countries (Smart Cities) Optimize traffic flows based on real time traffic information, social media and weather data. A bus would wait for a delayed train and traffic signals predict traffic volumes.

Turning Big Data into Value

The Four Layers of Big Data (I)

Data Source Layer

This is where the data arrives at the organization. It includes everything from sales records, customer database, feedback, social media channels, marketing list, email archives etc.

Identify and Prioritize Data Sources

The Four Layers of Big Data (II)

The Four Layers of Big Data (III)

The Four Layers of Big Data (IV)

From Data Warehouse to Data Lake

Instead of recording millions of transactions, today’s organizations are recording billions of interactions. Companies are capturing more and more data that can open business opportunities and unlock new sources of value for organizations.

Companies are not able to store this data in data warehouses because it is of high volume, mostly raw and often not structured. As consequence, data lakes have emerged as an alternative approach. The intent is to capture enterprise data and load it in its raw form into a centralized, large, and inexpensive storage system.

In shifting from data warehouses to data lakes, it became important to decouple data movement from data transformation. Data movement (the “E” and “L” of ETL) is an operational task. Data transformation (the “T” of ETL) is a content-based, analytic-facing task that requires an understanding both of the data and how it’s to be used.

A clean separation between data movement and data transformation has the benefits of less friction because the instance loading the data isn’t responsible for transforming it.

The Data Lake

A data lake is a method of storing data within a system in its natural format, that facilitates the collocation of data in various schemata and structural forms. The idea of data lake is to have a single store of all data in the enterprise ranging from raw data to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning.

Modern Data Analytics Architecture

Logical Data Warehouse

A “logical data warehouse” provides analytical company data without first physically moving it to a physical data warehouse.

As in a classic data warehouse, uniform views are provided for analysis purposes.

While the data in the classic data warehouse comes from a “well-defined” physically uniform database, the “logical data warehouse” pulls data together from the data lake at the time of the query.

Aggregation is done just in time. Thus, the schema of the data warehouse is just virtual.

Survey: Session 8

https://forms.gle/hNuAD8UZvC8KYo4a8

References

Schmarzo, B. (2016). Big data MBA. Wiley. https://doi.org/10.1002/9781119238881