Analytics & Big Data

Session 1: The rise of analytics

Prof. Dr. Gerit Wagner

(2026-03-23)

Why is modern analytics so successful?

More data
More computing power
New algorithms
New analytics processes

More data for the analysis

The explosion of data

Unit	Equivalent	Approximate meaning
Gigabyte (GB)		A HD movie file or a few hundred photos
Terabyte (TB)	1,000 GB	Storage of a modern laptop or external drive
Petabyte (PB)	1,000 TB	Data of a large company or several large data centers
Exabyte (EB)	1,000 PB	Roughly the yearly internet traffic of a small country
Zettabyte (ZB)	1,000 EB	≈ 1 trillion gigabytes; global data creation scale

Data production

Enterprise and transactional data
Enterprise systems such as ERP, CRM, and supply chain platforms generate large volumes of structured data through everyday business transactions.

E-commerce
Every search, click, purchase, and review creates behavioral and transactional data used for recommendations and personalized marketing.

Social media and user-generated content
Platforms such as TikTok, YouTube, and Instagram generate enormous data volumes through uploads, interactions, and live-streaming.

IoT and smart devices
Connected devices—from wearables to industrial sensors—continuously produce real-time data across interconnected systems.

Digital transactions
Online banking, mobile payments, and blockchain systems generate detailed financial records for transactions and security monitoring.

AI-generated data
Machine learning and generative AI create large datasets during training and operation, further accelerating global data growth.

Computing power

Growth of computing power

The rapid acceleration of computing power—driven by advances in hardware, cloud infrastructure, and parallel processing—has enabled modern analytics and machine learning to scale to massive datasets.

Evolution of computing power

The growth of modern analytics is enabled by changing strategies for increasing computing power.

Era	Main strategy	Explanation
1970s–2000s (*)	Moore’s Law & miniaturization	Smaller transistors → more components per chip → faster processors
2005–today	Parallel computing	Performance increases by using multiple processors simultaneously
2010s–today	Specialized hardware	Chips optimized for specific workloads (GPUs, TPUs, AI accelerators)
Emerging	New computing paradigms	Alternative computing models such as quantum computing

New algorithms

Algorithms

An algorithm is a step-by-step procedure for performing a computation and thereby solving a problem.

Algorithms determine how efficiently computers can process data and solve tasks.

Examples:

Linux Scheduling Algorithms (1990s–present) — Efficient process scheduling enabling operating systems to run tasks concurrently
RSA Encryption (1977) — Public-key cryptography algorithm enabling secure internet communication
PageRank (1998) — Algorithm ranking webpages based on link structure, enabling scalable web search
Blockchain (2008) — Distributed consensus algorithm enabling decentralized digital ledgers and cryptocurrencies
Deep Learning (2010s) — Improved neural network training enabling major advances in vision, speech, and language AI
Gradient Boosting (2014) — High-performance ensemble learning algorithm widely used in predictive analytics and data science
GPT / Transformer Models (2017–present) — Transformer architecture enabling large language models and generative AI

Learning note

No need to memorize everything.
Be able to give a few illustrative examples.
This also applies to the data production areas.

Algorithms enable more powerful analytical models

Traditional Regression

Decision Tree

Neural Network

Algorithmic breakthroughs drive AI progress

Recent breakthroughs in artificial intelligence (AI)¹ show how new algorithms can rapidly surpass human performance. DeepMind provides good examples.

AlphaGo (2016)

Uses deep neural networks + reinforcement learning
Defeated world champion Lee Sedol in the game of Go

Go was long considered too complex for computers due to the enormous search space.

AlphaFold (2020–2022)

Uses deep learning to predict protein structures
Achieved breakthrough performance in the CASP competition²

Predicting protein folding had been a major unsolved problem in biology for decades.

Jagged frontier of AI

AI progress often occurs through algorithmic breakthroughs, enabling machines to outperform humans in increasingly complex tasks.

Recent studies suggest that AI creates a “jagged frontier.” (Dell’Acqua et al., 2023), i.e., some tasks are well suited to AI, while others that appear similar remain outside its capabilities.

Within this frontier, AI can strongly enhance knowledge work, improving productivity and the quality of outputs.
Outside the frontier, AI can reduce performance, especially when users rely too heavily on its outputs without verification.

Analytics processes

Maturing analytical capabilities

Descriptive analytics: What happened?

Summarizes historical data to understand patterns and trends.
Example: Sales reports, dashboards, KPIs

Predictive analytics: What will happen?

Uses statistical models and machine learning to forecast future outcomes.
Example: Demand forecasting, churn prediction

Prescriptive analytics: What should we do?

Recommends actions based on predictions and optimization.
Example: Pricing optimization, recommendation systems

Analytical purposes

What happened? (descriptive )	What will happen? (predictive analytics)	What should I do? (prescriptive analytics)
How many widgets did I sell last month?	How many widgets will I sell next month?	Order 5,000 units of Component Z to support widget sales for next month.
What were sales by zip code for Christmas last year?	What will be sales by zip code over this Christmas season?	Hire Y new sales reps by these zip codes to handle projected Christmas sales.
How many of Product X were returned last month?	How many of Product X will be returned next month?	Set aside $125K in financial reserve to cover Product X returns.
What were company revenues and profits for the past quarter?	What are projected company revenues and profits for next quarter?	Sell the following product mix to achieve quarterly revenue and margin goals.
How many employees did I hire last year?	How many employees will I need to hire next year?	Increase hiring pipeline by 35% to achieve hiring goals.

A particular method, such as regression or machine learning, can serve multiple purposes.

Example: Descriptive analytics

Example: Predictive analytics

Example: Prescriptive analytics

Based on more than 300 million data records per week, Otto generates over one billion forecasts annually on the expected sales of individual products in the coming days and weeks. These forecasts are used to optimize inventory decisions, determining how many units of each product should be stocked or reordered across warehouses. By systematically adjusting inventory levels based on these data-driven recommendations, Otto is able to reduce its overall inventories by up to 30% on average while maintaining product availability.

Analytical models

There is a broad repertoire of models and methods from multiple disciplines, each with its own assumptions, data preparation steps, and modeling processes. While these fields often focus on methodological development, business analytics emphasizes applying these approaches to support understanding and decision-making in organizational contexts.

Discipline	Model culture	Typical models
Statistics	Probabilistic inference	Regression, GLM, Bayesian
Econometrics	Causal modeling	IV, panel models
Computer Science	Algorithmic learning	Trees, SVM, NN
Operations Research	Optimization	Linear programming, Non-linear optimization
Management Science	Decision modeling	Stochastic optimization
Complex Systems	Simulation	Agent-Based/Discrete Event Simulation

Cross-Industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM is a widely used framework that structures data mining and analytics projects (Wirth & Hipp, 2000).

Developed by an industry consortium (e.g., DaimlerChrysler, SPSS, NCR) to create a common methodology for data mining projects.
Designed to be independent of specific industries and technologies, making it applicable across many domains.
Integrates best practices from real-world projects, helping teams plan, communicate, and document analytics work.
Became popular because it provides a reliable and repeatable standard process for managing complex data mining projects.

Structure of the course

We structure the course along the CRISP-DM analytics lifecycle:

Introduction (Session 0–1)
- Overview of analytics and the CRISP-DM workflow.
Data Foundations (Session 2–3)
- Exploration and analytical data architecture.
Analytical Models (Session 4-9)
- Regression, machine learning, and big data analytics.
Deployment (Session 10)
- Analytics in organizations.
Synthesis (Session 11)
- Integration of the full analytics workflow.

Analytics is not a linear pipeline. We constantly move between business problems, data, modeling, and evaluation.

Illustrate how the rise of data analytics capabilities is enabled by data availability, advances in computing power, new algorithms, and maturing analytics processes.
Distinguish between descriptive, predictive, and prescriptive analytics.
Explore the Python and Jupyter analytics ecosystem (exercise).

Exercise

Setup for the practical exercises

Options

Microsoft Excel
RapidMiner
RStudio
IBM SPSS Statistics
Jupyter Notebooks
Tableau
Microsoft Power BI

Why this setup?

We use Jupyter Notebooks and Python because

it supports advanced analytics, including data analysis, visualization, machine learning, and big data analytics
it is a more demanding environment to learn, which helps us build skills that transfer easily to simpler analytics tools and interfaces
it is widely used in research and large organizations, making it a valuable and relevant skillset
it allows us to combine context, code, explanations, and results in a single document

A Jupyter notebook

Jupyter Notebooks combine context, code, output, and implications in one interactive document.

Cells

Markdown cells → text, explanations, documentation
Code cells → Python code for analysis

Output

Running a code cell produces results directly below it (text, tables, charts, etc.).

Execution environment

Kernel → executes the code (Python in this course)
Virtual machine (VM) → the computer where the kernel runs
In this course: provided through GitHub Codespaces

The analysis runs on a remote machine with Python and libraries already installed.

Starting GitHub Codespaces

Create a GitHub account and sign in: https://github.com/signup
Open https://github.com/fs-ise/analytics-and-big-data-notebooks
Start a Codespace:

☕ Short Break

Take 5–10 minutes

Stretch, grab a coffee, or chat with others.

We’ll continue shortly

Group split

Each section is 30 min. Work in pairs of two.

Group 1: Read Competing on Analytics

Read the Competing on Analytics paper by Davenport (2006). Prepare to discuss the following questions:

What does it mean for a company to “compete on analytics”?
How is this different from simply using data or reports in decision-making?
What organizational capabilities are required to compete on analytics?
Consider aspects such as leadership, culture, people, and technology.
Which companies or industries today seem to compete on analytics?
Give examples and explain how analytics creates their competitive advantage.

Group 2: Developing an analytical notebook

Find a dataset on https://www.kaggle.com/datasets and a corresponding business problem you could address with the data.

Create a notebook report.ipynb and draft an analysis structure based on CRISP-DM:

Add Markdown sections describing what you would do in each phase
Add Python code sections for the analyses you intend to run
Indicate which parts of the analysis are descriptive, predictive, or prescriptive

Then:

Download a suitable dataset (CSV)
Import it into your notebook
Start exploring the data and implement the analyses as far as you get

The goal is not to complete the analysis, but to begin working with the data.

Meeting link: https://teams.microsoft.com/meet/35117082310591?p=f5NP6xsZSwVGssoMjB

Jupyter Notebook

Modes

Edit mode: Enter
Command mode: Esc

Navigating cells

Move between cells: ↑ / ↓
Create new cell: a (above) / b (below)

Cell types

Markdown cell: m
Code cell: y

Run cells

Run cell: Ctrl + Enter or Shift + Enter

Steps to get started:

Create a new notebook: notebook.ipynb
Kernel selection → choose Python environments / /opt/conda/bin/python.
Create a Markdown cell explaining that this is a test notebook
Create a code cell to print “Hello world”
Run both cells

GitHub Codespaces: Stop, resume, and download your work

Stopping and resuming a Codespace

Stop a Codespace when you finish working
- This pauses the environment but keeps your files and setup.
- You can resume later and continue where you left off.
Resume a Codespace
- Open the repository on GitHub.
- Navigate to Code → Codespaces → Resume.

Deleting a Codespace

Deleting a Codespace permanently removes the environment and all files stored in the Codespace.
Before deleting: Download notebooks or files that you want to keep for your own reference.

Recommendation

Prefer Stop / Resume if you plan to continue working later.
If you delete a Codespace, make sure you download important notebooks or push your changes to GitHub first.

Survey: Session 1

https://forms.gle/Jna4dmyEvcw3cjRPA

References

Davenport, T. H. (2006). Competing on analytics. Harvard Business Review, 84(1), 98–107. https://cs.brown.edu/courses/cs295-11/competing.pdf

Dell’Acqua, F., McFowland, E., Mollick, E., Lifshitz-Assaf, H., Kellogg, K. C., Rajendran, S., Krayer, L., Candelon, F., Lakhani, K. R., Bervell, M., Cheng, J., Deshpande, P., Ledovskiy, M., Kalil, J., Kung, K., Lacerda, R., Awada, M., Sariego, P. M., Noriega, R., … Lakhani, M. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality (Working Paper Nos. 24-013). Harvard Business School.

Hendrycks, D., Song, D., Szegedy, C., Lee, H., Gal, Y., Brynjolfsson, E., Li, S., Zou, A., Levine, L., Han, B., et al. (2025). A definition of AGI. In arXiv preprint arXiv:2510.18212.

Schmarzo, B. (2016). Big data MBA. Wiley. https://doi.org/10.1002/9781119238881

Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining. Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, 1, 29–39.