Frankfurt School Logo

Analytics & Big Data

Session 1: The rise of analytics

Prof. Dr. Gerit Wagner

(2026-03-23)

Why is modern analytics so successful?

  1. More data
  2. More computing power
  3. New algorithms
  4. New analytics processes

More data for the analysis

The explosion of data





Unit Equivalent Approximate meaning
Gigabyte (GB) A HD movie file or a few hundred photos
Terabyte (TB) 1,000 GB Storage of a modern laptop or external drive
Petabyte (PB) 1,000 TB Data of a large company or several large data centers
Exabyte (EB) 1,000 PB Roughly the yearly internet traffic of a small country
Zettabyte (ZB) 1,000 EB ≈ 1 trillion gigabytes; global data creation scale

Data production

Enterprise and transactional data
Enterprise systems such as ERP, CRM, and supply chain platforms generate large volumes of structured data through everyday business transactions.

eCommerce
Every search, click, purchase, and review creates behavioral and transactional data used for recommendations and personalized marketing.

Social media and user-generated content
Platforms such as TikTok, YouTube, and Instagram generate enormous data volumes through uploads, interactions, and live streaming.

IoT and smart devices
Connected devices—from wearables to industrial sensors—continuously produce real-time data across interconnected systems.

Digital transactions
Online banking, mobile payments, and blockchain systems generate detailed financial records for transactions and security monitoring.

AI-generated data
Machine learning and generative AI create large datasets during training and operation, further accelerating global data growth.

Computing power

Growth of computing power

The rapid acceleration of computing power—driven by advances in hardware, cloud infrastructure, and parallel processing—has enabled modern analytics and machine learning to scale to massive datasets.


Evolution of computing power

The growth of modern analytics is enabled by changing strategies for increasing computing power.


Era Main strategy Explanation
1970s–2000s (*) Moore’s Law & miniaturization Smaller transistors → more components per chip → faster processors
2005–today Parallel computing Performance increases by using multiple processors simultaneously
2010s–today Specialized hardware Chips optimized for specific workloads (GPUs, TPUs, AI accelerators)
Emerging New computing paradigms Alternative computing models such as quantum computing

New algorithms

Algorithms

An algorithm is a step-by-step procedure for performing a computation and thereby solving a problem.

Algorithms determine how efficiently computers can process data and solve tasks.

Examples:

  • Linux Scheduling Algorithms (1990s–present) — Efficient process scheduling enabling operating systems to run tasks concurrently
  • RSA Encryption (1977) — Public-key cryptography algorithm enabling secure internet communication
  • PageRank (1998) — Algorithm ranking webpages based on link structure, enabling scalable web search
  • Blockchain (2008) — Distributed consensus algorithm enabling decentralized digital ledgers and cryptocurrencies
  • Deep Learning (2010s) — Improved neural network training enabling major advances in vision, speech, and language AI
  • Gradient Boosting (2014) — High-performance ensemble learning algorithm widely used in predictive analytics and data science
  • GPT / Transformer Models (2017–present) — Transformer architecture enabling large language models and generative AI

Learning note

No need to memorize everything.
Be able to give a few illustrative examples.
This also applies to the data production areas.

Algorithms enable more powerful analytical models



Traditional Regression

Decision Tree

Neural Network

Algorithmic breakthroughs drive AI progress

Recent breakthroughs in artificial intelligence (AI)1 show how new algorithms can rapidly surpass human performance. DeepMind provides good examples.

AlphaGo (2016)

  • Uses deep neural networks + reinforcement learning
  • Defeated world champion Lee Sedol in the game of Go

Go was long considered too complex for computers due to the enormous search space.

AlphaFold (2020–2022)

  • Uses deep learning to predict protein structures
  • Achieved breakthrough performance in the CASP competition2

Predicting protein folding had been a major unsolved problem in biology for decades.

Jagged frontier of AI





AI progress often occurs through algorithmic breakthroughs, enabling machines to outperform humans in increasingly complex tasks.

Recent studies suggest that AI creates a “jagged frontier.” (Dell’Acqua et al., 2023), i.e., some tasks are well suited to AI, while others that appear similar remain outside its capabilities.

  • Within this frontier, AI can strongly enhance knowledge work, improving productivity and the quality of outputs.
  • Outside the frontier, AI can reduce performance, especially when users rely too heavily on its outputs without verification.

Analytics processes

Maturing analytical capabilities

Descriptive analytics: What happened?

  • Summarizes historical data to understand patterns and trends.
  • Example: Sales reports, dashboards, KPIs

Predictive analytics: What will happen?

  • Uses statistical models and machine learning to forecast future outcomes.
  • Example: Demand forecasting, churn prediction

Prescriptive analytics: What should we do?

  • Recommends actions based on predictions and optimization.
  • Example: Pricing optimization, recommendation systems



Analytical purposes



What happened? (descriptive ) What will happen? (predictive analytics) What should I do? (prescriptive analytics)
How many widgets did I sell last month? How many widgets will I sell next month? Order 5,000 units of Component Z to support widget sales for next month.
What were sales by zip code for Christmas last year? What will be sales by zip code over this Christmas season? Hire Y new sales reps by these zip codes to handle projected Christmas sales.
How many of Product X were returned last month? How many of Product X will be returned next month? Set aside $125K in financial reserve to cover Product X returns.
What were company revenues and profits for the past quarter? What are projected company revenues and profits for next quarter? Sell the following product mix to achieve quarterly revenue and margin goals.
How many employees did I hire last year? How many employees will I need to hire next year? Increase hiring pipeline by 35% to achieve hiring goals.



A particular method, such as regression or machine learning, can serve multiple purposes.

Example: Descriptive analytics


Example: Predictive analytics


Example: Prescriptive analytics

Based on more than 300 million data records per week, Otto generates over one billion forecasts annually on the expected sales of individual products in the coming days and weeks. These forecasts are used to optimize inventory decisions, determining how many units of each product should be stocked or reordered across warehouses. By systematically adjusting inventory levels based on these data-driven recommendations, Otto is able to reduce its overall inventories by up to 30% on average while maintaining product availability.

Analytical models

There is a a broad repertoire of models and methods from multiple disciplines, each with its own assumptions, data preparation steps, and modeling processes. While these fields often focus on methodological development, business analytics emphasizes applying these approaches to support understanding and decision-making in organizational contexts.


Discipline Model culture Typical models
Statistics Probabilistic inference Regression, GLM, Bayesian
Econometrics Causal modeling IV, panel models
Computer Science Algorithmic learning Trees, SVM, NN
Operations Research Optimization Linear programming, Non-linear optimization
Management Science Decision modeling Stochastic optimization
Complex Systems Simulation Agent-Based/Discrete Event Simulation

Cross-Industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM is a widely used framework that structures data mining and analytics projects (Wirth & Hipp, 2000).

  • Developed by an industry consortium (e.g., DaimlerChrysler, SPSS, NCR) to create a common methodology for data mining projects.
  • Designed to be independent of specific industries and technologies, making it applicable across many domains.
  • Integrates best practices from real-world projects, helping teams plan, communicate, and document analytics work.
  • Became popular because it provides a reliable and repeatable standard process for managing complex data mining projects.


CRISPDM A Business Understanding B Data Understanding A->B A->B E Evaluation A->E B->A C Data Preparation B->C D Modeling C->D C->D D->C D->E E->A F Deployment E->F X Data

Structure of the course

We structure the course along the CRISP-DM analytics lifecycle:

  • Introduction (Session 0–1)
    • Overview of analytics and the CRISP-DM workflow.
  • Data Foundations (Session 2–3)
    • Exploration and analytical data architecture.
  • Analytical Models (Session 4-9)
    • Regression, machine learning, and big data analytics.
  • Deployment (Session 10)
    • Analytics in organizations.
  • Synthesis (Session 11)
    • Integration of the full analytics workflow.


CRISPDM A Business Understanding B Data Understanding A->B A->B E Evaluation A->E B->A C Data Preparation B->C D Modeling C->D C->D D->C D->E E->A F Deployment E->F X Data

Analytics is not a linear pipeline. We constantly move between business problems, data, modeling, and evaluation.






  • Illustrate how the rise of data analytics capabilities is enabled by data availability, advances in computing power, new algorithms, and maturing analytics processes.
  • Distinguish between descriptive, predictive, and prescriptive analytics.
  • Explore the Python and Jupyter analytics ecosystem (exercise).

Exercise

Setup for the practical exercises

Options

  • Microsoft Excel
  • RapidMiner
  • RStudio
  • IBM SPSS Statistics
  • Jupyter Notebooks
  • Tableau
  • Microsoft Power BI

Why this setup?

We use Jupyter Notebooks and Python because

  • it supports advanced analytics, including data analysis, visualization, machine learning, and big data analytics
  • it is a more demanding environment to learn, which helps us build skills that transfer easily to simpler analytics tools and interfaces
  • it is widely used in research and large organizations, making it a valuable and relevant skillset
  • it allows us to combine context, code, explanations, and results in a single document

A Jupyter notebook

Jupyter Notebooks combine context, code, output, and implications in one interactive document.

Cells

  • Markdown cells → text, explanations, documentation
  • Code cells → Python code for analysis

Output

Running a code cell produces results directly below it (text, tables, charts, etc.).

Execution environment

  • Kernel → executes the code (Python in this course)
  • Virtual machine (VM) → the computer where the kernel runs
  • In this course: provided through GitHub Codespaces

The analysis runs on a remote machine with Python and libraries already installed.

Starting GitHub Codespaces


☕ Short Break

Take 5–10 minutes

Stretch, grab a coffee, or chat with others.

We’ll continue shortly

Group split

Each section is 30 min. Work in pairs of two.




learning_flow B Split C Group 1 Jupyter Notebook B->C D Group 2 Reading B->D E Switch C->E D->E F Group 1 Reading E->F G Group 2 Jupyter Notebook E->G H Discussion F->H G->H

Group 1: Read Competing on Analytics

Read the Competing on Analytics paper by Davenport (2006). Prepare to discuss the following questions:

  1. What does it mean for a company to “compete on analytics”?
    How is this different from simply using data or reports in decision making?

  2. What organizational capabilities are required to compete on analytics?
    Consider aspects such as leadership, culture, people, and technology.

  3. Which companies or industries today seem to compete on analytics?
    Give examples and explain how analytics creates their competitive advantage.

Group 2: Developing an analytical notebook

Find a dataset on https://www.kaggle.com/datasets and a corresponding business problem you could address with the data.

Create a notebook report.ipynb and draft an analysis structure based on CRISP-DM:

  • Add Markdown sections describing what you would do in each phase
  • Add Python code sections for the analyses you intend to run
  • Indicate which parts of the analysis are descriptive, predictive, or prescriptive

Then:

  • Download a suitable dataset (CSV)
  • Import it into your notebook
  • Start exploring the data and implement the analyses as far as you get

The goal is not to complete the analysis, but to begin working with the data.

Jupyter Notebook

Modes

  • Edit mode: Enter
  • Command mode: Esc

Navigating cells

  • Move between cells: ↑ / ↓
  • Create new cell: a (above) / b (below)

Cell types

  • Markdown cell: m
  • Code cell: y

Run cells

  • Run cell: Ctrl + Enter or Shift + Enter

Steps to get started:

  • Create a new notebook: notebook.ipynb
  • Kernel selection → choose Python environments / /opt/conda/bin/python.
  • Create a Markdown cell explaining that this is a test notebook
  • Create a code cell to print “Hello world”
  • Run both cells

GitHub Codespaces: Stop, resume, and download your work

Stopping and resuming a Codespace

  • Stop a Codespace when you finish working
    • This pauses the environment but keeps your files and setup.
    • You can resume later and continue where you left off.
  • Resume a Codespace
    • Open the repository on GitHub.
    • Navigate to Code → Codespaces → Resume.

Deleting a Codespace

  • Deleting a Codespace permanently removes the environment and all files stored in the Codespace.
  • Before deleting: Download notebooks or files that you want to keep for your own reference.

Recommendation

  • Prefer Stop / Resume if you plan to continue working later.
  • If you delete a Codespace, make sure you download important notebooks or push your changes to GitHub first.

Survey: Session 1





https://forms.gle/Jna4dmyEvcw3cjRPA

References

Davenport, T. H. (2006). Competing on analytics. Harvard Business Review, 84(1), 98–107. https://cs.brown.edu/courses/cs295-11/competing.pdf
Dell’Acqua, F., McFowland, E., Mollick, E., Lifshitz-Assaf, H., Kellogg, K. C., Rajendran, S., Krayer, L., Candelon, F., Lakhani, K. R., Bervell, M., Cheng, J., Deshpande, P., Ledovskiy, M., Kalil, J., Kung, K., Lacerda, R., Awada, M., Sariego, P. M., Noriega, R., … Lakhani, M. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality (Working Paper Nos. 24-013). Harvard Business School.
Hendrycks, D., Song, D., Szegedy, C., Lee, H., Gal, Y., Brynjolfsson, E., Li, S., Zou, A., Levine, L., Han, B., et al. (2025). A definition of AGI. In arXiv preprint arXiv:2510.18212.
Schmarzo, B. (2016). Big data MBA. Wiley. https://doi.org/10.1002/9781119238881
Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining. Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, 1, 29–39.