Notes: Session 2: Analytical data architecture

Time (min) Duration Topic Additional materials
0–23 23 Data warehouse
23–46 23 ETL pipeline
46–68 22 Dimensional modeling
68–90 22 OLAP operations

2026-03-30 lecture: 85 minutes (5 min short)

TODO

Summary/previous lectures

  • Rise of analytics: descriptive/predictive/prescriptive, CRISP-DM
  • Data preparation: variable types, data quality, structuring (what tools expect), clustering/k-means
  • Tech. setup: Jupyter notebooks, Codespaces -> Almost completed data foundations, will start with modeling next session

DWH concept

Subject-oriented (Inmon): A DWH is organized around core business subjects (e.g., customer, product, sales), which correspond to facts and dimensions in a star schema. Instead of storing data separately across operational systems (e.g., CRM, billing, support), the DWH integrates them into unified subject areas—e.g., a Customer view combining purchases, interactions, payments, and demographics.

Modern analytics

DWH codifies data preparation procedures (all transformations, etc. we covered in the previous lecture), access to databases, metadata-knowledge. This could make DWH-evolution more efficient for organizations than replacing them with new systems in a greenfield-approach.

Data quality issues could be discussed along DWH/ETL. E.g., transformations are in the transformation phase, incorrect data should be fixed in the source systems, for duplicates it depends (whether they are in or between-dataset duplicates)

Modern ML/analytics applications often operate as a layer on top of a DWH

Exercise

2026-03-30: 90 minutes (approx. 60 minutes for part 1+2 with discussion; 30 minutes for part 5/1+2; skipped remaining modelling/OLAP/Questions)

TODO:

  • Do fact tables require a primary key?
  • Can we add the same tuple multiple times?
  • Check: is this handled by “artificial primary keys” (built-in)?
  • Key question: why should time be a separate table?
  • Creating a time dimension table is helpful to trigger database indexing (for efficient OLAP operations)

Part 2: lesson should be “extract time as table” and “integrate” class into wine.

Materials

Davenport, T. H. (2006). Competing on analytics. Harvard Business Review, 84(1), 98–107. https://cs.brown.edu/courses/cs295-11/competing.pdf