Notes: Session 3: Exploratory data analysis

Time (min) Duration Topic Additional materials
0–10 10 Foundations
10–40 30 Data preparation
40–55 15 Univariate exploratory data analysis
55–65 10 Multivariate exploratory data analysis
65–90 25 Clustering in EDA

TODO: prepare better explanations of interval/ratio/… TODO: think how the second part of clustering should be presented (hierarchical; distance measures). It was covered (too) quickly in the last session.

Badly formatted data blocks analysis — data preparation unlocks it.

Data quality: never trust data blindly. always evaluate data you receive as an input

Data preparation:

Data structuring

  • Need to develop the intuition and clear understanding of how data should be structured.
  • Go back to the examples: and highlight values, variables, observations. Define observational unit. The wide format is an output in a report, not an input. Observational unit: Quarterly sales per region (implicitly: across years)

Key learning: Data structures for analytics may tolerate redundancies! -> Illustrate this on the whiteboard: ERD/ERM: different rationale (ACID): preventing inconsistent states in operational databases -> analytical data: often extracted (copied), will not be changed by operational systems. may not even be persisted (just used in-memory for the analysis) -> in tidy data, there is no need for assigning unique IDs. Duplicates are a concern, but unique IDs are not enough to conclude that there are not duplicates.

Data integration

Remember the Davenport (2006) case: competing on analytics means pooling data that was generated in-house and data acquired from outside data sources.

Question: what does the “merge()” method correspond to in SQL?

Metaphor: statisticians carefully collect their data on their own (live on their own island) - business analysts have to work with existing data (which can be very messy)

But: Sometimes it is better to know/predict something even if we cannot explain it instead of doing nothing! Examples:

  • Recommendation systems: Which movie are you likely to watch next? thousands of variables may be available; we may never know why exactly a particular recommendation works, but recommendations generally improve engagement/revenue. (“We don’t know/don’t give a recommendation” is the worst option) -> Business decision to recommend anything is always better than no recommendation at all
  • Fraud detection: prediction of irregular and fraudulent transactions can be very accurate, but the attackers goal is to behave in a way that is hard to anticipate. Prediction may focus on identifying cases that are “not behaving regularly” (rather than identifying generalizable patterns)

EDA/Clustering

TODO: Illustrate how k-means clustering works for a given k (random centroids/assignments, recalculation of centroids), using different colors

Exercise

TODO: include an example for tidying data with different observational units (e.g., different levels like department sales vs. package deliveries to customers)

Note: focus on pandas df (not so much on other Python datatypes) Possible extension: financial API (code for the api call should be included)

Implementation in Python/pandas: level of difficulty: use the library documentation vs. use ChatGPT

Possible extensions: data structuring with tables requiring transpose

Survey

  • Generally: indicate when notes on learning focus or expectations would be helpful.

References

Davenport, T. H. (2006). Competing on analytics. Harvard Business Review, 84(1), 98–107. https://cs.brown.edu/courses/cs295-11/competing.pdf