Notes: Session 3: Exploratory data analysis
| Time (min) | Duration | Topic | Additional materials |
|---|---|---|---|
| 0–10 | 10 | Foundations | |
| 10–40 | 30 | Data preparation | |
| 40–55 | 15 | Univariate exploratory data analysis | |
| 55–65 | 10 | Multivariate exploratory data analysis | |
| 65–90 | 25 | Clustering in EDA |
TODO: prepare better explanations of interval/ratio/… TODO: think how the second part of clustering should be presented (hierarchical; distance measures). It was covered (too) quickly in the last session.
Badly formatted data blocks analysis — data preparation unlocks it.
Data quality: never trust data blindly. always evaluate data you receive as an input
Data preparation:
- Statistics traditionally relies on a controlled mode of data collection
- Repurposing of data, analyzing transactional data is not a typical setting
- Data quality measures are relatively limited (focused on outliers)
- Data preparation (manual correction or the use of synthetic data) are even discarded as a questionable research practice
Data structuring
- Need to develop the intuition and clear understanding of how data should be structured.
- Go back to the examples: and highlight values, variables, observations. Define observational unit. The wide format is an output in a report, not an input. Observational unit: Quarterly sales per region (implicitly: across years)
Key learning: Data structures for analytics may tolerate redundancies! -> Illustrate this on the whiteboard: ERD/ERM: different rationale (ACID): preventing inconsistent states in operational databases -> analytical data: often extracted (copied), will not be changed by operational systems. may not even be persisted (just used in-memory for the analysis) -> in tidy data, there is no need for assigning unique IDs. Duplicates are a concern, but unique IDs are not enough to conclude that there are not duplicates.
Data integration
Remember the Davenport (2006) case: competing on analytics means pooling data that was generated in-house and data acquired from outside data sources.
Question: what does the “merge()” method correspond to in SQL?
Metaphor: statisticians carefully collect their data on their own (live on their own island) - business analysts have to work with existing data (which can be very messy)
But: Sometimes it is better to know/predict something even if we cannot explain it instead of doing nothing! Examples:
- Recommendation systems: Which movie are you likely to watch next? thousands of variables may be available; we may never know why exactly a particular recommendation works, but recommendations generally improve engagement/revenue. (“We don’t know/don’t give a recommendation” is the worst option) -> Business decision to recommend anything is always better than no recommendation at all
- Fraud detection: prediction of irregular and fraudulent transactions can be very accurate, but the attackers goal is to behave in a way that is hard to anticipate. Prediction may focus on identifying cases that are “not behaving regularly” (rather than identifying generalizable patterns)
EDA/Clustering
TODO: Illustrate how k-means clustering works for a given k (random centroids/assignments, recalculation of centroids), using different colors
Exercise
TODO: include an example for tidying data with different observational units (e.g., different levels like department sales vs. package deliveries to customers)
Note: focus on pandas df (not so much on other Python datatypes) Possible extension: financial API (code for the api call should be included)
Implementation in Python/pandas: level of difficulty: use the library documentation vs. use ChatGPT
Possible extensions: data structuring with tables requiring transpose
Survey
- Generally: indicate when notes on learning focus or expectations would be helpful.