Teaching notes

In progress: preparation.

Overall learning objectives.

Teaching notes: see overview.

Building Materials

PDFs of lecture slides

make pdfs
# pdfs created in _site/slides/*.pdf

Exercise notebooks

make exercises
# jupyter notebooks created in _site/exercises/*.ipynb

Exercises: Setup for Jupyter notebooks

Installation of JupyterLab in venv:

sudo dnf install python3 python3-pip python3-virtualenv
python3 -m venv ~/venvs/jupyterlab
source ~/venvs/jupyterlab/bin/activate
pip install --upgrade pip
pip install jupyterlab scikit-learn pandas numpy matplotlib seaborn openpyxl

Start JupyterLab

source ~/venvs/jupyterlab/bin/activate
cd ~/repos/analytics-and-big-data/_site/exercises
jupyter lab
# http://localhost:8888

TODO : - installing libraries (pandas, …)

Pedagogical rationale

  • The established CRISP-DM (process model) serves as a scaffold for the course. The first session gives an overview, the last a synthesis (across phases). The initial sessions focus on data understanding and preparation and the penultimate ones on deployment in organizations. The main focus is then on analytical modeling, covering different models (regression, ML, Big Data analytics). In this part, the focus is on modeling, but relevant aspects in the other phases (such as over-fitting in ML) are also covered. The modeling sessions use CRISP-DM as a meta-layer to organize materials (e.g., starting from the business problem and data, which explain the need for particular models), but the major sections mirror the model specificities instead of repeating the CRISP-DM phases in each session.
  • The focus is on dataset examples from a business-context. To ensure a good fit, we aim to avoid typical example datasets from the areas of medicine, climate, or political and societal phenomena. Instead, we rely on business and finance example, and create synthetic dataset to illustrate realistic business analytics applications. There are a number of important skills for which public datasets (typically used in statistics or data science courses) are unsuitable. For instance, data preparation, integration and cleansing is often particularly challenging in business data, compared to cleanly controlled survey data. Similarly, examples like recommender systems show why it can be reasonable to focus solely on prediction, putting less emphasis on typical concerns raised by statistical inference (interpretation of coefficients, or generalization). In addition, business data raises particular upstream considerations, e.g., related to the typical data sources and features that should be accessed and incorporated or even data that should be recorded in the first place, and downstream deployment considerations, e.g., related to compliant use of employee analytics, or bias in algorithmic decision systems.
  • The focus is on coding analytical notebooks. Reasons: advanced analytics, transferability to graphical tools, rich context, and LLM support.

TBD: If analytical decisions should support particular actions, a typical task may be related to handling identifiers: removing them for the model training, but adding them again when predictions were made? -> e.g., logistic regression and default prediction? TBD: Prepare for “cognitive translation” (between Python code, formulas, different outputs, conceptual procedures like OLS/Max. Likelihood)

Recommendations for building computational notebooks in Jupyter: https://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1007007

Table illustrating case examples and insights (specific to business analytics, and in line with the whole CRISP-DM process):

Phase Case Example Insights
Data Preparation Integration; data quality issues
Exploratory Data Analysis Cluster example TODO
Linear Regression Employee analytics Ethics
Logistic Regression Churn prediction Expected value criterion
Machine Learning TBA Explainable AI; MLOps
Big Data TBA Handling extreme events (e.g., Uber crisis example)
Deployment TBA TBA

Books

Students: Setup for exercises

  • Students: TBD (GitHub/Jupyter notebooks)