Notes: Session 4: Regression 1

Time (min)	Duration	Topic
0–12	12	Business understanding
12–24	12	Data understanding
24–35	11	Data preparation
35–46	11	Modeling
46–57	11	Simple regression
57–68	11	Evaluation
68–79	11	Next steps
79–90	11	Deployment

2026-03-31: 80 mins for the lecture (slightly short, could be a bit more formal/challenging)

TODO

prepare explanation of regression assumptions and diagnostics
include transformations at the end (see https://ds100.org/course-notes/loss-transformations/#transformations-of-linear-models)

Explicitly discuss “why do we focus on/start with regression? (boring/want ML/AI)”

-> Simplest form of a model, where we can see principles (like fitting/overfitting/…) -> sometimes, we may try to extract structured data from big data to run regression/ML models

Process:

… in this session, we apply the CRISP-DM, taking a foundational analytical technique as an example -> modeling: selecting a model and refining it; or comparing different models

Model choice

Clustering (session 2): possible choice (use average price in each cluster as the prediction)

Using the Ames Housing dataset, students explore the relationship between living area and price, estimate a simple regression model, and then extend it to multiple predictors. The case illustrates how regression helps quantify marginal effects under ceteris paribus assumptions and highlights the difference between explanation and prediction in observational data.

Dataset: Mention Kaggle as a great source. Different datasets are available. It offers descriptive overviews (EDA), and links existing analyses (tab: code)

Notebook demonstration: run regression without NA-filling and correct afterward.

“Solution”


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset (download from Kaggle and place train.csv locally)
df = pd.read_csv("data/ames.csv")

df.columns


# Select numeric predictors
features = [
    "area",
    "Overall.Qual",
]

X = df[features]
y = df["price"]


# Fit linear regression
model = LinearRegression()
model.fit(X, y)


# Run, read error, add fillna before model
X = X.fillna(X.median())

# Show coefficients
coef_df = pd.DataFrame({
    "Feature": features,
    "Coefficient": model.coef_
})
print(coef_df)

Overall model performance

Whether predictions are accurate enough for decisions

-> depends on nr. of cases and their significance (e.g., large samples: estimating age based on names)

Categorical predictors

TODO: Why should we not include all levels for a categorical variable? (include explanation here)

Exercise

**Why does it make sense to “test” different libraries? - Check whether they yield the same results? (did we specify the model correctly? Is there an error in the library?) - generally, validating results is an important skill - Some libraries may offer more comprehensive output (like statsmodel) - We may need to work in different environments (R or Python) in the future, so it makes sense to understand different options - Understand what preprocessing steps must be done manually (before the library call), e.g., statsmodel does the encoding of categorical variables internally (specified through the formula)

For statsmodel output, we may need to click on scrollable (output is truncated for readability)

Insight: model coefficients change in subgroups

Note: always exclude unique identifiers from regression models.

Materials

Example: https://github.com/collinprather/ISLR-Python/blob/master/Chapter%203%20Linear%20Regression.ipynb
MAFS6010u_Regression see pairwise scatter and multiple linear regression.ipynb
Check: https://github.com/lmarti/machine-learning/blob/master/02.%20Linear%20regression.ipynb