Notes: Session 7: Machine learning 2

2026-04-20: 95 min (skipped regression trees and support vector regression)

Time (min)	Duration	Topic	Additional materials
0–90	90	TODO

TODO:

Better prepare the explanation of decision trees (entropy, information gain) (see https://www.cs.cmu.edu/~aarti/Class/10315_Fall20/recs/DecisionTreesBoostingExampleProblem.pdf)
Check regression trees carefully
Prepare explanation and visualization of SVMs (on the slides)

Decision trees

ID3 algorithm (entropy)

1) Intuition and Definition

Entropy is a measure of how mixed or impure the labels in a dataset are.

If all observations belong to the same class → low entropy (0) → perfectly pure
If observations are evenly distributed across classes → high entropy → maximum uncertainty

Formally, for a dataset with (K) classes and class probabilities (p_1, , p_K):

\[H = - \sum_{i=1}^{K} p_i \log_2 p_i\]

This captures the expected amount of information (uncertainty) in the labels.

2) Key Teaching Insight: What Is the Maximum Entropy?

A common misconception is that entropy is always between 0 and 1 — this is only true for binary classification. In general, entropy depends on the number of classes (K).

Entropy is maximized when all classes are equally likely:

\[p_1 = p_2 = \dots = p_K = \frac{1}{K}\]

Substituting into the entropy formula:

\[H = - \sum_{i=1}^{K} \frac{1}{K} \log_2 \left(\frac{1}{K}\right)\]

Now simplify:

\[H = - K \cdot \frac{1}{K} \cdot \log_2 \left(\frac{1}{K}\right)\]

\[H = - \log_2 \left(\frac{1}{K}\right)\]

Using the log rule:

\[\log_2 \left(\frac{1}{K}\right) = -\log_2 K\]

So:

\[H = \log_2 K\]

Important takeaway:

Maximum entropy is not fixed
It grows with the number of classes
For example:
- \(K = 2 \Rightarrow H = 1\)
- \(K = 4 \Rightarrow H = 2\)
- \(K = 8 \Rightarrow H = 3\)

4) Python Illustration (Entropy vs. Number of Classes)

Entropy is best understood as:

“A measure of how uncertain we are about the class label.”

Low entropy → clear, predictable classification
High entropy → mixed, uncertain data
Maximum entropy = \(\log_2(K)\) → depends on how many classes exist

This is exactly why decision trees aim to reduce entropy through splits (information gain).

SVM

Maximal margin classifier

Small M&A teams are successful, large teams unsuccessful.

A simple baseline would place the threshold halfway between the class means. The maximal margin classifier instead places the threshold halfway between the closest opposing cases and maximizes the distance to both classes.

Figure 1: A naive midpoint between class means versus the maximal-margin threshold

Teaching point: The midpoint between means uses the class centers. The maximal margin classifier uses the edge cases that define the safest separating threshold. This matches the standard SVM intuition that only the nearest points determine the margin.

Support vector classifier (soft margin)

Now allow slight overlap:

one relatively small team is unsuccessful
one relatively large team is successful

The support vector classifier keeps a central boundary, but also allows some points to lie inside the margin or even on the wrong side of the boundary.

Figure 2: Soft-margin classifier: margin bands, an in-margin point, and a misclassified point

Teaching point: This is the most important soft-margin visual: one point can be inside the margin but still correctly classified, while another can be across the boundary and misclassified. That is exactly the role of slack variables in the standard support vector classifier formulation.

Support Vector Machines (Kernel trick for nonlinear patterns)

Change the M&A example. Assume:

small → unsuccessful
mid-sized → successful (“sweet spot”)
large → unsuccessful

Figure 3: Nonlinear pattern: only mid-sized M&A teams are successful

Note: linear separation no longer possible.

Transformation (Quadratic transformation)

Figure 4: Quadratic transformation enables linear separation

Teaching point: The kernel trick makes nonlinear patterns linearly separable in a transformed space.

Short wrap-up

Maximal margin classifier → perfect separation
Support vector classifier → allows some errors
Support vector machine → handles nonlinear patterns

Exercises

2026-04-20: 80 min (Data Analytics with Python.pdf : Ridge Regression (p.57), Decision Trees and Random Forests (p.28), and Support Vector Regression (p.20))

TODO: add to exercise: If a model is trained on scaled data, it will only work with scaled data. Note, that the scaling must be exactly the same as for the training data. Thus, it is necessary, to keep the scaler.

Materials

Lectures: https://harvard-iacs.github.io/2019-CS109A/lectures/lecture15/presentation/Lecture15_Decision_Trees.pdf

https://medium.com/data-science/entropy-how-decision-trees-make-decisions-2946b9c18c8