Notes: Session 8: Big data 1

2026-05-04: 95 min (lecture)

Time (min)	Duration	Topic
0–30	30	Big Data / 4V
30–50	20	Architecture
50–90	40	Text analytics

TODO:

Extend text analytics slides to illustrate “Relevance today” (moving towards a system perspective; used in LLMs (as input); RAG (explain), information retrieval (e.g., obsidian))
Explain how big data environments reshape model development, validation, and deployment. This creates continuity with: Generalization problem, Overfitting, Model evaluation

Text analytics: Sparse vectors

Aka. first-generation / bag of words models

TF-IDF:

Explain how \(idf_i\) changes:

when \(\frac{N}{df_i}\) gets smaller and approaches 1, the log approaches 0 and the \(tf_{ij}\) will have low weights in the TF-IDF
if the term is very rare, the \(\frac{N}{df_i}\) gets large and the terms will have more weight in the TF-IDF (at the same time, log ensures that outliers/very rare terms do not have an excessive influence on TF-IDF)

Example with a mini corpus after preprocessing

Document	Tokens
D1	`service helpful quick experience`
D2	`service experience delay shipping`
D3	`service rude refund`

Vocabulary:

[delay, experience, helpful, quick, refund, rude, service, shipping]

Sparse vector for D1:

delay

experience

helpful

quick

refund

rude

service

shipping

0

1

1

1

0

0

1

0

Note: using \(log_{10}\) for ease of calculation; libraries often use the natural logarithm.

Note: Service has \(idf=0\) and will be eliminated from TF-IDF.

Most entries are zero because each document only uses a small part of the full vocabulary.

Highlight that dense representations allow us to determine semantic similarity based on mathematical operations.

2026-05-04: 70 min

TODO

Reduce the IMDB dataset to 1500 or 2000 ? (faster computations of the dense vectors)

Check for NLP assignments: https://github.com/microsoft/ai-for-beginners - https://github.com/iam-salma/NLP-Bootcamp-with-python