Notes: Session 8: Big data 1
2026-05-04: 95 min (lecture)
| Time (min) | Duration | Topic | Additional materials |
|---|---|---|---|
| 0–30 | 30 | Big Data / 4V | |
| 30–50 | 20 | Architecture | |
| 50–90 | 40 | Text analytics |
TODO:
- Extend text analytics slides to illustrate “Relevance today” (moving towards a system perspective; used in LLMs (as input); RAG (explain), information retrieval (e.g., obsidian))
- Explain how big data environments reshape model development, validation, and deployment. This creates continuity with: Generalization problem, Overfitting, Model evaluation
Text analytics: Sparse vectors
Aka. first-generation / bag of words models
TF-IDF:
Explain how \(idf_i\) changes:
- when \(\frac{N}{df_i}\) gets smaller and approaches 1, the log approaches 0 and the \(tf_{ij}\) will have low weights in the TF-IDF
- if the term is very rare, the \(\frac{N}{df_i}\) gets large and the terms will have more weight in the TF-IDF (at the same time,
logensures that outliers/very rare terms do not have an excessive influence on TF-IDF)
Example with a mini corpus after preprocessing
| Document | Tokens |
|---|---|
| D1 | service helpful quick experience |
| D2 | service experience delay shipping |
| D3 | service rude refund |
Vocabulary:
[delay, experience, helpful, quick, refund, rude, service, shipping]
Sparse vector for D1:
| Representation | Vector for D1 |
|---|---|
| Binary occurrence | [0, 1, 1, 1, 0, 0, 1, 0] |
| Term occurrence | [0, 1, 1, 1, 0, 0, 1, 0] |
| Term frequency | [0, .25, .25, .25, 0, 0, .25, 0] |
| TF-IDF | [0, 0.18, .48, .48, 0, 0, 0, 0] |
Note: using \(log_{10}\) for ease of calculation; libraries often use the natural logarithm.
- \(w_{Experience;D1} = 1\times log(\frac{3}{2}) = 0.18\)
- \(w_{Helpful;D1} = 1\times log(\frac{3}{1}) = 0.48\)
- \(w_{Quick;D1} = 1\times log(\frac{3}{1}) = 0.48\)
- \(w_{Service;D1} = 1\times log(\frac{3}{3}) = 0\)
Note: Service has \(idf=0\) and will be eliminated from TF-IDF.
Most entries are zero because each document only uses a small part of the full vocabulary.
Dense representations
Highlight that dense representations allow us to determine semantic similarity based on mathematical operations.
Exercises
2026-05-04: 70 min
TODO
- Reduce the IMDB dataset to 1500 or 2000 ? (faster computations of the dense vectors)
Materials
Check for NLP assignments: https://github.com/microsoft/ai-for-beginners - https://github.com/iam-salma/NLP-Bootcamp-with-python