Notes: Session 8: Big data 1

2026-05-04: 95 min (lecture)

Time (min) Duration Topic Additional materials
0–30 30 Big Data / 4V
30–50 20 Architecture
50–90 40 Text analytics

TODO:

Text analytics: Sparse vectors

Aka. first-generation / bag of words models

TF-IDF:

Explain how \(idf_i\) changes:

  • when \(\frac{N}{df_i}\) gets smaller and approaches 1, the log approaches 0 and the \(tf_{ij}\) will have low weights in the TF-IDF
  • if the term is very rare, the \(\frac{N}{df_i}\) gets large and the terms will have more weight in the TF-IDF (at the same time, log ensures that outliers/very rare terms do not have an excessive influence on TF-IDF)

Example with a mini corpus after preprocessing

Document Tokens
D1 service helpful quick experience
D2 service experience delay shipping
D3 service rude refund


Vocabulary:

[delay, experience, helpful, quick, refund, rude, service, shipping]


Sparse vector for D1:

delay
experience
helpful
quick
refund
rude
service
shipping
0
1
1
1
0
0
1
0


Representation Vector for D1
Binary occurrence [0, 1, 1, 1, 0, 0, 1, 0]
Term occurrence [0, 1, 1, 1, 0, 0, 1, 0]
Term frequency [0, .25, .25, .25, 0, 0, .25, 0]
TF-IDF [0, 0.18, .48, .48, 0, 0, 0, 0]

Note: using \(log_{10}\) for ease of calculation; libraries often use the natural logarithm.

  • \(w_{Experience;D1} = 1\times log(\frac{3}{2}) = 0.18\)
  • \(w_{Helpful;D1} = 1\times log(\frac{3}{1}) = 0.48\)
  • \(w_{Quick;D1} = 1\times log(\frac{3}{1}) = 0.48\)
  • \(w_{Service;D1} = 1\times log(\frac{3}{3}) = 0\)

Note: Service has \(idf=0\) and will be eliminated from TF-IDF.

Most entries are zero because each document only uses a small part of the full vocabulary.

Dense representations

Highlight that dense representations allow us to determine semantic similarity based on mathematical operations.

Exercises

2026-05-04: 70 min

TODO

  • Reduce the IMDB dataset to 1500 or 2000 ? (faster computations of the dense vectors)

Materials

Check for NLP assignments: https://github.com/microsoft/ai-for-beginners - https://github.com/iam-salma/NLP-Bootcamp-with-python