Notes: Session 1: The rise of analytics

NotePreparation
  • Prepare/bring tags for groups
  • Prepare a teams session for students to share their work

Lecture

Note: Orga: 20 min

Time (min) Duration Topic Additional materials
20-35 15 More data
35-50 15 More computing power
50-65 15 New algorithms
65-80 15 New analytics processes
ObjectiveObjective

In this session, our goal is to explain why modern data analytics is successful, with reference to and examples from the four areas.

Focus on algorithms (enable more elaborate models and analyses)

  • AlphaFold: also illustrates how science and algorithmic competitions drive progress
  • AlphaFold enables a range of commercial use cases in the pharmaceutical and biotech industries ()

Analytical processes

Key trends of improvement:

  • Maturing: from descriptive to predictive and prescriptive (reducing ambiguity and the need for human involvement in business decisions)
  • Pervasive: extending to different areas (e.g., understanding customers with A/B testing), departments (logistics, financial, …) with specific disciplines refining more specialized models (forecasting, supply chain, scheduling, queueing, …)
  • Standardized: algorithms and processes are shared and standardized across companies and industries (e.g., ML/LLM; analytics software/environments, governance models like CRIPS-DM) -> intensifies competition

CRISP-DM: most widely used analytics model (https://www.forbes.com/sites/metabrown/2015/07/29/what-it-needs-to-know-about-the-data-mining-process/#2065f3a3515f)

Transition: CRISP-DM is a well-established model for data analytics, so it also serves as a structure for this course…

Exercise

Time (min) Duration Topic Additional materials
0-30 30 Introduction and setup
30-90 60 Data handling in Python

Distribute tags (1.1, 1.1, 1.2, 1.2, 2.1, 2.1, 2.2, 2.2) -> work in pairs.

I will work with the notebook group.

Benefits of jupyter notebooks

  • one document instead of multiple (easier to keep in sync, harder for files to get lost / matters more if the analyses are more complex, if there are more “moving parts”)

  • option to collapse/hide cells (communicating with business stakeholders)

  • Notebook scaffolding: Also gives context to LLMs

Jupyter notebooks are the standard environment in data science courses worldwide.

Setup

Ask students to create a GitHub account

Explain the Jupyter notebooks, GitHub setup. Mention that students can always use software like Spyder (see Tipps im Umgang mit Spyder.docx).

Start in groups of two (random?)

Introduce system (badges) - similar to https://eduki.com/de/material/306565/schilder-fragen-fertig-ich-arbeite check classrooms before! - can it be attached to the desks?

“simple” amazon question: you could say yes, but you will get no points for it. All exam questions are about selecting the appropriate concepts from the lecture and applying them to the case. Explain: signals the need for a rationale

  • yes: 0 points
  • no - it is about data/computation/algorithms/processes. 2 points
  • yes, but its only one part of the equation. It is enabled by large data collection about customers, the computational resources in cloud centers like AWS, new algorithms such as Deep learning, and analytical processes like CRISP-DM and mature prescriptive capabilities. - 5 points.

TODO: explicitly address the question why we select Python/jupyter (give an overview of the landscape), argue that Python is challenging (not a low/no-code platform), very popular (supports many analytical use cases), and allows you to quickly learn other tools

-> LLMs are language models: they are good at handling language (not necessarily at handling data) - so if we use a programming language to analyze data, LLMs can help us. More than they could help us operate a GUI. -> LLMs are not directly trained with user-GUI interactions (workflows are weakly documented and harder to analyze/version/control)

Useful Jupyter Notebook Tricks

Autocompletion

  • Tab → autocomplete variables, functions, file paths
  • Shift + Tab → show function documentation
  • Shift + Tab (twice) → expanded documentation

Inspect objects

  • variable? → quick help
  • variable?? → show source code (if available)

Example:

pd.read_csv?

List variables in memory

  • %who → list variables
  • %whos → list variables with type and size

Similarly: “Jupyter variables” button

Reset notebook variables

%reset

Two particularly useful ones for beginners

If you only show two, I recommend:

  • Tab → autocomplete
  • %whos → see all variables

These immediately help students understand what data is currently in memory.

Line-by-line execution:

right-click, create console

Notebook exercise

See how far you get: helps me understand how much students already know.,

Ask students who shared their solutions whether they can send them to me / that I make them available

Expectation management: you bring different levels of experience with Jupyter Notebooks. So we can also learn from each other and I believe we should all be very comfortable with Jupyter notebooks at the end of the course.

Jupyter notebooks: Zoom in a lot (close the left sidebar)

Before starting with the question sets: throw your help-card back into the bucket.

Survey

  • Use the last field to give feedback on the setup - did Codespaces work for you? Would you prefer to work locally?
  • You have seen other courses. Let me know if there is anything (practice, tool, …) that I could learn from one of my colleagues.