Notes: Session 1: The rise of analytics
- Prepare/bring tags for groups
- Prepare a teams session for students to share their work
Lecture
Note: Orga: 20 min
| Time (min) | Duration | Topic | Additional materials |
|---|---|---|---|
| 20-35 | 15 | More data | |
| 35-50 | 15 | More computing power | |
| 50-65 | 15 | New algorithms | |
| 65-80 | 15 | New analytics processes |
In this session, our goal is to explain why modern data analytics is successful, with reference to and examples from the four areas.
Focus on algorithms (enable more elaborate models and analyses)
- AlphaFold: also illustrates how science and algorithmic competitions drive progress
- AlphaFold enables a range of commercial use cases in the pharmaceutical and biotech industries ()
Analytical processes
Key trends of improvement:
- Maturing: from descriptive to predictive and prescriptive (reducing ambiguity and the need for human involvement in business decisions)
- Pervasive: extending to different areas (e.g., understanding customers with A/B testing), departments (logistics, financial, …) with specific disciplines refining more specialized models (forecasting, supply chain, scheduling, queueing, …)
- Standardized: algorithms and processes are shared and standardized across companies and industries (e.g., ML/LLM; analytics software/environments, governance models like CRIPS-DM) -> intensifies competition
CRISP-DM: most widely used analytics model (https://www.forbes.com/sites/metabrown/2015/07/29/what-it-needs-to-know-about-the-data-mining-process/#2065f3a3515f)
Transition: CRISP-DM is a well-established model for data analytics, so it also serves as a structure for this course…
Exercise
| Time (min) | Duration | Topic | Additional materials |
|---|---|---|---|
| 0-30 | 30 | Introduction and setup | |
| 30-90 | 60 | Data handling in Python |
Distribute tags (1.1, 1.1, 1.2, 1.2, 2.1, 2.1, 2.2, 2.2) -> work in pairs.
I will work with the notebook group.
Benefits of jupyter notebooks
one document instead of multiple (easier to keep in sync, harder for files to get lost / matters more if the analyses are more complex, if there are more “moving parts”)
option to collapse/hide cells (communicating with business stakeholders)
Notebook scaffolding: Also gives context to LLMs
Jupyter notebooks are the standard environment in data science courses worldwide.
Setup
Ask students to create a GitHub account
Explain the Jupyter notebooks, GitHub setup. Mention that students can always use software like Spyder (see Tipps im Umgang mit Spyder.docx).
Start in groups of two (random?)
Introduce system (badges) - similar to https://eduki.com/de/material/306565/schilder-fragen-fertig-ich-arbeite check classrooms before! - can it be attached to the desks?
“simple” amazon question: you could say yes, but you will get no points for it. All exam questions are about selecting the appropriate concepts from the lecture and applying them to the case. Explain: signals the need for a rationale
- yes: 0 points
- no - it is about data/computation/algorithms/processes. 2 points
- yes, but its only one part of the equation. It is enabled by large data collection about customers, the computational resources in cloud centers like AWS, new algorithms such as Deep learning, and analytical processes like CRISP-DM and mature prescriptive capabilities. - 5 points.
TODO: explicitly address the question why we select Python/jupyter (give an overview of the landscape), argue that Python is challenging (not a low/no-code platform), very popular (supports many analytical use cases), and allows you to quickly learn other tools
-> LLMs are language models: they are good at handling language (not necessarily at handling data) - so if we use a programming language to analyze data, LLMs can help us. More than they could help us operate a GUI. -> LLMs are not directly trained with user-GUI interactions (workflows are weakly documented and harder to analyze/version/control)
Useful Jupyter Notebook Tricks
Autocompletion
Tab→ autocomplete variables, functions, file pathsShift + Tab→ show function documentationShift + Tab(twice) → expanded documentation
Inspect objects
variable?→ quick helpvariable??→ show source code (if available)
Example:
pd.read_csv?
List variables in memory
%who→ list variables%whos→ list variables with type and size
Similarly: “Jupyter variables” button
Reset notebook variables
%reset
Two particularly useful ones for beginners
If you only show two, I recommend:
Tab→ autocomplete%whos→ see all variables
These immediately help students understand what data is currently in memory.
Line-by-line execution:
right-click, create console
Notebook exercise
See how far you get: helps me understand how much students already know.,
Ask students who shared their solutions whether they can send them to me / that I make them available
Expectation management: you bring different levels of experience with Jupyter Notebooks. So we can also learn from each other and I believe we should all be very comfortable with Jupyter notebooks at the end of the course.
Jupyter notebooks: Zoom in a lot (close the left sidebar)
Before starting with the question sets: throw your help-card back into the bucket.
Survey
- Use the last field to give feedback on the setup - did Codespaces work for you? Would you prefer to work locally?
- You have seen other courses. Let me know if there is anything (practice, tool, …) that I could learn from one of my colleagues.