gantt
dateFormat YYYY-MM-DD
excludes weekends
section Setup
Topic selection :active, 2026-04-14, 7d
section Iterative development
Data, EDA, modeling (iterative) : 2026-04-21, 18d
section Refinement
Refinement :2026-05-12, 7d
section Finalization
Final report and submission :2026-05-19, 9d
Group work
In this group project, you will develop and analyze a realistic business analytics case, including data, modeling, and critical reflection. While many analytics courses rely on publicly available datasets—often centered on societal questions—this project focuses on business-relevant contexts such as management, finance, operations, or digital platforms. To achieve this, you will design and work with synthetic datasets that reflect realistic organizational processes and decision situations. This allows you not only to apply analytical methods but also to engage with the upstream challenge of framing meaningful problems and constructing data that captures them. In addition, you will address downstream challenges such as deployment considerations (e.g., implementation, ethics, and risks). Your work should be grounded in appropriate and reputable references (e.g., academic literature or industry reports) to demonstrate the relevance of the problem and to provide evidence that the analytical approach and dataset reflect realistic practices.
Objectives
- Design a realistic business analytics scenario
- Select and apply appropriate analytical models
- Communicate results using CRISP-DM as a structuring framework
- Critically reflect on assumptions, limitations, and improvements
Group formation
- The target group size is 4 participants.
- We will form groups during class sessions.
- If you are not present, you must email me so that I can assign you to a group.
- We will select two groups with 5 participants using a fair and transparent procedure. Groups with 5 participants are expected to submit a more comprehensive case (+25%).
- Once your group is formed, you should create or join your group in the Canvas assignment.
Task
Select a business topic and formulate a concrete question. Example domains include:
- Finance (e.g., credit default, fraud, customer lifetime value)
- Marketing (e.g., churn, customer segmentation, campaign effectiveness)
- Human Resources (e.g., attrition, promotion, team productivity)
- Digital business (e.g., user engagement, gig worker retention, pricing strategy)
Create a synthetic but realistic dataset representing internal organizational data. You may complement this with synthetic or real external data if appropriate.
Develop an analytical notebook following the CRISP-DM process:
Business understanding What is the context, the concrete question, and the decision relevance? Who are the stakeholders?
Data understanding What does the dataset contain? How does it reflect a realistic business setting? What insights emerge from EDA?
Data preparation How is the data cleaned, transformed, and enriched?
Modeling Which model is used and why? How is it trained?
Evaluation Which metrics are used? How should the results be interpreted in practice?
Deployment How would the model be implemented? What are the requirements, risks, and ethical considerations?
Develop a report that presents and critically reflects on the case, including:
- What can be learned from the case
- How it connects to or extends course content
- Simplifying assumptions and limitations
- Opportunities for improvement
- Support key elements of your analysis with reputable sources, particularly for: Problem relevance, typical data sources, and common modeling approaches.
- Simplifications are acceptable, provided they are clearly justified.
- Advanced extensions (optional) may include: Advanced dataset preparation, model comparison or refinement, robustness checks, interactive visualizations, or more detailed deployment considerations.
- Your code must run and reproduce your results.
- If you plan to work with large-scale (big data) scenarios, you must consult with me in advance.
- You are strongly advised not to focus on real-time or streaming data, as this is difficult to implement and evaluate within a notebook-based project.
Project timeline
- Start: 2026-04-14
- Deadline: 2026-05-30
Proposed timeline
Week 1 – Topic selection
- Define and align on your topic
- Share your topic via email by the end of Week 1
- If multiple groups choose very similar topics, you will be asked to refine your focus
Weeks 2–4 – Iterative development
Develop and refine:
- Problem framing and assumptions
- Dataset design and structure
- Notebook (EDA and modeling)
Iterate between data design, analysis, and modeling
Improve coherence and realism step by step
Week 5 – Refinement
- Strengthen analysis and interpretation
- Optionally include an advanced extension
Week 6 – Finalization
- Ensure clarity, reproducibility, and quality of communication
- Submit all deliverables via Canvas
Deliverables
Synthetic dataset generation script
A Python script that:
- Clearly documents the business context and purpose (docstring)
- Defines the data schema (variables)
- Implements data generation logic (distributions and relationships)
- Includes realism features (e.g., noise, missing values, duplicates)
- States key assumptions and provides justification (where possible)
- Produces reproducible output (fixed random seed, CSV export)
Dataset
- CSV file(s) containing the simulated data
Analytical notebook
A Jupyter Notebook that:
- Covers the full CRISP-DM pipeline
- Includes explanations in Markdown
- Is fully executable and reproducible
Report (max 15 pages; including title page and references)
Section 1: Case package
- Business context and motivation
- Problem statement and objectives
- Stakeholders and decision relevance
- Positioning in practice (with references)
Section 2: Analytical approach and key findings
- Overview and justification of the analytical approach
- Selected key results (do not duplicate the notebook)
- Key business insights
Section 3: Reflection
- Assumptions
- Limitations of dataset and approach
- Ethical and deployment considerations
- Potential improvements and extensions
Evaluation criteria
A maximum of 60 points can be earned. Each group receives a single grade. You are expected to contribute equally. Document your individual contributions clearly and transparently. In case of disputes, you should be prepared to provide evidence of your contributions.
| Category | Points | Criteria |
|---|---|---|
| A. Case and dataset design | 15 pts | Clarity and relevance of problem; realism of dataset; transparency of data generation |
| B. Analysis quality | 20 pts | Method justification; correctness; CRISP-DM use; insightfulness |
| C. Communication | 20 pts | Structure; visualization; clean code; argument quality; use of references; reflection |
| D. Advanced extension | 5 pts | Meaningful additional feature beyond core requirements |
Note: The applied methodology and reasoning are more important than achieving the highest possible model performance.
AI policy
- Allowed: Generating dataset scripts, debugging code, and documentation support.
- Not allowed: Generating complete end-to-end solutions.
- You must be able to explain and defend your work at any time.
- All outputs must be validated.
Consultation and support
Note: Availability will be limited from 2026-05-25 to 2026-06-01. Please plan accordingly.
Template
Submission
Via Canvas assignment.
High-quality project work may contribute to future course development, for example by extending teaching materials or informing teaching cases. In such cases, we will discuss how contributions are acknowledged.
Example: Synthetic data generation script
"""
Purpose:
- Simulate [business scenario]
Data schema:
- age: numeric (years)
- income: numeric (annual income in EUR)
Generation logic:
- Age is drawn from a normal distribution (mean=40, sd=10)
- Income depends linearly on age with added noise
Assumptions and justification:
- Age distribution approximates working population demographics
- Income increases with age due to experience (human capital theory)
- Noise reflects unobserved heterogeneity in earnings
References (if applicable): [Add source, e.g., industry report, literature]
"""
import numpy as np
import pandas as pd
np.random.seed(42) # Reproducibility
# 1. Define dataset size
n = 1000
# 2. Generate base variables
age = np.random.normal(loc=40, scale=10, size=n)
# 3. Define relationships
income = age * 1000 + np.random.normal(loc=0, scale=5000, size=n)
# 4. Add realism (e.g., missing values)
missing_mask = np.random.rand(n) < 0.05
income[missing_mask] = np.nan
# 5. Create dataframe and export
df = pd.DataFrame({
"age": age,
"income": income,
})
df.to_csv("dataset.csv", index=False)Note: More examples available here.