Group work

In this group project, you will develop and analyze a realistic business analytics case, including data, modeling, and critical reflection. While many analytics courses rely on publicly available datasets—often centered on societal questions—this project focuses on business-relevant contexts such as management, finance, operations, or digital platforms. To achieve this, you will design and work with synthetic datasets that reflect realistic organizational processes and decision situations. This allows you not only to apply analytical methods but also to engage with the upstream challenge of framing meaningful problems and constructing data that captures them. In addition, you will address downstream challenges such as deployment considerations (e.g., implementation, ethics, and risks). Your work should be grounded in appropriate and reputable references (e.g., academic literature or industry reports) to demonstrate the relevance of the problem and to provide evidence that the analytical approach and dataset reflect realistic practices.

Objectives

Design a realistic business analytics scenario
Select and apply appropriate analytical models
Communicate results using CRISP-DM as a structuring framework
Critically reflect on assumptions, limitations, and improvements

Group formation

The target group size is 4 participants.
We will form groups during class sessions.
If you are not present, you must email me so that I can assign you to a group.
We will select two groups with 5 participants using a fair and transparent procedure. Groups with 5 participants are expected to submit a more comprehensive case (+25%).
Once your group is formed, you should create or join your group in the Canvas assignment.

Task

Select a business topic and formulate a concrete question. Example domains include:
- Finance (e.g., credit default, fraud, customer lifetime value)
- Marketing (e.g., churn, customer segmentation, campaign effectiveness)
- Human Resources (e.g., attrition, promotion, team productivity)
- Digital business (e.g., user engagement, gig worker retention, pricing strategy)
Create a synthetic but realistic dataset representing internal organizational data. You may complement this with synthetic or real external data if appropriate.
Develop an analytical notebook following the CRISP-DM process:
1. Business understanding What is the context, the concrete question, and the decision relevance? Who are the stakeholders?
2. Data understanding What does the dataset contain? How does it reflect a realistic business setting? What insights emerge from EDA?
3. Data preparation How is the data cleaned, transformed, and enriched?
4. Modeling Which model is used and why? How is it trained?
5. Evaluation Which metrics are used? How should the results be interpreted in practice?
6. Deployment How would the model be implemented? What are the requirements, risks, and ethical considerations?
Develop a report that presents and critically reflects on the case, including:
- What can be learned from the case
- How it connects to or extends course content
- Simplifying assumptions and limitations
- Opportunities for improvement

Notes

Support key elements of your analysis with reputable sources, particularly for: Problem relevance, typical data sources, and common modeling approaches.
Simplifications are acceptable, provided they are clearly justified.
Advanced extensions (optional) may include: Advanced dataset preparation, model comparison or refinement, robustness checks, interactive visualizations, or more detailed deployment considerations.
Your code must run and reproduce your results.
If you plan to work with large-scale (big data) scenarios, you must consult with me in advance.
You are strongly advised not to focus on real-time or streaming data, as this is difficult to implement and evaluate within a notebook-based project.

Project timeline

Important dates

Start: 2026-04-14
Deadline: 2026-05-30

Proposed timeline

Week 1 – Topic selection

Define and align on your topic
Share your topic via email by the end of Week 1
If multiple groups choose very similar topics, you will be asked to refine your focus

Weeks 2–4 – Iterative development

Develop and refine:
- Problem framing and assumptions
- Dataset design and structure
- Notebook (EDA and modeling)
Iterate between data design, analysis, and modeling
Improve coherence and realism step by step

Week 5 – Refinement

Strengthen analysis and interpretation
Optionally include an advanced extension

Week 6 – Finalization

Ensure clarity, reproducibility, and quality of communication
Submit all deliverables via Canvas

gantt
    dateFormat  YYYY-MM-DD
    excludes    weekends

    section Setup
    Topic selection                  :active, 2026-04-14, 7d

    section Iterative development
    Data, EDA, modeling (iterative)  : 2026-04-21, 18d

    section Refinement
    Refinement                       :2026-05-12, 7d

    section Finalization
    Final report and submission        :2026-05-19, 9d

Deliverables

Synthetic dataset generation script

A Python script that:
- Clearly documents the business context and purpose (docstring)
- Defines the data schema (variables)
- Implements data generation logic (distributions and relationships)
- Includes realism features (e.g., noise, missing values, duplicates)
- States key assumptions and provides justification (where possible)
- Produces reproducible output (fixed random seed, CSV export)
Dataset
- CSV file(s) containing the simulated data
Analytical notebook

A Jupyter Notebook that:
- Covers the full CRISP-DM pipeline
- Includes explanations in Markdown
- Is fully executable and reproducible
Report (max 15 pages; including title page and references)

Section 1: Case package
- Business context and motivation
- Problem statement and objectives
- Stakeholders and decision relevance
- Positioning in practice (with references)
Section 2: Analytical approach and key findings
- Overview and justification of the analytical approach
- Selected key results (do not duplicate the notebook)
- Key business insights
Section 3: Reflection
- Assumptions
- Limitations of dataset and approach
- Ethical and deployment considerations
- Potential improvements and extensions

Evaluation criteria

A maximum of 60 points can be earned. Each group receives a single grade. You are expected to contribute equally. Document your individual contributions clearly and transparently. In case of disputes, you should be prepared to provide evidence of your contributions.

Category	Points	Criteria
A. Case and dataset design	15 pts	Clarity and relevance of problem; realism of dataset; transparency of data generation
B. Analysis quality	20 pts	Method justification; correctness; CRISP-DM use; insightfulness
C. Communication	20 pts	Structure; visualization; clean code; argument quality; use of references; reflection
D. Advanced extension	5 pts	Meaningful additional feature beyond core requirements

Note: The applied methodology and reasoning are more important than achieving the highest possible model performance.

AI policy

Allowed: Generating dataset scripts, debugging code, and documentation support.
Not allowed: Generating complete end-to-end solutions.
You must be able to explain and defend your work at any time.
All outputs must be validated.

Consultation and support

Book appointments via: Bookings
Join meetings via: Teams

Note: Availability will be limited from 2026-05-25 to 2026-06-01. Please plan accordingly.

Template

Report template

Submission

Via Canvas assignment.

Note

High-quality project work may contribute to future course development, for example by extending teaching materials or informing teaching cases. In such cases, we will discuss how contributions are acknowledged.

Example: Synthetic data generation script

"""
Purpose:
- Simulate [business scenario]

Data schema:
- age: numeric (years)
- income: numeric (annual income in EUR)

Generation logic:
- Age is drawn from a normal distribution (mean=40, sd=10)
- Income depends linearly on age with added noise

Assumptions and justification:
- Age distribution approximates working population demographics
- Income increases with age due to experience (human capital theory)
- Noise reflects unobserved heterogeneity in earnings

References (if applicable): [Add source, e.g., industry report, literature]
"""
import numpy as np
import pandas as pd

np.random.seed(42) # Reproducibility

# 1. Define dataset size
n = 1000

# 2. Generate base variables
age = np.random.normal(loc=40, scale=10, size=n)

# 3. Define relationships
income = age * 1000 + np.random.normal(loc=0, scale=5000, size=n)

# 4. Add realism (e.g., missing values)
missing_mask = np.random.rand(n) < 0.05
income[missing_mask] = np.nan

# 5. Create dataframe and export
df = pd.DataFrame({
    "age": age,
    "income": income,
})
df.to_csv("dataset.csv", index=False)

Note: More examples available here.