Group work

In this group project, you will develop and analyze a realistic business analytics case, including data, modeling, and critical reflection. While many analytics courses rely on publicly available datasets—often centered on societal questions—this project focuses on business-relevant contexts such as management, finance, operations, or digital platforms. To achieve this, you will design and work with synthetic datasets that reflect realistic organizational processes and decision situations. This allows you not only to apply analytical methods but also to engage with the upstream challenge of framing meaningful problems and constructing data that captures them. In addition, you will address downstream challenges such as deployment considerations (e.g., implementation, ethics, and risks). Your work should be grounded in appropriate and reputable references (e.g., academic literature or industry reports) to demonstrate the relevance of the problem and to provide evidence that the analytical approach and dataset reflect realistic practices.

Objectives

  • Design a realistic business analytics scenario
  • Select and apply appropriate analytical models
  • Communicate results using CRISP-DM as a structuring framework
  • Critically reflect on assumptions, limitations, and improvements

Group formation

  • The target group size is 4 participants.
  • We will form groups during class sessions.
  • If you are not present, you must email me so that I can assign you to a group.
  • We will select two groups with 5 participants using a fair and transparent procedure. Groups with 5 participants are expected to submit a more comprehensive case (+25%).
  • Once your group is formed, you should create or join your group in the Canvas assignment.

Task

  • Select a business topic and formulate a concrete question. Example domains include:

    • Finance (e.g., credit default, fraud, customer lifetime value)
    • Marketing (e.g., churn, customer segmentation, campaign effectiveness)
    • Human Resources (e.g., attrition, promotion, team productivity)
    • Digital business (e.g., user engagement, gig worker retention, pricing strategy)
  • Create a synthetic but realistic dataset representing internal organizational data. You may complement this with synthetic or real external data if appropriate.

  • Develop an analytical notebook following the CRISP-DM process:

    1. Business understanding What is the context, the concrete question, and the decision relevance? Who are the stakeholders?

    2. Data understanding What does the dataset contain? How does it reflect a realistic business setting? What insights emerge from EDA?

    3. Data preparation How is the data cleaned, transformed, and enriched?

    4. Modeling Which model is used and why? How is it trained?

    5. Evaluation Which metrics are used? How should the results be interpreted in practice?

    6. Deployment How would the model be implemented? What are the requirements, risks, and ethical considerations?

  • Develop a report that presents and critically reflects on the case, including:

    • What can be learned from the case
    • How it connects to or extends course content
    • Simplifying assumptions and limitations
    • Opportunities for improvement
NoteNotes
  • Support key elements of your analysis with reputable sources, particularly for: Problem relevance, typical data sources, and common modeling approaches.
  • Simplifications are acceptable, provided they are clearly justified.
  • Advanced extensions (optional) may include: Advanced dataset preparation, model comparison or refinement, robustness checks, interactive visualizations, or more detailed deployment considerations.
  • Your code must run and reproduce your results.
  • If you plan to work with large-scale (big data) scenarios, you must consult with me in advance.
  • You are strongly advised not to focus on real-time or streaming data, as this is difficult to implement and evaluate within a notebook-based project.

Project timeline

ImportantImportant dates
  • Start: 2026-04-14
  • Deadline: 2026-05-30

Proposed timeline

Week 1 – Topic selection

  • Define and align on your topic
  • Share your topic via email by the end of Week 1
  • If multiple groups choose very similar topics, you will be asked to refine your focus

Weeks 2–4 – Iterative development

  • Develop and refine:

    • Problem framing and assumptions
    • Dataset design and structure
    • Notebook (EDA and modeling)
  • Iterate between data design, analysis, and modeling

  • Improve coherence and realism step by step

Week 5 – Refinement

  • Strengthen analysis and interpretation
  • Optionally include an advanced extension

Week 6 – Finalization

  • Ensure clarity, reproducibility, and quality of communication
  • Submit all deliverables via Canvas

gantt
    dateFormat  YYYY-MM-DD
    excludes    weekends

    section Setup
    Topic selection                  :active, 2026-04-14, 7d

    section Iterative development
    Data, EDA, modeling (iterative)  : 2026-04-21, 18d

    section Refinement
    Refinement                       :2026-05-12, 7d

    section Finalization
    Final report and submission        :2026-05-19, 9d

Deliverables

  1. Synthetic dataset generation script

    A Python script that:

    • Clearly documents the business context and purpose (docstring)
    • Defines the data schema (variables)
    • Implements data generation logic (distributions and relationships)
    • Includes realism features (e.g., noise, missing values, duplicates)
    • States key assumptions and provides justification (where possible)
    • Produces reproducible output (fixed random seed, CSV export)
  2. Dataset

    • CSV file(s) containing the simulated data
  3. Analytical notebook

    A Jupyter Notebook that:

    • Covers the full CRISP-DM pipeline
    • Includes explanations in Markdown
    • Is fully executable and reproducible
  4. Report (max 15 pages; including title page and references)

    Section 1: Case package

    • Business context and motivation
    • Problem statement and objectives
    • Stakeholders and decision relevance
    • Positioning in practice (with references)

    Section 2: Analytical approach and key findings

    • Overview and justification of the analytical approach
    • Selected key results (do not duplicate the notebook)
    • Key business insights

    Section 3: Reflection

    • Assumptions
    • Limitations of dataset and approach
    • Ethical and deployment considerations
    • Potential improvements and extensions

Evaluation criteria

A maximum of 60 points can be earned. Each group receives a single grade. You are expected to contribute equally. Document your individual contributions clearly and transparently. In case of disputes, you should be prepared to provide evidence of your contributions.

Category Points Criteria
A. Case and dataset design 15 pts Clarity and relevance of problem; realism of dataset; transparency of data generation
B. Analysis quality 20 pts Method justification; correctness; CRISP-DM use; insightfulness
C. Communication 20 pts Structure; visualization; clean code; argument quality; use of references; reflection
D. Advanced extension 5 pts Meaningful additional feature beyond core requirements

Note: The applied methodology and reasoning are more important than achieving the highest possible model performance.

AI policy

  • Allowed: Generating dataset scripts, debugging code, and documentation support.
  • Not allowed: Generating complete end-to-end solutions.
  • You must be able to explain and defend your work at any time.
  • All outputs must be validated.

Consultation and support

Note: Availability will be limited from 2026-05-25 to 2026-06-01. Please plan accordingly.

Template

Submission

Via Canvas assignment.

NoteNote

High-quality project work may contribute to future course development, for example by extending teaching materials or informing teaching cases. In such cases, we will discuss how contributions are acknowledged.

Example: Synthetic data generation script

"""
Purpose:
- Simulate [business scenario]

Data schema:
- age: numeric (years)
- income: numeric (annual income in EUR)

Generation logic:
- Age is drawn from a normal distribution (mean=40, sd=10)
- Income depends linearly on age with added noise

Assumptions and justification:
- Age distribution approximates working population demographics
- Income increases with age due to experience (human capital theory)
- Noise reflects unobserved heterogeneity in earnings

References (if applicable): [Add source, e.g., industry report, literature]
"""
import numpy as np
import pandas as pd

np.random.seed(42) # Reproducibility

# 1. Define dataset size
n = 1000

# 2. Generate base variables
age = np.random.normal(loc=40, scale=10, size=n)

# 3. Define relationships
income = age * 1000 + np.random.normal(loc=0, scale=5000, size=n)

# 4. Add realism (e.g., missing values)
missing_mask = np.random.rand(n) < 0.05
income[missing_mask] = np.nan

# 5. Create dataframe and export
df = pd.DataFrame({
    "age": age,
    "income": income,
})
df.to_csv("dataset.csv", index=False)

Note: More examples available here.