marketing-roi

πŸ“Š Marketing Campaign ROI Analyzer

An end-to-end data analytics project covering data profiling, SQL analysis, Python EDA, statistical testing, attribution modeling, machine learning and interactive dashboard β€” built on real customer data from a 2-year marketing campaign dataset.


Live Dashboard

View the interactive dashboard here:
Marketing ROI Dashboard

🎯 Business Problem

A company ran 5 marketing campaigns across 3 purchase channels over 2 years (2012–2014). The business needed answers to these critical questions:

This project answers all of these questions using real data, statistical proof and machine learning.


πŸ“ Project Structure

marketing-roi-analyzer/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── marketing_campaign.csv        ← Original dataset (never modified)
β”‚   └── processed/
β”‚       β”œβ”€β”€ cleaned_data.csv              ← Cleaned dataset from Phase 3
β”‚       └── customer_segments.csv         ← ML-generated segments from Phase 6
β”‚
β”œβ”€β”€ sql/
β”‚   β”œβ”€β”€ 01_create_tables.sql              ← Data verification queries
β”‚   β”œβ”€β”€ 02_channel_analysis.sql           ← Channel ROI + Campaign ranking
β”‚   β”œβ”€β”€ 03_cohort_analysis.sql            ← Monthly cohort analysis
β”‚   └── views/
β”‚       β”œβ”€β”€ vw_channel_roi.sql            ← BigQuery view: channel metrics
β”‚       └── vw_customer_segments.sql      ← BigQuery view: RFM segments
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_eda.ipynb                      ← Data cleaning & exploration
β”‚   β”œβ”€β”€ 02_statistical_analysis.ipynb     ← Hypothesis testing
β”‚   β”œβ”€β”€ 03_ab_testing.ipynb               ← A/B test: Campaign 3 vs 5
β”‚   β”œβ”€β”€ 04_attribution_modeling.ipynb     ← First/Last/Linear attribution
β”‚   └── 05_clv_prediction.ipynb           ← K-Means + XGBoost + SHAP
β”‚
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ 01_data_profiling_audit.xlsx      ← Phase 1 profiling sheet
β”‚   └── 02_channel_pivot_dashboard.xlsx   ← Phase 2 Excel pivot dashboard
β”‚
β”œβ”€β”€ dashboard/
β”‚   └── marketing_roi_dashboard      ← 3-page interactive dashboard
β”‚
└── README.md

πŸ“¦ Dataset

Property Detail
Source Kaggle β€” Customer Personality Analysis
Raw rows 2,240 customers
Cleaned rows 2,236 customers (4 outliers removed)
Columns 29 original + 4 engineered features
Time period July 2012 β€” June 2014
Campaigns 5 (AcceptedCmp1 through AcceptedCmp5)
Channels Web, Store, Catalog
Products Wines, Fruits, Meat, Fish, Sweets, Gold

πŸ—ΊοΈ Project Phases

βœ… Phase 1 β€” Data Profiling & Acquisition


βœ… Phase 2 β€” SQL Analysis on Google BigQuery

Uploaded data to BigQuery and wrote 6 production-quality SQL queries:

Query File What It Does
01_create_tables.sql Verifies row counts and null values β€” confirmed 24 Income nulls matching Excel audit
02_channel_analysis.sql CTE-based ROI analysis β€” Store leads with 2,225 customers
02_channel_analysis.sql RANK() window function β€” Campaign 4 ranks #1 at 7.46% acceptance
03_cohort_analysis.sql DATE_TRUNC monthly cohort β€” August 2012 highest acquisition month
vw_channel_roi.sql BigQuery view β€” reusable channel metrics table for Power BI
vw_customer_segments.sql BigQuery view β€” RFM scoring with NTILE(4) and CASE WHEN segmentation

Key SQL finding: September 2012 cohort had only 99 customers but the highest average spend (Β£800+) β€” fewer but higher quality customers than later months.


βœ… Phase 3 β€” Python EDA & Data Cleaning

Notebook: 01_eda.ipynb (Google Colab)

Cleaning steps performed:

Problem Fix Applied Rows Affected
24 null Income values fillna(median) β€” median more robust than mean for skewed data 0 removed
Year_Birth < 1940 (age 80+) Removed impossible ages ~3 removed
Income = 666,666 (outlier) Removed above Β£200,000 1 removed
Absurd, YOLO, Alone in Marital_Status Replaced with Other 0 removed

Feature engineering β€” 4 new columns created:

New Column Formula Business Use
Age current_year - Year_Birth More intuitive than birth year
Customer_Tenure Days since Dt_Customer Loyalty measurement
Total_Spend Sum of all 6 Mnt columns Single spend metric
Total_Campaigns_Accepted Sum of AcceptedCmp1–5 Campaign engagement score

Key EDA findings:


βœ… Phase 4 β€” Statistical Analysis & A/B Testing

Notebooks: 02_statistical_analysis.ipynb, 03_ab_testing.ipynb

Three hypothesis tests performed:

Test Question Result p-value
Independent T-test Do Campaign 1 acceptors spend more? βœ… Reject Hβ‚€ β€” Yes, 2.7Γ— more (Β£1,482 vs Β£545) 0.0000
One-way ANOVA Does education level affect purchases? βœ… Reject Hβ‚€ β€” Education significantly affects buying 0.0000
Chi-square Does marital status affect campaign acceptance? ❌ Fail to reject Hβ‚€ β€” No relationship proven 0.923

A/B Test β€” Campaign 3 vs Campaign 5:


βœ… Phase 5 β€” Attribution Modeling

Notebook: 04_attribution_modeling.ipynb

Built three attribution models to understand the customer journey:

Model Winner Credit % Business Meaning
First Touch Campaign 1 ~40% Creates awareness β€” starts most customer journeys
Last Touch Campaign 5 ~50% Closes conversions β€” the final push to buy
Linear Shared Equal split All campaigns contribute throughout the journey

Key insight: Campaign 1 and Campaign 5 serve different but equally important roles. Campaign 1 introduces customers to the brand. Campaign 5 converts them. Cutting either one would damage the full marketing funnel.

Campaign 2 consistently performed worst across all three models β€” budget reallocation recommended.


βœ… Phase 6 β€” CLV Prediction & Customer Segmentation

Notebook: 05_clv_prediction.ipynb

Part A β€” K-Means Clustering:

Segment Count Business Action
πŸ† Champions 520 (23.3%) Protect β€” highest spenders (avg Β£1,269)
πŸ’š Loyal 597 (26.7%) Reward β€” consistent buyers
⚠️ At-Risk 646 (28.9%) Re-engage β€” immediate retention opportunity
❌ Lost 473 (21.2%) Win-back β€” were high value (avg Β£1,078)

Part B β€” XGBoost CLV Prediction:

Metric Value Interpretation
RΒ² Score 0.891 Model explains 89.1% of spending variance
RMSE Β£202.76 Predictions off by Β£203 on average
Train/Test split 80/20 1,788 training / 448 testing

CLV Tier Distribution:

Tier Customers Avg Predicted CLV
πŸ’Ž High CLV 171 (7.65%) Β£1,761
πŸ”΅ Mid CLV 638 (28.5%) Β£1,158
βšͺ Low CLV 1,427 (63.8%) Β£218

Part C β€” SHAP Explainability:

Top 3 features driving CLV predictions:

  1. NumCatalogPurchases β€” Catalog buyers are typically older, higher-income premium customers
  2. Frequency β€” More purchases directly drives higher lifetime value
  3. Income β€” Confirms Phase 3 finding: Income vs Spend correlation of 0.79

SHAP independently confirmed the correlation analysis finding from Phase 3 using a completely different method β€” strengthening confidence in the recommendation.


βœ… Phase 7 β€” Power BI Dashboard

File: powerbi/marketing_roi_dashboard.pbix

Built a 3-page interactive dashboard connected to BigQuery views and processed CSV files:

Page Content
Executive Overview 4 KPI cards, spend trend line chart, segment donut chart, revenue by segment bars, interactive slicer
Channel Performance Channel metrics table, revenue bars, conversion comparison, campaign ranking, attribution model visualization
Customer Segments CLV tier donut, scatter plot (Recency vs Spend), income by education, SHAP feature importance, model performance metrics

πŸ”‘ Key Business Findings

# Finding Evidence Recommended Action
1 Campaign 4 has highest acceptance rate (7.46%) SQL RANK() query Prioritise Campaign 4 spend
2 Campaign 2 performs 5Γ— worse than Campaign 4 (1.34%) SQL + Attribution Reallocate Campaign 2 budget
3 Campaign 1 starts 40% of customer journeys Attribution Modeling Maintain Campaign 1 for awareness
4 Campaign 5 closes 50% of conversions Attribution Modeling Increase Campaign 5 investment
5 Income predicts spend with 0.79 correlation Correlation + SHAP Target high-income customers
6 PhD holders earn Β£56K avg β€” 2.76Γ— more than Basic ANOVA + EDA Target educated segments with premium campaigns
7 Catalog has lowest CAC (Β£18,194 vs Β£24,475 for Store) SQL channel analysis Invest more in Catalog channel
8 171 High CLV customers worth 8Γ— more than Low CLV XGBoost prediction Build VIP loyalty programme
9 646 At-Risk customers represent Β£75K retention opportunity K-Means clustering Launch immediate re-engagement campaign
10 Campaign 3 and Campaign 5 perform identically (p=0.954) A/B Testing Choose by cost β€” performance is identical

πŸ› οΈ Tools & Technologies

Category Tool Purpose
Cloud Database Google BigQuery SQL analysis, views, cohort analysis
Data Profiling Microsoft Excel Online Data audit, pivot dashboard, budget analysis
Python Google Colab EDA, cleaning, statistics, ML
Data Manipulation Pandas, NumPy Data processing and feature engineering
Visualisation Matplotlib, Seaborn EDA charts and statistical plots
Statistics SciPy, Statsmodels T-test, ANOVA, Chi-square, A/B testing
Machine Learning Scikit-learn, XGBoost K-Means clustering, CLV prediction
Explainability SHAP Feature importance and model interpretation
Dashboard Power BI Desktop Interactive 3-page business dashboard
Version Control GitHub Project repository and portfolio

πŸ“Š Statistical Results Summary

T-test  (Campaign 1 vs Spend):  t=19.54,  p=0.0000  βœ… Significant
ANOVA   (Education vs Spend):   F=17.27,  p=0.0000  βœ… Significant
Chi-sq  (Marital vs Campaign):  χ²=1.45,  p=0.923   ❌ Not significant
A/B Test (Campaign 3 vs 5):     t=0.057,  p=0.954   🀝 No difference

ML Model Performance:
XGBoost RΒ² Score:  0.891  (89.1% variance explained)
XGBoost RMSE:      Β£202.76 (average prediction error)

πŸ“ Data Limitations

  1. No actual advertising spend data β€” Budget tracker was omitted as the dataset contains customer purchase revenue, not company advertising costs. A real budget tracker would require actual ad spend data from marketing platforms.

  2. Z_Revenue is a constant proxy β€” Not real transaction revenue per customer. ROI calculations are directionally correct but not absolute.

  3. No timestamp per campaign β€” Attribution modeling assumes Campaign 1–5 are chronologically ordered. Real attribution would require individual campaign exposure timestamps.

  4. Single market dataset β€” Results are specific to this company’s customer base and may not generalise across industries.


βœ… Business Questions β€” Answered

At the start of this project, the business asked 5 critical questions. Here are the data-backed answers:


❓ Question 1 β€” Which campaign drives the most customer conversions?

Campaign 4 is the top performing campaign with a 7.46% acceptance rate and 167 conversions β€” ranking #1 across all 5 campaigns. Campaign 2 is the worst performer at only 1.34% and should have its budget reallocated. Additionally, attribution modeling revealed that Campaign 1 starts 40% of all customer journeys (awareness) while Campaign 5 closes 50% of all conversions (sales driver) β€” both play critical but different roles in the funnel.


❓ Question 2 β€” Which channel generates the best return?

All three channels generate similar revenue (Store: Β£1.35M, Web: Β£1.35M, Catalog: Β£1.32M). However Catalog is the most cost-efficient channel with the lowest Customer Acquisition Cost at Β£18,194 compared to Store (Β£24,475) and Web (Β£24,101). Store reaches the most customers (2,225) making it the best volume channel. The right answer depends on the business goal β€” efficiency (Catalog) or reach (Store).


❓ Question 3 β€” Which customers are most valuable and which are at risk?

K-Means clustering identified 4 distinct segments:


❓ Question 4 β€” Can we predict how much a customer will spend in the future?

Yes β€” with 89.1% accuracy. An XGBoost machine learning model was trained on customer demographics, purchase history and campaign behaviour. The model predicts Customer Lifetime Value with an RΒ² of 0.891 and an average error of just Β£203. Customers were classified into three CLV tiers: High (Β£1,761 avg), Mid (Β£1,158 avg) and Low (Β£218 avg). The 171 High CLV customers are worth 8Γ— more than Low CLV customers individually.


❓ Question 5 β€” What factors actually drive customer spending behaviour?

Three independent methods all point to the same answer:

  1. Correlation analysis β€” Income vs Total Spend = 0.79 (strong positive relationship)
  2. ANOVA test β€” Education level significantly affects purchases (F=17.27, p=0.0000)
  3. SHAP explainability β€” Top 3 predictors are Catalog purchases, Purchase frequency and Income

Conclusion: High-income, educated customers who buy frequently through the catalog channel are the highest-value customer profile. Marketing efforts should be concentrated on acquiring and retaining this profile.


πŸ‘€ Author

Fahad VB


This project was built as a portfolio project demonstrating end-to-end data analytics skills across SQL, Python, Statistics and Business Intelligence.