An end-to-end data analytics project covering data profiling, SQL analysis, Python EDA, statistical testing, attribution modeling, machine learning and interactive dashboard β built on real customer data from a 2-year marketing campaign dataset.
View the interactive dashboard here:
Marketing ROI Dashboard
A company ran 5 marketing campaigns across 3 purchase channels over 2 years (2012β2014). The business needed answers to these critical questions:
This project answers all of these questions using real data, statistical proof and machine learning.
marketing-roi-analyzer/
β
βββ data/
β βββ raw/
β β βββ marketing_campaign.csv β Original dataset (never modified)
β βββ processed/
β βββ cleaned_data.csv β Cleaned dataset from Phase 3
β βββ customer_segments.csv β ML-generated segments from Phase 6
β
βββ sql/
β βββ 01_create_tables.sql β Data verification queries
β βββ 02_channel_analysis.sql β Channel ROI + Campaign ranking
β βββ 03_cohort_analysis.sql β Monthly cohort analysis
β βββ views/
β βββ vw_channel_roi.sql β BigQuery view: channel metrics
β βββ vw_customer_segments.sql β BigQuery view: RFM segments
β
βββ notebooks/
β βββ 01_eda.ipynb β Data cleaning & exploration
β βββ 02_statistical_analysis.ipynb β Hypothesis testing
β βββ 03_ab_testing.ipynb β A/B test: Campaign 3 vs 5
β βββ 04_attribution_modeling.ipynb β First/Last/Linear attribution
β βββ 05_clv_prediction.ipynb β K-Means + XGBoost + SHAP
β
βββ reports/
β βββ 01_data_profiling_audit.xlsx β Phase 1 profiling sheet
β βββ 02_channel_pivot_dashboard.xlsx β Phase 2 Excel pivot dashboard
β
βββ dashboard/
β βββ marketing_roi_dashboard β 3-page interactive dashboard
β
βββ README.md
| Property | Detail |
|---|---|
| Source | Kaggle β Customer Personality Analysis |
| Raw rows | 2,240 customers |
| Cleaned rows | 2,236 customers (4 outliers removed) |
| Columns | 29 original + 4 engineered features |
| Time period | July 2012 β June 2014 |
| Campaigns | 5 (AcceptedCmp1 through AcceptedCmp5) |
| Channels | Web, Store, Catalog |
| Products | Wines, Fruits, Meat, Fish, Sweets, Gold |
Absurd, YOLO, Alone β invalid categoriesUploaded data to BigQuery and wrote 6 production-quality SQL queries:
| Query File | What It Does |
|---|---|
01_create_tables.sql |
Verifies row counts and null values β confirmed 24 Income nulls matching Excel audit |
02_channel_analysis.sql |
CTE-based ROI analysis β Store leads with 2,225 customers |
02_channel_analysis.sql |
RANK() window function β Campaign 4 ranks #1 at 7.46% acceptance |
03_cohort_analysis.sql |
DATE_TRUNC monthly cohort β August 2012 highest acquisition month |
vw_channel_roi.sql |
BigQuery view β reusable channel metrics table for Power BI |
vw_customer_segments.sql |
BigQuery view β RFM scoring with NTILE(4) and CASE WHEN segmentation |
Key SQL finding: September 2012 cohort had only 99 customers but the highest average spend (Β£800+) β fewer but higher quality customers than later months.
Notebook: 01_eda.ipynb (Google Colab)
Cleaning steps performed:
| Problem | Fix Applied | Rows Affected |
|---|---|---|
| 24 null Income values | fillna(median) β median more robust than mean for skewed data |
0 removed |
| Year_Birth < 1940 (age 80+) | Removed impossible ages | ~3 removed |
| Income = 666,666 (outlier) | Removed above Β£200,000 | 1 removed |
| Absurd, YOLO, Alone in Marital_Status | Replaced with Other |
0 removed |
Feature engineering β 4 new columns created:
| New Column | Formula | Business Use |
|---|---|---|
Age |
current_year - Year_Birth |
More intuitive than birth year |
Customer_Tenure |
Days since Dt_Customer |
Loyalty measurement |
Total_Spend |
Sum of all 6 Mnt columns | Single spend metric |
Total_Campaigns_Accepted |
Sum of AcceptedCmp1β5 | Campaign engagement score |
Key EDA findings:
Notebooks: 02_statistical_analysis.ipynb, 03_ab_testing.ipynb
Three hypothesis tests performed:
| Test | Question | Result | p-value |
|---|---|---|---|
| Independent T-test | Do Campaign 1 acceptors spend more? | β Reject Hβ β Yes, 2.7Γ more (Β£1,482 vs Β£545) | 0.0000 |
| One-way ANOVA | Does education level affect purchases? | β Reject Hβ β Education significantly affects buying | 0.0000 |
| Chi-square | Does marital status affect campaign acceptance? | β Fail to reject Hβ β No relationship proven | 0.923 |
A/B Test β Campaign 3 vs Campaign 5:
Notebook: 04_attribution_modeling.ipynb
Built three attribution models to understand the customer journey:
| Model | Winner | Credit % | Business Meaning |
|---|---|---|---|
| First Touch | Campaign 1 | ~40% | Creates awareness β starts most customer journeys |
| Last Touch | Campaign 5 | ~50% | Closes conversions β the final push to buy |
| Linear | Shared | Equal split | All campaigns contribute throughout the journey |
Key insight: Campaign 1 and Campaign 5 serve different but equally important roles. Campaign 1 introduces customers to the brand. Campaign 5 converts them. Cutting either one would damage the full marketing funnel.
Campaign 2 consistently performed worst across all three models β budget reallocation recommended.
Notebook: 05_clv_prediction.ipynb
Part A β K-Means Clustering:
| Segment | Count | Business Action |
|---|---|---|
| π Champions | 520 (23.3%) | Protect β highest spenders (avg Β£1,269) |
| π Loyal | 597 (26.7%) | Reward β consistent buyers |
| β οΈ At-Risk | 646 (28.9%) | Re-engage β immediate retention opportunity |
| β Lost | 473 (21.2%) | Win-back β were high value (avg Β£1,078) |
Part B β XGBoost CLV Prediction:
| Metric | Value | Interpretation |
|---|---|---|
| RΒ² Score | 0.891 | Model explains 89.1% of spending variance |
| RMSE | Β£202.76 | Predictions off by Β£203 on average |
| Train/Test split | 80/20 | 1,788 training / 448 testing |
CLV Tier Distribution:
| Tier | Customers | Avg Predicted CLV |
|---|---|---|
| π High CLV | 171 (7.65%) | Β£1,761 |
| π΅ Mid CLV | 638 (28.5%) | Β£1,158 |
| βͺ Low CLV | 1,427 (63.8%) | Β£218 |
Part C β SHAP Explainability:
Top 3 features driving CLV predictions:
SHAP independently confirmed the correlation analysis finding from Phase 3 using a completely different method β strengthening confidence in the recommendation.
File: powerbi/marketing_roi_dashboard.pbix
Built a 3-page interactive dashboard connected to BigQuery views and processed CSV files:
| Page | Content |
|---|---|
| Executive Overview | 4 KPI cards, spend trend line chart, segment donut chart, revenue by segment bars, interactive slicer |
| Channel Performance | Channel metrics table, revenue bars, conversion comparison, campaign ranking, attribution model visualization |
| Customer Segments | CLV tier donut, scatter plot (Recency vs Spend), income by education, SHAP feature importance, model performance metrics |
| # | Finding | Evidence | Recommended Action |
|---|---|---|---|
| 1 | Campaign 4 has highest acceptance rate (7.46%) | SQL RANK() query | Prioritise Campaign 4 spend |
| 2 | Campaign 2 performs 5Γ worse than Campaign 4 (1.34%) | SQL + Attribution | Reallocate Campaign 2 budget |
| 3 | Campaign 1 starts 40% of customer journeys | Attribution Modeling | Maintain Campaign 1 for awareness |
| 4 | Campaign 5 closes 50% of conversions | Attribution Modeling | Increase Campaign 5 investment |
| 5 | Income predicts spend with 0.79 correlation | Correlation + SHAP | Target high-income customers |
| 6 | PhD holders earn Β£56K avg β 2.76Γ more than Basic | ANOVA + EDA | Target educated segments with premium campaigns |
| 7 | Catalog has lowest CAC (Β£18,194 vs Β£24,475 for Store) | SQL channel analysis | Invest more in Catalog channel |
| 8 | 171 High CLV customers worth 8Γ more than Low CLV | XGBoost prediction | Build VIP loyalty programme |
| 9 | 646 At-Risk customers represent Β£75K retention opportunity | K-Means clustering | Launch immediate re-engagement campaign |
| 10 | Campaign 3 and Campaign 5 perform identically (p=0.954) | A/B Testing | Choose by cost β performance is identical |
| Category | Tool | Purpose |
|---|---|---|
| Cloud Database | Google BigQuery | SQL analysis, views, cohort analysis |
| Data Profiling | Microsoft Excel Online | Data audit, pivot dashboard, budget analysis |
| Python | Google Colab | EDA, cleaning, statistics, ML |
| Data Manipulation | Pandas, NumPy | Data processing and feature engineering |
| Visualisation | Matplotlib, Seaborn | EDA charts and statistical plots |
| Statistics | SciPy, Statsmodels | T-test, ANOVA, Chi-square, A/B testing |
| Machine Learning | Scikit-learn, XGBoost | K-Means clustering, CLV prediction |
| Explainability | SHAP | Feature importance and model interpretation |
| Dashboard | Power BI Desktop | Interactive 3-page business dashboard |
| Version Control | GitHub | Project repository and portfolio |
T-test (Campaign 1 vs Spend): t=19.54, p=0.0000 β
Significant
ANOVA (Education vs Spend): F=17.27, p=0.0000 β
Significant
Chi-sq (Marital vs Campaign): ΟΒ²=1.45, p=0.923 β Not significant
A/B Test (Campaign 3 vs 5): t=0.057, p=0.954 π€ No difference
ML Model Performance:
XGBoost RΒ² Score: 0.891 (89.1% variance explained)
XGBoost RMSE: Β£202.76 (average prediction error)
No actual advertising spend data β Budget tracker was omitted as the dataset contains customer purchase revenue, not company advertising costs. A real budget tracker would require actual ad spend data from marketing platforms.
Z_Revenue is a constant proxy β Not real transaction revenue per customer. ROI calculations are directionally correct but not absolute.
No timestamp per campaign β Attribution modeling assumes Campaign 1β5 are chronologically ordered. Real attribution would require individual campaign exposure timestamps.
Single market dataset β Results are specific to this companyβs customer base and may not generalise across industries.
At the start of this project, the business asked 5 critical questions. Here are the data-backed answers:
β Question 1 β Which campaign drives the most customer conversions?
Campaign 4 is the top performing campaign with a 7.46% acceptance rate and 167 conversions β ranking #1 across all 5 campaigns. Campaign 2 is the worst performer at only 1.34% and should have its budget reallocated. Additionally, attribution modeling revealed that Campaign 1 starts 40% of all customer journeys (awareness) while Campaign 5 closes 50% of all conversions (sales driver) β both play critical but different roles in the funnel.
β Question 2 β Which channel generates the best return?
All three channels generate similar revenue (Store: Β£1.35M, Web: Β£1.35M, Catalog: Β£1.32M). However Catalog is the most cost-efficient channel with the lowest Customer Acquisition Cost at Β£18,194 compared to Store (Β£24,475) and Web (Β£24,101). Store reaches the most customers (2,225) making it the best volume channel. The right answer depends on the business goal β efficiency (Catalog) or reach (Store).
β Question 3 β Which customers are most valuable and which are at risk?
K-Means clustering identified 4 distinct segments:
- π Champions (520 customers) β avg spend Β£1,269 β most valuable, generate 49% of total revenue
- π Loyal (597 customers) β consistent buyers, lower individual spend
- β οΈ At-Risk (646 customers) β represent a Β£75,000 retention opportunity β immediate re-engagement needed
- β Lost (473 customers) β had high historical spend (avg Β£1,078) β worth a win-back campaign
β Question 4 β Can we predict how much a customer will spend in the future?
Yes β with 89.1% accuracy. An XGBoost machine learning model was trained on customer demographics, purchase history and campaign behaviour. The model predicts Customer Lifetime Value with an RΒ² of 0.891 and an average error of just Β£203. Customers were classified into three CLV tiers: High (Β£1,761 avg), Mid (Β£1,158 avg) and Low (Β£218 avg). The 171 High CLV customers are worth 8Γ more than Low CLV customers individually.
β Question 5 β What factors actually drive customer spending behaviour?
Three independent methods all point to the same answer:
- Correlation analysis β Income vs Total Spend = 0.79 (strong positive relationship)
- ANOVA test β Education level significantly affects purchases (F=17.27, p=0.0000)
- SHAP explainability β Top 3 predictors are Catalog purchases, Purchase frequency and Income
Conclusion: High-income, educated customers who buy frequently through the catalog channel are the highest-value customer profile. Marketing efforts should be concentrated on acquiring and retaining this profile.
Fahad VB
πΌ LinkedIn: https://www.linkedin.com/in/fahadvb/
This project was built as a portfolio project demonstrating end-to-end data analytics skills across SQL, Python, Statistics and Business Intelligence.