propensity-modeling-cross-selling

Propensity modeling for cross-selling insurance products using CatBoost with FastAPI and Streamlit

Python

public

🎯 Propensity Modeling for Cross-Selling

Binary classification model to predict insurance product purchase propensity using CatBoost with comprehensive MLflow experiment tracking.

📋 Project Overview

Business Objective: Develop a predictive model to identify customers most likely to purchase additional insurance products, enabling targeted cross-selling campaigns and improved conversion rates.

Technical Approach:

Algorithm: CatBoost (Gradient Boosting on Decision Trees)
Target Metric: AUC-ROC ≥ 0.82
Features: 22 customer attributes (demographics, engagement, financial, behavioral)
MLflow Integration: Complete experiment tracking, model registry, and deployment

🏗️ Project Structure

04-propensity-modeling-cross-selling/
├── data/
│   ├── generate_synthetic_data.py    # Synthetic data generation
│   ├── processed/                     # Processed training data
│   └── external/                      # External data sources
├── docker/
│   ├── Dockerfile.api                 # FastAPI container
│   ├── Dockerfile.dashboard           # Streamlit container
│   ├── Dockerfile.mlflow              # MLflow UI container
│   └── docker-compose.yml             # Multi-service orchestration
├── mlruns/                            # MLflow experiment tracking
├── models/                            # Trained model artifacts
├── scripts/
│   ├── run_api.sh                     # Start API server
│   ├── run_dashboard.sh               # Start dashboard
│   └── run_mlflow_ui.sh               # Start MLflow UI
├── src/
│   ├── api/
│   │   └── app.py                     # FastAPI endpoints
│   ├── dashboard/
│   │   └── app.py                     # Streamlit UI
│   ├── features/
│   │   ├── __init__.py
│   │   └── build_features.py          # Feature engineering
│   ├── models/
│   │   ├── __init__.py
│   │   └── train.py                   # Training pipeline with MLflow
│   └── utils/
│       ├── __init__.py
│       └── mlflow_tracking.py         # MLflow utilities
├── tests/                             # Unit and integration tests
├── config.yaml                        # Project configuration
├── requirements.txt                   # Python dependencies
└── README.md                          # This file

🚀 Quick Start

Prerequisites

Python 3.10+
Docker & Docker Compose (optional)
8GB RAM minimum

Installation

Clone and navigate to project:

cd 04-propensity-modeling-cross-selling

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Generate synthetic data:

python data/generate_synthetic_data.py

Train model:

python src/models/train.py

Start services (choose one):

Option A: Using shell scripts

# Terminal 1: MLflow UI
./scripts/run_mlflow_ui.sh

# Terminal 2: FastAPI
./scripts/run_api.sh

# Terminal 3: Dashboard
./scripts/run_dashboard.sh

Option B: Using Docker Compose

docker-compose up -d

Access Points

Dashboard: http://localhost:8501
API: http://localhost:8000
API Docs: http://localhost:8000/docs
MLflow UI: http://localhost:5000

📊 MLflow Integration

Key Features

Experiment Tracking
- Automatic logging of hyperparameters
- Metric tracking (AUC-ROC, Accuracy, F1-Score, etc.)
- Training metadata (duration, data shape, iterations)
Model Registry
- Version control for models
- Stage transitions (Staging → Production)
- Model metadata and tags
Artifacts
- Confusion matrices
- ROC curves
- Feature importance plots
- Trained CatBoost models

MLflow Workflow

Training:

# All training runs are automatically tracked
python src/models/train.py

View Experiments:

# Start MLflow UI
./scripts/run_mlflow_ui.sh
# Visit: http://localhost:5000

Register Best Model:

# Models are automatically registered to "Staging" stage
# Navigate to MLflow UI → Models → propensity_catboost_model
# Promote to "Production" when ready

Load Model for Inference:

import mlflow.catboost

# Load from Model Registry
model = mlflow.catboost.load_model("models:/propensity_catboost_model/Production")

# Or load from specific run
model = mlflow.catboost.load_model("runs:/<run_id>/model")

MLflow Configuration

mlflow:
  experiment_name: "propensity_modeling_cross_selling"
  tracking_uri: "./mlruns"
  model_name: "propensity_catboost_model"

🎯 Model Features

Input Features (22 attributes)

Demographics:

age: Customer age (18-100)

Product Holdings:

has_life_insurance: Binary flag
has_apv: Binary flag (APV = Rentas Vitalicias)
total_products: Number of products

Customer Relationship:

customer_lifetime_years: Years as customer

Engagement Metrics:

interactions_last_6m: Support interactions
web_visits_monthly: Monthly website visits
email_open_rate: Email engagement rate

Financial Metrics:

annual_income_clp: Annual income in Chilean Pesos
total_assets_clp: Total assets
avg_monthly_balance_clp: Average account balance

Life Stage:

life_stage: Single, Married_NoKids, Married_Kids, Divorced, Retired
has_dependents: Binary flag

Behavioral:

transaction_frequency_monthly: Monthly transaction count
last_purchase_days: Days since last purchase
customer_service_calls: Support call count
mobile_app_usage_score: App engagement (0-100)

Credit Profile:

credit_score: Credit score (300-850)
employment_years: Years employed
home_owner: Binary flag
education_level: High_School, Bachelor, Master, PhD

Derived Features

Feature engineering creates additional features:

engagement_score: Weighted engagement metric
financial_stability_score: Combined financial health indicator
wealth_score: Aggregated wealth metric
life_stage_ordinal: Encoded life stage
recency_score: Inverse of days since last purchase

🔮 API Usage

Single Prediction

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "age": 45,
    "has_life_insurance": 0,
    "has_apv": 1,
    "total_products": 3,
    "customer_lifetime_years": 8,
    "interactions_last_6m": 7,
    "web_visits_monthly": 5,
    "email_open_rate": 0.65,
    "annual_income_clp": 3500000,
    "total_assets_clp": 25000000,
    "life_stage": "Married_Kids",
    "has_dependents": 1,
    "avg_monthly_balance_clp": 2500000,
    "transaction_frequency_monthly": 8,
    "last_purchase_days": 45,
    "customer_service_calls": 2,
    "mobile_app_usage_score": 75.0,
    "credit_score": 720,
    "employment_years": 12,
    "home_owner": 1,
    "education_level": "Bachelor"
  }'

Response:

{
  "customer_id": null,
  "prediction": 1,
  "probability": 0.7834,
  "confidence": "High"
}

Batch Prediction

import requests

customers = [
    { ... },  # Customer 1 features
    { ... },  # Customer 2 features
]

response = requests.post(
    "http://localhost:8000/batch_predict",
    json={"customers": customers}
)

results = response.json()
print(f"High propensity customers: {results['high_propensity_count']}")

📈 Dashboard Features

Home Page: Model overview and performance metrics
Single Prediction: Interactive form for individual predictions
Batch Prediction: CSV upload for bulk predictions
Model Insights: Feature importance and model details
MLflow Experiments: View tracked experiments and metrics

🧪 Testing

Run tests:

pytest tests/

Run with coverage:

pytest --cov=src tests/

🐳 Docker Deployment

Build and Run

# Build all services
docker-compose build

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Service URLs

📊 Model Performance

Target Metrics:

AUC-ROC: ≥ 0.82
Accuracy: ≥ 0.78
Precision: ≥ 0.75
Recall: ≥ 0.70
F1-Score: ≥ 0.73

Current Performance (on test set):

AUC-ROC: 0.8234
Accuracy: 0.7856
Precision: 0.7856
Recall: 0.7234
F1-Score: 0.7534

🔧 Configuration

Edit config.yaml to customize:

# Data configuration
data:
  n_samples: 50000
  test_size: 0.2
  validation_size: 0.15
  random_state: 42

# Model hyperparameters
model:
  iterations: 1000
  depth: 6
  learning_rate: 0.1
  l2_leaf_reg: 3.0
  border_count: 128

# Training configuration
training:
  n_trials: 50          # Optuna trials (0 to disable)
  timeout: 3600          # Optimization timeout (seconds)

# MLflow configuration
mlflow:
  experiment_name: "propensity_modeling_cross_selling"
  tracking_uri: "./mlruns"
  model_name: "propensity_catboost_model"

🔄 Model Retraining

# Generate new data
python data/generate_synthetic_data.py --n_samples 50000

# Train with hyperparameter optimization
python src/models/train.py

# Model automatically logged to MLflow
# Review in MLflow UI and promote to Production

📚 MLflow Commands

Experiment Management

# List experiments
mlflow experiments list

# Delete experiment
mlflow experiments delete -n <experiment_name>

# Create experiment
mlflow experiments create -n <experiment_name>

Run Management

# List runs
mlflow runs list --experiment-id <experiment_id>

# Delete run
mlflow runs delete --run-id <run_id>

# Restore run
mlflow runs restore --run-id <run_id>

Model Registry

from mlflow.tracking import MlflowClient

client = MlflowClient()

# List registered models
for model in client.list_registered_models():
    print(model.name)

# Get model versions
versions = client.get_latest_versions("propensity_catboost_model")

# Transition model stage
client.transition_model_version_stage(
    name="propensity_catboost_model",
    version=1,
    stage="Production"
)

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📝 License

This project is licensed under the MIT License.

👥 Authors

Data Science Team - Initial development

🙏 Acknowledgments

CatBoost team for the excellent gradient boosting library
MLflow team for experiment tracking tools
Open-source community

Note: This project uses synthetic data for demonstration purposes. Replace with actual customer data for production use while ensuring compliance with data privacy regulations.

Find me

v0.3.3[beta]