customer-segmentation-pension

Customer segmentation for pension fund using K-Means and GMM clustering with FastAPI serving

Python

public

02 – Customer Segmentation (Pension)

Unsupervised clustering of pension customers with synthetic data, auto-tuned clusters (K-Means + GMM), segment profiling, a lightweight recommendation layer, and a Streamlit explorer. MLflow logs experiments and artifacts locally.

Setup

Create a virtual environment inside the project root and install dependencies:

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Always activate the venv before running the commands below.
Before running training or UI, point MLflow to the local directory so the UI can find recorded runs:
```
export MLFLOW_TRACKING_URI="$(pwd)/mlruns"
```

Quickstart

Train pipeline + recommenders (logs to MLflow experiment customer-segmentation-pension)
```
PYTHONPATH=. python3 scripts/train.py --config configs/config.yaml
```
Training generates:
- Models: models/02-segmentation/cluster_pipeline.joblib, models/02-segmentation/recommenders.joblib
- Artifacts: artifacts/metrics.json, artifacts/segment_profiles.json, artifacts/pca_projection.csv
- Data: data/02-segmentation/raw/customers.csv (plus processed CSV)
- MLflow traces under mlruns/
Launch MLflow UI
```
PYTHONPATH=. MLFLOW_TRACKING_URI="$(pwd)/mlruns" mlflow ui --backend-store-uri "$(pwd)/mlruns" --port 5001
```
Open http://localhost:5001 and select customer-segmentation-pension.
(Troubleshoot with the fuller order/notes in info.md if you see “No traces recorded”.)
Run Streamlit explorer
```
PYTHONPATH=. streamlit run src/app.py -- --config configs/config.yaml
```
Default URL: http://localhost:8502; stop with Ctrl+C.

Run FastAPI service (for external tools like Power BI)

./scripts/run_api.sh

Default URL: http://localhost:8000
Endpoints: /health, /segment, /recommend

Quick example:

curl -X POST http://localhost:8000/recommend \
  -H "Content-Type: application/json" \
  -d '{"age":48,"tenure_years":10,"balance_total":1e6,"conservative_ratio":0.4,"variable_ratio":0.3,"contribution_consistency":0.8,"app_logins_monthly":12,"feature_usage_score":70,"support_contacts":2,"avg_transaction_amount":250000,"transaction_frequency":6,"last_transaction_days":12,"risk_tolerance":"Moderate","financial_goal":"Growth"}'

What it builds

Synthetic dataset (15k rows, pension schema) with train/val/test splits.
Auto-tuned clustering across K=3–8 for K-Means and GMM using silhouette/Davies-Bouldin/CH.
Segment profiles (size, centroids, top differentiating features) stored as JSON.
Recommendation layer: per-offer propensity models (LogReg) using cluster-aware features.
Streamlit explorer: dataset overview, segment explorer (PCA scatter), customer lookup, and top-N offers.
MLflow tracking: params/metrics/artifacts for each training run.

Repo layout

configs/ – configuration (paths, tuning ranges, offers, MLflow).
data/02-segmentation/ – raw + processed CSVs.
models/02-segmentation/ – persisted cluster pipeline + recommender models.
artifacts/ – metrics, profiles, PCA projection for the app.
src/ – code: data generation, features, clustering, profiling, recommenders, app, api.
scripts/ – entrypoints to train, serve app, and run API.
tests/ – unit/smoke tests for pipeline pieces.

Key commands

Train: python3 scripts/train.py --config configs/config.yaml
App: streamlit run src/app.py -- --config configs/config.yaml
API: ./scripts/run_api.sh
Tests: pytest -q

Notes

Designed to swap in anonymized real data later if schema remains compatible.
MLflow uses a local mlruns/ backend by default; set MLFLOW_TRACKING_URI to use a remote server.

Find me

v0.3.3[beta]