ds-modeling (prospectml) Overview
Python machine learning library for prospect acquisition modeling, response analysis, and client reporting. The core Data Science engine powering Path2Acquisition.
Purpose
ds-modeling (package name: prospectml) is the Data Science team’s machine learning library for:
- XGBoost Propensity Models - Train and score gradient boosting models to predict household response probability
- Response Analysis - Measure campaign performance by matching fulfillment files to transaction data
- Client Reporting - Generate model grade reports, client reports, and future projections
- DMRP (Path2Performance) - RFM-based scoring and tiering for Path2Ignite campaigns
The library runs both locally for development and on AWS Batch/EMR for production workloads.
Architecture
Repository Structure
ds-modeling/ # Root - Version 335.0.0+SNAPSHOT
├── prospectml/ # Main Python package
│ ├── core/ # Core model classes
│ │ ├── p2a.py # P2AModel - main orchestrator
│ │ ├── base_class.py # P2RBaseClass
│ │ └── fulfillment.py # Fulfillment processing
│ │
│ ├── estimators/ # ML model wrappers
│ │ ├── baseclass.py # P2REstimator base
│ │ └── xgbestimator.py # XGBEstimator (XGBoost wrapper)
│ │
│ ├── evaluators/ # Model evaluation
│ │ ├── modelevaluator.py # Base evaluation logic
│ │ └── xgbevaluator.py # XGBoost-specific evaluation
│ │
│ ├── readers/ # Data I/O
│ │ ├── baseclass.py # VariableReader, ScoreReader bases
│ │ ├── textreader.py # Tab-separated variable files
│ │ ├── parquetreader.py # Parquet format support
│ │ ├── samplereader.py # Sampling utilities
│ │ ├── fileio.py # File I/O helpers
│ │ └── io_config.py # Configuration file handling
│ │
│ ├── auto_ra/ # Automated Response Analysis
│ │ ├── ra_for_promotion.py # Promotion-level RA
│ │ ├── ra_for_campaign.py # Campaign-level RA
│ │ ├── combine_all_ra.py # RA aggregation
│ │ ├── get_order_curves.py # Order curve generation
│ │ └── get_order_info.py # Order information extraction
│ │
│ ├── dmrp/ # Path2Performance (DMRP)
│ │ ├── dmrp_scoring.py # RFM scoring and tiering
│ │ └── dmrp_client_report.py # DMRP client reports
│ │
│ ├── command_line_reports/ # Report generation
│ │ ├── model_grade_report.py # Model quality assessment
│ │ ├── client_report.py # Client-facing reports
│ │ ├── new_client_report.py # Updated client report format
│ │ ├── model_comparison.py # A/B model comparison
│ │ └── future_projections_*.json # Mock data for testing
│ │
│ ├── future_projections/ # Future projection calculations
│ │ └── xgb_future_projections.md # Documentation
│ │
│ ├── mcmc/ # MCMC/Bayesian methods
│ │ ├── metropolis_hastings.py # MCMC sampling
│ │ ├── mcmc_metrics.py # Metric calculations
│ │ ├── mcmc_density_report.py # Density reporting
│ │ └── mcmc_file_handlers.py # File handling for MCMC
│ │
│ ├── reporting/ # Report utilities
│ ├── reporting_encoded/ # Encoded report templates
│ ├── parsers/ # Data parsing utilities
│ ├── utils/ # General utilities
│ │ └── aes_utils/ # Brand aesthetics (fonts, colors)
│ ├── mocking/ # Test mocks
│ ├── p2r_exceptions/ # Custom exceptions
│ ├── p2r_str_enums/ # String enumerations
│ ├── experimental/ # Experimental features
│ └── tests/ # Test suite
│ └── integration_tests/ # AWS Batch integration tests
│
├── bin/ # Command-line entry points
│ ├── xgb_training # XGBoost training
│ ├── xgb_scoring # XGBoost scoring
│ ├── xgb_reporting # Report generation
│ ├── xgb_future_projection # Future projections
│ ├── run_full_xgb_model # Full pipeline orchestrator
│ ├── run_xgb_training # AWS Batch training
│ ├── run_xgb_scoring # AWS Batch scoring
│ ├── run_xgb_reporting # AWS Batch reporting
│ ├── run_dmrp_scoring # DMRP scoring
│ ├── run_dmrp_client_report # DMRP reports
│ ├── model_grade_report # Model grade generation
│ ├── client_report # Client report generation
│ ├── interleave_fulfillments # Fulfillment interleaving
│ └── submit_batch # AWS Batch job submission
│
├── deployment/ # Deployment utilities
│ ├── deployment.sh # Deploy to dev/prod servers
│ ├── make_aws_resources.py # Create AWS Batch resources
│ ├── test_deployment.py # Deployment verification
│ └── ansible_deploy.py # Ansible integration
│
├── docker/ # Docker configurations
│ ├── prospectml_poetry/ # Main production image
│ ├── jupyter_notebook/ # Jupyter Lab image
│ └── mlflow_server/ # MLflow UI image
│
├── deprecated_code/ # Legacy code (MLflow tracking)
├── ds_service/ # Deprecated API service
├── docs/ # Sphinx documentation
│
├── pyproject.toml # Poetry dependencies
├── poetry.lock # Locked dependencies
├── setup.py # Legacy setup
├── bitbucket-pipelines.yml # CI/CD pipeline
├── magical_deployment.ts # Deployment script (TypeScript)
├── build_and_install.sh # Local build script
└── README.md # Main documentation
Technology Stack
| Component | Version | Notes |
|---|---|---|
| Python | 3.12.8 | Locked version |
| XGBoost | ~2.1 | Core ML framework |
| pandas | ~2.3 | Data manipulation |
| polars | ~1.36 | High-performance DataFrames |
| NumPy | ~2.3 | Numerical computing |
| scikit-learn | ~1.8 | ML utilities, cross-validation |
| SHAP | ~0.50 | Model explainability |
| MLflow | ~3.8 | Experiment tracking |
| Plotly | ~6.5 | Interactive visualizations |
| PyArrow | ~22.0 | Parquet file support |
| boto3 | 1.35.36 | AWS SDK |
| s3fs | 2024.10.0 | S3 filesystem access |
| PyMongo | ~4.15 | MongoDB access |
| borb | ~2.1 | PDF generation |
| XlsxWriter | ~3.2 | Excel file generation |
Build Tools:
- Poetry for dependency management
- Pre-commit hooks for code quality
- Pylint for linting
- Pytest for testing (70% minimum coverage)
- Sphinx for API documentation
Core Functionality
XGBoost Model Training (P2AModel)
The P2AModel class orchestrates the entire training pipeline:
# Training workflow
model = P2AModel(
variable_reader=collected_variables,
estimator=XGBEstimator()
)
model.fit(max_features=300) # Initial training
model.fit_score(drop_train=True) # Retrain for scoring
model.cv_fit() # Cross-validation
Key Steps:
- Variable Collection - Read tab-separated or Parquet variable files
- Feature Selection - Drop sparse/irrelevant variables, apply buyer type filtering
- Initial Training - Train with train/test split, early stopping
- Feature Importance - Select top 300 features by importance
- Score Training - Retrain on full data with selected features
- Cross-Validation - 10-fold stratified CV for metrics
XGBEstimator
Wraps XGBoost with P2R-specific defaults:
| Parameter | Value | Purpose |
|---|---|---|
| objective | binary:logistic | Binary classification |
| base_score | 0.01 | Low base score for imbalanced data |
| learning_rate | 0.02 | Slow learning |
| max_depth | 4 | Shallow trees |
| n_estimators | 2000 | Many boosting rounds |
| colsample_bytree | 0.7 | Column sampling |
| reg_alpha | 0.5 | L1 regularization |
| reg_lambda | 2.5 | L2 regularization |
| eval_metric | aucpr | Area under PR curve |
Model Evaluation
The XGBEvaluator produces:
- Responder graphs (gains curves)
- Calibration plots
- Variable importance plots
- SHAP values for explainability
- Precision-recall curves
- Model metrics (AUC-PR, etc.)
Scoring
Score households using trained models:
model.score(score_reader)
scores_df = model.scores # DataFrame with hhid, prob_respond
Supports chunked scoring for large datasets.
Response Analysis (auto_ra)
Response Analysis measures campaign performance by matching fulfillment files to transaction data.
Key Metrics
| Metric | Description |
|---|---|
| Response Rate (RR) | % of mailed names that responded |
| Average Order Volume (AOV) | Mean order amount |
| Median Order Volume (MedOV) | Median order amount |
| Dollars per Book | Total demand / names mailed |
| Demand | Total order amounts |
| Index | Value / target value |
RA Workflow
Promotion Level:
- Combine fulfillment plan with raw fulfillment (get hhid + keycode + productCode)
- Create keycode_df and segment_df (expected counts)
- Match transactions (title, date window, amount > 0)
- Calculate RA by keycode (group by keycode, productCode)
- Calculate RA by segment (group by modelKey, tier)
- Save model QC info from MongoDB
Campaign Level:
- Aggregate promotion-level results
- Create results by date, keycode, segment
- Generate Path2Ignite reports (if applicable)
Output Files
| File | Description |
|---|---|
results_keycode.csv | RA by keycode |
results_segments.csv | RA by segment/tier |
matched_transactions.csv.gz | Matched transaction data |
amounts.csv | Dollar amount distribution |
days.csv | Response curve data |
display_text.json | Model QC and warnings |
DMRP (Path2Performance)
RFM-based scoring for Path2Ignite campaigns:
/bin/bash /opt/data-science/prospectml/ordersapp/dmrp.sh <select_dir> <num_tiers>
Outputs:
fulfillment_input2with tier assignmentsseg_reportwith RFM statistics per tier- Client report (via
dmrp_client_report.sh)
Command-Line Tools
Training Pipeline
| Script | Purpose |
|---|---|
run_full_xgb_model | Full pipeline: preselect -> train -> score -> reports |
xgb_training | Direct training (local) |
run_xgb_training | AWS Batch training |
xgb_scoring | Direct scoring (local) |
run_xgb_scoring | AWS Batch scoring |
xgb_reporting | Direct reporting (local) |
run_xgb_reporting | AWS Batch reporting |
Reports
| Script | Purpose |
|---|---|
model_grade_report | Model quality metrics |
client_report | Client-facing report |
model_comparison | A/B model comparison |
Future Projections
| Script | Purpose |
|---|---|
xgb_future_projection | Calculate future response estimates |
xgb_future_projection_direct | Direct execution |
run_xgb_future_projection | AWS Batch execution |
Example: Full Model Run
run_full_xgb_model \
--title teacollection \
--variables-dir /path/to/select \
--model-dir /path/to/model \
--score-dir /path/to/select/score \
--report-dir /path/to/report \
--segment-size 25000 \
--max-depth 100000 \
--households-date 2020-08-14 \
--run-env staging \
--run-steps train-select train score-select score reports-select reports \
tracking \
--order ORDER-12345 \
--model MODEL-7 \
--server-type databricks
Run Steps:
train-select- Run preselect for variables (coop-scala)train- XGBoost training (AWS Batch)score-select- Run preselect for scoring (coop-scala)score- XGBoost scoring (AWS Batch)reports-select- Prepare report data (coop-scala)reports- Generate reports (AWS Batch)
Integration with coop-scala
ds-modeling integrates with coop-scala Spark jobs for data preparation:
Data Flow
┌──────────────────┐
│ Order App │ (Configuration, preselect.json)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ preselect.sc │ (coop-scala: Generate variables)
│ (EMR Spark) │
└────────┬─────────┘
│ variables-*.txt.gz
▼
┌──────────────────┐
│ xgb_training │ (ds-modeling: Train model)
│ (AWS Batch) │
└────────┬─────────┘
│ model.pkl, features.txt
▼
┌──────────────────┐
│ preselect.sc │ (coop-scala: Score population)
│ --score │
└────────┬─────────┘
│ score-*.txt.gz
▼
┌──────────────────┐
│ xgb_scoring │ (ds-modeling: Generate scores)
│ (AWS Batch) │
└────────┬─────────┘
│ fulfillment_input2.tsv
▼
┌──────────────────┐
│ xgb_reporting │ (ds-modeling: Generate reports)
│ (AWS Batch) │
└──────────────────┘
Key Files
| File | Source | Consumer |
|---|---|---|
preselect.json | Order App | coop-scala, ds-modeling |
io_config.json | coop-scala | ds-modeling |
variables-*.txt.gz | coop-scala | ds-modeling |
aggregateCounts.txt.gz | coop-scala | ds-modeling |
model.pkl | ds-modeling | ds-modeling |
features.txt | ds-modeling | coop-scala |
fulfillment_input2.tsv | ds-modeling | Order App |
AWS Infrastructure
AWS Batch
Job Definitions:
| Name | Environment | Purpose |
|---|---|---|
P2A3_prod | Production | Production model runs |
P2A3_dev | Development | Staging/dev model runs |
P2A3_rc | Release Candidate | RC testing |
Docker Image: 448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:<tag>
Resources:
- vCPUs: 94
- Memory: 370,000 MB
Environment Variables
| Variable | Purpose |
|---|---|
DATABRICKS_ACCOUNT | Databricks account name |
DATABRICKS_AUTHORITY | Databricks URL |
DATABRICKS_ENV | Environment (dev/prod) |
DATABRICKS_TOKEN | Authentication (from Secrets Manager) |
MLFLOW_ROOT_PATH | MLflow tracking location |
Development
Local Setup
# Install dependencies
./build_and_install.sh
# Activate virtual environment
source ./venv/bin/activate
# Run pre-commit
pre-commit run --all-files
Testing
# Unit tests
pytest -rs --cov=prospectml --cov-config=.coveragerc prospectml/tests
# Integration tests (requires AWS credentials)
pytest -rs prospectml/tests/integration_tests/integration_test_batch.py --tag staging
Docker Development
# Run unit tests in Docker
docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:staging \
"pytest -rs --cov=prospectml prospectml/tests"
# Run Jupyter Lab
docker run -p 8889:8889 \
448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:staging \
"jupyter lab --no-browser --ip=0.0.0.0 --port=8889 --allow-root"
Deployment
# Merge to staging branch triggers:
# 1. Docker image build -> ECR (tagged :staging)
# 2. AWS Batch job definitions created
# 3. Integration tests run
# Tag main for production release:
git tag 225.0.0
git push origin 225.0.0
# Deploy to servers
./magical_deployment.ts
Dependency Maintenance
# Update dependencies (scheduled maintenance only)
./update_requirements.sh
# Check outdated packages
poetry show -o
# Export to requirements.txt
poetry export --without-hashes -o requirements.txt
Related Documentation
| Document | Location | Description |
|---|---|---|
| README.md | /ds-modeling/README.md | Main project documentation |
| TRANSITION.md | /ds-modeling/TRANSITION.md | Dependency maintenance notes |
| auto_ra README | /prospectml/auto_ra/README.md | Response Analysis docs |
| Future Projections | /prospectml/future_projections/xgb_future_projections.md | FP documentation |
| AES Utils | /prospectml/utils/aes_utils/README.md | Brand aesthetics |
| coop-scala | product-management/docs/coop-scala-overview.md | Spark jobs integration |
| Path2Acquisition Flow | product-management/docs/path2acquisition-flow.md | Business process |
Team Ownership
| Role | Person |
|---|---|
| Director, Data Science | Igor Oliynyk |
| Data Scientists | Morgan Ford, Erica Yang, Paul Martin |
Source: README.md, TRANSITION.md, prospectml/, bin/, deployment/, pyproject.toml, bitbucket-pipelines.yml
Documentation created: 2026-01-24