Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ds-modeling (prospectml) Overview

Python machine learning library for prospect acquisition modeling, response analysis, and client reporting. The core Data Science engine powering Path2Acquisition.

Purpose

ds-modeling (package name: prospectml) is the Data Science team’s machine learning library for:

  • XGBoost Propensity Models - Train and score gradient boosting models to predict household response probability
  • Response Analysis - Measure campaign performance by matching fulfillment files to transaction data
  • Client Reporting - Generate model grade reports, client reports, and future projections
  • DMRP (Path2Performance) - RFM-based scoring and tiering for Path2Ignite campaigns

The library runs both locally for development and on AWS Batch/EMR for production workloads.

Architecture

Repository Structure

ds-modeling/                          # Root - Version 335.0.0+SNAPSHOT
├── prospectml/                       # Main Python package
│   ├── core/                         # Core model classes
│   │   ├── p2a.py                    # P2AModel - main orchestrator
│   │   ├── base_class.py             # P2RBaseClass
│   │   └── fulfillment.py            # Fulfillment processing
│   │
│   ├── estimators/                   # ML model wrappers
│   │   ├── baseclass.py              # P2REstimator base
│   │   └── xgbestimator.py           # XGBEstimator (XGBoost wrapper)
│   │
│   ├── evaluators/                   # Model evaluation
│   │   ├── modelevaluator.py         # Base evaluation logic
│   │   └── xgbevaluator.py           # XGBoost-specific evaluation
│   │
│   ├── readers/                      # Data I/O
│   │   ├── baseclass.py              # VariableReader, ScoreReader bases
│   │   ├── textreader.py             # Tab-separated variable files
│   │   ├── parquetreader.py          # Parquet format support
│   │   ├── samplereader.py           # Sampling utilities
│   │   ├── fileio.py                 # File I/O helpers
│   │   └── io_config.py              # Configuration file handling
│   │
│   ├── auto_ra/                      # Automated Response Analysis
│   │   ├── ra_for_promotion.py       # Promotion-level RA
│   │   ├── ra_for_campaign.py        # Campaign-level RA
│   │   ├── combine_all_ra.py         # RA aggregation
│   │   ├── get_order_curves.py       # Order curve generation
│   │   └── get_order_info.py         # Order information extraction
│   │
│   ├── dmrp/                         # Path2Performance (DMRP)
│   │   ├── dmrp_scoring.py           # RFM scoring and tiering
│   │   └── dmrp_client_report.py     # DMRP client reports
│   │
│   ├── command_line_reports/         # Report generation
│   │   ├── model_grade_report.py     # Model quality assessment
│   │   ├── client_report.py          # Client-facing reports
│   │   ├── new_client_report.py      # Updated client report format
│   │   ├── model_comparison.py       # A/B model comparison
│   │   └── future_projections_*.json # Mock data for testing
│   │
│   ├── future_projections/           # Future projection calculations
│   │   └── xgb_future_projections.md # Documentation
│   │
│   ├── mcmc/                         # MCMC/Bayesian methods
│   │   ├── metropolis_hastings.py    # MCMC sampling
│   │   ├── mcmc_metrics.py           # Metric calculations
│   │   ├── mcmc_density_report.py    # Density reporting
│   │   └── mcmc_file_handlers.py     # File handling for MCMC
│   │
│   ├── reporting/                    # Report utilities
│   ├── reporting_encoded/            # Encoded report templates
│   ├── parsers/                      # Data parsing utilities
│   ├── utils/                        # General utilities
│   │   └── aes_utils/                # Brand aesthetics (fonts, colors)
│   ├── mocking/                      # Test mocks
│   ├── p2r_exceptions/               # Custom exceptions
│   ├── p2r_str_enums/                # String enumerations
│   ├── experimental/                 # Experimental features
│   └── tests/                        # Test suite
│       └── integration_tests/        # AWS Batch integration tests
│
├── bin/                              # Command-line entry points
│   ├── xgb_training                  # XGBoost training
│   ├── xgb_scoring                   # XGBoost scoring
│   ├── xgb_reporting                 # Report generation
│   ├── xgb_future_projection         # Future projections
│   ├── run_full_xgb_model            # Full pipeline orchestrator
│   ├── run_xgb_training              # AWS Batch training
│   ├── run_xgb_scoring               # AWS Batch scoring
│   ├── run_xgb_reporting             # AWS Batch reporting
│   ├── run_dmrp_scoring              # DMRP scoring
│   ├── run_dmrp_client_report        # DMRP reports
│   ├── model_grade_report            # Model grade generation
│   ├── client_report                 # Client report generation
│   ├── interleave_fulfillments       # Fulfillment interleaving
│   └── submit_batch                  # AWS Batch job submission
│
├── deployment/                       # Deployment utilities
│   ├── deployment.sh                 # Deploy to dev/prod servers
│   ├── make_aws_resources.py         # Create AWS Batch resources
│   ├── test_deployment.py            # Deployment verification
│   └── ansible_deploy.py             # Ansible integration
│
├── docker/                           # Docker configurations
│   ├── prospectml_poetry/            # Main production image
│   ├── jupyter_notebook/             # Jupyter Lab image
│   └── mlflow_server/                # MLflow UI image
│
├── deprecated_code/                  # Legacy code (MLflow tracking)
├── ds_service/                       # Deprecated API service
├── docs/                             # Sphinx documentation
│
├── pyproject.toml                    # Poetry dependencies
├── poetry.lock                       # Locked dependencies
├── setup.py                          # Legacy setup
├── bitbucket-pipelines.yml           # CI/CD pipeline
├── magical_deployment.ts             # Deployment script (TypeScript)
├── build_and_install.sh              # Local build script
└── README.md                         # Main documentation

Technology Stack

ComponentVersionNotes
Python3.12.8Locked version
XGBoost~2.1Core ML framework
pandas~2.3Data manipulation
polars~1.36High-performance DataFrames
NumPy~2.3Numerical computing
scikit-learn~1.8ML utilities, cross-validation
SHAP~0.50Model explainability
MLflow~3.8Experiment tracking
Plotly~6.5Interactive visualizations
PyArrow~22.0Parquet file support
boto31.35.36AWS SDK
s3fs2024.10.0S3 filesystem access
PyMongo~4.15MongoDB access
borb~2.1PDF generation
XlsxWriter~3.2Excel file generation

Build Tools:

  • Poetry for dependency management
  • Pre-commit hooks for code quality
  • Pylint for linting
  • Pytest for testing (70% minimum coverage)
  • Sphinx for API documentation

Core Functionality

XGBoost Model Training (P2AModel)

The P2AModel class orchestrates the entire training pipeline:

# Training workflow
model = P2AModel(
    variable_reader=collected_variables,
    estimator=XGBEstimator()
)
model.fit(max_features=300)           # Initial training
model.fit_score(drop_train=True)      # Retrain for scoring
model.cv_fit()                        # Cross-validation

Key Steps:

  1. Variable Collection - Read tab-separated or Parquet variable files
  2. Feature Selection - Drop sparse/irrelevant variables, apply buyer type filtering
  3. Initial Training - Train with train/test split, early stopping
  4. Feature Importance - Select top 300 features by importance
  5. Score Training - Retrain on full data with selected features
  6. Cross-Validation - 10-fold stratified CV for metrics

XGBEstimator

Wraps XGBoost with P2R-specific defaults:

ParameterValuePurpose
objectivebinary:logisticBinary classification
base_score0.01Low base score for imbalanced data
learning_rate0.02Slow learning
max_depth4Shallow trees
n_estimators2000Many boosting rounds
colsample_bytree0.7Column sampling
reg_alpha0.5L1 regularization
reg_lambda2.5L2 regularization
eval_metricaucprArea under PR curve

Model Evaluation

The XGBEvaluator produces:

  • Responder graphs (gains curves)
  • Calibration plots
  • Variable importance plots
  • SHAP values for explainability
  • Precision-recall curves
  • Model metrics (AUC-PR, etc.)

Scoring

Score households using trained models:

model.score(score_reader)
scores_df = model.scores  # DataFrame with hhid, prob_respond

Supports chunked scoring for large datasets.

Response Analysis (auto_ra)

Response Analysis measures campaign performance by matching fulfillment files to transaction data.

Key Metrics

MetricDescription
Response Rate (RR)% of mailed names that responded
Average Order Volume (AOV)Mean order amount
Median Order Volume (MedOV)Median order amount
Dollars per BookTotal demand / names mailed
DemandTotal order amounts
IndexValue / target value

RA Workflow

Promotion Level:

  1. Combine fulfillment plan with raw fulfillment (get hhid + keycode + productCode)
  2. Create keycode_df and segment_df (expected counts)
  3. Match transactions (title, date window, amount > 0)
  4. Calculate RA by keycode (group by keycode, productCode)
  5. Calculate RA by segment (group by modelKey, tier)
  6. Save model QC info from MongoDB

Campaign Level:

  1. Aggregate promotion-level results
  2. Create results by date, keycode, segment
  3. Generate Path2Ignite reports (if applicable)

Output Files

FileDescription
results_keycode.csvRA by keycode
results_segments.csvRA by segment/tier
matched_transactions.csv.gzMatched transaction data
amounts.csvDollar amount distribution
days.csvResponse curve data
display_text.jsonModel QC and warnings

DMRP (Path2Performance)

RFM-based scoring for Path2Ignite campaigns:

/bin/bash /opt/data-science/prospectml/ordersapp/dmrp.sh <select_dir> <num_tiers>

Outputs:

  • fulfillment_input2 with tier assignments
  • seg_report with RFM statistics per tier
  • Client report (via dmrp_client_report.sh)

Command-Line Tools

Training Pipeline

ScriptPurpose
run_full_xgb_modelFull pipeline: preselect -> train -> score -> reports
xgb_trainingDirect training (local)
run_xgb_trainingAWS Batch training
xgb_scoringDirect scoring (local)
run_xgb_scoringAWS Batch scoring
xgb_reportingDirect reporting (local)
run_xgb_reportingAWS Batch reporting

Reports

ScriptPurpose
model_grade_reportModel quality metrics
client_reportClient-facing report
model_comparisonA/B model comparison

Future Projections

ScriptPurpose
xgb_future_projectionCalculate future response estimates
xgb_future_projection_directDirect execution
run_xgb_future_projectionAWS Batch execution

Example: Full Model Run

run_full_xgb_model \
    --title teacollection \
    --variables-dir /path/to/select \
    --model-dir /path/to/model \
    --score-dir /path/to/select/score \
    --report-dir /path/to/report \
    --segment-size 25000 \
    --max-depth 100000 \
    --households-date 2020-08-14 \
    --run-env staging \
    --run-steps train-select train score-select score reports-select reports \
    tracking \
    --order ORDER-12345 \
    --model MODEL-7 \
    --server-type databricks

Run Steps:

  1. train-select - Run preselect for variables (coop-scala)
  2. train - XGBoost training (AWS Batch)
  3. score-select - Run preselect for scoring (coop-scala)
  4. score - XGBoost scoring (AWS Batch)
  5. reports-select - Prepare report data (coop-scala)
  6. reports - Generate reports (AWS Batch)

Integration with coop-scala

ds-modeling integrates with coop-scala Spark jobs for data preparation:

Data Flow

┌──────────────────┐
│  Order App       │ (Configuration, preselect.json)
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  preselect.sc    │ (coop-scala: Generate variables)
│  (EMR Spark)     │
└────────┬─────────┘
         │ variables-*.txt.gz
         ▼
┌──────────────────┐
│  xgb_training    │ (ds-modeling: Train model)
│  (AWS Batch)     │
└────────┬─────────┘
         │ model.pkl, features.txt
         ▼
┌──────────────────┐
│  preselect.sc    │ (coop-scala: Score population)
│  --score         │
└────────┬─────────┘
         │ score-*.txt.gz
         ▼
┌──────────────────┐
│  xgb_scoring     │ (ds-modeling: Generate scores)
│  (AWS Batch)     │
└────────┬─────────┘
         │ fulfillment_input2.tsv
         ▼
┌──────────────────┐
│  xgb_reporting   │ (ds-modeling: Generate reports)
│  (AWS Batch)     │
└──────────────────┘

Key Files

FileSourceConsumer
preselect.jsonOrder Appcoop-scala, ds-modeling
io_config.jsoncoop-scalads-modeling
variables-*.txt.gzcoop-scalads-modeling
aggregateCounts.txt.gzcoop-scalads-modeling
model.pklds-modelingds-modeling
features.txtds-modelingcoop-scala
fulfillment_input2.tsvds-modelingOrder App

AWS Infrastructure

AWS Batch

Job Definitions:

NameEnvironmentPurpose
P2A3_prodProductionProduction model runs
P2A3_devDevelopmentStaging/dev model runs
P2A3_rcRelease CandidateRC testing

Docker Image: 448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:<tag>

Resources:

  • vCPUs: 94
  • Memory: 370,000 MB

Environment Variables

VariablePurpose
DATABRICKS_ACCOUNTDatabricks account name
DATABRICKS_AUTHORITYDatabricks URL
DATABRICKS_ENVEnvironment (dev/prod)
DATABRICKS_TOKENAuthentication (from Secrets Manager)
MLFLOW_ROOT_PATHMLflow tracking location

Development

Local Setup

# Install dependencies
./build_and_install.sh

# Activate virtual environment
source ./venv/bin/activate

# Run pre-commit
pre-commit run --all-files

Testing

# Unit tests
pytest -rs --cov=prospectml --cov-config=.coveragerc prospectml/tests

# Integration tests (requires AWS credentials)
pytest -rs prospectml/tests/integration_tests/integration_test_batch.py --tag staging

Docker Development

# Run unit tests in Docker
docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
    448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:staging \
    "pytest -rs --cov=prospectml prospectml/tests"

# Run Jupyter Lab
docker run -p 8889:8889 \
    448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:staging \
    "jupyter lab --no-browser --ip=0.0.0.0 --port=8889 --allow-root"

Deployment

# Merge to staging branch triggers:
# 1. Docker image build -> ECR (tagged :staging)
# 2. AWS Batch job definitions created
# 3. Integration tests run

# Tag main for production release:
git tag 225.0.0
git push origin 225.0.0

# Deploy to servers
./magical_deployment.ts

Dependency Maintenance

# Update dependencies (scheduled maintenance only)
./update_requirements.sh

# Check outdated packages
poetry show -o

# Export to requirements.txt
poetry export --without-hashes -o requirements.txt
DocumentLocationDescription
README.md/ds-modeling/README.mdMain project documentation
TRANSITION.md/ds-modeling/TRANSITION.mdDependency maintenance notes
auto_ra README/prospectml/auto_ra/README.mdResponse Analysis docs
Future Projections/prospectml/future_projections/xgb_future_projections.mdFP documentation
AES Utils/prospectml/utils/aes_utils/README.mdBrand aesthetics
coop-scalaproduct-management/docs/coop-scala-overview.mdSpark jobs integration
Path2Acquisition Flowproduct-management/docs/path2acquisition-flow.mdBusiness process

Team Ownership

RolePerson
Director, Data ScienceIgor Oliynyk
Data ScientistsMorgan Ford, Erica Yang, Paul Martin

Source: README.md, TRANSITION.md, prospectml/, bin/, deployment/, pyproject.toml, bitbucket-pipelines.yml

Documentation created: 2026-01-24