ds-modeling (prospectml) Overview

Python machine learning library for prospect acquisition modeling, response analysis, and client reporting. The core Data Science engine powering Path2Acquisition.

Purpose

ds-modeling (package name: prospectml) is the Data Science team’s machine learning library for:

XGBoost Propensity Models - Train and score gradient boosting models to predict household response probability
Response Analysis - Measure campaign performance by matching fulfillment files to transaction data
Client Reporting - Generate model grade reports, client reports, and future projections
DMRP (Path2Performance) - RFM-based scoring and tiering for Path2Ignite campaigns

The library runs both locally for development and on AWS Batch/EMR for production workloads.

Architecture

Repository Structure

ds-modeling/                          # Root - Version 335.0.0+SNAPSHOT
├── prospectml/                       # Main Python package
│   ├── core/                         # Core model classes
│   │   ├── p2a.py                    # P2AModel - main orchestrator
│   │   ├── base_class.py             # P2RBaseClass
│   │   └── fulfillment.py            # Fulfillment processing
│   │
│   ├── estimators/                   # ML model wrappers
│   │   ├── baseclass.py              # P2REstimator base
│   │   └── xgbestimator.py           # XGBEstimator (XGBoost wrapper)
│   │
│   ├── evaluators/                   # Model evaluation
│   │   ├── modelevaluator.py         # Base evaluation logic
│   │   └── xgbevaluator.py           # XGBoost-specific evaluation
│   │
│   ├── readers/                      # Data I/O
│   │   ├── baseclass.py              # VariableReader, ScoreReader bases
│   │   ├── textreader.py             # Tab-separated variable files
│   │   ├── parquetreader.py          # Parquet format support
│   │   ├── samplereader.py           # Sampling utilities
│   │   ├── fileio.py                 # File I/O helpers
│   │   └── io_config.py              # Configuration file handling
│   │
│   ├── auto_ra/                      # Automated Response Analysis
│   │   ├── ra_for_promotion.py       # Promotion-level RA
│   │   ├── ra_for_campaign.py        # Campaign-level RA
│   │   ├── combine_all_ra.py         # RA aggregation
│   │   ├── get_order_curves.py       # Order curve generation
│   │   └── get_order_info.py         # Order information extraction
│   │
│   ├── dmrp/                         # Path2Performance (DMRP)
│   │   ├── dmrp_scoring.py           # RFM scoring and tiering
│   │   └── dmrp_client_report.py     # DMRP client reports
│   │
│   ├── command_line_reports/         # Report generation
│   │   ├── model_grade_report.py     # Model quality assessment
│   │   ├── client_report.py          # Client-facing reports
│   │   ├── new_client_report.py      # Updated client report format
│   │   ├── model_comparison.py       # A/B model comparison
│   │   └── future_projections_*.json # Mock data for testing
│   │
│   ├── future_projections/           # Future projection calculations
│   │   └── xgb_future_projections.md # Documentation
│   │
│   ├── mcmc/                         # MCMC/Bayesian methods
│   │   ├── metropolis_hastings.py    # MCMC sampling
│   │   ├── mcmc_metrics.py           # Metric calculations
│   │   ├── mcmc_density_report.py    # Density reporting
│   │   └── mcmc_file_handlers.py     # File handling for MCMC
│   │
│   ├── reporting/                    # Report utilities
│   ├── reporting_encoded/            # Encoded report templates
│   ├── parsers/                      # Data parsing utilities
│   ├── utils/                        # General utilities
│   │   └── aes_utils/                # Brand aesthetics (fonts, colors)
│   ├── mocking/                      # Test mocks
│   ├── p2r_exceptions/               # Custom exceptions
│   ├── p2r_str_enums/                # String enumerations
│   ├── experimental/                 # Experimental features
│   └── tests/                        # Test suite
│       └── integration_tests/        # AWS Batch integration tests
│
├── bin/                              # Command-line entry points
│   ├── xgb_training                  # XGBoost training
│   ├── xgb_scoring                   # XGBoost scoring
│   ├── xgb_reporting                 # Report generation
│   ├── xgb_future_projection         # Future projections
│   ├── run_full_xgb_model            # Full pipeline orchestrator
│   ├── run_xgb_training              # AWS Batch training
│   ├── run_xgb_scoring               # AWS Batch scoring
│   ├── run_xgb_reporting             # AWS Batch reporting
│   ├── run_dmrp_scoring              # DMRP scoring
│   ├── run_dmrp_client_report        # DMRP reports
│   ├── model_grade_report            # Model grade generation
│   ├── client_report                 # Client report generation
│   ├── interleave_fulfillments       # Fulfillment interleaving
│   └── submit_batch                  # AWS Batch job submission
│
├── deployment/                       # Deployment utilities
│   ├── deployment.sh                 # Deploy to dev/prod servers
│   ├── make_aws_resources.py         # Create AWS Batch resources
│   ├── test_deployment.py            # Deployment verification
│   └── ansible_deploy.py             # Ansible integration
│
├── docker/                           # Docker configurations
│   ├── prospectml_poetry/            # Main production image
│   ├── jupyter_notebook/             # Jupyter Lab image
│   └── mlflow_server/                # MLflow UI image
│
├── deprecated_code/                  # Legacy code (MLflow tracking)
├── ds_service/                       # Deprecated API service
├── docs/                             # Sphinx documentation
│
├── pyproject.toml                    # Poetry dependencies
├── poetry.lock                       # Locked dependencies
├── setup.py                          # Legacy setup
├── bitbucket-pipelines.yml           # CI/CD pipeline
├── magical_deployment.ts             # Deployment script (TypeScript)
├── build_and_install.sh              # Local build script
└── README.md                         # Main documentation

Technology Stack

Component	Version	Notes
Python	3.12.8	Locked version
XGBoost	~2.1	Core ML framework
pandas	~2.3	Data manipulation
polars	~1.36	High-performance DataFrames
NumPy	~2.3	Numerical computing
scikit-learn	~1.8	ML utilities, cross-validation
SHAP	~0.50	Model explainability
MLflow	~3.8	Experiment tracking
Plotly	~6.5	Interactive visualizations
PyArrow	~22.0	Parquet file support
boto3	1.35.36	AWS SDK
s3fs	2024.10.0	S3 filesystem access
PyMongo	~4.15	MongoDB access
borb	~2.1	PDF generation
XlsxWriter	~3.2	Excel file generation

Build Tools:

Poetry for dependency management
Pre-commit hooks for code quality
Pylint for linting
Pytest for testing (70% minimum coverage)
Sphinx for API documentation

Core Functionality

XGBoost Model Training (`P2AModel`)

The P2AModel class orchestrates the entire training pipeline:

# Training workflow
model = P2AModel(
    variable_reader=collected_variables,
    estimator=XGBEstimator()
)
model.fit(max_features=300)           # Initial training
model.fit_score(drop_train=True)      # Retrain for scoring
model.cv_fit()                        # Cross-validation

Key Steps:

Variable Collection - Read tab-separated or Parquet variable files
Feature Selection - Drop sparse/irrelevant variables, apply buyer type filtering
Initial Training - Train with train/test split, early stopping
Feature Importance - Select top 300 features by importance
Score Training - Retrain on full data with selected features
Cross-Validation - 10-fold stratified CV for metrics

XGBEstimator

Wraps XGBoost with P2R-specific defaults:

Parameter	Value	Purpose
objective	binary:logistic	Binary classification
base_score	0.01	Low base score for imbalanced data
learning_rate	0.02	Slow learning
max_depth	4	Shallow trees
n_estimators	2000	Many boosting rounds
colsample_bytree	0.7	Column sampling
reg_alpha	0.5	L1 regularization
reg_lambda	2.5	L2 regularization
eval_metric	aucpr	Area under PR curve

Model Evaluation

The XGBEvaluator produces:

Responder graphs (gains curves)
Calibration plots
Variable importance plots
SHAP values for explainability
Precision-recall curves
Model metrics (AUC-PR, etc.)

Scoring

Score households using trained models:

model.score(score_reader)
scores_df = model.scores  # DataFrame with hhid, prob_respond

Supports chunked scoring for large datasets.

Response Analysis (auto_ra)

Response Analysis measures campaign performance by matching fulfillment files to transaction data.

Key Metrics

Metric	Description
Response Rate (RR)	% of mailed names that responded
Average Order Volume (AOV)	Mean order amount
Median Order Volume (MedOV)	Median order amount
Dollars per Book	Total demand / names mailed
Demand	Total order amounts
Index	Value / target value

RA Workflow

Promotion Level:

Combine fulfillment plan with raw fulfillment (get hhid + keycode + productCode)
Create keycode_df and segment_df (expected counts)
Match transactions (title, date window, amount > 0)
Calculate RA by keycode (group by keycode, productCode)
Calculate RA by segment (group by modelKey, tier)
Save model QC info from MongoDB

Campaign Level:

Aggregate promotion-level results
Create results by date, keycode, segment
Generate Path2Ignite reports (if applicable)

Output Files

File	Description
`results_keycode.csv`	RA by keycode
`results_segments.csv`	RA by segment/tier
`matched_transactions.csv.gz`	Matched transaction data
`amounts.csv`	Dollar amount distribution
`days.csv`	Response curve data
`display_text.json`	Model QC and warnings

DMRP (Path2Performance)

RFM-based scoring for Path2Ignite campaigns:

/bin/bash /opt/data-science/prospectml/ordersapp/dmrp.sh <select_dir> <num_tiers>

Outputs:

fulfillment_input2 with tier assignments
seg_report with RFM statistics per tier
Client report (via dmrp_client_report.sh)

Command-Line Tools

Training Pipeline

Script	Purpose
`run_full_xgb_model`	Full pipeline: preselect -> train -> score -> reports
`xgb_training`	Direct training (local)
`run_xgb_training`	AWS Batch training
`xgb_scoring`	Direct scoring (local)
`run_xgb_scoring`	AWS Batch scoring
`xgb_reporting`	Direct reporting (local)
`run_xgb_reporting`	AWS Batch reporting

Reports

Script	Purpose
`model_grade_report`	Model quality metrics
`client_report`	Client-facing report
`model_comparison`	A/B model comparison

Future Projections

Script	Purpose
`xgb_future_projection`	Calculate future response estimates
`xgb_future_projection_direct`	Direct execution
`run_xgb_future_projection`	AWS Batch execution

Example: Full Model Run

run_full_xgb_model \
    --title teacollection \
    --variables-dir /path/to/select \
    --model-dir /path/to/model \
    --score-dir /path/to/select/score \
    --report-dir /path/to/report \
    --segment-size 25000 \
    --max-depth 100000 \
    --households-date 2020-08-14 \
    --run-env staging \
    --run-steps train-select train score-select score reports-select reports \
    tracking \
    --order ORDER-12345 \
    --model MODEL-7 \
    --server-type databricks

Run Steps:

train-select - Run preselect for variables (coop-scala)
train - XGBoost training (AWS Batch)
score-select - Run preselect for scoring (coop-scala)
score - XGBoost scoring (AWS Batch)
reports-select - Prepare report data (coop-scala)
reports - Generate reports (AWS Batch)

Integration with coop-scala

ds-modeling integrates with coop-scala Spark jobs for data preparation:

Data Flow

┌──────────────────┐
│  Order App       │ (Configuration, preselect.json)
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  preselect.sc    │ (coop-scala: Generate variables)
│  (EMR Spark)     │
└────────┬─────────┘
         │ variables-*.txt.gz
         ▼
┌──────────────────┐
│  xgb_training    │ (ds-modeling: Train model)
│  (AWS Batch)     │
└────────┬─────────┘
         │ model.pkl, features.txt
         ▼
┌──────────────────┐
│  preselect.sc    │ (coop-scala: Score population)
│  --score         │
└────────┬─────────┘
         │ score-*.txt.gz
         ▼
┌──────────────────┐
│  xgb_scoring     │ (ds-modeling: Generate scores)
│  (AWS Batch)     │
└────────┬─────────┘
         │ fulfillment_input2.tsv
         ▼
┌──────────────────┐
│  xgb_reporting   │ (ds-modeling: Generate reports)
│  (AWS Batch)     │
└──────────────────┘

Key Files

File	Source	Consumer
`preselect.json`	Order App	coop-scala, ds-modeling
`io_config.json`	coop-scala	ds-modeling
`variables-*.txt.gz`	coop-scala	ds-modeling
`aggregateCounts.txt.gz`	coop-scala	ds-modeling
`model.pkl`	ds-modeling	ds-modeling
`features.txt`	ds-modeling	coop-scala
`fulfillment_input2.tsv`	ds-modeling	Order App

AWS Infrastructure

AWS Batch

Job Definitions:

Name	Environment	Purpose
`P2A3_prod`	Production	Production model runs
`P2A3_dev`	Development	Staging/dev model runs
`P2A3_rc`	Release Candidate	RC testing

Docker Image: 448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:<tag>

Resources:

vCPUs: 94
Memory: 370,000 MB

Environment Variables

Variable	Purpose
`DATABRICKS_ACCOUNT`	Databricks account name
`DATABRICKS_AUTHORITY`	Databricks URL
`DATABRICKS_ENV`	Environment (dev/prod)
`DATABRICKS_TOKEN`	Authentication (from Secrets Manager)
`MLFLOW_ROOT_PATH`	MLflow tracking location

Development

Local Setup

# Install dependencies
./build_and_install.sh

# Activate virtual environment
source ./venv/bin/activate

# Run pre-commit
pre-commit run --all-files

Testing

# Unit tests
pytest -rs --cov=prospectml --cov-config=.coveragerc prospectml/tests

# Integration tests (requires AWS credentials)
pytest -rs prospectml/tests/integration_tests/integration_test_batch.py --tag staging

Docker Development

# Run unit tests in Docker
docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY \
    448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:staging \
    "pytest -rs --cov=prospectml prospectml/tests"

# Run Jupyter Lab
docker run -p 8889:8889 \
    448838825215.dkr.ecr.us-east-1.amazonaws.com/prospectml_poetry:staging \
    "jupyter lab --no-browser --ip=0.0.0.0 --port=8889 --allow-root"

Deployment

# Merge to staging branch triggers:
# 1. Docker image build -> ECR (tagged :staging)
# 2. AWS Batch job definitions created
# 3. Integration tests run

# Tag main for production release:
git tag 225.0.0
git push origin 225.0.0

# Deploy to servers
./magical_deployment.ts

Dependency Maintenance

# Update dependencies (scheduled maintenance only)
./update_requirements.sh

# Check outdated packages
poetry show -o

# Export to requirements.txt
poetry export --without-hashes -o requirements.txt

Document	Location	Description
README.md	`/ds-modeling/README.md`	Main project documentation
TRANSITION.md	`/ds-modeling/TRANSITION.md`	Dependency maintenance notes
auto_ra README	`/prospectml/auto_ra/README.md`	Response Analysis docs
Future Projections	`/prospectml/future_projections/xgb_future_projections.md`	FP documentation
AES Utils	`/prospectml/utils/aes_utils/README.md`	Brand aesthetics
coop-scala	`product-management/docs/coop-scala-overview.md`	Spark jobs integration
Path2Acquisition Flow	`product-management/docs/path2acquisition-flow.md`	Business process

Team Ownership

Role	Person
Director, Data Science	Igor Oliynyk
Data Scientists	Morgan Ford, Erica Yang, Paul Martin

Source: README.md, TRANSITION.md, prospectml/, bin/, deployment/, pyproject.toml, bitbucket-pipelines.yml

Documentation created: 2026-01-24

Keyboard shortcuts

Path2Response Product Management Knowledge Base