CDK Backend Overview
AWS CDK infrastructure for Path2Response’s Step Functions workflows and batch processing systems.
Purpose
The cdk-backend repository defines and deploys AWS infrastructure for Path2Response’s data processing pipelines. It manages:
- Step Functions Workflows - Orchestrated multi-step data processing pipelines
- AWS Batch Compute - Scalable compute resources for heavy processing jobs
- Lambda Functions - Lightweight serverless functions for order processing
- Docker Images - Container definitions for batch and Lambda workloads
- EFS Integration - Shared file system access for order processing data
This infrastructure supports the core Path2Acquisition product by enabling automated audience creation, model training, and data file generation workflows.
Architecture
Directory Structure
cdk-backend/
├── cdk-stepfunctions/ # CDK stack definitions for Step Functions
│ ├── bin/main.ts # Main CDK app entry point
│ └── lib/
│ ├── sfn-stacks/ # Step Function workflow stacks (17 workflows)
│ └── util/ # Shared utilities for CDK constructs
├── step-scripts/ # Deno/TypeScript step implementations
│ └── src/bin.step/ # Step function step scripts
├── projects/ # Standalone utility projects
│ ├── athanor/ # Workflow runner for licensed files
│ ├── response-analysis/ # Automated response analysis (RA)
│ ├── digital-audience/ # Digital audience processing
│ ├── sumcats/ # Category summaries processing
│ └── experiments/ # Experimental features
├── cicd/ # Build and deployment tools
│ ├── backend # Deployment CLI
│ └── it # Integration testing CLI
├── docker/ # Docker image definitions
│ ├── backend-batch-orders/ # EMR-compatible batch processing
│ └── backend-lambda-orders/ # Lambda function container
├── commons/ # Shared code libraries
│ ├── deno/ # Deno-compatible utilities
│ ├── node/ # Node.js utilities
│ └── any/ # Platform-agnostic code
├── infrastructure/ # VPC and network infrastructure CDK
│ ├── cdk-vpc4emr/ # VPC for EMR clusters
│ ├── cdk-vpc4general/ # General purpose VPC
│ └── cdk-vpc4melissa/ # MelissaData integration VPC
└── book/ # Test report output (mdBook)
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Infrastructure | AWS CDK (TypeScript) | Define and deploy AWS resources |
| Workflows | AWS Step Functions | Orchestrate multi-step processing |
| Compute | AWS Batch | Scalable container-based processing |
| Serverless | AWS Lambda | Order processing functions |
| Runtime | Deno 2.0+ | Step script execution |
| Storage | AWS EFS | Shared file system for data |
| Data | AWS S3 | Source data and artifacts |
| Containers | Docker | Batch job and Lambda packaging |
Compute Instance Tiers
AWS Batch job queues are configured with different instance sizes for various workload requirements:
| Size | CPUs | Memory | Instance Types | Use Case |
|---|---|---|---|---|
| XS | 4 | 14 GB | m7a.xlarge | Quick initialization steps |
| S | 8 | 28 GB | m7a.2xlarge | Standard processing |
| M | 16 | 60 GB | r7a.2xlarge, m6a.4xlarge | Model training prep |
| L | 16 | 120 GB | r7a.4xlarge, r6a.4xlarge | XGBoost training |
| XXL | 192 | 1.47 TB | r7a.48xlarge, r6a.48xlarge | Large-scale scoring |
Step Functions Workflows
The system deploys 17 Step Function state machines for different processing workflows:
Core P2A3 Workflows
| Stack | State Machine | Purpose |
|---|---|---|
| SfnP2a3xgb2Stack | P2A3XGB | Standard XGBoost model training and scoring (8 steps) |
| SfnP2a3xgb2FutureStack | P2A3XGB-MCMC | MCMC-optimized training data selection (10 steps) |
| SfnP2a3xgbCountsOnlyStack | P2A3-CountsOnly | Quick count generation without full scoring |
Licensed Files Workflows
| Stack | Purpose |
|---|---|
| SfnSummaryFileStack | Customer profile summaries |
| SfnSummaryByStateStack | State-filtered summary processing |
| SfnLinkageFileStack | Identity matching files |
| SfnLinkageByStateStack | State-filtered linkage processing |
| SfnSumCatsStack | Category-based audience summaries |
| SfnLicensedFilesStack | Consolidated licensed file processing |
Operational Workflows
| Stack | Purpose |
|---|---|
| SfnFulfillmentStack | Order fulfillment processing |
| SfnFulfillmentInputAnalysisStack | Pre-fulfillment validation |
| SfnHotlineSiteVisitorProspectsStack | Hotline site visitor prospect scoring |
| SfnDTCNonContributingStack | DTC non-contributing member processing |
| SfnDigitalAudienceStack | Digital audience file generation |
| SfnBrowseCountsWeeklyStack | Weekly browse count aggregation |
| SfnBrowseTransactionsStack | Browse transaction processing |
| SfnTemplateStack | Template/example workflow |
P2A3XGB Workflow Steps
The primary P2A3XGB workflow consists of 8 steps:
- Initialize (S) - Set up working directories and validate inputs
- TrainSelect (S) - Select training data based on model parameters
- Train (L) - Execute XGBoost model training
- ScoreSelect (M) - Prepare scoring dataset
- Score (XXL) - Score all households against trained model
- ReportsSelect (S) - Prepare reporting data
- Reports (XXL) - Generate model performance reports
- Finalization (S) - Clean up and stage outputs
Projects
Athanor
Location: projects/athanor/
A workflow runner for multi-step data processing with resume capability. Named after the alchemical furnace that transforms base materials into valuable outputs.
Key Features:
- Multi-step workflows with automatic resume from failure
- File-based locking (SHARED/EXCLUSIVE) for concurrent operations
- Dry-run mode for previewing steps
- State filtering for processing subsets of data
- EMR integration for distributed processing
- Dual execution mode (CLI and Step Functions)
Workflows:
- SumCats - Category-based audience summaries for digital advertising
- Summary - Customer profiles with demographics and purchase history
- Linkage - Identity matching files for cross-system data linking
- P2A3XGB - Standard XGBoost model training
- P2A3XGB-MCMC - MCMC-optimized training variant
Usage:
# Run a summary workflow
./bin/ath create summary --hh-date 2025-11-04
# Run with state filtering
./bin/ath create summary --hh-date 2025-11-04 --states ca,tx,ny
# Preview without executing
./bin/ath create summary --hh-date 2025-11-04 --dry-run
Response Analysis
Location: projects/response-analysis/
Automated batch runner for response analysis (RA) jobs. Determines which promotions need analysis based on transaction data availability and runs them automatically.
Key Features:
- Processes promotions where transaction data is after mail date
- Batches campaigns together for combined analysis
- Integrates with Dashboards audit data
- Runs
auto_raPython script for each promotion - Stores results to EFS at
/mnt/data/prod/<title>/ra/*
Data Sources:
- Dashboards audit account collection
- Households-memo transaction dates mapping
- Title transactions from current households file (EMR)
- Production data in
/mnt/data/prod/*
Digital Audience
Location: projects/digital-audience/
Processing scripts for digital audience file generation. Symlinks to step-scripts implementation.
CICD Tools
The cicd/ directory contains two command-line tools for deployment and testing:
backend
Deployment utility for building and deploying CDK stacks.
Key Commands:
# Check current checkout status
backend info
# Deploy all stacks
backend deploy
# Deploy with clean CDK (recommended regularly)
backend deploy --clean-cdk
# Deploy single stack
backend deploy --stack p2a3xgb2
# Checkout staging branches
backend checkout-staging
# Checkout release tags
backend checkout-tag
it (Integration Testing)
Test execution and reporting utility.
Key Commands:
# Batch testing (recommended)
it batch all # Run all tests with dependency resolution
it batch core licensed # Run specific test suites
it batch all --dry-run # Preview execution order
# Individual testing
it start p2a3all # Start specific test
it ls # Check test status
# Reporting
it capture # Capture test output
it report # Generate test report
Test Suites:
core- Essential tests (p2a3all, p2a3counts, hotline)licensed- Licensed data processing testslicensed-v2- New V2 licensed data testsfulfillment- Order processing and validationquick- Fast smoke tests (p2a3counts, hotline)all- Every available test
Integrations
Related Repositories
These projects must be checked out at the same level as cdk-backend:
| Repository | Purpose |
|---|---|
coop-scala | Core Scala/Spark data processing |
ds-modeling | Data science model training |
order-processing | Order fulfillment logic |
data-science | Python data science utilities |
AWS Services
| Service | Usage |
|---|---|
| AWS Step Functions | Workflow orchestration |
| AWS Batch | Scalable compute |
| AWS Lambda | Serverless functions |
| AWS EFS | Shared file storage |
| AWS S3 | Data storage |
| AWS ECR | Docker image registry |
| AWS EMR | Distributed processing |
| AWS Secrets Manager | Credential storage |
External Systems
| System | Purpose |
|---|---|
| Dashboards | Order management, audit data source |
| MongoDB Atlas | Shiny app data storage |
| S3 (Databricks) | Cross-account data access |
| MelissaData | Address validation |
Development
Prerequisites
- Deno 2.0+ - Runtime for step scripts and Athanor
- Node.js - CDK and build tooling
- AWS CLI - Account access
- Maven - Building coop-scala
- mdBook - Documentation and test reports
- Docker - Container builds
Setup
-
Clone required repositories to same workspace level:
git clone git@bitbucket.org:path2response/cdk-backend.git git clone git@bitbucket.org:path2response/coop-scala.git git clone git@bitbucket.org:path2response/ds-modeling.git git clone git@bitbucket.org:path2response/order-processing.git git clone git@bitbucket.org:path2response/data-science.git -
Build coop-scala:
cd coop-scala mvn clean install -
Build cicd tools:
cd cdk-backend/cicd ./build.sh -
Configure CDK target in
~/.cdk.json:{ "context": { "target": "staging", "version": 22301 } }
Deployment
# From workspace root (not inside project)
cd ~/workspace/
# Verify checkout status
backend info
# Deploy (full clean recommended)
backend deploy --clean-cdk
Testing
# Run all tests with batch command
it batch all
# View results
it capture
it report
# View report in browser
cd /mnt/data/it/<version>/report
mdbook serve
Deployment Environments
| Environment | Branch/Tag | Server |
|---|---|---|
| Development | staging (or feature) | 10.129.50.50 |
| Staging | staging | 10.130.50.50 |
| RC | rc | 10.131.50.50 |
| Production | release tag (e.g., 291.0.0) | 10.132.50.50 |
Related Documentation
- Path2Acquisition Flow - Complete data flow diagram
- Glossary - Term definitions
- Response Analysis - RA system documentation
- Athanor Documentation - Full workflow runner documentation
Source: README.md, INSTALL.md, cdk-stepfunctions/README.md, cicd/README.md, projects/athanor/README.md, projects/response-analysis/README.md, docker/README.md, commons/any/README.md, step-scripts/README.md
Documentation created: 2026-01-24