CDK Backend Overview

AWS CDK infrastructure for Path2Response’s Step Functions workflows and batch processing systems.

Purpose

The cdk-backend repository defines and deploys AWS infrastructure for Path2Response’s data processing pipelines. It manages:

Step Functions Workflows - Orchestrated multi-step data processing pipelines
AWS Batch Compute - Scalable compute resources for heavy processing jobs
Lambda Functions - Lightweight serverless functions for order processing
Docker Images - Container definitions for batch and Lambda workloads
EFS Integration - Shared file system access for order processing data

This infrastructure supports the core Path2Acquisition product by enabling automated audience creation, model training, and data file generation workflows.

Architecture

Directory Structure

cdk-backend/
├── cdk-stepfunctions/       # CDK stack definitions for Step Functions
│   ├── bin/main.ts          # Main CDK app entry point
│   └── lib/
│       ├── sfn-stacks/      # Step Function workflow stacks (17 workflows)
│       └── util/            # Shared utilities for CDK constructs
├── step-scripts/            # Deno/TypeScript step implementations
│   └── src/bin.step/        # Step function step scripts
├── projects/                # Standalone utility projects
│   ├── athanor/             # Workflow runner for licensed files
│   ├── response-analysis/   # Automated response analysis (RA)
│   ├── digital-audience/    # Digital audience processing
│   ├── sumcats/             # Category summaries processing
│   └── experiments/         # Experimental features
├── cicd/                    # Build and deployment tools
│   ├── backend              # Deployment CLI
│   └── it                   # Integration testing CLI
├── docker/                  # Docker image definitions
│   ├── backend-batch-orders/    # EMR-compatible batch processing
│   └── backend-lambda-orders/   # Lambda function container
├── commons/                 # Shared code libraries
│   ├── deno/                # Deno-compatible utilities
│   ├── node/                # Node.js utilities
│   └── any/                 # Platform-agnostic code
├── infrastructure/          # VPC and network infrastructure CDK
│   ├── cdk-vpc4emr/         # VPC for EMR clusters
│   ├── cdk-vpc4general/     # General purpose VPC
│   └── cdk-vpc4melissa/     # MelissaData integration VPC
└── book/                    # Test report output (mdBook)

Technology Stack

Component	Technology	Purpose
Infrastructure	AWS CDK (TypeScript)	Define and deploy AWS resources
Workflows	AWS Step Functions	Orchestrate multi-step processing
Compute	AWS Batch	Scalable container-based processing
Serverless	AWS Lambda	Order processing functions
Runtime	Deno 2.0+	Step script execution
Storage	AWS EFS	Shared file system for data
Data	AWS S3	Source data and artifacts
Containers	Docker	Batch job and Lambda packaging

Compute Instance Tiers

AWS Batch job queues are configured with different instance sizes for various workload requirements:

Size	CPUs	Memory	Instance Types	Use Case
XS	4	14 GB	m7a.xlarge	Quick initialization steps
S	8	28 GB	m7a.2xlarge	Standard processing
M	16	60 GB	r7a.2xlarge, m6a.4xlarge	Model training prep
L	16	120 GB	r7a.4xlarge, r6a.4xlarge	XGBoost training
XXL	192	1.47 TB	r7a.48xlarge, r6a.48xlarge	Large-scale scoring

Step Functions Workflows

The system deploys 17 Step Function state machines for different processing workflows:

Core P2A3 Workflows

Stack	State Machine	Purpose
SfnP2a3xgb2Stack	P2A3XGB	Standard XGBoost model training and scoring (8 steps)
SfnP2a3xgb2FutureStack	P2A3XGB-MCMC	MCMC-optimized training data selection (10 steps)
SfnP2a3xgbCountsOnlyStack	P2A3-CountsOnly	Quick count generation without full scoring

Licensed Files Workflows

Stack	Purpose
SfnSummaryFileStack	Customer profile summaries
SfnSummaryByStateStack	State-filtered summary processing
SfnLinkageFileStack	Identity matching files
SfnLinkageByStateStack	State-filtered linkage processing
SfnSumCatsStack	Category-based audience summaries
SfnLicensedFilesStack	Consolidated licensed file processing

Operational Workflows

Stack	Purpose
SfnFulfillmentStack	Order fulfillment processing
SfnFulfillmentInputAnalysisStack	Pre-fulfillment validation
SfnHotlineSiteVisitorProspectsStack	Hotline site visitor prospect scoring
SfnDTCNonContributingStack	DTC non-contributing member processing
SfnDigitalAudienceStack	Digital audience file generation
SfnBrowseCountsWeeklyStack	Weekly browse count aggregation
SfnBrowseTransactionsStack	Browse transaction processing
SfnTemplateStack	Template/example workflow

P2A3XGB Workflow Steps

The primary P2A3XGB workflow consists of 8 steps:

Initialize (S) - Set up working directories and validate inputs
TrainSelect (S) - Select training data based on model parameters
Train (L) - Execute XGBoost model training
ScoreSelect (M) - Prepare scoring dataset
Score (XXL) - Score all households against trained model
ReportsSelect (S) - Prepare reporting data
Reports (XXL) - Generate model performance reports
Finalization (S) - Clean up and stage outputs

Projects

Athanor

Location: projects/athanor/

A workflow runner for multi-step data processing with resume capability. Named after the alchemical furnace that transforms base materials into valuable outputs.

Key Features:

Multi-step workflows with automatic resume from failure
File-based locking (SHARED/EXCLUSIVE) for concurrent operations
Dry-run mode for previewing steps
State filtering for processing subsets of data
EMR integration for distributed processing
Dual execution mode (CLI and Step Functions)

Workflows:

SumCats - Category-based audience summaries for digital advertising
Summary - Customer profiles with demographics and purchase history
Linkage - Identity matching files for cross-system data linking
P2A3XGB - Standard XGBoost model training
P2A3XGB-MCMC - MCMC-optimized training variant

Usage:

# Run a summary workflow
./bin/ath create summary --hh-date 2025-11-04

# Run with state filtering
./bin/ath create summary --hh-date 2025-11-04 --states ca,tx,ny

# Preview without executing
./bin/ath create summary --hh-date 2025-11-04 --dry-run

Response Analysis

Location: projects/response-analysis/

Automated batch runner for response analysis (RA) jobs. Determines which promotions need analysis based on transaction data availability and runs them automatically.

Key Features:

Processes promotions where transaction data is after mail date
Batches campaigns together for combined analysis
Integrates with Dashboards audit data
Runs auto_ra Python script for each promotion
Stores results to EFS at /mnt/data/prod/<title>/ra/*

Data Sources:

Dashboards audit account collection
Households-memo transaction dates mapping
Title transactions from current households file (EMR)
Production data in /mnt/data/prod/*

Digital Audience

Location: projects/digital-audience/

Processing scripts for digital audience file generation. Symlinks to step-scripts implementation.

CICD Tools

The cicd/ directory contains two command-line tools for deployment and testing:

backend

Deployment utility for building and deploying CDK stacks.

Key Commands:

# Check current checkout status
backend info

# Deploy all stacks
backend deploy

# Deploy with clean CDK (recommended regularly)
backend deploy --clean-cdk

# Deploy single stack
backend deploy --stack p2a3xgb2

# Checkout staging branches
backend checkout-staging

# Checkout release tags
backend checkout-tag

it (Integration Testing)

Test execution and reporting utility.

Key Commands:

# Batch testing (recommended)
it batch all              # Run all tests with dependency resolution
it batch core licensed    # Run specific test suites
it batch all --dry-run    # Preview execution order

# Individual testing
it start p2a3all          # Start specific test
it ls                     # Check test status

# Reporting
it capture                # Capture test output
it report                 # Generate test report

Test Suites:

core - Essential tests (p2a3all, p2a3counts, hotline)
licensed - Licensed data processing tests
licensed-v2 - New V2 licensed data tests
fulfillment - Order processing and validation
quick - Fast smoke tests (p2a3counts, hotline)
all - Every available test

Integrations

These projects must be checked out at the same level as cdk-backend:

Repository	Purpose
`coop-scala`	Core Scala/Spark data processing
`ds-modeling`	Data science model training
`order-processing`	Order fulfillment logic
`data-science`	Python data science utilities

AWS Services

Service	Usage
AWS Step Functions	Workflow orchestration
AWS Batch	Scalable compute
AWS Lambda	Serverless functions
AWS EFS	Shared file storage
AWS S3	Data storage
AWS ECR	Docker image registry
AWS EMR	Distributed processing
AWS Secrets Manager	Credential storage

External Systems

System	Purpose
Dashboards	Order management, audit data source
MongoDB Atlas	Shiny app data storage
S3 (Databricks)	Cross-account data access
MelissaData	Address validation

Development

Prerequisites

Deno 2.0+ - Runtime for step scripts and Athanor
Node.js - CDK and build tooling
AWS CLI - Account access
Maven - Building coop-scala
mdBook - Documentation and test reports
Docker - Container builds

Setup

Clone required repositories to same workspace level:

git clone git@bitbucket.org:path2response/cdk-backend.git
git clone git@bitbucket.org:path2response/coop-scala.git
git clone git@bitbucket.org:path2response/ds-modeling.git
git clone git@bitbucket.org:path2response/order-processing.git
git clone git@bitbucket.org:path2response/data-science.git

Build coop-scala:
```
cd coop-scala
mvn clean install
```
Build cicd tools:
```
cd cdk-backend/cicd
./build.sh
```

Configure CDK target in ~/.cdk.json:

{
  "context": {
    "target": "staging",
    "version": 22301
  }
}

Deployment

# From workspace root (not inside project)
cd ~/workspace/

# Verify checkout status
backend info

# Deploy (full clean recommended)
backend deploy --clean-cdk

Testing

# Run all tests with batch command
it batch all

# View results
it capture
it report

# View report in browser
cd /mnt/data/it/<version>/report
mdbook serve

Deployment Environments

Environment	Branch/Tag	Server
Development	staging (or feature)	10.129.50.50
Staging	staging	10.130.50.50
RC	rc	10.131.50.50
Production	release tag (e.g., 291.0.0)	10.132.50.50

Path2Acquisition Flow - Complete data flow diagram
Glossary - Term definitions
Response Analysis - RA system documentation
Athanor Documentation - Full workflow runner documentation

Source: README.md, INSTALL.md, cdk-stepfunctions/README.md, cicd/README.md, projects/athanor/README.md, projects/response-analysis/README.md, docker/README.md, commons/any/README.md, step-scripts/README.md

Documentation created: 2026-01-24

Keyboard shortcuts

Path2Response Product Management Knowledge Base