Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CDK Backend Overview

AWS CDK infrastructure for Path2Response’s Step Functions workflows and batch processing systems.

Purpose

The cdk-backend repository defines and deploys AWS infrastructure for Path2Response’s data processing pipelines. It manages:

  • Step Functions Workflows - Orchestrated multi-step data processing pipelines
  • AWS Batch Compute - Scalable compute resources for heavy processing jobs
  • Lambda Functions - Lightweight serverless functions for order processing
  • Docker Images - Container definitions for batch and Lambda workloads
  • EFS Integration - Shared file system access for order processing data

This infrastructure supports the core Path2Acquisition product by enabling automated audience creation, model training, and data file generation workflows.

Architecture

Directory Structure

cdk-backend/
├── cdk-stepfunctions/       # CDK stack definitions for Step Functions
│   ├── bin/main.ts          # Main CDK app entry point
│   └── lib/
│       ├── sfn-stacks/      # Step Function workflow stacks (17 workflows)
│       └── util/            # Shared utilities for CDK constructs
├── step-scripts/            # Deno/TypeScript step implementations
│   └── src/bin.step/        # Step function step scripts
├── projects/                # Standalone utility projects
│   ├── athanor/             # Workflow runner for licensed files
│   ├── response-analysis/   # Automated response analysis (RA)
│   ├── digital-audience/    # Digital audience processing
│   ├── sumcats/             # Category summaries processing
│   └── experiments/         # Experimental features
├── cicd/                    # Build and deployment tools
│   ├── backend              # Deployment CLI
│   └── it                   # Integration testing CLI
├── docker/                  # Docker image definitions
│   ├── backend-batch-orders/    # EMR-compatible batch processing
│   └── backend-lambda-orders/   # Lambda function container
├── commons/                 # Shared code libraries
│   ├── deno/                # Deno-compatible utilities
│   ├── node/                # Node.js utilities
│   └── any/                 # Platform-agnostic code
├── infrastructure/          # VPC and network infrastructure CDK
│   ├── cdk-vpc4emr/         # VPC for EMR clusters
│   ├── cdk-vpc4general/     # General purpose VPC
│   └── cdk-vpc4melissa/     # MelissaData integration VPC
└── book/                    # Test report output (mdBook)

Technology Stack

ComponentTechnologyPurpose
InfrastructureAWS CDK (TypeScript)Define and deploy AWS resources
WorkflowsAWS Step FunctionsOrchestrate multi-step processing
ComputeAWS BatchScalable container-based processing
ServerlessAWS LambdaOrder processing functions
RuntimeDeno 2.0+Step script execution
StorageAWS EFSShared file system for data
DataAWS S3Source data and artifacts
ContainersDockerBatch job and Lambda packaging

Compute Instance Tiers

AWS Batch job queues are configured with different instance sizes for various workload requirements:

SizeCPUsMemoryInstance TypesUse Case
XS414 GBm7a.xlargeQuick initialization steps
S828 GBm7a.2xlargeStandard processing
M1660 GBr7a.2xlarge, m6a.4xlargeModel training prep
L16120 GBr7a.4xlarge, r6a.4xlargeXGBoost training
XXL1921.47 TBr7a.48xlarge, r6a.48xlargeLarge-scale scoring

Step Functions Workflows

The system deploys 17 Step Function state machines for different processing workflows:

Core P2A3 Workflows

StackState MachinePurpose
SfnP2a3xgb2StackP2A3XGBStandard XGBoost model training and scoring (8 steps)
SfnP2a3xgb2FutureStackP2A3XGB-MCMCMCMC-optimized training data selection (10 steps)
SfnP2a3xgbCountsOnlyStackP2A3-CountsOnlyQuick count generation without full scoring

Licensed Files Workflows

StackPurpose
SfnSummaryFileStackCustomer profile summaries
SfnSummaryByStateStackState-filtered summary processing
SfnLinkageFileStackIdentity matching files
SfnLinkageByStateStackState-filtered linkage processing
SfnSumCatsStackCategory-based audience summaries
SfnLicensedFilesStackConsolidated licensed file processing

Operational Workflows

StackPurpose
SfnFulfillmentStackOrder fulfillment processing
SfnFulfillmentInputAnalysisStackPre-fulfillment validation
SfnHotlineSiteVisitorProspectsStackHotline site visitor prospect scoring
SfnDTCNonContributingStackDTC non-contributing member processing
SfnDigitalAudienceStackDigital audience file generation
SfnBrowseCountsWeeklyStackWeekly browse count aggregation
SfnBrowseTransactionsStackBrowse transaction processing
SfnTemplateStackTemplate/example workflow

P2A3XGB Workflow Steps

The primary P2A3XGB workflow consists of 8 steps:

  1. Initialize (S) - Set up working directories and validate inputs
  2. TrainSelect (S) - Select training data based on model parameters
  3. Train (L) - Execute XGBoost model training
  4. ScoreSelect (M) - Prepare scoring dataset
  5. Score (XXL) - Score all households against trained model
  6. ReportsSelect (S) - Prepare reporting data
  7. Reports (XXL) - Generate model performance reports
  8. Finalization (S) - Clean up and stage outputs

Projects

Athanor

Location: projects/athanor/

A workflow runner for multi-step data processing with resume capability. Named after the alchemical furnace that transforms base materials into valuable outputs.

Key Features:

  • Multi-step workflows with automatic resume from failure
  • File-based locking (SHARED/EXCLUSIVE) for concurrent operations
  • Dry-run mode for previewing steps
  • State filtering for processing subsets of data
  • EMR integration for distributed processing
  • Dual execution mode (CLI and Step Functions)

Workflows:

  • SumCats - Category-based audience summaries for digital advertising
  • Summary - Customer profiles with demographics and purchase history
  • Linkage - Identity matching files for cross-system data linking
  • P2A3XGB - Standard XGBoost model training
  • P2A3XGB-MCMC - MCMC-optimized training variant

Usage:

# Run a summary workflow
./bin/ath create summary --hh-date 2025-11-04

# Run with state filtering
./bin/ath create summary --hh-date 2025-11-04 --states ca,tx,ny

# Preview without executing
./bin/ath create summary --hh-date 2025-11-04 --dry-run

Response Analysis

Location: projects/response-analysis/

Automated batch runner for response analysis (RA) jobs. Determines which promotions need analysis based on transaction data availability and runs them automatically.

Key Features:

  • Processes promotions where transaction data is after mail date
  • Batches campaigns together for combined analysis
  • Integrates with Dashboards audit data
  • Runs auto_ra Python script for each promotion
  • Stores results to EFS at /mnt/data/prod/<title>/ra/*

Data Sources:

  • Dashboards audit account collection
  • Households-memo transaction dates mapping
  • Title transactions from current households file (EMR)
  • Production data in /mnt/data/prod/*

Digital Audience

Location: projects/digital-audience/

Processing scripts for digital audience file generation. Symlinks to step-scripts implementation.

CICD Tools

The cicd/ directory contains two command-line tools for deployment and testing:

backend

Deployment utility for building and deploying CDK stacks.

Key Commands:

# Check current checkout status
backend info

# Deploy all stacks
backend deploy

# Deploy with clean CDK (recommended regularly)
backend deploy --clean-cdk

# Deploy single stack
backend deploy --stack p2a3xgb2

# Checkout staging branches
backend checkout-staging

# Checkout release tags
backend checkout-tag

it (Integration Testing)

Test execution and reporting utility.

Key Commands:

# Batch testing (recommended)
it batch all              # Run all tests with dependency resolution
it batch core licensed    # Run specific test suites
it batch all --dry-run    # Preview execution order

# Individual testing
it start p2a3all          # Start specific test
it ls                     # Check test status

# Reporting
it capture                # Capture test output
it report                 # Generate test report

Test Suites:

  • core - Essential tests (p2a3all, p2a3counts, hotline)
  • licensed - Licensed data processing tests
  • licensed-v2 - New V2 licensed data tests
  • fulfillment - Order processing and validation
  • quick - Fast smoke tests (p2a3counts, hotline)
  • all - Every available test

Integrations

These projects must be checked out at the same level as cdk-backend:

RepositoryPurpose
coop-scalaCore Scala/Spark data processing
ds-modelingData science model training
order-processingOrder fulfillment logic
data-sciencePython data science utilities

AWS Services

ServiceUsage
AWS Step FunctionsWorkflow orchestration
AWS BatchScalable compute
AWS LambdaServerless functions
AWS EFSShared file storage
AWS S3Data storage
AWS ECRDocker image registry
AWS EMRDistributed processing
AWS Secrets ManagerCredential storage

External Systems

SystemPurpose
DashboardsOrder management, audit data source
MongoDB AtlasShiny app data storage
S3 (Databricks)Cross-account data access
MelissaDataAddress validation

Development

Prerequisites

  • Deno 2.0+ - Runtime for step scripts and Athanor
  • Node.js - CDK and build tooling
  • AWS CLI - Account access
  • Maven - Building coop-scala
  • mdBook - Documentation and test reports
  • Docker - Container builds

Setup

  1. Clone required repositories to same workspace level:

    git clone git@bitbucket.org:path2response/cdk-backend.git
    git clone git@bitbucket.org:path2response/coop-scala.git
    git clone git@bitbucket.org:path2response/ds-modeling.git
    git clone git@bitbucket.org:path2response/order-processing.git
    git clone git@bitbucket.org:path2response/data-science.git
    
  2. Build coop-scala:

    cd coop-scala
    mvn clean install
    
  3. Build cicd tools:

    cd cdk-backend/cicd
    ./build.sh
    
  4. Configure CDK target in ~/.cdk.json:

    {
      "context": {
        "target": "staging",
        "version": 22301
      }
    }
    

Deployment

# From workspace root (not inside project)
cd ~/workspace/

# Verify checkout status
backend info

# Deploy (full clean recommended)
backend deploy --clean-cdk

Testing

# Run all tests with batch command
it batch all

# View results
it capture
it report

# View report in browser
cd /mnt/data/it/<version>/report
mdbook serve

Deployment Environments

EnvironmentBranch/TagServer
Developmentstaging (or feature)10.129.50.50
Stagingstaging10.130.50.50
RCrc10.131.50.50
Productionrelease tag (e.g., 291.0.0)10.132.50.50

Source: README.md, INSTALL.md, cdk-stepfunctions/README.md, cicd/README.md, projects/athanor/README.md, projects/response-analysis/README.md, docker/README.md, commons/any/README.md, step-scripts/README.md

Documentation created: 2026-01-24