Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Operations Overview

Automated system maintenance, health monitoring, and data synchronization tools for Path2Response back-office infrastructure.

Purpose

The Operations repository provides day-to-day operational tooling for the Infrastructure team. It performs automated system checks, data synchronization, and maintenance across staging, release-candidate (RC), and production environments. These tools run continuously via cron jobs on cc.path2response.com, the command-and-control server.

Primary Users: Infrastructure Team (Jason Smith, Wes Hofmann)

Key Responsibilities:

  • System health monitoring at multiple intervals (10-minute, hourly, daily, weekly, monthly)
  • Data synchronization from production to staging/development environments
  • AWS resource management (EC2 instances, S3 storage, EFS volumes)
  • Automated cleanup of temporary files and old data
  • RC environment management (start/stop servers, pause/resume MongoDB Atlas)

Architecture

Directory Structure

operations/
├── bin/                    # Bash cron wrapper scripts
├── book/                   # Operations Reference Manual (mdbook)
│   └── src/               # Documentation source files
├── deno/                   # Deno utilities
│   ├── app/               # Deno applications (audit, costs, network)
│   ├── book-of-ops/       # AWS infrastructure documentation
│   ├── common/            # Shared Deno modules
│   ├── lib/               # Deno libraries (up, gateway)
│   └── utilities/         # Deno utility scripts
├── src/                    # TypeScript source code
│   ├── checks/            # Health check definitions
│   ├── rcman/             # RC management modules
│   ├── sync/              # Data synchronization modules
│   └── util/              # Shared utilities (AWS, functional, MongoDB)
├── tests/                  # Unit tests
├── crontab.txt            # Reference copy of production crontab
├── deno.jsonc             # Deno configuration
├── package.json           # Node.js dependencies
└── tsconfig.json          # TypeScript configuration

Technology Stack

ComponentTechnologyPurpose
RuntimeNode.js + DenoDual runtime (Node for legacy, Deno for new utilities)
LanguageTypeScriptType-safe scripting
AWS SDK@aws-sdk/client-s3, @aws-sdk/client-ec2 (v3)AWS API access
Databasemongodb (7.0)MongoDB operations for data sync
Notificationsnode-slackSlack integration for alerts
CLIcommanderCommand-line argument parsing
DocumentationmdbookBrowsable operations manual
SchedulingcronAutomated job execution

Key Dependencies

{
  "@aws-sdk/client-ec2": "3.958.0",
  "@aws-sdk/client-s3": "3.958.0",
  "mongodb": "7.0.0",
  "node-slack": "0.0.7",
  "commander": "14.0.2"
}

Core Functionality

Utilities (Command-Line Tools)

UtilityPurposeSchedule
system-healthRun health checks at varying intervals; alert on failuresEvery 10 min, hourly, daily, weekly, monthly
sync-monkeySynchronize data from production to staging/developmentMultiple schedules (q6h, daily, weekly)
cleanDelete old households, temp folders, title-householdsDaily (8 AM)
archiveArchive files to read-only or Glacier storageWeekly (Monday 8 AM)
rc-manStart/stop RC EC2 servers; pause/resume MongoDB AtlasSprint-based (Friday/Sunday)
restartRestart specific services (e.g., Shiny server)Daily (10 AM)
aws-s3-auditAudit S3 bucket sizes and identify large foldersWeekly (Wednesday 2:10 PM)
aws-efs-auditAudit EFS volume sizesWeekly (disabled)

Health Checks (system-health)

The system-health utility runs checks at different intervals:

IntervalChecks
10-minuteSystem load average, stale EC2 instances
HourlyDatabricks API version, 4Cite service status
DailyDisk storage, households validity, production/development match
Weekly/MonthlyReserved for future checks

EC2 Instance Management:

  • Instances tagged with p2r=permanent are exempt from stale checks
  • Instances can have custom timeouts (e.g., p2r=1.12:00 for 1 day 12 hours)
  • Auto-kill option: p2r=1:00,kill terminates instance after timeout

Data Synchronization (sync-monkey)

Synchronizes data from production to staging/development:

Data TypeSourceDestinationsSchedule
Householdsp2r.prod.datap2r-dev-data-1, p2r.dev.use2.data1, p2r.dev2.dataEvery 6 hours
Convert/Finalp2r.prod.datap2r.prod.use2.data1Daily + S3 replication
MongoDB operations.datafileProductionStaging/RCFiltered sync (PATH-25842)
Households Archivep2r.prod.datap2r.prod.archive (Glacier)Weekly

Critical Note - MongoDB Filtered Sync: The operations.datafile collection sync uses a filter ({done: true, pid: {$exists: true}}) to prevent race conditions where staging processes could intercept production work queue items. See PATH-25842 for full context.

RC Environment Management (rc-man)

Manages Release Candidate infrastructure for cost optimization:

SubcommandFunction
servers --startStart all RC EC2 instances
servers --stopStop all RC EC2 instances
atlas --pausePause RC MongoDB Atlas cluster
atlas --resumeResume RC MongoDB Atlas cluster

Warning: Paused MongoDB Atlas clusters auto-resume after 30 days.

Cleanup Operations (clean, archive)

CommandFunctionTarget
clean hh-tempRemove /temp folders from old households runsProduction only
clean title-hhsKeep only latest titleHouseholds versionProduction only
clean hhRemove old households (14+ days, keep last 3)Production/Development
archive ordersMake order files read-only after 60 daysprod02:/mnt/data/prod
archive 4citeSync to Glacier, keep 25 monthsS3

Integrations

AWS Services

ServiceUsage
EC2Instance lifecycle management, health monitoring, auto-termination
S3Data storage, cross-region sync, Glacier archival
EFSFile storage auditing
MongoDB AtlasRC cluster pause/resume

Internal Systems

SystemIntegration
SlackAlerts to #ops-auto channel; notifications via legacy webhook
MongoDBData sync, households metadata
JiraReminder notifications (jira-reminder.ts)

External Services

ServiceIntegration
DatabricksAPI version monitoring
4CiteData collection service health check

Development

Prerequisites

  • Node.js (project-local installation)
  • Deno (for newer utilities)
  • AWS CLI (apt install awscli)
  • Rust + mdbook (for documentation)
  • graphviz (for mdbook diagrams)

Build Commands

CommandDescription
npm run buildPrettify, test, and compile TypeScript
npm run watchWatch mode for development
npm run testRun unit tests (43 tests)
npm run prettyFormat code with Prettier
npm install -gDeploy to local Node.js (go live)
./update-all.shUpdate all dependencies

Deployment

  1. Make changes in TypeScript source
  2. Run npm run build to compile
  3. Run npm install -g to install globally (live deployment)
  4. Changes take effect immediately, including for cron jobs

Testing

# Run unit tests
npm run test

# Test specific utility without Slack
node ./dist/system-health.js --short --no-slacking

# Test RC management (dry run)
./dist/rc-man.js servers --start --no-slacking

Documentation (mdbook)

# Install dependencies
sudo apt install graphviz
cargo install mdbook-graphviz

# Serve documentation
cd book && mdbook serve -p 8089

The documentation auto-refreshes on save.

Cron Schedule Reference

ScheduleJobs
Every 10 minsystem-health (short), sync-monkey (q6h with lock)
Every 15 mincopy-from-mongo
Hourlysystem-health (hourly)
2 AMsync-monkey (daily-convert, daily-dev, daily-misc, daily-rc)
4 AMfix-file-permissions
6 AM (Tue)sync-monkey (weekly)
6 AM (Mon)sync-monkey (backup)
8 AMclean
8 AM (Mon)archive
10 AMrestart
2 PM (daily)system-health (daily)
2 PM (Sun)system-health (weekly)
2 PM (1st)system-health (monthly)
2:10 PM (Wed)aws-s3-audit
3 PMreminders
11:50 AM (Sat)generate-site-docs

Environment Configuration

All authentication uses environment variables or IAM roles:

VariablePurpose
AWS_ACCESS_KEY_IDAWS authentication
AWS_SECRET_ACCESS_KEYAWS authentication
SLACKBOT_URLSlack webhook for notifications

Note: Command-line utilities require the secrets environment on cc.path2response.com.

  • Operations Reference Manual - book/src/SUMMARY.md
  • [AWS SDK v3 Migration](operations repo: AWS_SDK_V3_MIGRATION.md) - Completed migration notes
  • [Deno Book of Ops](operations repo: deno/book-of-ops/) - AWS infrastructure documentation
  • [S3 Replication](operations repo: book/src/s3-replication.md) - Data sync topology
  • [MongoDB Filtered Sync](operations repo: book/src/sync-monkey.md) - PATH-25842 race condition fix

Key Jira References

TicketDescription
PATH-25166Original production incident (MongoDB sync race condition)
PATH-25175Immediate fix (disabled datafile sync)
PATH-25842Filtered sync implementation
PATH-25939Filter design by Carroll Houk and David Fuller

Source: README.md, book/src/SUMMARY.md, book/src/intro.md, book/src/utilities.md, book/src/system-health.md, book/src/sync-monkey.md, book/src/clean.md, book/src/archive.md, book/src/rc-man.md, book/src/s3-replication.md, book/src/aws-s3-audit.md, book/src/aws-efs-audit.md, package.json, crontab.txt, deno/README.md, AWS_SDK_V3_MIGRATION.md

Documentation created: 2026-01-24