Operations Overview
Automated system maintenance, health monitoring, and data synchronization tools for Path2Response back-office infrastructure.
Purpose
The Operations repository provides day-to-day operational tooling for the Infrastructure team. It performs automated system checks, data synchronization, and maintenance across staging, release-candidate (RC), and production environments. These tools run continuously via cron jobs on cc.path2response.com, the command-and-control server.
Primary Users: Infrastructure Team (Jason Smith, Wes Hofmann)
Key Responsibilities:
- System health monitoring at multiple intervals (10-minute, hourly, daily, weekly, monthly)
- Data synchronization from production to staging/development environments
- AWS resource management (EC2 instances, S3 storage, EFS volumes)
- Automated cleanup of temporary files and old data
- RC environment management (start/stop servers, pause/resume MongoDB Atlas)
Architecture
Directory Structure
operations/
├── bin/ # Bash cron wrapper scripts
├── book/ # Operations Reference Manual (mdbook)
│ └── src/ # Documentation source files
├── deno/ # Deno utilities
│ ├── app/ # Deno applications (audit, costs, network)
│ ├── book-of-ops/ # AWS infrastructure documentation
│ ├── common/ # Shared Deno modules
│ ├── lib/ # Deno libraries (up, gateway)
│ └── utilities/ # Deno utility scripts
├── src/ # TypeScript source code
│ ├── checks/ # Health check definitions
│ ├── rcman/ # RC management modules
│ ├── sync/ # Data synchronization modules
│ └── util/ # Shared utilities (AWS, functional, MongoDB)
├── tests/ # Unit tests
├── crontab.txt # Reference copy of production crontab
├── deno.jsonc # Deno configuration
├── package.json # Node.js dependencies
└── tsconfig.json # TypeScript configuration
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Runtime | Node.js + Deno | Dual runtime (Node for legacy, Deno for new utilities) |
| Language | TypeScript | Type-safe scripting |
| AWS SDK | @aws-sdk/client-s3, @aws-sdk/client-ec2 (v3) | AWS API access |
| Database | mongodb (7.0) | MongoDB operations for data sync |
| Notifications | node-slack | Slack integration for alerts |
| CLI | commander | Command-line argument parsing |
| Documentation | mdbook | Browsable operations manual |
| Scheduling | cron | Automated job execution |
Key Dependencies
{
"@aws-sdk/client-ec2": "3.958.0",
"@aws-sdk/client-s3": "3.958.0",
"mongodb": "7.0.0",
"node-slack": "0.0.7",
"commander": "14.0.2"
}
Core Functionality
Utilities (Command-Line Tools)
| Utility | Purpose | Schedule |
|---|---|---|
| system-health | Run health checks at varying intervals; alert on failures | Every 10 min, hourly, daily, weekly, monthly |
| sync-monkey | Synchronize data from production to staging/development | Multiple schedules (q6h, daily, weekly) |
| clean | Delete old households, temp folders, title-households | Daily (8 AM) |
| archive | Archive files to read-only or Glacier storage | Weekly (Monday 8 AM) |
| rc-man | Start/stop RC EC2 servers; pause/resume MongoDB Atlas | Sprint-based (Friday/Sunday) |
| restart | Restart specific services (e.g., Shiny server) | Daily (10 AM) |
| aws-s3-audit | Audit S3 bucket sizes and identify large folders | Weekly (Wednesday 2:10 PM) |
| aws-efs-audit | Audit EFS volume sizes | Weekly (disabled) |
Health Checks (system-health)
The system-health utility runs checks at different intervals:
| Interval | Checks |
|---|---|
| 10-minute | System load average, stale EC2 instances |
| Hourly | Databricks API version, 4Cite service status |
| Daily | Disk storage, households validity, production/development match |
| Weekly/Monthly | Reserved for future checks |
EC2 Instance Management:
- Instances tagged with
p2r=permanentare exempt from stale checks - Instances can have custom timeouts (e.g.,
p2r=1.12:00for 1 day 12 hours) - Auto-kill option:
p2r=1:00,killterminates instance after timeout
Data Synchronization (sync-monkey)
Synchronizes data from production to staging/development:
| Data Type | Source | Destinations | Schedule |
|---|---|---|---|
| Households | p2r.prod.data | p2r-dev-data-1, p2r.dev.use2.data1, p2r.dev2.data | Every 6 hours |
| Convert/Final | p2r.prod.data | p2r.prod.use2.data1 | Daily + S3 replication |
MongoDB operations.datafile | Production | Staging/RC | Filtered sync (PATH-25842) |
| Households Archive | p2r.prod.data | p2r.prod.archive (Glacier) | Weekly |
Critical Note - MongoDB Filtered Sync:
The operations.datafile collection sync uses a filter ({done: true, pid: {$exists: true}}) to prevent race conditions where staging processes could intercept production work queue items. See PATH-25842 for full context.
RC Environment Management (rc-man)
Manages Release Candidate infrastructure for cost optimization:
| Subcommand | Function |
|---|---|
servers --start | Start all RC EC2 instances |
servers --stop | Stop all RC EC2 instances |
atlas --pause | Pause RC MongoDB Atlas cluster |
atlas --resume | Resume RC MongoDB Atlas cluster |
Warning: Paused MongoDB Atlas clusters auto-resume after 30 days.
Cleanup Operations (clean, archive)
| Command | Function | Target |
|---|---|---|
clean hh-temp | Remove /temp folders from old households runs | Production only |
clean title-hhs | Keep only latest titleHouseholds version | Production only |
clean hh | Remove old households (14+ days, keep last 3) | Production/Development |
archive orders | Make order files read-only after 60 days | prod02:/mnt/data/prod |
archive 4cite | Sync to Glacier, keep 25 months | S3 |
Integrations
AWS Services
| Service | Usage |
|---|---|
| EC2 | Instance lifecycle management, health monitoring, auto-termination |
| S3 | Data storage, cross-region sync, Glacier archival |
| EFS | File storage auditing |
| MongoDB Atlas | RC cluster pause/resume |
Internal Systems
| System | Integration |
|---|---|
| Slack | Alerts to #ops-auto channel; notifications via legacy webhook |
| MongoDB | Data sync, households metadata |
| Jira | Reminder notifications (jira-reminder.ts) |
External Services
| Service | Integration |
|---|---|
| Databricks | API version monitoring |
| 4Cite | Data collection service health check |
Development
Prerequisites
- Node.js (project-local installation)
- Deno (for newer utilities)
- AWS CLI (
apt install awscli) - Rust + mdbook (for documentation)
- graphviz (for mdbook diagrams)
Build Commands
| Command | Description |
|---|---|
npm run build | Prettify, test, and compile TypeScript |
npm run watch | Watch mode for development |
npm run test | Run unit tests (43 tests) |
npm run pretty | Format code with Prettier |
npm install -g | Deploy to local Node.js (go live) |
./update-all.sh | Update all dependencies |
Deployment
- Make changes in TypeScript source
- Run
npm run buildto compile - Run
npm install -gto install globally (live deployment) - Changes take effect immediately, including for cron jobs
Testing
# Run unit tests
npm run test
# Test specific utility without Slack
node ./dist/system-health.js --short --no-slacking
# Test RC management (dry run)
./dist/rc-man.js servers --start --no-slacking
Documentation (mdbook)
# Install dependencies
sudo apt install graphviz
cargo install mdbook-graphviz
# Serve documentation
cd book && mdbook serve -p 8089
The documentation auto-refreshes on save.
Cron Schedule Reference
| Schedule | Jobs |
|---|---|
| Every 10 min | system-health (short), sync-monkey (q6h with lock) |
| Every 15 min | copy-from-mongo |
| Hourly | system-health (hourly) |
| 2 AM | sync-monkey (daily-convert, daily-dev, daily-misc, daily-rc) |
| 4 AM | fix-file-permissions |
| 6 AM (Tue) | sync-monkey (weekly) |
| 6 AM (Mon) | sync-monkey (backup) |
| 8 AM | clean |
| 8 AM (Mon) | archive |
| 10 AM | restart |
| 2 PM (daily) | system-health (daily) |
| 2 PM (Sun) | system-health (weekly) |
| 2 PM (1st) | system-health (monthly) |
| 2:10 PM (Wed) | aws-s3-audit |
| 3 PM | reminders |
| 11:50 AM (Sat) | generate-site-docs |
Environment Configuration
All authentication uses environment variables or IAM roles:
| Variable | Purpose |
|---|---|
AWS_ACCESS_KEY_ID | AWS authentication |
AWS_SECRET_ACCESS_KEY | AWS authentication |
SLACKBOT_URL | Slack webhook for notifications |
Note: Command-line utilities require the secrets environment on cc.path2response.com.
Related Documentation
- Operations Reference Manual -
book/src/SUMMARY.md - [AWS SDK v3 Migration](operations repo:
AWS_SDK_V3_MIGRATION.md) - Completed migration notes - [Deno Book of Ops](operations repo:
deno/book-of-ops/) - AWS infrastructure documentation - [S3 Replication](operations repo:
book/src/s3-replication.md) - Data sync topology - [MongoDB Filtered Sync](operations repo:
book/src/sync-monkey.md) - PATH-25842 race condition fix
Key Jira References
| Ticket | Description |
|---|---|
| PATH-25166 | Original production incident (MongoDB sync race condition) |
| PATH-25175 | Immediate fix (disabled datafile sync) |
| PATH-25842 | Filtered sync implementation |
| PATH-25939 | Filter design by Carroll Houk and David Fuller |
Source: README.md, book/src/SUMMARY.md, book/src/intro.md, book/src/utilities.md, book/src/system-health.md, book/src/sync-monkey.md, book/src/clean.md, book/src/archive.md, book/src/rc-man.md, book/src/s3-replication.md, book/src/aws-s3-audit.md, book/src/aws-efs-audit.md, package.json, crontab.txt, deno/README.md, AWS_SDK_V3_MIGRATION.md
Documentation created: 2026-01-24