SI-4 / A.12.1.3 / CIS-8.11 NIST SP 800-53 Rev 5 SOC 2

Are scheduled batch jobs / data pipelines monitored, with failures alerted to ops?

Demonstrate that all scheduled batch jobs and data pipelines are monitored with automated alerting mechanisms that notify operations personnel of execution failures or anomalies in near real-time.

Description

What this control does

This control ensures that automated batch jobs, ETL processes, scheduled tasks, and data pipelines are continuously monitored for successful execution, and that failures, delays, or anomalies trigger alerts to operations or engineering teams. Implementation typically involves job schedulers (e.g., cron, Airflow, Jenkins) integrated with monitoring platforms (e.g., Datadog, PagerDuty, CloudWatch) that track job status, execution duration, and completion state. This control is critical for maintaining data integrity, ensuring timely processing of business-critical workflows, and preventing silent failures that could lead to stale or missing data affecting downstream systems and decision-making.

Control objective

What auditing this proves

Demonstrate that all scheduled batch jobs and data pipelines are monitored with automated alerting mechanisms that notify operations personnel of execution failures or anomalies in near real-time.

Associated risks

Risks this control addresses

Silent failure of data processing jobs leading to stale or incomplete data in production systems without detection
Delayed detection of failed security-critical jobs such as log aggregation, vulnerability scanning, or backup processes
Data corruption or inconsistency resulting from partially completed ETL processes that fail mid-execution without rollback
Unauthorized modification or deletion of scheduled jobs by malicious actors going unnoticed due to lack of execution monitoring
Service degradation or outages caused by cascading failures when dependent downstream jobs execute against incomplete or missing upstream data
Compliance violations due to failure of regulatory reporting jobs or audit log processing pipelines without timely remediation
Resource exhaustion from runaway or stuck jobs that are not identified and terminated due to inadequate runtime monitoring

Testing procedure

How an auditor verifies this control

Obtain a complete inventory of all scheduled batch jobs, data pipelines, cron jobs, and automated workflows across production and production-supporting environments, including job names, schedules, owners, and criticality classifications.
Review the configuration of job scheduling platforms (e.g., Airflow DAGs, Jenkins pipelines, AWS Step Functions, cron configurations) to identify what monitoring and logging capabilities are enabled.
Examine the integration between job schedulers and monitoring/alerting systems (e.g., PagerDuty, Splunk, CloudWatch Alarms, Prometheus) to verify that job execution status is captured and evaluated.
Select a representative sample of critical and high-priority batch jobs across different systems and review their specific alerting rules, including failure conditions, timeout thresholds, and notification recipients.
Request and review historical alert logs or incident tickets from the past 90 days showing actual job failures and corresponding alerts sent to operations teams, including timestamps and recipient confirmation.
Interview operations or DevOps personnel to validate that they receive, acknowledge, and respond to job failure alerts, and review documented runbooks or response procedures for common failure scenarios.
Simulate a job failure by working with technical teams to intentionally fail a non-critical test job or use a pre-existing test harness, then verify that the monitoring system detects the failure and generates an alert within the defined SLA timeframe.
Verify that alerting configurations include appropriate escalation paths, deduplication logic to prevent alert fatigue, and coverage for both job failures and anomalous conditions such as execution duration exceeding baselines.

Evidence required Job scheduler configuration exports showing enabled monitoring and alerting settings, screenshots or exports from monitoring platforms displaying active alert rules for batch jobs with defined failure conditions and notification targets, historical alert logs or PagerDuty/ServiceNow incidents demonstrating actual job failure detection and notification over the audit period, runbook documentation detailing response procedures for job failures, and evidence of simulated failure testing such as timestamped alert messages or incident tickets.

Pass criteria All critical and high-priority scheduled batch jobs and data pipelines have active monitoring with configured alerting rules that successfully detect failures and notify designated operations personnel, as evidenced by documented alert configurations, historical alert delivery records, and successful simulation testing.