Are scheduled batch jobs / data pipelines monitored, with failures alerted to ops?
Demonstrate that all scheduled batch jobs and data pipelines are monitored with automated alerting mechanisms that notify operations personnel of execution failures or anomalies in near real-time.
Description
What this control does
This control ensures that automated batch jobs, ETL processes, scheduled tasks, and data pipelines are continuously monitored for successful execution, and that failures, delays, or anomalies trigger alerts to operations or engineering teams. Implementation typically involves job schedulers (e.g., cron, Airflow, Jenkins) integrated with monitoring platforms (e.g., Datadog, PagerDuty, CloudWatch) that track job status, execution duration, and completion state. This control is critical for maintaining data integrity, ensuring timely processing of business-critical workflows, and preventing silent failures that could lead to stale or missing data affecting downstream systems and decision-making.
Control objective
What auditing this proves
Demonstrate that all scheduled batch jobs and data pipelines are monitored with automated alerting mechanisms that notify operations personnel of execution failures or anomalies in near real-time.
Associated risks
Risks this control addresses
- Silent failure of data processing jobs leading to stale or incomplete data in production systems without detection
- Delayed detection of failed security-critical jobs such as log aggregation, vulnerability scanning, or backup processes
- Data corruption or inconsistency resulting from partially completed ETL processes that fail mid-execution without rollback
- Unauthorized modification or deletion of scheduled jobs by malicious actors going unnoticed due to lack of execution monitoring
- Service degradation or outages caused by cascading failures when dependent downstream jobs execute against incomplete or missing upstream data
- Compliance violations due to failure of regulatory reporting jobs or audit log processing pipelines without timely remediation
- Resource exhaustion from runaway or stuck jobs that are not identified and terminated due to inadequate runtime monitoring
Testing procedure
How an auditor verifies this control
- Obtain a complete inventory of all scheduled batch jobs, data pipelines, cron jobs, and automated workflows across production and production-supporting environments, including job names, schedules, owners, and criticality classifications.
- Review the configuration of job scheduling platforms (e.g., Airflow DAGs, Jenkins pipelines, AWS Step Functions, cron configurations) to identify what monitoring and logging capabilities are enabled.
- Examine the integration between job schedulers and monitoring/alerting systems (e.g., PagerDuty, Splunk, CloudWatch Alarms, Prometheus) to verify that job execution status is captured and evaluated.
- Select a representative sample of critical and high-priority batch jobs across different systems and review their specific alerting rules, including failure conditions, timeout thresholds, and notification recipients.
- Request and review historical alert logs or incident tickets from the past 90 days showing actual job failures and corresponding alerts sent to operations teams, including timestamps and recipient confirmation.
- Interview operations or DevOps personnel to validate that they receive, acknowledge, and respond to job failure alerts, and review documented runbooks or response procedures for common failure scenarios.
- Simulate a job failure by working with technical teams to intentionally fail a non-critical test job or use a pre-existing test harness, then verify that the monitoring system detects the failure and generates an alert within the defined SLA timeframe.
- Verify that alerting configurations include appropriate escalation paths, deduplication logic to prevent alert fatigue, and coverage for both job failures and anomalous conditions such as execution duration exceeding baselines.