Re: Improving Airflow SLAs
About de-coupling the SLA management process, I've had conversations in the
direction of renaming the scheduler to "supervisor" to reflect the fact
that it's not just scheduling processes, it does a lot more tasks than just
that, SLA management being one of them.
I still think the default should be to require a single supervisor that
would do all the "supervision" work though. I'm generally against requiring
more types of nodes on the cluster. But perhaps the supervisor could have
switches to be started in modes where it would only do a subset of its
tasks, so that people can run multiple specialized supervisor nodes if they
For the record, I was thinking that renaming the scheduler to supervisor
would likely happen as we re-write it to enable multiple concurrent
supervisor processes. It turns out that parallelizing the scheduler hasn't
been as critical as I thought it would be originally, especially with the
current multi-process scheduler. Sounds like the community is getting a lot
of mileage out of this current multi-process scheduler.
On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <jieningwen@xxxxxxxxxxx> wrote:
> I would love to see this proposal gets implemented in airflow.
> In our case duration based SLA makes much more sense and I ended up adding
> a decorator to the execute method in our custom operators.
> Best regards,
> -----Original Message-----
> From: James Meickle [mailto:jmeickle@xxxxxxxxxxxxxx]
> Sent: Wednesday 02 May 2018 9:00 PM
> To: dev@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Subject: Improving Airflow SLAs [External]
> At Quantopian we use Airflow to produce artifacts based on the previous
> day's stock market data. These artifacts are required for us to trade on
> today's stock market. Therefore, I've been investing time in improving
> Airflow notifications (such as writing PagerDuty and Slack integrations).
> My attention has turned to Airflow's SLA system, which has some drawbacks
> for our use case:
> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> skipped for this execution date will still trigger emails/callbacks. This
> is a huge problem for us because we run almost no tasks on weekends (since
> the stock market isn't open).
> 2) Defining SLAs can be awkward because they are relative to the execution
> date instead of the task start time. There's no way to alert if a task runs
> for "more than an hour", for any non-trivial DAG. Instead you can only
> express "more than an hour from execution date". The financial data we use
> varies in when it arrives, and how long it takes to process (data volume
> changes frequently); we also have tight timelines that make retries
> difficult, so we want to alert an operator while leaving the task running,
> rather than failing and then alerting.
> 3) SLA miss emails don't have a subject line containing the instance URL
> (important for us because we run the same DAGs in both staging/production)
> or the execution date they apply to. When opened, they can get hard to read
> for even a moderately sized DAG because they include a flat list of task
> instances that are unsorted (neither alpha nor topo). They are also lacking
> any links back to the Airflow instance.
> 4) SLA emails are not callbacks, and can't be turned off (other than either
> removing the SLA or removing the email attribute on the task instance). The
> way that SLA miss callbacks are defined is not intuitive, as in contrast to
> all other callbacks, they are DAG-level rather than task-level. Also, the
> call signature is poorly defined: for instance, two of the arguments are
> just strings produced from the other two arguments.
> I have some thoughts about ways to fix these issues:
> 1) I just consider this one a bug. If a task instance is skipped, that was
> intentional, and it should not trigger any alerts.
> 2) I think that the `sla=` parameter should be split into something like
> `expected_start`: Timedelta after execution date, representing when this
> task must have started by.
> `expected_finish`: Timedelta after execution date, representing when this
> task must have finished by.
> `expected_duration`: Timedelta after task start, representing how long it
> is expected to run including all retries.
> This would give better operator control over SLAs, particularly for tasks
> deeper in larger DAGs where exact ordering may be hard to predict.
> 3) The emails should be improved to be more operator-friendly, and take
> into account that someone may get a callback for a DAG they don't know very
> well, or be paged by this notification.
> 4.1) All Airflow callbacks should support a list, rather than requiring a
> single function. (I've written a wrapper that does this, but it would be
> better for Airflow to just handle this itself.)
> 4.2) SLA miss callbacks should be task callbacks that receive context, like
> all the other callbacks. Having a DAG figure out which tasks have missed
> SLAs collectively is fine, but getting SLA failures in a batched callback
> doesn't really make much sense. Per-task callbacks can be fired
> individually within a batch of failures detected at the same time.
> 4.3) SLA emails should be the default SLA miss callback function, rather
> than being hardcoded.
> Also, overall, the SLA miss logic is very complicated. It's stuffed into
> one overloaded function that is responsible for checking for SLA misses,
> creating database objects for them, filtering tasks, selecting emails,
> rendering, and sending. Refactoring it would be a good maintainability win.
> I am already implementing some of the above in a private branch, but I'd be
> curious to hear community feedback as to which of these suggestions might
> be desirable upstream. I could have this ready for Airflow 2.0 if there is
> interest beyond my own use case.