git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving Airflow SLAs


Hi folks,

I've created a branch off of v1-10-test; the diff can be found here:
https://github.com/apache/incubator-airflow/compare/v1-10-test...Eronarn:sla_improvements

As a recap, this work is expected to do the following:

- split the "sla" parameter into three independent SLAs: expected duration,
expected start, and expected finish
- move the SLA miss callback to be a task-level attribute rather than
DAG-level (removing a lot of the "batching" functionality)
- convert the SLA miss email to the default SLA miss callback
- add a "type" to SLA misses, which will be part of the primary key, and
can be checked against in the callback to respond appropriately to the type
of SLA that was missed.
- don't send SLA misses for skipped tasks, or for backfill jobs

Before I polish up the remaining TODO functions and write a migration and
tests, I'd appreciate feedback from the maintainers as to whether this
seems to be on the right track, design-wise. (Note that it's definitely not
going to pass tests right now; I am having significant problems getting
Airflow's test suite running locally so I'm not even attempting at the
moment.)

Thanks,

-James M.

On Wed, May 9, 2018 at 12:43 PM, James Meickle <jmeickle@xxxxxxxxxxxxxx>
wrote:

> Hi all,
>
> Since the response so far has been positive or neutral, I intend to submit
> one or more PRs targeting 2.0 (I think that some parts will be separable
> from a larger SLA refactor). I intend to address at least the following
> JIRA issues:
>
> https://issues.apache.org/jira/browse/AIRFLOW-2236
> https://issues.apache.org/jira/browse/AIRFLOW-1472
> https://issues.apache.org/jira/browse/AIRFLOW-1360
> https://issues.apache.org/jira/browse/AIRFLOW-557
> https://issues.apache.org/jira/browse/AIRFLOW-133
>
> Regards,
>
> -James M.
>
>
>
> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
> maximebeauchemin@xxxxxxxxx> wrote:
>
>> About de-coupling the SLA management process, I've had conversations in
>> the
>> direction of renaming the scheduler to "supervisor" to reflect the fact
>> that it's not just scheduling processes, it does a lot more tasks than
>> just
>> that, SLA management being one of them.
>>
>> I still think the default should be to require a single supervisor that
>> would do all the "supervision" work though. I'm generally against
>> requiring
>> more types of nodes on the cluster. But perhaps the supervisor could have
>> switches to be started in modes where it would only do a subset of its
>> tasks, so that people can run multiple specialized supervisor nodes if
>> they
>> want to.
>>
>> For the record, I was thinking that renaming the scheduler to supervisor
>> would likely happen as we re-write it to enable multiple concurrent
>> supervisor processes. It turns out that parallelizing the scheduler hasn't
>> been as critical as I thought it would be originally, especially with the
>> current multi-process scheduler. Sounds like the community is getting a
>> lot
>> of mileage out of this current multi-process scheduler.
>>
>> Max
>>
>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <jieningwen@xxxxxxxxxxx>
>> wrote:
>>
>> > I would love to see this proposal gets implemented in airflow.
>> > In our case duration based SLA makes much more sense and I ended up
>> adding
>> > a decorator to the execute method in our custom operators.
>> >
>> > Best regards,
>> > Jiening
>> >
>> > -----Original Message-----
>> > From: James Meickle [mailto:jmeickle@xxxxxxxxxxxxxx]
>> > Sent: Wednesday 02 May 2018 9:00 PM
>> > To: dev@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
>> > Subject: Improving Airflow SLAs [External]
>> >
>> > At Quantopian we use Airflow to produce artifacts based on the previous
>> > day's stock market data. These artifacts are required for us to trade on
>> > today's stock market. Therefore, I've been investing time in improving
>> > Airflow notifications (such as writing PagerDuty and Slack
>> integrations).
>> > My attention has turned to Airflow's SLA system, which has some
>> drawbacks
>> > for our use case:
>> >
>> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
>> > skipped for this execution date will still trigger emails/callbacks.
>> This
>> > is a huge problem for us because we run almost no tasks on weekends
>> (since
>> > the stock market isn't open).
>> >
>> > 2) Defining SLAs can be awkward because they are relative to the
>> execution
>> > date instead of the task start time. There's no way to alert if a task
>> runs
>> > for "more than an hour", for any non-trivial DAG. Instead you can only
>> > express "more than an hour from execution date".  The financial data we
>> use
>> > varies in when it arrives, and how long it takes to process (data volume
>> > changes frequently); we also have tight timelines that make retries
>> > difficult, so we want to alert an operator while leaving the task
>> running,
>> > rather than failing and then alerting.
>> >
>> > 3) SLA miss emails don't have a subject line containing the instance URL
>> > (important for us because we run the same DAGs in both
>> staging/production)
>> > or the execution date they apply to. When opened, they can get hard to
>> read
>> > for even a moderately sized DAG because they include a flat list of task
>> > instances that are unsorted (neither alpha nor topo). They are also
>> lacking
>> > any links back to the Airflow instance.
>> >
>> > 4) SLA emails are not callbacks, and can't be turned off (other than
>> either
>> > removing the SLA or removing the email attribute on the task instance).
>> The
>> > way that SLA miss callbacks are defined is not intuitive, as in
>> contrast to
>> > all other callbacks, they are DAG-level rather than task-level. Also,
>> the
>> > call signature is poorly defined: for instance, two of the arguments are
>> > just strings produced from the other two arguments.
>> >
>> > I have some thoughts about ways to fix these issues:
>> >
>> > 1) I just consider this one a bug. If a task instance is skipped, that
>> was
>> > intentional, and it should not trigger any alerts.
>> >
>> > 2) I think that the `sla=` parameter should be split into something like
>> > this:
>> >
>> > `expected_start`: Timedelta after execution date, representing when this
>> > task must have started by.
>> > `expected_finish`: Timedelta after execution date, representing when
>> this
>> > task must have finished by.
>> > `expected_duration`: Timedelta after task start, representing how long
>> it
>> > is expected to run including all retries.
>> >
>> > This would give better operator control over SLAs, particularly for
>> tasks
>> > deeper in larger DAGs where exact ordering may be hard to predict.
>> >
>> > 3) The emails should be improved to be more operator-friendly, and take
>> > into account that someone may get a callback for a DAG they don't know
>> very
>> > well, or be paged by this notification.
>> >
>> > 4.1) All Airflow callbacks should support a list, rather than requiring
>> a
>> > single function. (I've written a wrapper that does this, but it would be
>> > better for Airflow to just handle this itself.)
>> >
>> > 4.2) SLA miss callbacks should be task callbacks that receive context,
>> like
>> > all the other callbacks. Having a DAG figure out which tasks have missed
>> > SLAs collectively is fine, but getting SLA failures in a batched
>> callback
>> > doesn't really make much sense. Per-task callbacks can be fired
>> > individually within a batch of failures detected at the same time.
>> >
>> > 4.3) SLA emails should be the default SLA miss callback function, rather
>> > than being hardcoded.
>> >
>> > Also, overall, the SLA miss logic is very complicated. It's stuffed into
>> > one overloaded function that is responsible for checking for SLA misses,
>> > creating database objects for them, filtering tasks, selecting emails,
>> > rendering, and sending. Refactoring it would be a good maintainability
>> win.
>> >
>> > I am already implementing some of the above in a private branch, but
>> I'd be
>> > curious to hear community feedback as to which of these suggestions
>> might
>> > be desirable upstream. I could have this ready for Airflow 2.0 if there
>> is
>> > interest beyond my own use case.
>> >
>>
>
>