git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving Airflow SLAs


Hm, not sure I understand your question. To my mind this use case already
isn't possible because while the callback function is currently a DAG
attribute, the SLA value has always been set on specific tasks within the
DAG. Perhaps this gets obscured by the current SLA miss email reporting
downstream tasks, but it's an individual task's SLA (or more than one task
if they got batched) that was triggering that email, not the DAG's.

I'm not sure it's worth adding DAG-level SLAs too since that would likely
result in confusion with task-level SLAs. I feel like for most use cases,
when the DAGRun starts/finishes is more of an implementation detail, and
the actual concern is for specific tasks (even if it's only the first or
the last tasks in your pipeline). Open to discussion on that though!

On Thu, May 24, 2018 at 3:16 PM, Ace Haidrey <acehaidrey@xxxxxxxxx> wrote:

> Hi James,
> I haven’t read everything or looked at the entire PR yet but one thing I
> wanted to ask was you state you move the SLA miss callback to the task
> level. In our org and I can imagine in others, we would like to have the
> callback stay at the DAG level so we can see if the entire pipeline is
> taking longer than X hours, not just if each task is taking more than X
> hours. Is this still possible, to do that feasibly?
>
> > On May 24, 2018, at 12:12 PM, James Meickle <jmeickle@xxxxxxxxxxxxxx>
> wrote:
> >
> > Just giving this a bump; it's a pretty major rework so I'd love to know
> > whether this effort is likely to be accepted if I bring it to a PR-able
> > state, before I invest more time.
> >
> > On Wed, May 23, 2018 at 1:59 PM, James Meickle <jmeickle@xxxxxxxxxxxxxx>
> > wrote:
> >
> >> Hi folks,
> >>
> >> I've created a branch off of v1-10-test; the diff can be found here:
> >> https://github.com/apache/incubator-airflow/compare/v1-
> >> 10-test...Eronarn:sla_improvements
> >>
> >> As a recap, this work is expected to do the following:
> >>
> >> - split the "sla" parameter into three independent SLAs: expected
> >> duration, expected start, and expected finish
> >> - move the SLA miss callback to be a task-level attribute rather than
> >> DAG-level (removing a lot of the "batching" functionality)
> >> - convert the SLA miss email to the default SLA miss callback
> >> - add a "type" to SLA misses, which will be part of the primary key, and
> >> can be checked against in the callback to respond appropriately to the
> type
> >> of SLA that was missed.
> >> - don't send SLA misses for skipped tasks, or for backfill jobs
> >>
> >> Before I polish up the remaining TODO functions and write a migration
> and
> >> tests, I'd appreciate feedback from the maintainers as to whether this
> >> seems to be on the right track, design-wise. (Note that it's definitely
> not
> >> going to pass tests right now; I am having significant problems getting
> >> Airflow's test suite running locally so I'm not even attempting at the
> >> moment.)
> >>
> >> Thanks,
> >>
> >> -James M.
> >>
> >> On Wed, May 9, 2018 at 12:43 PM, James Meickle <jmeickle@xxxxxxxxxxxxxx
> >
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Since the response so far has been positive or neutral, I intend to
> >>> submit one or more PRs targeting 2.0 (I think that some parts will be
> >>> separable from a larger SLA refactor). I intend to address at least the
> >>> following JIRA issues:
> >>>
> >>> https://issues.apache.org/jira/browse/AIRFLOW-2236
> >>> https://issues.apache.org/jira/browse/AIRFLOW-1472
> >>> https://issues.apache.org/jira/browse/AIRFLOW-1360
> >>> https://issues.apache.org/jira/browse/AIRFLOW-557
> >>> https://issues.apache.org/jira/browse/AIRFLOW-133
> >>>
> >>> Regards,
> >>>
> >>> -James M.
> >>>
> >>>
> >>>
> >>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
> >>> maximebeauchemin@xxxxxxxxx> wrote:
> >>>
> >>>> About de-coupling the SLA management process, I've had conversations
> in
> >>>> the
> >>>> direction of renaming the scheduler to "supervisor" to reflect the
> fact
> >>>> that it's not just scheduling processes, it does a lot more tasks than
> >>>> just
> >>>> that, SLA management being one of them.
> >>>>
> >>>> I still think the default should be to require a single supervisor
> that
> >>>> would do all the "supervision" work though. I'm generally against
> >>>> requiring
> >>>> more types of nodes on the cluster. But perhaps the supervisor could
> have
> >>>> switches to be started in modes where it would only do a subset of its
> >>>> tasks, so that people can run multiple specialized supervisor nodes if
> >>>> they
> >>>> want to.
> >>>>
> >>>> For the record, I was thinking that renaming the scheduler to
> supervisor
> >>>> would likely happen as we re-write it to enable multiple concurrent
> >>>> supervisor processes. It turns out that parallelizing the scheduler
> >>>> hasn't
> >>>> been as critical as I thought it would be originally, especially with
> the
> >>>> current multi-process scheduler. Sounds like the community is getting
> a
> >>>> lot
> >>>> of mileage out of this current multi-process scheduler.
> >>>>
> >>>> Max
> >>>>
> >>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <jieningwen@xxxxxxxxxxx>
> >>>> wrote:
> >>>>
> >>>>> I would love to see this proposal gets implemented in airflow.
> >>>>> In our case duration based SLA makes much more sense and I ended up
> >>>> adding
> >>>>> a decorator to the execute method in our custom operators.
> >>>>>
> >>>>> Best regards,
> >>>>> Jiening
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: James Meickle [mailto:jmeickle@xxxxxxxxxxxxxx]
> >>>>> Sent: Wednesday 02 May 2018 9:00 PM
> >>>>> To: dev@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >>>>> Subject: Improving Airflow SLAs [External]
> >>>>>
> >>>>> At Quantopian we use Airflow to produce artifacts based on the
> previous
> >>>>> day's stock market data. These artifacts are required for us to trade
> >>>> on
> >>>>> today's stock market. Therefore, I've been investing time in
> improving
> >>>>> Airflow notifications (such as writing PagerDuty and Slack
> >>>> integrations).
> >>>>> My attention has turned to Airflow's SLA system, which has some
> >>>> drawbacks
> >>>>> for our use case:
> >>>>>
> >>>>> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> >>>>> skipped for this execution date will still trigger emails/callbacks.
> >>>> This
> >>>>> is a huge problem for us because we run almost no tasks on weekends
> >>>> (since
> >>>>> the stock market isn't open).
> >>>>>
> >>>>> 2) Defining SLAs can be awkward because they are relative to the
> >>>> execution
> >>>>> date instead of the task start time. There's no way to alert if a
> task
> >>>> runs
> >>>>> for "more than an hour", for any non-trivial DAG. Instead you can
> only
> >>>>> express "more than an hour from execution date".  The financial data
> >>>> we use
> >>>>> varies in when it arrives, and how long it takes to process (data
> >>>> volume
> >>>>> changes frequently); we also have tight timelines that make retries
> >>>>> difficult, so we want to alert an operator while leaving the task
> >>>> running,
> >>>>> rather than failing and then alerting.
> >>>>>
> >>>>> 3) SLA miss emails don't have a subject line containing the instance
> >>>> URL
> >>>>> (important for us because we run the same DAGs in both
> >>>> staging/production)
> >>>>> or the execution date they apply to. When opened, they can get hard
> to
> >>>> read
> >>>>> for even a moderately sized DAG because they include a flat list of
> >>>> task
> >>>>> instances that are unsorted (neither alpha nor topo). They are also
> >>>> lacking
> >>>>> any links back to the Airflow instance.
> >>>>>
> >>>>> 4) SLA emails are not callbacks, and can't be turned off (other than
> >>>> either
> >>>>> removing the SLA or removing the email attribute on the task
> >>>> instance). The
> >>>>> way that SLA miss callbacks are defined is not intuitive, as in
> >>>> contrast to
> >>>>> all other callbacks, they are DAG-level rather than task-level. Also,
> >>>> the
> >>>>> call signature is poorly defined: for instance, two of the arguments
> >>>> are
> >>>>> just strings produced from the other two arguments.
> >>>>>
> >>>>> I have some thoughts about ways to fix these issues:
> >>>>>
> >>>>> 1) I just consider this one a bug. If a task instance is skipped,
> that
> >>>> was
> >>>>> intentional, and it should not trigger any alerts.
> >>>>>
> >>>>> 2) I think that the `sla=` parameter should be split into something
> >>>> like
> >>>>> this:
> >>>>>
> >>>>> `expected_start`: Timedelta after execution date, representing when
> >>>> this
> >>>>> task must have started by.
> >>>>> `expected_finish`: Timedelta after execution date, representing when
> >>>> this
> >>>>> task must have finished by.
> >>>>> `expected_duration`: Timedelta after task start, representing how
> long
> >>>> it
> >>>>> is expected to run including all retries.
> >>>>>
> >>>>> This would give better operator control over SLAs, particularly for
> >>>> tasks
> >>>>> deeper in larger DAGs where exact ordering may be hard to predict.
> >>>>>
> >>>>> 3) The emails should be improved to be more operator-friendly, and
> take
> >>>>> into account that someone may get a callback for a DAG they don't
> know
> >>>> very
> >>>>> well, or be paged by this notification.
> >>>>>
> >>>>> 4.1) All Airflow callbacks should support a list, rather than
> >>>> requiring a
> >>>>> single function. (I've written a wrapper that does this, but it would
> >>>> be
> >>>>> better for Airflow to just handle this itself.)
> >>>>>
> >>>>> 4.2) SLA miss callbacks should be task callbacks that receive
> context,
> >>>> like
> >>>>> all the other callbacks. Having a DAG figure out which tasks have
> >>>> missed
> >>>>> SLAs collectively is fine, but getting SLA failures in a batched
> >>>> callback
> >>>>> doesn't really make much sense. Per-task callbacks can be fired
> >>>>> individually within a batch of failures detected at the same time.
> >>>>>
> >>>>> 4.3) SLA emails should be the default SLA miss callback function,
> >>>> rather
> >>>>> than being hardcoded.
> >>>>>
> >>>>> Also, overall, the SLA miss logic is very complicated. It's stuffed
> >>>> into
> >>>>> one overloaded function that is responsible for checking for SLA
> >>>> misses,
> >>>>> creating database objects for them, filtering tasks, selecting
> emails,
> >>>>> rendering, and sending. Refactoring it would be a good
> maintainability
> >>>> win.
> >>>>>
> >>>>> I am already implementing some of the above in a private branch, but
> >>>> I'd be
> >>>>> curious to hear community feedback as to which of these suggestions
> >>>> might
> >>>>> be desirable upstream. I could have this ready for Airflow 2.0 if
> >>>> there is
> >>>>> interest beyond my own use case.
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>