[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving Airflow SLAs

Since we are talking about the SLA implementation, The current SLA miss
implementation is part of the scheduler code. So in the cases like
scheduler max out the process / not running for some reason, we will miss
all the SLA alert. It is worth to decouple SLA alert from the scheduler
path and run as a separate process.


On 2 May 2018 at 20:31, David Capwell <dcapwell@xxxxxxxxx> wrote:

> We use SLA as well and works great for some DAGs and painful for others
> We rely on sensors to validate the data is ready before we run and each dag
> waits on sensors for different times (one dag waits for 8 hours since it
> expects date at the start of day but tends to get it 8 hours later).  We
> also have some nested dags that have about 10 tasks deep.
> In these two cases SLA warnings come very late since the semantics we see
> is DAG completion time; what we really want is what you were talking about,
> expected execution times
> Also SLA trigger on backfills and manual reruns of tasks
> I see this as a critical feature for production monitoring so would love to
> see this get improved
> On Wed, May 2, 2018, 12:00 PM James Meickle <jmeickle@xxxxxxxxxxxxxx>
> wrote:
> > At Quantopian we use Airflow to produce artifacts based on the previous
> > day's stock market data. These artifacts are required for us to trade on
> > today's stock market. Therefore, I've been investing time in improving
> > Airflow notifications (such as writing PagerDuty and Slack integrations).
> > My attention has turned to Airflow's SLA system, which has some drawbacks
> > for our use case:
> >
> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > skipped for this execution date will still trigger emails/callbacks. This
> > is a huge problem for us because we run almost no tasks on weekends
> (since
> > the stock market isn't open).
> >
> > 2) Defining SLAs can be awkward because they are relative to the
> execution
> > date instead of the task start time. There's no way to alert if a task
> runs
> > for "more than an hour", for any non-trivial DAG. Instead you can only
> > express "more than an hour from execution date".  The financial data we
> use
> > varies in when it arrives, and how long it takes to process (data volume
> > changes frequently); we also have tight timelines that make retries
> > difficult, so we want to alert an operator while leaving the task
> running,
> > rather than failing and then alerting.
> >
> > 3) SLA miss emails don't have a subject line containing the instance URL
> > (important for us because we run the same DAGs in both
> staging/production)
> > or the execution date they apply to. When opened, they can get hard to
> read
> > for even a moderately sized DAG because they include a flat list of task
> > instances that are unsorted (neither alpha nor topo). They are also
> lacking
> > any links back to the Airflow instance.
> >
> > 4) SLA emails are not callbacks, and can't be turned off (other than
> either
> > removing the SLA or removing the email attribute on the task instance).
> The
> > way that SLA miss callbacks are defined is not intuitive, as in contrast
> to
> > all other callbacks, they are DAG-level rather than task-level. Also, the
> > call signature is poorly defined: for instance, two of the arguments are
> > just strings produced from the other two arguments.
> >
> > I have some thoughts about ways to fix these issues:
> >
> > 1) I just consider this one a bug. If a task instance is skipped, that
> was
> > intentional, and it should not trigger any alerts.
> >
> > 2) I think that the `sla=` parameter should be split into something like
> > this:
> >
> > `expected_start`: Timedelta after execution date, representing when this
> > task must have started by.
> > `expected_finish`: Timedelta after execution date, representing when this
> > task must have finished by.
> > `expected_duration`: Timedelta after task start, representing how long it
> > is expected to run including all retries.
> >
> > This would give better operator control over SLAs, particularly for tasks
> > deeper in larger DAGs where exact ordering may be hard to predict.
> >
> > 3) The emails should be improved to be more operator-friendly, and take
> > into account that someone may get a callback for a DAG they don't know
> very
> > well, or be paged by this notification.
> >
> > 4.1) All Airflow callbacks should support a list, rather than requiring a
> > single function. (I've written a wrapper that does this, but it would be
> > better for Airflow to just handle this itself.)
> >
> > 4.2) SLA miss callbacks should be task callbacks that receive context,
> like
> > all the other callbacks. Having a DAG figure out which tasks have missed
> > SLAs collectively is fine, but getting SLA failures in a batched callback
> > doesn't really make much sense. Per-task callbacks can be fired
> > individually within a batch of failures detected at the same time.
> >
> > 4.3) SLA emails should be the default SLA miss callback function, rather
> > than being hardcoded.
> >
> > Also, overall, the SLA miss logic is very complicated. It's stuffed into
> > one overloaded function that is responsible for checking for SLA misses,
> > creating database objects for them, filtering tasks, selecting emails,
> > rendering, and sending. Refactoring it would be a good maintainability
> win.
> >
> > I am already implementing some of the above in a private branch, but I'd
> be
> > curious to hear community feedback as to which of these suggestions might
> > be desirable upstream. I could have this ready for Airflow 2.0 if there
> is
> > interest beyond my own use case.
> >