git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Is `airflow backfill` disfunctional?


Thanks for the input, this is helpful.

To add to the list, there's some complexity around concurrency management
and multiple executors:
I just hit this thing where backfill doesn't check DAG-level concurrency,
fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
limit and exits. Right after backfill reschedules right away and so on,
burning a bunch of CPU doing nothing. In this specific case it seems like
`airflow run` should skip that specific check when in the context of a
backfill.

Max

On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:

> Thinking out loud here, because it is a while back that I did work on
> backfills. There were some real issues with backfills:
>
> 1. Tasks were running in non deterministic order ending up in regular
> deadlocks
> 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> could not be enforced. Ui could really display it, lots of minor other
> issues because of it.
> 3. Behavior was different from the scheduler, while subdagoperators
> particularly make use of backfills at the moment.
>
> I think with 3 the behavior you are observing crept in. And given 3 I
> would argue a consistent behavior between the scheduler and the backfill
> mechanism is still paramount. Thus we should explicitly clear tasks from
> failed if we want to rerun them. This at least until we move the
> subdagoperator out of backfill and into the scheduler (which is actually
> not too hard). Also we need those command line options anyway.
>
> Bolke
>
> Verstuurd vanaf mijn iPad
>
> > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <scott.halgrim@xxxxxxxxxx.INVALID>
> het volgende geschreven:
> >
> > The request was for opposition, but I’d like to weigh in on the side of
> “it’s a better behavior [to have failed tasks re-run when cleared in a
> backfill"
> >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> maximebeauchemin@xxxxxxxxx>, wrote:
> >> @Jeremiah Lowin <jlowin@xxxxxxxxx> & @Bolke de Bruin <bdbruin@xxxxxxxxx>
> I
> >> think you may have some context on why this may have changed at some
> point.
> >> I'm assuming that when DagRun handling was added to the backfill logic,
> the
> >> behavior just happened to change to what it is now.
> >>
> >> Any opposition in moving back towards re-running failed tasks when
> starting
> >> a backfill? I think it's a better behavior, though it's a change in
> >> behavior that we should mention in UPDATE.md.
> >>
> >> One of our goals is to make sure that a failed or killed backfill can be
> >> restarted and just seamlessly pick up where it left off.
> >>
> >> Max
> >>
> >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fengtao04@xxxxxxxxx> wrote:
> >>>
> >>> After discussing with Max, we think it would be great if `airflow
> backfill`
> >>> could be able to auto pick up and rerun those failed tasks. Currently,
> it
> >>> will throw exceptions(
> >>>
> >>>
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> >>> )
> >>> without rerunning the failed tasks.
> >>>
> >>> But since it broke some of the previous assumptions for backfill, we
> would
> >>> like to get some feedback and see if anyone has any concerns(pr could
> be
> >>> found at https://github.com/apache/incubator-airflow/pull/3464/files).
> >>>
> >>> Thanks,
> >>> -Tao
> >>>
> >>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> >>> maximebeauchemin@xxxxxxxxx> wrote:
> >>>
> >>>> So I'm running a backfill for what feels like the first time in years
> >>> using
> >>>> a simple `airflow backfill --local` commands.
> >>>>
> >>>> First I start getting a ton of `logging.info` of each tasks that
> cannot
> >>> be
> >>>> started just yet at every tick flooding my terminal with the keyword
> >>>> `FAILED` in it, looking like a million of lines like this one:
> >>>>
> >>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met
> >>> for
> >>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
> >>>> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> >>>> quires all upstream tasks to have succeeded, but found 1
> non-success(es).
> >>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> 'upstream_failed':
> >>>> 0L,
> >>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
> >>>>
> >>>> Good thing I triggered 1 month and not 2 years like I actually need,
> just
> >>>> the logs here would be "big data". Now I'm unclear whether there's
> >>> anything
> >>>> actually running or if I did something wrong, so I decide to kill the
> >>>> process so I can set a smaller date range and get a better picture of
> >>>> what's up.
> >>>>
> >>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a
> >>> note
> >>>> that I'll need to find that log-flooding line and demote it to DEBUG
> in a
> >>>> quick PR, no biggy.
> >>>>
> >>>> Now I restart with just a single schedule, and get an error `Dag
> >>> {some_dag}
> >>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could
> >>> just
> >>>> pickup where it left off. Maybe I need to run an `airflow clear`
> command
> >>>> and restart? Ok, ran my clear command, same error is showing up. Dead
> >>> end.
> >>>>
> >>>> Maybe there is some new `airflow clear --reset-dagruns` option?
> Doesn't
> >>>> look like it... Maybe `airflow backfill` has some new switches to
> pick up
> >>>> where it left off? Can't find it. Am I supposed to clear the DAG Runs
> >>>> manually in the UI? This is a pre-production, in-development DAG, so
> >>> it's
> >>>> not on the production web server. Am I supposed to fire up my own web
> >>>> server to go and manually handle the backfill-related DAG Runs?
> Cannot to
> >>>> my staging MySQL and do manually clear some DAG runs?
> >>>>
> >>>> So. Fire up a web server, navigate to my dag_id, delete the DAG runs,
> it
> >>>> appears I can finally start over.
> >>>>
> >>>> Next thought was: "Alright looks like I need to go Linus on the
> mailing
> >>>> list".
> >>>>
> >>>> What am I missing? I'm really hoping these issues specific to 1.8.2!
> >>>>
> >>>> Backfilling is core to Airflow and should work very well. I want to
> >>> restate
> >>>> some reqs for Airflow backfill:
> >>>> * when failing / interrupted, it should seamlessly be able to pickup
> >>> where
> >>>> it left off
> >>>> * terminal logging at the INFO level should be a clear, human
> consumable,
> >>>> indicator of progress
> >>>> * backfill-related operations (including restarts) should be doable
> >>> through
> >>>> CLI interactions, and not require web server interactions as the
> typical
> >>>> sandbox (dev environment) shouldn't assume the existence of a web
> server
> >>>>
> >>>> Let's fix this.
> >>>>
> >>>> Max
> >>>>
> >>>
>