git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Convert Dag Run from Backfill to Scheduled?


Yes, clearly the DAG runs be can in inconsistent states with related task
instances and backfill processes. Here's a quick patch that helps a little:
https://github.com/apache/incubator-airflow/pull/3433

After writing the quick patch above I'm thinking it requires a bit more
thinking. The clear command is effectively a bit of a way to issue a
"scheduler-driven backfill", maybe we can deprecate clear and have a new
"airflow backfill --scheduler", which would effectively clear task
instances and create/set DAG runs in the right state.

Max

On Tue, May 29, 2018 at 5:58 PM Ruiqin Yang <yrqls21@xxxxxxxxx> wrote:

> This line
> <
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L935
> >
> is
> where the scheduler skips the backfill DAG runs. Despite what state the DAG
> run is in, tasks in DAG run starts with 'backfill_' would not be considered
> when scheduling.
>
> I agree with Dan Davydov's idea that we should at least have something like
> multiple DAG runs for one execution to distinguish different DAG runs like
> scheduled and backfilled. The situation Scott is facing here is not the
> only case that lack of multiple DAG run has caused (e.g. manually trigging
> a task in the UI should also create a seperate DAG run, otherwise the
> implementation logic is a bit wired).
>
> Cheers,
> Kevin Y
>
> On Tue, May 29, 2018 at 5:52 PM Scott Halgrim
> <scott.halgrim@xxxxxxxxxx.invalid> wrote:
>
> > Well I’ve gone ahead and run the UPDATE query now, so the scheduler is
> > picking up tasks.
> >
> > When I cleared the tasks, every DAG run that had a cleared task in it was
> > set to running. Because I’d backfilled them all they were all `backfill_`
> > dag runs.  Inspection of various tasks via `task_failed_deps` indicated
> the
> > tasks had all their dependencies filled. After running the update query,
> > they’re all `scheduled__` dag runs.
> >
> > On May 29, 2018, 5:02 PM -0700, Maxime Beauchemin <
> > maximebeauchemin@xxxxxxxxx>, wrote:
> > > While this may work it's clearly not the prescribed way to do this.
> > > Clearing should just work.
> > >
> > > I'm trying to understand why the scheduler is not picking up the
> cleared
> > > task. Clearing should remove the task instance state and set the state
> of
> > > the related DAG Run to running so that the scheduler picks those up.
> > > Perhaps there's a conflict between the backfill and scheduler-related
> DAG
> > > Runs? Which DAG runs are set to running? The backfill or
> > scheduler-related
> > > ones?
> > >
> > > Originally when I introduced DAG runs, backfill was operating without
> any
> > > consideration related to DAG runs (DAG runs were a scheduler-specific
> > > construct), later on Bolke added backfill-specific DAG runs and I'm not
> > > 100% sure how that works.
> > >
> > > Let's get to the bottom of this.
> > >
> > > Max
> > >
> > > On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang <yrqls21@xxxxxxxxx> wrote:
> > >
> > > > If you are sure the update query targets the desired rows, the
> behavior
> > > > should be the same.
> > > >
> > > > Scott Halgrim <scott.halgrim@xxxxxxxxxx.invalid>于2018年5月25日
> > 周五下午4:23写道:
> > > >
> > > > > So far no ill effects from:
> > > > >
> > > > > update dag_run
> > > > > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > > > > where dag_id = 'daily'
> > > > > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > > > > and run_id like 'backfill_%'
> > > > > order by execution_date;
> > > > >
> > > > > On May 25, 2018, 4:03 PM -0700, Scott Halgrim <
> > scott.halgrim@xxxxxxxxxx
> > > > > ,
> > > > > wrote:
> > > > > > Oh wow, that will work? Thanks! Is there any reason for me not to
> > just
> > > > > run a mass UPDATE on those dag runs directly in the metadata
> > database?
> > > > > >
> > > > > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yrqls21@xxxxxxxxx>,
> > > > wrote:
> > > > > > > Airflow is not going to schedule backfill DAG runs, by looking
> > at the
> > > > > dag
> > > > > > > run ID (which will start by 'backfill__'). If you want the
> > scheduler
> > > > to
> > > > > > > schedule those tasks, you can click the DAG run and edit its
> name
> > > > back
> > > > > to
> > > > > > > 'scheduled__<something>'
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Kevin Y
> > > > > > >
> > > > > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > > > > scott.halgrim@xxxxxxxxxx.invalid> wrote:
> > > > > > >
> > > > > > > > I’ve got four months of dag runs that were scheduled dag
> runs,
> > > > then I
> > > > > > > > backfilled them. And now when I clear a task from one of
> those
> > the
> > > > > dag run
> > > > > > > > goes to “running,” but none of the tasks get scheduled
> (unless
> > I
> > > > > manually
> > > > > > > > backfill each of them)
> > > > > > > >
> > > > > > > > What I really should have done here was just cleared a
> mid-dag
> > task
> > > > > as
> > > > > > > > well as all downstream tasks for these dag runs, but, well,
> > now I’m
> > > > > here
> > > > > > > > and I’m wondering what the best way to fix this.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >
> > > > >
> > > >
> >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-airflow-development/msg03508.html on line 216
Call Stack
#TimeMemoryFunctionLocation
10.0029372888{main}( ).../msg03508.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-airflow-development/msg03508.html on line 216
Call Stack
#TimeMemoryFunctionLocation
10.0029372888{main}( ).../msg03508.html:0