git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Convert Dag Run from Backfill to Scheduled?


This line
<https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L935>
is
where the scheduler skips the backfill DAG runs. Despite what state the DAG
run is in, tasks in DAG run starts with 'backfill_' would not be considered
when scheduling.

I agree with Dan Davydov's idea that we should at least have something like
multiple DAG runs for one execution to distinguish different DAG runs like
scheduled and backfilled. The situation Scott is facing here is not the
only case that lack of multiple DAG run has caused (e.g. manually trigging
a task in the UI should also create a seperate DAG run, otherwise the
implementation logic is a bit wired).

Cheers,
Kevin Y

On Tue, May 29, 2018 at 5:52 PM Scott Halgrim
<scott.halgrim@xxxxxxxxxx.invalid> wrote:

> Well I’ve gone ahead and run the UPDATE query now, so the scheduler is
> picking up tasks.
>
> When I cleared the tasks, every DAG run that had a cleared task in it was
> set to running. Because I’d backfilled them all they were all `backfill_`
> dag runs.  Inspection of various tasks via `task_failed_deps` indicated the
> tasks had all their dependencies filled. After running the update query,
> they’re all `scheduled__` dag runs.
>
> On May 29, 2018, 5:02 PM -0700, Maxime Beauchemin <
> maximebeauchemin@xxxxxxxxx>, wrote:
> > While this may work it's clearly not the prescribed way to do this.
> > Clearing should just work.
> >
> > I'm trying to understand why the scheduler is not picking up the cleared
> > task. Clearing should remove the task instance state and set the state of
> > the related DAG Run to running so that the scheduler picks those up.
> > Perhaps there's a conflict between the backfill and scheduler-related DAG
> > Runs? Which DAG runs are set to running? The backfill or
> scheduler-related
> > ones?
> >
> > Originally when I introduced DAG runs, backfill was operating without any
> > consideration related to DAG runs (DAG runs were a scheduler-specific
> > construct), later on Bolke added backfill-specific DAG runs and I'm not
> > 100% sure how that works.
> >
> > Let's get to the bottom of this.
> >
> > Max
> >
> > On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang <yrqls21@xxxxxxxxx> wrote:
> >
> > > If you are sure the update query targets the desired rows, the behavior
> > > should be the same.
> > >
> > > Scott Halgrim <scott.halgrim@xxxxxxxxxx.invalid>于2018年5月25日
> 周五下午4:23写道:
> > >
> > > > So far no ill effects from:
> > > >
> > > > update dag_run
> > > > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > > > where dag_id = 'daily'
> > > > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > > > and run_id like 'backfill_%'
> > > > order by execution_date;
> > > >
> > > > On May 25, 2018, 4:03 PM -0700, Scott Halgrim <
> scott.halgrim@xxxxxxxxxx
> > > > ,
> > > > wrote:
> > > > > Oh wow, that will work? Thanks! Is there any reason for me not to
> just
> > > > run a mass UPDATE on those dag runs directly in the metadata
> database?
> > > > >
> > > > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yrqls21@xxxxxxxxx>,
> > > wrote:
> > > > > > Airflow is not going to schedule backfill DAG runs, by looking
> at the
> > > > dag
> > > > > > run ID (which will start by 'backfill__'). If you want the
> scheduler
> > > to
> > > > > > schedule those tasks, you can click the DAG run and edit its name
> > > back
> > > > to
> > > > > > 'scheduled__<something>'
> > > > > >
> > > > > > Cheers,
> > > > > > Kevin Y
> > > > > >
> > > > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > > > scott.halgrim@xxxxxxxxxx.invalid> wrote:
> > > > > >
> > > > > > > I’ve got four months of dag runs that were scheduled dag runs,
> > > then I
> > > > > > > backfilled them. And now when I clear a task from one of those
> the
> > > > dag run
> > > > > > > goes to “running,” but none of the tasks get scheduled (unless
> I
> > > > manually
> > > > > > > backfill each of them)
> > > > > > >
> > > > > > > What I really should have done here was just cleared a mid-dag
> task
> > > > as
> > > > > > > well as all downstream tasks for these dag runs, but, well,
> now I’m
> > > > here
> > > > > > > and I’m wondering what the best way to fix this.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > >
> > > >
> > >
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-airflow-development/msg03505.html on line 188
Call Stack
#TimeMemoryFunctionLocation
10.0006358472{main}( ).../msg03505.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-airflow-development/msg03505.html on line 188
Call Stack
#TimeMemoryFunctionLocation
10.0006358472{main}( ).../msg03505.html:0