git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: programmatically creating and airflow quirks


In my opinion this searching for dags is not ideal.

We should be explicitly specifying the dags to load somewhere.


> On 25 Nov 2018, at 10:41 am, Kevin Yang <yrqls21@xxxxxxxxx> wrote:
> 
> I believe that is mostly because we want to skip parsing/loading .py files
> that doesn't contain DAG defs to save time, as scheduler is going to
> parse/load the .py files over and over again and some files can take quite
> long to load.
> 
> Cheers,
> Kevin Y
> 
> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhavala@xxxxxxxxx>
> wrote:
> 
>> happy to report that the “fix” worked. thanks Alex.
>> 
>> btw, wondering why was it there in the first place? how does it help —
>> saves time, early termination — what?
>> 
>> 
>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guziel@xxxxxxxxxx> wrote:
>>> 
>>> Yup.
>>> 
>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@xxxxxxxxx
>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>> 
>>> 
>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@xxxxxxxxxx
>> <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>>>> 
>>>> It’s because of this
>>>> 
>>>> “When searching for DAGs, Airflow will only consider files where the
>> string “airflow” and “DAG” both appear in the contents of the .py file.”
>>>> 
>>> 
>>> Have not noticed it.  From airflow/models.py, in process_file — (both in
>> 1.9 and 1.10)
>>> ..
>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>> ..
>>> is looking for those strings and if they are not found, it is returning
>> without loading the DAGs.
>>> 
>>> 
>>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
>> it work?
>>> 
>>> 
>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@xxxxxxxxx
>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>> 
>>>> 
>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@xxxxxxxxxx
>> <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>>>>> 
>>>>> I think this is what is going on. The dags are picked by local
>> variables. I.E. if you do
>>>>> dag = Dag(...)
>>>>> dag = Dag(…)
>>>> 
>>>> from my_module import create_dag
>>>> 
>>>> for file in yaml_files:
>>>>     dag = create_dag(file)
>>>>     globals()[dag.dag_id] = dag
>>>> 
>>>> You notice that create_dag is in a different module. If it is in the
>> same scope (file), it will be fine.
>>>> 
>>>>> 
>>>> 
>>>>> Only the second dag will be picked up.
>>>>> 
>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <soma.dhavala@xxxxxxxxx
>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>>> Hey AirFlow Devs:
>>>>> In our organization, we build a Machine Learning WorkBench with
>> AirFlow as
>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>>>> operators to customize the behaviour. These work flows are specified in
>>>>> YAML.
>>>>> 
>>>>> We drop a DAG loader (written python) in the default location airflow
>>>>> expects the DAG files.  This DAG loader reads the specified YAML files
>> and
>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>> programmatically creating the DAG objects. In order to support muliple
>>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
>> But
>>>>> when a DAG is created (in a separate module) and made available to the
>> DAG
>>>>> loaders, airflow does not pick it up. As an example, consider that I
>>>>> created a DAG picked it, and will simply unpickle the DAG and give it
>> to
>>>>> airflow.
>>>>> 
>>>>> However, in current avatar of airfow, the very creation of DAG has to
>>>>> happen in the loader itself. As far I am concerned, airflow should not
>> care
>>>>> where and how the DAG object is created, so long as it is a valid DAG
>>>>> object. The workaround for us is to mix parser and loader in the same
>> file
>>>>> and drop it in the airflow default dags folder. During dag_bag
>> creation,
>>>>> this file is loaded up with import_modules utility and shows up in the
>> UI.
>>>>> While this is a solution, but it is not clean.
>>>>> 
>>>>> What do DEVs think about a solution to this problem? Will saving the
>> DAG to
>>>>> the db and reading it from the db work? Or some core changes need to
>> happen
>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>>>> 
>>>>> thanks,
>>>>> -soma
>>>> 
>>> 
>> 
>>