git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: programmatically creating and airflow quirks


happy to report that the “fix” worked. thanks Alex.

btw, wondering why was it there in the first place? how does it help — saves time, early termination — what?


> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guziel@xxxxxxxxxx> wrote:
> 
> Yup. 
> 
> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@xxxxxxxxx <mailto:soma.dhavala@xxxxxxxxx>> wrote:
> 
> 
>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@xxxxxxxxxx <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>> 
>> It’s because of this 
>> 
>> “When searching for DAGs, Airflow will only consider files where the string “airflow” and “DAG” both appear in the contents of the .py file.”
>> 
> 
> Have not noticed it.  From airflow/models.py, in process_file — (both in 1.9 and 1.10)
> ..
> if not all([s in content for s in (b'DAG', b'airflow')]):
> ..
> is looking for those strings and if they are not found, it is returning without loading the DAGs.
> 
> 
> So having “airflow” and “DAG”  dummy strings placed somewhere will make it work?
> 
> 
>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@xxxxxxxxx <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>> 
>> 
>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@xxxxxxxxxx <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>>> 
>>> I think this is what is going on. The dags are picked by local variables. I.E. if you do
>>> dag = Dag(...)
>>> dag = Dag(…)
>> 
>> from my_module import create_dag
>> 
>> for file in yaml_files:
>> 	dag = create_dag(file)
>> 	globals()[dag.dag_id] = dag
>> 
>> You notice that create_dag is in a different module. If it is in the same scope (file), it will be fine.
>> 
>>> 
>> 
>>> Only the second dag will be picked up.
>>> 
>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <soma.dhavala@xxxxxxxxx <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>> Hey AirFlow Devs:
>>> In our organization, we build a Machine Learning WorkBench with AirFlow as
>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>> operators to customize the behaviour. These work flows are specified in
>>> YAML.
>>> 
>>> We drop a DAG loader (written python) in the default location airflow
>>> expects the DAG files.  This DAG loader reads the specified YAML files and
>>> converts them into airflow DAG objects. Essentially, we are
>>> programmatically creating the DAG objects. In order to support muliple
>>> parsers (yaml, json etc), we separated the DAG creation from loading. But
>>> when a DAG is created (in a separate module) and made available to the DAG
>>> loaders, airflow does not pick it up. As an example, consider that I
>>> created a DAG picked it, and will simply unpickle the DAG and give it to
>>> airflow.
>>> 
>>> However, in current avatar of airfow, the very creation of DAG has to
>>> happen in the loader itself. As far I am concerned, airflow should not care
>>> where and how the DAG object is created, so long as it is a valid DAG
>>> object. The workaround for us is to mix parser and loader in the same file
>>> and drop it in the airflow default dags folder. During dag_bag creation,
>>> this file is loaded up with import_modules utility and shows up in the UI.
>>> While this is a solution, but it is not clean.
>>> 
>>> What do DEVs think about a solution to this problem? Will saving the DAG to
>>> the db and reading it from the db work? Or some core changes need to happen
>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>> 
>>> thanks,
>>> -soma
>> 
>