git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: programmatically creating and airflow quirks



> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <maximebeauchemin@xxxxxxxxx> wrote:
> 
> The historical reason is that people would check in scripts in the repo
> that had actual compute or other forms or undesired effect in module scope
> (scripts with no "if __name__ == '__main__':") and Airflow would just run
> this script while seeking for DAGs. So we added this mitigation patch that
> would confirm that there's something Airflow-related in the .py file. Not
> elegant, and confusing at times, but it also probably prevented some issues
> over the years.
> 
> The solution here is to have a more explicit way of adding DAGs to the
> DagBag (instead of the folder-crawling approach). The DagFetcher proposal
> offers solutions around that, having a central "manifest" file that
> provides explicit pointers to all DAGs in the environment.

Some rebasing needs to happen. When I looked at 1.8 code base almost an year ago, it felt like more complex than necessary.  What airflow is trying to promise from an architectural standpoint — that was not clear to me. It is trying to do too many things, scattered in too many places, is the feeling I got. As a result, I stopped peeping, and just trust that it works — which it does, btw. I tend to think that, airflow outgrew its original intents. A sort of micro-services architecture has to be brought in. I may sound critical, but no offense. I truly appreciate the contributions.    

> 
> Max
> 
> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <beauinmelbourne@xxxxxxxxx>
> wrote:
> 
>> In my opinion this searching for dags is not ideal.
>> 
>> We should be explicitly specifying the dags to load somewhere.
>> 
>> 
>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yrqls21@xxxxxxxxx> wrote:
>>> 
>>> I believe that is mostly because we want to skip parsing/loading .py
>> files
>>> that doesn't contain DAG defs to save time, as scheduler is going to
>>> parse/load the .py files over and over again and some files can take
>> quite
>>> long to load.
>>> 
>>> Cheers,
>>> Kevin Y
>>> 
>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhavala@xxxxxxxxx>
>>> wrote:
>>> 
>>>> happy to report that the “fix” worked. thanks Alex.
>>>> 
>>>> btw, wondering why was it there in the first place? how does it help —
>>>> saves time, early termination — what?
>>>> 
>>>> 
>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guziel@xxxxxxxxxx>
>> wrote:
>>>>> 
>>>>> Yup.
>>>>> 
>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@xxxxxxxxx
>>>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>>> 
>>>>> 
>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@xxxxxxxxxx
>>>> <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>>>>>> 
>>>>>> It’s because of this
>>>>>> 
>>>>>> “When searching for DAGs, Airflow will only consider files where the
>>>> string “airflow” and “DAG” both appear in the contents of the .py file.”
>>>>>> 
>>>>> 
>>>>> Have not noticed it.  From airflow/models.py, in process_file — (both
>> in
>>>> 1.9 and 1.10)
>>>>> ..
>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>>>> ..
>>>>> is looking for those strings and if they are not found, it is returning
>>>> without loading the DAGs.
>>>>> 
>>>>> 
>>>>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
>>>> it work?
>>>>> 
>>>>> 
>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@xxxxxxxxx
>>>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@xxxxxxxxxx
>>>> <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>>>>>>> 
>>>>>>> I think this is what is going on. The dags are picked by local
>>>> variables. I.E. if you do
>>>>>>> dag = Dag(...)
>>>>>>> dag = Dag(…)
>>>>>> 
>>>>>> from my_module import create_dag
>>>>>> 
>>>>>> for file in yaml_files:
>>>>>>    dag = create_dag(file)
>>>>>>    globals()[dag.dag_id] = dag
>>>>>> 
>>>>>> You notice that create_dag is in a different module. If it is in the
>>>> same scope (file), it will be fine.
>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>>> Only the second dag will be picked up.
>>>>>>> 
>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
>> soma.dhavala@xxxxxxxxx
>>>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>>>>> Hey AirFlow Devs:
>>>>>>> In our organization, we build a Machine Learning WorkBench with
>>>> AirFlow as
>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>>>>>> operators to customize the behaviour. These work flows are specified
>> in
>>>>>>> YAML.
>>>>>>> 
>>>>>>> We drop a DAG loader (written python) in the default location airflow
>>>>>>> expects the DAG files.  This DAG loader reads the specified YAML
>> files
>>>> and
>>>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>>>> programmatically creating the DAG objects. In order to support
>> muliple
>>>>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
>>>> But
>>>>>>> when a DAG is created (in a separate module) and made available to
>> the
>>>> DAG
>>>>>>> loaders, airflow does not pick it up. As an example, consider that I
>>>>>>> created a DAG picked it, and will simply unpickle the DAG and give it
>>>> to
>>>>>>> airflow.
>>>>>>> 
>>>>>>> However, in current avatar of airfow, the very creation of DAG has to
>>>>>>> happen in the loader itself. As far I am concerned, airflow should
>> not
>>>> care
>>>>>>> where and how the DAG object is created, so long as it is a valid DAG
>>>>>>> object. The workaround for us is to mix parser and loader in the same
>>>> file
>>>>>>> and drop it in the airflow default dags folder. During dag_bag
>>>> creation,
>>>>>>> this file is loaded up with import_modules utility and shows up in
>> the
>>>> UI.
>>>>>>> While this is a solution, but it is not clean.
>>>>>>> 
>>>>>>> What do DEVs think about a solution to this problem? Will saving the
>>>> DAG to
>>>>>>> the db and reading it from the db work? Or some core changes need to
>>>> happen
>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> -soma
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>