git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: programmatically creating and airflow quirks


Great inputs James. I was premature in saying we need micro-services. Any solutioning should  depend on the problem(s) being solved and promise(s) being made.

thanks,
-soma

> On Nov 28, 2018, at 11:24 PM, James Meickle <jmeickle@xxxxxxxxxxxxxx.INVALID> wrote:
> 
> I would be very interested in helping draft a rearchitecting AIP. Of
> course, that's a vague statement. I am interested in several specific areas
> of Airflow functionality that would be hard to modify without some
> refactoring taking place first:
> 
> 1) Improving Airflow's data model so it's easier to have functional data
> pipelines (such as addressing information propagation and artifacts via a
> non-xcom mechanism)
> 
> 2) Having point-in-timeness for DAGs: a concept of which revision of a DAG
> was in use at which date, represented in-Airflow.
> 
> 3) Better idioms and loading capabilities for DAG factories (either
> config-driven, or non-Python creation of DAGs, like with boundary-layer).
> 
> 4) Flexible execution dates: in finance we operate day over day, and have
> valid use cases for "t-1", "t+0", and "t+1" dates. The current execution
> date status is incredibly confusing for literally every developer we've
> brought onto Airflow (they understand it eventually but do make mistakes at
> first).
> 
> 5) Scheduler-integrated sensors
> 
> 6) Making Airflow more operator-friendly with better alerting, health
> checks, notifications, deploy-time configuration, etc.
> 
> 7) Improving testability of various components (both within the Airflow
> repo, as well as making it easier to test DAGs and plugins)
> 
> 8) Deprecating "newbie trap" or excess complexity features (like skips), by
> fixing their internal implementation or by providing alternatives that
> address their use cases in more sound ways.
> 
> To my mind, I would need Airflow to be more modular to accomplish several
> of those. Even if these aims don't happen in Airflow contrib (as some are
> quite contentious and have been discussed on this list before), it would
> currently be nearly impossible to maintain an in-house branch that
> attempted to implement them.
> 
> That being said, saying that it requires microservices is IMO incorrect.
> Airflow already scales quite well, so while it needs more modularization,
> we probably would see no benefit from immediately breaking those modules
> into independent services.
> 
> On Wed, Nov 28, 2018 at 11:38 AM Ash Berlin-Taylor <ash@xxxxxxxxxx> wrote:
> 
>> I have similar feelings around the "core" of Airflow and would _love_ to
>> somehow find time to spend a month really getting to grips with the
>> scheduler and the dagbag and see what comes to light with fresh eyes and
>> the benefits of hindsight.
>> 
>> Finding that time is going to be.... A Challenge though.
>> 
>> (Oh, except no to microservices. Airflow is hard enough to operator right
>> now without splitting things in to even more daemons)
>> 
>> -ash
>>> On 26 Nov 2018, at 03:06, soma dhavala <soma.dhavala@xxxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>>> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <
>> maximebeauchemin@xxxxxxxxx> wrote:
>>>> 
>>>> The historical reason is that people would check in scripts in the repo
>>>> that had actual compute or other forms or undesired effect in module
>> scope
>>>> (scripts with no "if __name__ == '__main__':") and Airflow would just
>> run
>>>> this script while seeking for DAGs. So we added this mitigation patch
>> that
>>>> would confirm that there's something Airflow-related in the .py file.
>> Not
>>>> elegant, and confusing at times, but it also probably prevented some
>> issues
>>>> over the years.
>>>> 
>>>> The solution here is to have a more explicit way of adding DAGs to the
>>>> DagBag (instead of the folder-crawling approach). The DagFetcher
>> proposal
>>>> offers solutions around that, having a central "manifest" file that
>>>> provides explicit pointers to all DAGs in the environment.
>>> 
>>> Some rebasing needs to happen. When I looked at 1.8 code base almost an
>> year ago, it felt like more complex than necessary.  What airflow is trying
>> to promise from an architectural standpoint — that was not clear to me. It
>> is trying to do too many things, scattered in too many places, is the
>> feeling I got. As a result, I stopped peeping, and just trust that it works
>> — which it does, btw. I tend to think that, airflow outgrew its original
>> intents. A sort of micro-services architecture has to be brought in. I may
>> sound critical, but no offense. I truly appreciate the contributions.
>>> 
>>>> 
>>>> Max
>>>> 
>>>> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <beauinmelbourne@xxxxxxxxx>
>>>> wrote:
>>>> 
>>>>> In my opinion this searching for dags is not ideal.
>>>>> 
>>>>> We should be explicitly specifying the dags to load somewhere.
>>>>> 
>>>>> 
>>>>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yrqls21@xxxxxxxxx> wrote:
>>>>>> 
>>>>>> I believe that is mostly because we want to skip parsing/loading .py
>>>>> files
>>>>>> that doesn't contain DAG defs to save time, as scheduler is going to
>>>>>> parse/load the .py files over and over again and some files can take
>>>>> quite
>>>>>> long to load.
>>>>>> 
>>>>>> Cheers,
>>>>>> Kevin Y
>>>>>> 
>>>>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhavala@xxxxxxxxx
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> happy to report that the “fix” worked. thanks Alex.
>>>>>>> 
>>>>>>> btw, wondering why was it there in the first place? how does it help
>> —
>>>>>>> saves time, early termination — what?
>>>>>>> 
>>>>>>> 
>>>>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guziel@xxxxxxxxxx>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Yup.
>>>>>>>> 
>>>>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <
>> soma.dhavala@xxxxxxxxx
>>>>>>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@xxxxxxxxxx
>>>>>>> <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>>>>>>>>> 
>>>>>>>>> It’s because of this
>>>>>>>>> 
>>>>>>>>> “When searching for DAGs, Airflow will only consider files where
>> the
>>>>>>> string “airflow” and “DAG” both appear in the contents of the .py
>> file.”
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Have not noticed it.  From airflow/models.py, in process_file —
>> (both
>>>>> in
>>>>>>> 1.9 and 1.10)
>>>>>>>> ..
>>>>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>>>>>>> ..
>>>>>>>> is looking for those strings and if they are not found, it is
>> returning
>>>>>>> without loading the DAGs.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> So having “airflow” and “DAG”  dummy strings placed somewhere will
>> make
>>>>>>> it work?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <
>> soma.dhavala@xxxxxxxxx
>>>>>>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@xxxxxxxxxx
>>>>>>> <mailto:alex.guziel@xxxxxxxxxx>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I think this is what is going on. The dags are picked by local
>>>>>>> variables. I.E. if you do
>>>>>>>>>> dag = Dag(...)
>>>>>>>>>> dag = Dag(…)
>>>>>>>>> 
>>>>>>>>> from my_module import create_dag
>>>>>>>>> 
>>>>>>>>> for file in yaml_files:
>>>>>>>>>  dag = create_dag(file)
>>>>>>>>>  globals()[dag.dag_id] = dag
>>>>>>>>> 
>>>>>>>>> You notice that create_dag is in a different module. If it is in
>> the
>>>>>>> same scope (file), it will be fine.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Only the second dag will be picked up.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
>>>>> soma.dhavala@xxxxxxxxx
>>>>>>> <mailto:soma.dhavala@xxxxxxxxx>> wrote:
>>>>>>>>>> Hey AirFlow Devs:
>>>>>>>>>> In our organization, we build a Machine Learning WorkBench with
>>>>>>> AirFlow as
>>>>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow
>> python
>>>>>>>>>> operators to customize the behaviour. These work flows are
>> specified
>>>>> in
>>>>>>>>>> YAML.
>>>>>>>>>> 
>>>>>>>>>> We drop a DAG loader (written python) in the default location
>> airflow
>>>>>>>>>> expects the DAG files.  This DAG loader reads the specified YAML
>>>>> files
>>>>>>> and
>>>>>>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>>>>>>> programmatically creating the DAG objects. In order to support
>>>>> muliple
>>>>>>>>>> parsers (yaml, json etc), we separated the DAG creation from
>> loading.
>>>>>>> But
>>>>>>>>>> when a DAG is created (in a separate module) and made available to
>>>>> the
>>>>>>> DAG
>>>>>>>>>> loaders, airflow does not pick it up. As an example, consider
>> that I
>>>>>>>>>> created a DAG picked it, and will simply unpickle the DAG and
>> give it
>>>>>>> to
>>>>>>>>>> airflow.
>>>>>>>>>> 
>>>>>>>>>> However, in current avatar of airfow, the very creation of DAG
>> has to
>>>>>>>>>> happen in the loader itself. As far I am concerned, airflow should
>>>>> not
>>>>>>> care
>>>>>>>>>> where and how the DAG object is created, so long as it is a valid
>> DAG
>>>>>>>>>> object. The workaround for us is to mix parser and loader in the
>> same
>>>>>>> file
>>>>>>>>>> and drop it in the airflow default dags folder. During dag_bag
>>>>>>> creation,
>>>>>>>>>> this file is loaded up with import_modules utility and shows up in
>>>>> the
>>>>>>> UI.
>>>>>>>>>> While this is a solution, but it is not clean.
>>>>>>>>>> 
>>>>>>>>>> What do DEVs think about a solution to this problem? Will saving
>> the
>>>>>>> DAG to
>>>>>>>>>> the db and reading it from the db work? Or some core changes need
>> to
>>>>>>> happen
>>>>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created"
>> DAGs.
>>>>>>>>>> 
>>>>>>>>>> thanks,
>>>>>>>>>> -soma
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 
>>