git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using Too Many Aiflow Variables in Dag is Good thing ?


Cache them where? When would it get invalidated? Given the DAG parsing happens in a sub-process how would the cache live longer than that process?

I think the change might be to use a per-process/per-thread SQLA connection when parsing dags, so that if a DAG needs access to the metadata DB it does it with just one connection rather than N.

-ash

> On 22 Oct 2018, at 11:11, Sai Phanindhra <phani8996@xxxxxxxxx> wrote:
> 
> Who don't we cache variables? We can fairly assume that variables won't get
> changed very frequently(not as frequent as scheduler DAG run time). We can
> keep default timeout to few times scheduler run time. This will help
> control number of connections to database and reduces load both on
> scheduler and database.
> 
> On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms32035@xxxxxxxxx> wrote:
> 
>> Hi
>> 
>> You are right, it's a sure way to saturate db connections, as a connection
>> is established every few seconds when the DAGs are parsed. The same happens
>> when you use variables in __init__ of an operator. Os environment variable
>> would be safer for your need.
>> 
>> Marcin
>> 
>> 
>> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pramitigoel20@xxxxxxxxx> wrote:
>> 
>>> Hi,
>>> 
>>> We want to make owner and email Id general, so we don't want to put in
>>> airflow dag. Using variables will help us in changing the email/owner
>>> later, if there are lot of dags of same owner.
>>> 
>>> For example:
>>> 
>>> 
>>> default_args = {
>>>    'owner': Variable.get('test_owner_de'),
>>>    'depends_on_past': False,
>>>    'start_date': datetime(2018, 10, 17),
>>>    'email': Variable.get('de_infra_email'),
>>>    'email_on_failure': True,
>>>    'email_on_retry': True,
>>>    'retries': 2,
>>>    'retry_delay': timedelta(minutes=1)}
>>> 
>>> 
>>> Looking into the code of Airflow, it is making connection session
>> everytime
>>> the variable is created, and then close it. (Let me know if I understand
>>> wrong). If there are many dags with variables in default args running
>>> parallel, querying variable table in MySQL, will it have any sort of
>>> limitation on number of sessions of SQLAlchemy ? Will that make dag slow
>> as
>>> there will be many queries to mysql for each dag? is the above approach
>>> good ?
>>> 
>>>> using Airlfow 1.9
>>> 
>>> Thanks,
>>> Pramiti.
>>> 
>>