git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Single Airflow Instance Vs Multiple Airflow Instance


Not sure about 1.9 but parallelism seems to be supported on master
<https://github.com/apache/incubator-airflow/blob/272952a9dce932cb2c648f82c9f9f2cafd572ff1/airflow/executors/base_executor.py#L113>.
We are using 1.8 with some bug fixing cherry-picks. The machine is just out
of the box AWS EC2 instances. We've been using I3 for scheduler and R3 for
worker, but I urge you to checkout the new generations which are more
powerful and cheaper. As always, you may pick the best series by profile
your machine usage( on I/O, ram, cpu, etc). I don't think we've tuned too
much on the default Airflow settings and the best setting for you guys
should be different that the one best for us( that being said, I can
provide some more details when I'm back to the office if you are curious on
some particular settings).

Cheers,
Kevin Y

On Thu, Jun 7, 2018 at 9:02 PM ramandumcs@xxxxxxxxx <ramandumcs@xxxxxxxxx>
wrote:

> We have similar use case where we need to support multiple teams and
> expected load is 1000(s) active TIs. We are exploring setting up multiple
> airflow cluster on for each team and scale that cluster horizontally
> through   celery executor.
> @Ruiquin could you please share some details on airflow setup like
> Airflow Version, Machine configuration, Airflow cfg settings etc..
> How can we configure infinity(0) for cluster-wide setting. (We are using
> airflow v1.9 and it seems that
> airflow cfg's parallelism = 0 is not supported in v1.9)
>
> On 2018/06/07 22:27:20, Ruiqin Yang <yrqls21@xxxxxxxxx> wrote:
> > Here to provide a datapoint from Airbnb--all users share the same cluster
> > (~8k active DAGs and ~15k running tasks at peak).
> >
> > For the cluster-wide concurrency setting, we put infinity( 0) there and
> > scale up on the # of workers if we need more worker slot.
> >
> > For the scheduler & Airflow UI coupling, I believe Airflow UI is not
> > coupled with the scheduler. Actually in Airbnb we couple airflow worker
> and
> > airflow webserver together on the same EC2 instance--but you can always
> > have a set of instances only hosting webservers.
> >
> > If you have some critical users that don't want their DAG affected by
> > changes from other users( adhoc new DAGs/tasks), you can probably set up
> > dedicated celery queue( assuming you are using celery executor, local
> > executor is in theory not for production) for the user, or, you can
> enforce
> > DAG level concurrency( maybe a CI or through policy
> > <
> https://github.com/apache/incubator-airflow/blob/master/airflow/settings.py#L109
> >--which
> > I'm not sure is a good practice since it is more for task level
> attributes).
> >
> > With the awesome RBAC change in place, I think it make sense to share the
> > same cluster, easier maintenance, less user confusion, etc.
> >
> > Cheers,
> > Kevin Y
> >
> > On Thu, Jun 7, 2018 at 1:59 PM Ananth Durai <vananth22@xxxxxxxxx> wrote:
> >
> > > At Slack, We follow a similar pattern of deploying multiple airflow
> > > instances. Since the Airflow UI & the scheduler coupled, it introduces
> > > friction as the user need to know underlying deployment strategy. (like
> > > which Airflow URL I should visit to see my DAGs, multiple teams
> > > collaborating on the same DAG, pipeline operations, etc.)
> > >
> > > In one of the forum question, max mentioned renaming the scheduler to
> > > supervisor as the scheduler do more than just scheduling.
> > > It would be super cool if we can make multiple supervisors share the
> same
> > > airflow metadata storage and the Airflow UI. (maybe introducing a
> unique
> > > config param `supervisor.id` for each instance)
> > >
> > > The approach will help us to scale Airflow scheduler horizontally and
> while
> > > keeping the simplicity from the user perspective.
> > >
> > >
> > > Regards,
> > > Ananth.P,
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 7 June 2018 at 04:08, Arturo Michel <Arturo.Michel@xxxxxxxxxxxxxx>
> > > wrote:
> > >
> > > > We have had up to 50 dags with multiple tasks each. Many of them run
> in
> > > > parallel, we've had some issues with compute as it was meant to be a
> > > > temporary deployment but somehow it's now the permanent production
> one
> > > and
> > > > resources are not great.
> > > > Oranisationally it is very similar to what Gerard described. More
> than
> > > one
> > > > group working with different engineering practices and standards,
> this is
> > > > probably one of the sources of problems.
> > > >
> > > > -----Original Message-----
> > > > From: Gerard Toonstra <gtoonstra@xxxxxxxxx>
> > > > Sent: Wednesday, June 6, 2018 5:02 PM
> > > > To: dev@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > > > Subject: Re: Single Airflow Instance Vs Multiple Airflow Instance
> > > >
> > > > We are using two cluster instances. One cluster is for the
> engineering
> > > > teams that are in the "tech" wing and which rigorously follow tech
> > > > principles, the other instance is for use by business analysts and
> more
> > > > ad-hoc, experimental work, who do not necessarily follow the
> principles.
> > > We
> > > > have a nomad engineer helping out the ad-hoc cluster, setting it up,
> > > > connecting it to all systems and resolving programming questions. All
> > > > clusters are fully puppetized, so we reuse configs and ways how
> things
> > > are
> > > > configured, plus have a common "platform code" package that is reused
> > > > across both clusters.
> > > >
> > > > G>
> > > >
> > > >
> > > > On Wed, Jun 6, 2018 at 5:50 PM, James Meickle <
> jmeickle@xxxxxxxxxxxxxx>
> > > > wrote:
> > > >
> > > > > An important consideration here is that there are several settings
> > > > > that are cluster-wide. In particular, cluster-wide concurrency
> > > > > settings could result in Team B's DAG refusing to schedule based
> on an
> > > > error in Team A's DAG.
> > > > >
> > > > > Do your teams follow similar practices in how eagerly they ship
> code,
> > > > > or have similar SLAs for resolving issues? If so, you are probably
> > > > > fine using co-tenancy. If not, you should probably talk about it
> first
> > > > > to make sure the teams are okay with co-tenancy.
> > > > >
> > > > > On Wed, Jun 6, 2018 at 11:24 AM, gauthiermartin86@xxxxxxxxx <
> > > > > gauthiermartin86@xxxxxxxxx> wrote:
> > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > We have been experimenting with airflow for about 6 months now.
> > > > > > We are planning to have multiple departments to use it. Since we
> > > > > > don't have any internal experience with Airflow we are wondering
> if
> > > > > > single instance per department is more suited than single
> instance
> > > > > > with multi-tenancy? We have been aware about the upcoming
> release of
> > > > > > airflow
> > > > > > 1.10 and changes that will be made to the RBAC which will be more
> > > > > > suited for multi-tenancy.
> > > > > >
> > > > > > Any advice on this ? Any tips could be helpful to us.
> > > > > >
> > > > >
> > > >
> > > > This e-mail message and any attachments are confidential and are for
> the
> > > > exclusive use of the addressee only.  If you are not the intended
> > > > recipient, you should not use the content, place any reliance on it
> or
> > > > disclose it to anyone else.  Please notify the sender immediately by
> > > > replying to it and then ensure that it is deleted from your system
> > > > (including any attachments).
> > > >
> > >
> >
>