git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best Practice of Airflow Setting-Up & Usage


Thank you Xiaodong for bringing this up and pardon me for being late on this thread. Sharing the setup within Airbnb and some ideas/progresses, which should benefit people who's interested in this topic.

*- Setting-up*: 
One-time on 1.8 with cherry-picks, planning to move to containerization after releasing 1.10 internally.

*- Executors*: 
CeleryExecutor.

*- Scale*: 
400 worker nodes with celery concurrency on each nodes varies from 2 to 200( depends on the queue it is serving).

*- Queues*: 
We have 9 queues and 2 of them are to serve task with special env dependency( need GPU, need special packages, etc) and other
7 are to serve tasks with different resource consumptions, which leads to worker nodes with different celery concurrency and cgroup sizes.

*- SLA*:
5 mins. 3 mins is possible( our current scheduling delay stays in 0.5-3 min range) but we wanted some more headroom. (Our SLA is evaluated by a monitoring task that is running 
on the cluster with 5m interval. It will compare the current timestamp against the expected scheduling timestamp(execution_date + 5m) 
and send the time diff in min as one data point)

*- # of DAGs/Tasks*: 
We're maintaining ~9,500 active DAGs produced by ~1,600 DAG files( # of DAG file is actually the biggest scheduling bottleneck now). During peak hours we have ~15,000 task running at the same time.


Xiaodong, you had a very good point about scheduler being the performance bottleneck and we need HA for it. Looking forward for your contribution on the scheduelr HA topic!

About scheduler performance/scaling, I've previously sent a proposal with title "[Proposal] Scale Airflow" to the dev mailing list and currently I have one open PR improving performance with celery executor querying,
one WIP PR( fixing the CI, ETA this weekend) improving performance of scheduelr and one PR in my backlog( have an interval PR, need to open source it) improving performance with enqueuing in both scheduler and
celery executor. From our internal stress test result, with all three PRs together, our cluster can handle 4,000 DAG files, 30,000 peak concurrent running tasks(even when they all need to be scheduled at the same time)
within our 5 min SLA( calculated using our real DAG file parsing time, which can be as long as 90 seconds for one DAG file). I think we will have enough headroom after the changes have been merged and thus
some longer term improvements( separate DAG parsing component/service, scheduler sharding, distribute scheduler responsibility to worker, etc) can wait a bit. But of course I'm open of hear other opinions that can
better scale Airflow.
Screen Shot 2018-08-31 at 3.26.11 PM.png
Also I do want to mention that with faster scheduling and DAG parsing, DB might become the bottleneck for performance. With our stress test setup we can handle the DB load with an AWS RDS r3.4xlarge instance( only
with the improvement PRs). And webserver is not scaling very well as it is parsing all DAG files in a single process fashion, which is my planned next item to work on.


Cheers,
Kevin Y

On Thu, Sep 6, 2018 at 11:14 PM ramandumcs@xxxxxxxxx <ramandumcs@xxxxxxxxx> wrote:
Yeah, we are seeing scheduler becoming bottleneck as number of DAG files increase as scheduler can scale vertically and not horizontally.
We are trying with multiple independent airflow setup and are distributing the load between them.
But managing these many airflow clusters is becoming a challenge.

Thanks,
Raman Gupta

On 2018/09/06 14:55:26, Deng Xiaodong <xd.deng.r@xxxxxxxxx> wrote:
> Thanks for sharing, Raman.
>
> Based on what you shared, I think there are two points that may be worth
> further discussing/thinking.
>
> *Scaling up (given thousands of DAGs):*
> If you have thousands of DAGs, you may encounter longer scheduling latency
> (actual start time minus planned start time).
> For workers, we can scale horizontally by adding more worker nodes, which
> is relatively straightforward.
> But *Scheduler* may become another bottleneck.Scheduler can only be running
> on one node (please correct me if I'm wrong). Even if we can use multiple
> threads for it, it has its limit. HA is another concern. This is also what
> our team is looking into at this moment, since scheduler is the biggest
> "bottleneck" identified by us so far (anyone has experience tuning
> scheduler performance?).
>
> *Broker for Celery Executor*:
> you may want to try RabbitMQ rather than Redis/SQL as broker? Actually the
> Celery community had the proposal to deprecate Redis as broker (of course
> this proposal was rejected eventually) [
> https://github.com/celery/celery/issues/3274].
>
>
> Regards,
> XD
>
>
>
>
>
> On Thu, Sep 6, 2018 at 6:10 PM ramandumcs@xxxxxxxxx <ramandumcs@xxxxxxxxx>
> wrote:
>
> > Hi,
> > We have a requirement to scale to run 1000(s) concurrent dags. With celery
> > executor we observed that
> > Airflow worker gets stuck sometimes if connection to redis/mysql breaks
> > (https://github.com/celery/celery/issues/3932
> > https://github.com/celery/celery/issues/4457)
> > Currently we are using Airflow 1.9 with LocalExecutor but planning to
> > switch to Airflow 1.10 with K8 Executor.
> >
> > Thanks,
> > Raman Gupta
> >
> >
> > On 2018/09/05 12:56:38, Deng Xiaodong <xd.deng.r@xxxxxxxxx> wrote:
> > > Hi folks,
> > >
> > > May you kindly share how your organization is setting up Airflow and
> > using
> > > it? Especially in terms of architecture. For example,
> > >
> > > - *Setting-Up*: Do you install Airflow in a "one-time" fashion, or
> > > containerization fashion?
> > > - *Executor:* Which executor are you using (*LocalExecutor*,
> > > *CeleryExecutor*, etc)? I believe most production environments are using
> > > *CeleryExecutor*?
> > > - *Scale*: If using Celery, normally how many worker nodes do you add?
> > (for
> > > sure this is up to workloads and performance of your worker nodes).
> > > - *Queue*: if Queue feature
> > > <https://airflow.apache.org/concepts.html#queues> is used in your
> > > architecture? For what advantage? (for example, explicitly assign
> > > network-bound tasks to a worker node whose parallelism can be much higher
> > > than its # of cores)
> > > - *SLA*: do you have any SLA for your scheduling? (this is inspired by
> > > @yrqls21's PR 3830 <
> > https://github.com/apache/incubator-airflow/pull/3830>)
> > > - etc.
> > >
> > > Airflow's setting-up can be quite flexible, but I believe there is some
> > > sort of best practice, especially in the organisations where scalability
> > is
> > > essential.
> > >
> > > Thanks for sharing in advance!
> > >
> > >
> > > Best regards,
> > > XD
> > >
> >
>