git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Airflow - High Availability and Scale Up vs Scale Out


We are using AWS ECS to deploy airflow and we rely on it to have some kind of high availability and scaling workers.

We have defined 3 ECS services : scheduler / webserver / worker.

Scheduler and Webserver are each running on single container.

Worker service can scale to as many number of containers as we want we have actually 3 workers running within the worker service.
We use ECS service scheduler to make sure there is always one airflow scheduler running in fact we start the airflow scheduler with run-duration param to 10min so it gets restarted continuously by ECS.

We have also defined a health check endpoints to check the health of all airflow processes. For instance to check the health of the scheduler with a system health dag that spin 3 dummy tasks that will write some logs on S3 and fire an event to newrelic. The scheduler healthcheck endpoint will just check that there is a task instance log for the last dagrun and we use newrelic sys_health events to define alerts. The healthcheck endpoints are used by ECS to check that each of airflow ECS services are healthy.

We are also deploy our dags inside docker image when it's built so we have immutable image. It's not ideal for small dag change to rebuild the image and redploy the whole airflow cluster but it's is simpler enough than having to deal mounted volumes. We put our logs on S3 so we don't mind killing containers so often.

It's working fine so far but we have just started (plan to migrate few hundres dags from another workflow tools) and only have few dags on airflow so I don't know if we are going to keep this once we have a few dozens dag changes everyday.

Regards,
Yacine

On 10/06/2018, 21:04, "Ali Uz" <aliuz1@xxxxxxxxx> wrote:

    We also run one beefy box in AWS ECS with the scheduler and webserver
    running on the same container. However, we have run into issues with this
    approach as the scheduler does fail at times and our DAGs get stuck until I
    have to manually restart the container.
    What approaches do you guys use to restart the scheduler automatically when
    it's stuck and/or failed?

    - Ali

    On Sun, Jun 10, 2018 at 8:44 PM Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:

    > If you are running on one big box, you most certainly want to put the
    > scheduler in its own cgroup and run the tasks with sudo it their own.
    > Otherwise your availability might suffer.
    >
    > B.
    >
    > Verstuurd vanaf mijn iPad
    >
    > > Op 10 jun. 2018 om 16:30 heeft Sam Sen <sxs@xxxxxxxxxxxxxxxx> het
    > volgende geschreven:
    > >
    > > Wouldn't you want immutable containers, hence, baking in the code in the
    > > container would be more ideal?
    > >
    > >> On Sun, Jun 10, 2018, 9:53 AM Arash Soheili <tonyarash@xxxxxxxxx>
    > wrote:
    > >>
    > >> We are just starting out but our setup is 2 EC2 with one running the web
    > >> server and scheduler and the other having multiple workers. The
    > database is
    > >> an RDS which both are connected to as well as Redis on AWS elastic cache
    > >> for the Celery connection.
    > >>
    > >> All 4 services are run in containers with systemd and we use CodeDeploy
    > and
    > >> sync up the code by mapping volumes from local file to the container. We
    > >> are not yet heavy users of Airflow so I can't speak to performance and
    > >> scale up just yet.
    > >>
    > >> In general I think an AMI with baked in code can be brittle and hard to
    > >> maintain and update. Container is the way to go as you can bake in the
    > code
    > >> in the image if you want. We have chosen not to do that and rely on
    > volume
    > >> mapping to update the latest code in the container. This makes it easier
    > >> that you don't need to keep creating new images.
    > >>
    > >> Arash
    > >>
    > >>> On Sat, Jun 9, 2018 at 9:47 AM Naik Kaxil <k.naik@xxxxxxxxx> wrote:
    > >>>
    > >>> Let us know after trying the beefy box approach about your findings.
    > >>>
    > >>> On 08/06/2018, 12:24, "Sam Sen" <sxs@xxxxxxxxxxxxxxxx> wrote:
    > >>>
    > >>>    We are facing this now. We have tried the celeryexecutor and it adds
    > >>> more
    > >>>    moving parts. While we have no thrown out this idea, we are going to
    > >>> give
    > >>>    one big beefy box a try.
    > >>>
    > >>>    To handle the HA side of things, we are putting the server in an
    > >>>    auto-scaling group (we use AWS) with a min and Max of 1 server. We
    > >>> deploy
    > >>>    from an AMI that has airflow baked in and we point the DB config to
    > >> an
    > >>> RDS
    > >>>    using service discovery (consul).
    > >>>
    > >>>    As for the dag code, we can either bake it into the AMI as well or
    > >>> install
    > >>>    it on bootup. We haven't decided what to do for this but either way,
    > >> we
    > >>>    realize it could take a few minutes to fully recover in the event of
    > >> a
    > >>>    catastrophe.
    > >>>
    > >>>    The other option is to have a standby server if using celery isn't
    > >>> ideal.
    > >>>    With that, I have tried using Hashicorp nomad to handle the
    > services.
    > >>> In my
    > >>>    limited trial, it did what we wanted but we need more time to test.
    > >>>
    > >>>>    On Fri, Jun 8, 2018, 4:23 AM Naik Kaxil <k.naik@xxxxxxxxx> wrote:
    > >>>>
    > >>>> Hi guys,
    > >>>>
    > >>>>
    > >>>>
    > >>>> I have 2 specific questions for the guys using Airflow in
    > >> production?
    > >>>>
    > >>>>
    > >>>>
    > >>>>   1. How have you achieved High availability? How does the
    > >>> architecture
    > >>>>   look like? Do you replicate the master node as well?
    > >>>>   2. Scale Up vs Scale Out?
    > >>>>      1. What is the preferred approach you take? 1 beefy Airflow
    > >> VM
    > >>> with
    > >>>>      Worker, Scheduler and Webserver using Local Executor or a
    > >>> cluster with
    > >>>>      multiple workers using Celery Executor.
    > >>>>
    > >>>>
    > >>>>
    > >>>> I think this thread should help others as well with similar
    > >> question.
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>> Regards,
    > >>>>
    > >>>> Kaxil
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>> Kaxil Naik
    > >>>>
    > >>>> Data Reply
    > >>>> 2nd Floor, Nova South
    > >>>> 160 Victoria Street, Westminster
    > >>>> London SW1E 5LB - UK
    > >>>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
    > >>>> k.naik@xxxxxxxxx
    > >>>> www.reply.com
    > >>>>
    > >>>> [image: Data Reply]
    > >>>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>> Kaxil Naik
    > >>>
    > >>> Data Reply
    > >>> 2nd Floor, Nova South
    > >>> 160 Victoria Street, Westminster
    > >>> London SW1E 5LB - UK
    > >>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
    > >>> k.naik@xxxxxxxxx
    > >>> www.reply.com
    > >>>
    > >>
    >


The information in this email (and any attachments) is confidential and is intended solely for the use of the individual or entity to whom it is addressed. If you received this email in error please tell us by reply email (or telephone the sender) and delete all electronic copies on your system or other copies known to you. Trainline Investments Holdings Limited (Registered No.5776685), Trainline.com Limited (Registered No. 3846791) and Trainline International Limited (Registered No. 6881309) are all registered in England and Wales with registered office at 3rd floor, 120 Holborn, London, EC1N 2TD.