git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks


Also worth mentioning that when you restart the scheduler it will use ETCD
and postgres to recreate state so you won't end up re-launching
tasks/missing tasks

On Thu, Aug 30, 2018, 12:54 PM Eamon Keane <eamon.keane1@xxxxxxxxx> wrote:

> Great I must give pgbouncer a try. Testing on GKE/cloudsql I quickly ran
> into that limit. The next possible limit might be etcd, as pod creation is
> expensive so if there were a lot of short lived pods you might run into
> issues (e.g. k8s API refusing connections) or so a google SRE tells me.
>
> On Thu, Aug 30, 2018 at 8:21 PM Greg Neiheisel <greg@xxxxxxxxxxxxx> wrote:
>
> > Yep, that should work fine. Pgbouncer is pretty configurable, so you can
> > play around with different settings for your environment. You can set
> > limits on the amount of connections you want to the actual database and
> > point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In
> my
> > experience, you can get away with a pretty low amount of actual
> connections
> > to postgres. Pgbouncer has some tools to observe the count of clients
> > (airflow processes), the amount of actual connections to the database, as
> > well as the number of waiting clients. You should be able to tune your
> > max_connections to the point where you have little to no clients waiting,
> > but using a dramatically lower number of actual connections to postgres.
> >
> > That chart also deploys a sidecar to pgbouncer that exports the metrics
> for
> > Prometheus to scrape. Here's an example Grafana dashboard that we use to
> > keep an eye on things -
> >
> >
> https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json
> > .
> >
> > On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <eamon.keane1@xxxxxxxxx>
> > wrote:
> >
> > > Interesting, Greg. Do you know if using pg_bouncer would allow you to
> > have
> > > more than 100 running k8s executor tasks at one time if e.g. there is a
> > 100
> > > connection limit on gcp instance?
> > >
> > > On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <greg@xxxxxxxxxxxxx>
> > wrote:
> > >
> > > > Good point Eamon, maxing connections out is definitely something to
> > look
> > > > out for. We recently added pgbouncer to our helm charts to pool
> > > connections
> > > > to the database for all the different airflow processes. Here's our
> > chart
> > > > for reference -
> > > >
> > > >
> > >
> >
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
> > > >
> > > > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <hamlin.kn@xxxxxxxxx>
> > wrote:
> > > >
> > > > > Thanks for your responses! Glad to hear that tasks can run
> > > independently
> > > > if
> > > > > something happens.
> > > > >
> > > > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <
> eamon.keane1@xxxxxxxxx>
> > > > > wrote:
> > > > >
> > > > > > Adding to Greg's point, if you're using the k8s executor and for
> > some
> > > > > > reason the k8s executor worker pod fails to launch within 120
> > seconds
> > > > > (e.g.
> > > > > > pending due to scaling up a new node), this counts as a task
> > failure.
> > > > > Also,
> > > > > > if the k8s executor pod has already launched a pod operator but
> is
> > > > killed
> > > > > > (e.g. manually or due to node upgrade), the  pod operator it
> > launched
> > > > is
> > > > > > not killed and runs to completion so if using retries, you need
> to
> > > > ensure
> > > > > > idempotency. The worker pods update the db per my understanding,
> > with
> > > > > each
> > > > > > requiring a separate connection to the db, this can tax your
> > > connection
> > > > > > budget (100-300 for small postgres instances on gcp or aws).
> > > > > >
> > > > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <
> greg@xxxxxxxxxxxxx
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hey Kyle, the task pods will continue to run even if you reboot
> > the
> > > > > > > scheduler and webserver and the status does get updated in the
> > > > airflow
> > > > > > db,
> > > > > > > which is great.
> > > > > > >
> > > > > > > I know the scheduler subscribes to the Kubernetes watch API to
> > get
> > > an
> > > > > > event
> > > > > > > stream of pods completing and it keeps a checkpoint so it can
> > > > > resubscribe
> > > > > > > when it comes back up.
> > > > > > >
> > > > > > > I forget if the worker pods update the db or if the scheduler
> is
> > > > doing
> > > > > > > that, but it should work out.
> > > > > > >
> > > > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <hamlin.kn@xxxxxxxxx
> >
> > > > wrote:
> > > > > > >
> > > > > > > > gentle bump
> > > > > > > >
> > > > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <
> > hamlin.kn@xxxxxxxxx
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I'm about to make the switch to Kubernetes with Airflow,
> but
> > am
> > > > > > > wondering
> > > > > > > > > what happens when my CI/CD pipeline redeploys the webserver
> > and
> > > > > > > scheduler
> > > > > > > > > and there are still long-running tasks (pods). My intuition
> > is
> > > > that
> > > > > > > since
> > > > > > > > > the database hold all state and the tasks are in charge of
> > > > updating
> > > > > > > their
> > > > > > > > > own state, and the UI only renders what it sees in the
> > database
> > > > > that
> > > > > > > this
> > > > > > > > > is not so much of a problem. To be sure, however, here are
> my
> > > > > > > questions:
> > > > > > > > >
> > > > > > > > > Will task pods continue to run?
> > > > > > > > > Can task pods continue to poll the external system they are
> > > > running
> > > > > > > tasks
> > > > > > > > > on while being "headless"?
> > > > > > > > > Can the tasks pods change/update state in the database
> while
> > > > being
> > > > > > > > > "headless"?
> > > > > > > > > Will the UI/Scheduler still be aware of the tasks (pods)
> once
> > > > they
> > > > > > are
> > > > > > > > > live again?
> > > > > > > > >
> > > > > > > > > Is there anything else the might cause issues when
> deploying
> > > > while
> > > > > > > tasks
> > > > > > > > > (pods) are running that I'm not thinking of here?
> > > > > > > > >
> > > > > > > > > Kyle Hamlin
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Kyle Hamlin
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kyle Hamlin
> > > > >
> > > >
> > > >
> > > > --
> > > > *Greg Neiheisel* / CTO Astronomer.io
> > > >
> > >
> >
> >
> > --
> > *Greg Neiheisel* / CTO Astronomer.io
> >
>