git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] CloudStack graceful shutdown


Andrija

This is the reason for this enhancement, snapshot, migration and others -
are all async jobs - and therefore should be tracked in async_job table
under specific MS.It is known they may take a while to complete and last
thing we want is to interrupt it.

Depending on what value you have set in Configurations - it may time out -
but continue working on the background.. meaning cloudstack will stop
tracking the async job beyond specific interval - but cloudstack agent will
push forward.

I dont see a harm of taking the server offline - if there are no jobs that
are being tracked.

However - we should not stop the server - if we identify any jobs that are
still active. The user can decide to append the forceful shutdown after the
graceful one if he feels like it. For example

[shell] # service cloudstack-management graceful-shutdown; service
cloudstack-management shutdown

For your issue,

Please check the value for "job.cancel.threshold.minutes"

      "category": "Advanced",

      "description": "Time (in minutes) for async-jobs to be forcely
cancelled if it has been in process for long",

      "name": "job.cancel.threshold.minutes",

      "value": "60"


I propose for the graceful shutdown command to source
"job.cancel.threshold.minutes"
as a max value - before giving up on the endeavor.


The only issue i'm on the fence about - is blocking access to 8080/8443 -
if you have a single node setup.


There is a chance you may block the access to cloudstack for over an hour -
and that may not be what you intended.


Perhaps we add a parameter in db.properties for
"graceful.shutdown.block.api.server = true/false"


Regards,

ilya

On Wed, Apr 4, 2018 at 2:22 PM, Andrija Panic <andrija.panic@xxxxxxxxx>
wrote:

> One comment here (I had to shutdown whole DC for few hours recently....),
> please make sure to perhaps at least consider snapshoting process as the
> special case - it can take few hours for snapshot to complete really (copy
> process from Primary to Secondary Storage)
>
> I did (in my recent unfortunate DC shutdown), actually stop MS (we also
> have script to identify running async jobs), so we stop it once safe, but
> any running qemu-img processes (we use kVM) need to be killed manually
> (ansbile) after MS is stopped, etc,etc...
>
> I can assume most jobs can take reasonable long time to complete, but
> snapshots are probably the biggest exceptions as can take extremely long
> time to complete...
>
> Cheers
>
> On 4 April 2018 at 22:46, Tutkowski, Mike <Mike.Tutkowski@xxxxxxxxxx>
> wrote:
>
> > I may be remembering this incorrectly, but from what I recall, if a
> > resource is owned by one MS and a request related to that resource comes
> in
> > to another MS, the MS that received the request passes it on to the other
> > MS.
> >
> > > On Apr 4, 2018, at 2:36 PM, Rafael Weingärtner <
> > rafaelweingartner@xxxxxxxxx> wrote:
> > >
> > > Big +1 for this feature; I only have a few doubts.
> > >
> > > * Regarding the tasks/jobs that management servers (MSs) execute; are
> > these
> > > tasks originate from requests that come to the MS, or is it possible
> that
> > > requests received by one management server to be executed by other? I
> > mean,
> > > if I execute a request against MS1, will this request always be
> > > executed/threated by MS1, or is it possible that this request is
> executed
> > > by another MS (e.g. MS2)?
> > >
> > > * I would suggest that after we block traffic coming from
> > 8080/8443/8250(we
> > > will need to block this as well right?), we can log the execution of
> > tasks.
> > > I mean, something saying, there are XXX tasks (enumerate tasks) still
> > being
> > > executed, we will wait for them to finish before shutting down.
> > >
> > > * The timeout (60 minutes suggested) could be global settings that we
> can
> > > load before executing the graceful-shutdown.
> > >
> > > On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
> > ilya.mailing.lists@xxxxxxxxx>
> > > wrote:
> > >
> > >> Use case:
> > >> In any environment - time to time - administrator needs to perform a
> > >> maintenance. Current stop sequence of cloudstack management server
> will
> > >> ignore the fact that there may be long running async jobs - and
> > terminate
> > >> the process. This in turn can create a poor user experience and
> > occasional
> > >> inconsistency  in cloudstack db.
> > >>
> > >> This is especially painful in large environments where the user has
> > >> thousands of nodes and there is a continuous patching that happens
> > around
> > >> the clock - that requires migration of workload from one node to
> > another.
> > >>
> > >> With that said - i've created a script that monitors the async job
> queue
> > >> for given MS and waits for it complete all jobs. More details are
> posted
> > >> below.
> > >>
> > >> I'd like to introduce "graceful-shutdown" into the systemctl/service
> of
> > >> cloudstack-management service.
> > >>
> > >> The details of how it will work is below:
> > >>
> > >> Workflow for graceful shutdown:
> > >>  Using iptables/firewalld - block any connection attempts on 8080/8443
> > (we
> > >> can identify the ports dynamically)
> > >>  Identify the MSID for the node, using the proper msid - query
> async_job
> > >> table for
> > >> 1) any jobs that are still running (or job_status=“0”)
> > >> 2) job_dispatcher not like “pseudoJobDispatcher"
> > >> 3) job_init_msid=$my_ms_id
> > >>
> > >> Monitor this async_job table for 60 minutes - until all async jobs for
> > MSID
> > >> are done, then proceed with shutdown
> > >>    If failed for any reason or terminated, catch the exit via trap
> > command
> > >> and unblock the 8080/8443
> > >>
> > >> Comments are welcome
> > >>
> > >> Regards,
> > >> ilya
> > >>
> > >
> > >
> > >
> > > --
> > > Rafael Weingärtner
> >
>
>
>
> --
>
> Andrija Panić
>