Big +1 for this feature; I only have a few doubts.

* Regarding the tasks/jobs that management servers (MSs) execute; are these
tasks originate from requests that come to the MS, or is it possible that
requests received by one management server to be executed by other? I mean,
if I execute a request against MS1, will this request always be
executed/threated by MS1, or is it possible that this request is executed
by another MS (e.g. MS2)?

* I would suggest that after we block traffic coming from 8080/8443/8250(we
will need to block this as well right?), we can log the execution of tasks.
I mean, something saying, there are XXX tasks (enumerate tasks) still being
executed, we will wait for them to finish before shutting down.

* The timeout (60 minutes suggested) could be global settings that we can
load before executing the graceful-shutdown.

On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <ilya.mailing.lists@xxxxxxxxx>

> Use case:
> In any environment - time to time - administrator needs to perform a
> maintenance. Current stop sequence of cloudstack management server will
> ignore the fact that there may be long running async jobs - and terminate
> the process. This in turn can create a poor user experience and occasional
> inconsistency  in cloudstack db.
> This is especially painful in large environments where the user has
> thousands of nodes and there is a continuous patching that happens around
> the clock - that requires migration of workload from one node to another.
> With that said - i've created a script that monitors the async job queue
> for given MS and waits for it complete all jobs. More details are posted
> below.
> I'd like to introduce "graceful-shutdown" into the systemctl/service of
> cloudstack-management service.
> The details of how it will work is below:
> Workflow for graceful shutdown:
>   Using iptables/firewalld - block any connection attempts on 8080/8443 (we
> can identify the ports dynamically)
>   Identify the MSID for the node, using the proper msid - query async_job
> table for
> 1) any jobs that are still running (or job_status=“0”)
> 2) job_dispatcher not like “pseudoJobDispatcher"
> 3) job_init_msid=$my_ms_id
> Monitor this async_job table for 60 minutes - until all async jobs for MSID
> are done, then proceed with shutdown
>     If failed for any reason or terminated, catch the exit via trap command
> and unblock the 8080/8443
> Comments are welcome
> Regards,
> ilya

