git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt server) disconnection?


Hi Rohit,

When Management Server and Agent are up and running and there is a network
failure, I think it is better to wait for some time for the pending tasks
to complete, instead of failing them and try reconnecting. If network delay
is minimal, there can be a valid thread/context in the management server to
handle the answers.

It would be great if there are no major side-effects with this PR changes.

Thanks,
Suresh

On Wed, May 16, 2018 at 3:40 PM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
wrote:

> All,
>
>
> Based on testing against KVM, XenServer and VMware and this discussion,
> I'll merged the PR based on code reviews and tests. I investigated both
> code-wise and against live environment for possible side-effects of letting
> agent connect without being blocked on pending tasks and I found no new
> fault behaviour.
>
>
> If there are any objections or bugs, please share in which case we'll
> revert the change to continue legacy/historic behaviour. Thanks.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
> From: Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> Sent: Tuesday, May 15, 2018 2:37:58 PM
> To: dev@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> server) disconnection?
>
> Hi Suresh,
>
>
> I've replied to your comment on the PR. In addition, when (i) management
> server is restarted any pending operation on KVM/SSVM agent side will fail
> fail to be communicated back in the correct thread/context and it depends
> on a specific feature whether is supports sync or cleanup mechanism, in
> most cases, the async/job timeout may kick in or cause queue/concurrent
> failure seen in logs. When (ii) agent is reconnected, it reconnects only
> after any pending job finishes therefore such jobs finish and fail to be
> communicated back to the mgmt server (the answer instance is failed to be
> sent on the link, as link is no longer valid and causes exception).
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
> From: Suresh Kumar Anaparti <sureshkumar.anaparti@xxxxxxxxx>
> Sent: Tuesday, May 15, 2018 12:06:14 AM
> To: dev@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> server) disconnection?
>
> Hi,
>
> @rhtyd, I checked the PR changes. Good that the agent is not waiting for
> the pending jobs and retrying connection to management server. This might
> have impact on ssvm and kvm agent tasks, not much on cpvm. Any sync or
> cleanup mechanism for Volumes/VMs to address the failed/pending agent jobs
> after (i) management server restart and (ii) agent connected ?
>
> -Suresh
>
> On Mon, May 14, 2018 at 8:05 PM, Marc-Aurèle Brothier <marco@xxxxxxxxxxx>
> wrote:
>
> > Correct about the thread context, so if the answer is coming into a
> > management server that doesn't have the context and drops it, it should
> be
> > fine then. The PR is then already a good improvement to let the agent
> > reconnect even when it's doing a long processing request, so it can keeps
> > on completing other jobs too.
> >
> > Regarding the restart/shutdown operation, yes I have to push now the
> > changes to be able to stop some processing tasks (fetching new async jobs
> > mainly) on a management server to ensure a cleaner shutdown. My solution,
> > as said, is based on the content of a file that is compatible with HA
> > proxy, thus not the LB mechanism added recently in CS. It could be
> changed
> > for an API call to put/move out a management server from maintenance. The
> > listManagementServers API call has been merged and it was a requirement
> for
> > that.
> >
> > About Zookeeper, it's not on the rolling shutdown/restart for now. We are
> > using it as an efficient and true lock mechanism between multiple
> > management servers. We are slowly moving the locks code towards ZK and
> > added one during the allocation phase to ensure no host would be over
> > allocated. I will take this discussion in another email threads since I
> > have a few questions regarding ZK and also which to talk about the
> > connection between the agent & management servers.
> >
> > On Mon, May 14, 2018 at 2:39 PM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> > wrote:
> >
> > > Thanks Marc and Rafael for replying.
> > >
> > >
> > > In my experimentation, when agent disconnects if will wait for the
> > pending
> > > jobs/task to complete and on completion it creates an Answer instance
> and
> > > tries to sent it using a `link` which no longer exists and fails. This
> is
> > > current behaviour, on the mgmt server side the resource/task will be
> left
> > > hanging and may not be automatically marked failed right away (may be
> > after
> > > the configured timeout). My best guess is that the application of the
> > > change should likely not have any side-effects, other than the
> > > exceptions/faults we already observe.
> > >
> > >
> > > In my test, the failed async job did not get retried and I hit the
> famour
> > > 'concurrency limit 1' issue. At this point, I had to manually cleanup
> the
> > > snapshot row, the rows from sync_queue, sync_queue_item and async_job.
> > The
> > > current implementation we have on the agent side where mgmt server
> send a
> > > cmd and agent returns an answer after processing it -- we don't have
> the
> > > same for mgmt server where an agent sends a cmd's answer and mgmt
> server
> > > processes it irrespective of the context. Therefore, unless the answer
> > > receiving mgmt server is not in the right thread/context/state those
> > > answers are dropped.
> > >
> > >
> > > I think we need to solve for (1) claim and ownership management of a
> > > resource (how to manage when the owner/mgmt server shuts down or dies),
> > (2)
> > > task handover - executing tasks (in-flight) when mgmt server is
> shutdown
> > to
> > > other mgmt server, (3) central locking-service for this and other uses.
> > The
> > > bigger change ties with the other things we've seen in the discussion
> > > around mgmt server restart/shutdown. Till the time we get to solving
> the
> > > bigger issue,  perhaps we can provide some API/visual/UI ways to show
> the
> > > root admin the async jobs in flight for a management server or alert
> him,
> > > perhaps an API to do cleaner mgmt server shutdown that waits for all
> > > pending async jobs on a mgmg server to complete and does not take any
> new
> > > async/job API requests (say like Jenkins does with jobs)?
> > >
> > >
> > > Marc - were n't you working on a zookeeper based rolling
> > shutdown/restart?
> > > Did that handle some of the failure cases?
> > >
> > >
> > > - Rohit
> > >
> > > <https://cloudstack.apache.org>
> > >
> > >
> > >
> > > ________________________________
> > > From: Marc-Aurèle Brothier <marco@xxxxxxxxxxx>
> > > Sent: Monday, May 14, 2018 4:06:56 PM
> > > To: dev@xxxxxxxxxxxxxxxxxxxxx
> > > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on
> (mgmt
> > > server) disconnection?
> > >
> > > Hi,
> > >
> > > I'm also for a bigger change but this PR already moves forward to a
> > better
> > > agent <-> management connection hanlding.
> > >
> > > @rhtyd did you test your PR manually by, for example, requesting a long
> > > snapshot operation and disconnecting the agent.
> > >
> > > I have one concern here: when an async job is taken from the DB by a
> > > management server (in a cluster configuration), the mgmgt ID is put in
> > the
> > > row to tell which mgmt is managing the job. On disconnection from an
> > agent,
> > > the event is propagated and the job is mark as failed in the database,
> > and
> > > an error is return in the API for that command. Here we are only
> > resolving
> > > the fact to let the agent reconnect quickly but I'm unsure of what will
> > > happen in the mgmt when the job response is received by a mgmt (which
> > might
> > > be another one than the one registered in the job db row). I know it's
> > here
> > > it's becoming complicated because one async job might be only one part
> > of a
> > > bigger scenario for a command (like a live migration). I just want to
> > > ensure it won't propagate further inconsistency.
> > >
> > > Marco
> > >
> > > On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner <
> > > rafaelweingartner@xxxxxxxxx> wrote:
> > >
> > > > Would prefer “A bigger design fix would be to make management server
> > > > asynchronous of agent side answer/response handling”. However, I
> > > understand
> > > > the volume of changes that requires.
> > > >
> > > > I looked at the PR, and I think that everything is ok there. Of
> > course, I
> > > > think we might need some more time to review and think about the
> > possible
> > > > outcomes of such changes.
> > > >
> > > > On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav <
> > rohit.yadav@xxxxxxxxxxxxx>
> > > > wrote:
> > > >
> > > > > All,
> > > > >
> > > > >
> > > > > Historically, when the agent (kvm, ssvm, cpvm) is disconnected from
> > the
> > > > > management server (say due to mgmt server restart etc), the
> > > reconnection
> > > > > logic waits for any pending tasks/commands to complete before
> > > > reconnection
> > > > > attempts are made. I tried to search git history but could not
> find a
> > > > > reason, can anyone share why we may need this?
> > > > >
> > > > >
> > > > > Based on the reported issue:
> > > > >
> > > > > https://github.com/apache/cloudstack/issues/2633
> > > > >
> > > > >
> > > > > I've a working patch which removes this limitation:
> > > > >
> > > > > https://github.com/apache/cloudstack/pull/2638
> > > > >
> > > > >
> > > > > From testing with various combinations of tasks, I found that when
> > that
> > > > > happens even if the pending task succeeds it fails to send an
> Answer
> > to
> > > > the
> > > > > mgmt server, therefore from the control plane's perspective that
> task
> > > is
> > > > > still pending/on-going.
> > > > >
> > > > >
> > > > > When the mgmt server comes back online, and the agent finally
> > > reconnects
> > > > > (pending on how long the pending task took) the executed operation
> is
> > > > still
> > > > > pending in mgmt server's view and may sometimes require manual
> > cleanups
> > > > in
> > > > > database. By removing the limitation in above PR, at least the
> agent
> > > > > reconnects faster while of the failure/fault behaviours remain the
> > > same.
> > > > A
> > > > > bigger design fix would be to make management server asynchronous
> of
> > > > agent
> > > > > side answer/response handling.
> > > > >
> > > > >
> > > > > - Rohit
> > > > >
> > > > > <https://cloudstack.apache.org>
> > > > >
> > > > >
> > > > >
> > > > > rohit.yadav@xxxxxxxxxxxxx
> > > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > @shapeblue
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Rafael Weingärtner
> > > >
> > >
> > > rohit.yadav@xxxxxxxxxxxxx
> > > www.shapeblue.com<http://www.shapeblue.com>
> > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > @shapeblue
> > >
> > >
> > >
> > >
> >
>
> rohit.yadav@xxxxxxxxxxxxx
> www.shapeblue.com<http://www.shapeblue.com>
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>
> rohit.yadav@xxxxxxxxxxxxx
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-cloudstack-development/msg07372.html on line 404
Call Stack
#TimeMemoryFunctionLocation
10.0009377064{main}( ).../msg07372.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-cloudstack-development/msg07372.html on line 404
Call Stack
#TimeMemoryFunctionLocation
10.0009377064{main}( ).../msg07372.html:0