git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ops] [nova] [placement] Mismatch between allocations and instances


On Fri, Jul 5, 2019 at 10:21 PM Matt Riedemann <mriedemos at gmail.com> wrote:

> On 7/5/2019 1:45 AM, Massimo Sgaravatto wrote:
> > I tried to check the allocations on each compute node of a Ocata cloud,
> > using the command:
> >
> > curl -s ${PLACEMENT_ENDPOINT}/resource_providers/${UUID}/allocations -H
> > "x-auth-token: $TOKEN"  | python -m json.tool
> >
>
> Just FYI you can use osc-placement (openstack client plugin) for command
> line:
>
> https://docs.openstack.org/osc-placement/latest/index.html
>
> > I found that, on a few compute nodes, there are some instances for which
> > there is not a corresponding allocation.
>
> The heal_allocations command [1] might be able to find and fix these up
> for you. The bad news for you is that heal_allocations wasn't added
> until Rocky and you're on Ocata. The good news is you should be able to
> take the current version of the code from master (or stein) and run that
> in a container or virtual environment against your Ocata cloud (this
> would be particularly useful if you want to use the --dry-run or
> --instance options added in Train). You could also potentially backport
> those changes to your internal branch, or we could start a discussion
> upstream about backporting that tooling to stable branches - though
> going to Ocata might be a bit much at this point given Ocata and Pike
> are in extended maintenance mode [2].
>
> As for *why* the instances on those nodes are missing allocations, it's
> hard to say without debugging things. The allocation and resource
> tracking code has changed quite a bit since Ocata (in Pike the scheduler
> started creating the allocations but the resource tracker in the compute
> service could still overwrite those allocations if you had older nodes
> during a rolling upgrade). My guess would be a migration failed or there
> was just a bug in Ocata where we didn't cleanup or allocate properly.
> Again, heal_allocations should add the missing allocation for you if you
> can setup the environment to run that command.
>
> >
> > On another Rocky cloud, we had the opposite problem: there were
> > allocations also for some instances that didn't exist anymore.
> > And this caused problems since we were not able to use all the resources
> > of the relevant compute nodes: we had to manually remove the fwrong"
> > allocations to fix the problem ...
>
> Yup, this could happen for different reasons, usually all due to known
> bugs for which you don't have the fix yet, e.g. [3][4], or something is
> failing during a migration and we aren't cleaning up properly (an
> unreported/not-yet-fixed bug).
>
> >
> >
> > I wonder why/how this problem can happen ...
>
> I mentioned some possibilities above - but I'm sure there are other bugs
> that have been fixed which I've omitted here, or things that aren't
> fixed yet, especially in failure scenarios (rollback/cleanup handling is
> hard).
>
> Note that your Ocata and Rocky cases could be different because since
> Queens (once all compute nodes are >=Queens) during resize, cold and
> live migration the migration record in nova holds the source node
> allocations during the migration so the actual *consumer* of the
> allocations for a provider in placement might not be an instance
> (server) record but actually a migration, so if you were looking for an
> allocation consumer by ID in nova using something like "openstack server
> show $consumer_id" it might return NotFound because the consumer is
> actually not an instance but a migration record and the allocation was
> leaked.
>
> >
> > And how can we fix the issue ? Should we manually add the missing
> > allocations / manually remove the wrong ones ?
>
> Coincidentally a thread related to this [5] re-surfaced a couple of
> weeks ago. I am not sure what Sylvain's progress is on that audit tool,
> but the linked bug in that email has some other operator scripts you
> could try for the case that there are leaked/orphaned allocations on
> compute nodes that no longer have instances.
>
>
Yeah, I'm fighting off with the change due to some issues, but I'll
hopefully upload the change by the next days.
-Sylvain

>
> > Thanks, Massimo
> >
> >
>
> [1] https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement
> [2] https://docs.openstack.org/project-team-guide/stable-branches.html
> [3] https://bugs.launchpad.net/nova/+bug/1825537
> [4] https://bugs.launchpad.net/nova/+bug/1821594
> [5]
>
> http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007241.html
>
> --
>
> Thanks,
>
> Matt
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190708/a8ce3b2e/attachment.html>