[nova][ops][stable] Any interest in backporting --dry-run and/or --instance options for heal_allocations?
On 11/5/19 08:45, Matt Riedemann wrote:
> I was helping someone recover from a stuck live migration today where
> the migration record was stuck in pre-migrating status and somehow the
> request never hit the compute or was lost. The guest was stopped on the
> guest and basically the live migration either never started or never
> completed properly (maybe rabbit dropped the request or the compute
> service was restarted, I don't know).
> I instructed them to update the database to set the migration record
> status to 'error' and hard reboot the instance to get it running again.
> Then they pointed out they were seeing this in the compute logs:
> "There are allocations remaining against the source host that might need
> to be removed"
> That's because the source node allocations are still tracked in
> placement by the migration record and the dest node allocations are
> tracked by the instance. Cleaning that up is non-trivial. I have a
> troubleshooting doc started for manually cleaning up that kind of stuff
> here  but ultimately just told them to delete the allocations in
> placement for both the migration and the instance and then run the
> heal_allocations command to recreate the allocations for the instance.
> Since this person's nova deployment was running Stein, they don't have
> the --dry-run  or --instance  options for the heal_allocations
> command. This isn't a huge problem but it does mean they could be
> healing allocations for instances they didn't expect.
> They could work around this by installing nova from train or master in a
> VM/container/virtual environment and running it against the stein setup,
> but that's maybe more work than they want to do.
> The question I'm posing is if people would like to see those options
> backported to stein and if so, would the stable team be OK with it? I'd
> say this falls into a gray area where these are things that are
> optional, not used by default, and are operational tooling so less risk
> to backport, but it's not zero risk. It's also worth noting that when I
> wrote those patches I did so with the intent that people could backport
> them at least internally.
I think tools like this that provide significant operability benefit are
worthwhile to backport and that the value is much greater than the risk.
Related but not nearly as simple, I've backported nova-manage db purge
and nova-manage db archive_deleted_rows --purge, --before, and
--all-cells downstream because of the amount of bugs support/operators
have opened around database cleanup pain. These were all pretty
difficult to backport with the number of differences and conflicts, but
my point is that I understand the motivation well and support the idea.
The fact that the patches in question were written with backportability
in mind is A Good Thing.
>  https://review.opendev.org/#/c/691427/
>  https://review.opendev.org/#/c/651932/
>  https://review.opendev.org/#/c/651945/