git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[nova][ops][stable] Any interest in backporting --dry-run and/or --instance options for heal_allocations?


On 11/5/19 08:45, Matt Riedemann wrote:
> I was helping someone recover from a stuck live migration today where 
> the migration record was stuck in pre-migrating status and somehow the 
> request never hit the compute or was lost. The guest was stopped on the 
> guest and basically the live migration either never started or never 
> completed properly (maybe rabbit dropped the request or the compute 
> service was restarted, I don't know).
> 
> I instructed them to update the database to set the migration record 
> status to 'error' and hard reboot the instance to get it running again.
> 
> Then they pointed out they were seeing this in the compute logs:
> 
> "There are allocations remaining against the source host that might need 
> to be removed"
> 
> That's because the source node allocations are still tracked in 
> placement by the migration record and the dest node allocations are 
> tracked by the instance. Cleaning that up is non-trivial. I have a 
> troubleshooting doc started for manually cleaning up that kind of stuff 
> here [1] but ultimately just told them to delete the allocations in 
> placement for both the migration and the instance and then run the 
> heal_allocations command to recreate the allocations for the instance. 
> Since this person's nova deployment was running Stein, they don't have 
> the --dry-run [2] or --instance [3] options for the heal_allocations 
> command. This isn't a huge problem but it does mean they could be 
> healing allocations for instances they didn't expect.
> 
> They could work around this by installing nova from train or master in a 
> VM/container/virtual environment and running it against the stein setup, 
> but that's maybe more work than they want to do.
> 
> The question I'm posing is if people would like to see those options 
> backported to stein and if so, would the stable team be OK with it? I'd 
> say this falls into a gray area where these are things that are 
> optional, not used by default, and are operational tooling so less risk 
> to backport, but it's not zero risk. It's also worth noting that when I 
> wrote those patches I did so with the intent that people could backport 
> them at least internally.

I think tools like this that provide significant operability benefit are 
worthwhile to backport and that the value is much greater than the risk.

Related but not nearly as simple, I've backported nova-manage db purge 
and nova-manage db archive_deleted_rows --purge, --before, and 
--all-cells downstream because of the amount of bugs support/operators 
have opened around database cleanup pain. These were all pretty 
difficult to backport with the number of differences and conflicts, but 
my point is that I understand the motivation well and support the idea.

The fact that the patches in question were written with backportability 
in mind is A Good Thing.

-melanie

> [1] https://review.opendev.org/#/c/691427/
> [2] https://review.opendev.org/#/c/651932/
> [3] https://review.opendev.org/#/c/651945/
>