git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[nova][ops][stable] Any interest in backporting --dry-run and/or --instance options for heal_allocations?


I was helping someone recover from a stuck live migration today where 
the migration record was stuck in pre-migrating status and somehow the 
request never hit the compute or was lost. The guest was stopped on the 
guest and basically the live migration either never started or never 
completed properly (maybe rabbit dropped the request or the compute 
service was restarted, I don't know).

I instructed them to update the database to set the migration record 
status to 'error' and hard reboot the instance to get it running again.

Then they pointed out they were seeing this in the compute logs:

"There are allocations remaining against the source host that might need 
to be removed"

That's because the source node allocations are still tracked in 
placement by the migration record and the dest node allocations are 
tracked by the instance. Cleaning that up is non-trivial. I have a 
troubleshooting doc started for manually cleaning up that kind of stuff 
here [1] but ultimately just told them to delete the allocations in 
placement for both the migration and the instance and then run the 
heal_allocations command to recreate the allocations for the instance. 
Since this person's nova deployment was running Stein, they don't have 
the --dry-run [2] or --instance [3] options for the heal_allocations 
command. This isn't a huge problem but it does mean they could be 
healing allocations for instances they didn't expect.

They could work around this by installing nova from train or master in a 
VM/container/virtual environment and running it against the stein setup, 
but that's maybe more work than they want to do.

The question I'm posing is if people would like to see those options 
backported to stein and if so, would the stable team be OK with it? I'd 
say this falls into a gray area where these are things that are 
optional, not used by default, and are operational tooling so less risk 
to backport, but it's not zero risk. It's also worth noting that when I 
wrote those patches I did so with the intent that people could backport 
them at least internally.

[1] https://review.opendev.org/#/c/691427/
[2] https://review.opendev.org/#/c/651932/
[3] https://review.opendev.org/#/c/651945/

-- 

Thanks,

Matt