git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[nova][ironic] Lock-related performance issue with update_resources periodic job


Hi,
we do have the issue of ironic instances taking a lot of time to start
being created (The same Jason described).
This is because the resource tracker takes >30 minutes to cycle (~2500
ironic nodes in one nova-compute). Meanwhile operations are "queue" until
it finish.
To speed up the resource tracker we use:
https://review.opendev.org/#/c/637225/

We are working in shard the nova-compute for ironic. I think that is the
right way to go.

Considering the experience described by Jason we now increased the
"update_resources_interval" to 24h.
Yes, the "queue" issue disappeared.

We will report back if you find some weird unexpected consequence.

Belmiro
CERN

On Tue, Jun 11, 2019 at 5:56 PM Jason Anderson <jasonanderson at uchicago.edu>
wrote:

> Hi Surya,
>
> On 5/13/19 3:15 PM, Surya Seetharaman wrote:
>
> We faced the same problem at CERN when we upgraded to rocky (we have ~2300
> nodes on a single compute) like Eric said, and we set the
> [compute]resource_provider_association_refresh to a large value (this
> definitely helps by stopping the syncing of traits/aggregates and provider
> tree cache info stuff in terms of chattiness with placement) and inspite of
> that it doesn't scale that well for us. We still find the periodic task
> taking too much of time which causes the locking to hold up the claim for
> instances in BUILD state (the exact same problem you described). While one
> way to tackle this like you said is to set the "update_resources_interval"
> to a higher value - we were not sure how much out of sync things would get
> with placement, so it will be interesting to see how this spans out for you
> - another way out would be to use multiple computes and spread the nodes
> around (though this is also a pain to maintain IMHO) which is what we are
> looking into presently.
>
> I wanted to let you know that we've been running this way in production
> for a few weeks now and it's had a noticeable improvement: instances are no
> longer sticking in the "Build" stage, pre-networking, for ages. We were
> able to track the improvement by comparing the Nova conductor logs ("Took
> {seconds} to build the instance" vs "Took {seconds} to spawn the instance
> on the hypervisor"; the delta should be as small as possible and in our
> case went from ~30 minutes to ~1 minute.) There have been a few cases where
> a resource provider claim got "stuck", but in practice it has been so
> infrequent that it potentially has other causes. As such, I can recommend
> increasing the interval time significantly. Currently we have it set to 6
> hours.
>
> I have not yet looked in to bringing in the other Nova patches used at
> CERN (and available in Stein). I did take a look at updating the locking
> mechanism, but do not have work to show for this yet.
>
> Cheers,
>
> /Jason
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190704/b6615982/attachment.html>