[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[watcher] Compute CDM builder issues (mostly perf related)

Hi all,

I was looking over the NovaClusterDataModelCollector code today and 
trying to learn more about how watcher builds the nova CDM (and when) 
and got digging into this change from Stein [1] where I noted what 
appear to be several issues. I'd like to enumerate a few of those issues 
here and then figure out how to proceed.

1. In general, a lot of this code for building the compute node model is 
based on at least using the 2.53 microversion (Pike) in nova where the is a UUID - this is actually necessary for a multi-cell 
environment like CERN. The nova_client.api_version config option already 
defaults to 2.56 which was in Queens. I'm not sure what the 
compatibility matrix looks like for Watcher, but would it be possible 
for us to say that Watcher requires nova at least at Queens level API 
(so nova_client.api_version >= 2.60), add a release note and a 
"watcher-status upgrade check" if necessary. This might make things a 
bit cleaner in the nova CDM code to know we can rely on a given minimum 

2. I had a question about when the nova CDM gets built now [2]. It looks 
like the nova CDM only gets built when there is an audit? But I thought 
the CDM was supposed to get built on start of the decision-engine 
service and then refreshed every hour (by default) on a periodic task or 
as notifications are processed that change the model. Does this mean the 
nova CDM is rebuilt fresh whenever there is an audit even if the audit 
is not scoped? If so, isn't that potentially inefficient (and an 
unnecessary load on the compute API every time an audit runs?).

3. The host_aggregates and availability_zone compute audit scopes don't 
appear to be documented in the docs or the API reference, just the spec 
[3]. Should I open a docs bug about what are the supported audit scopes 
and how they work (it looks like the host_aggregates scope works for 
aggregate ids or names and availability_zone scope works for AZ names).

4. There are a couple of issues with how the unscoped compute nodes are 
retrieved from nova [4].

a) With microversion 2.33 there is a server-side configurable limit 
applied when listing hypervisors (defaults to 1000). In a large cloud 
this could be a problem since the watch client-side code is not paging.

b) The code is listing hypervisors with details, but then throwing away 
those details to just get the hypervisor_hostname, then iterating over 
each of those node names and getting the details per hypervisor again. I 
see why this is done because of the scope vs unscoped cases, but we 
could still optimize this I think (we might need some changes to 
python-novaclient for this though, which should be easy enough to add).

5. For each server on a node, we get the details of the server in 
separate API calls to nova [5]. Why can't we just do a GET 
/servers/detail and filter on "host" or "node" so it's a single API call 
to nova per hypervisor?

I'm happy to work on any of this but if there are any reasons things 
need to be done this way please let me know before I get started. Also, 
how would the core team like these kinds of improvements tracked? With bugs?

[2] at 181
[4] at 257
[5] at 399