[watcher] Compute CDM builder issues (mostly perf related)
I was looking over the NovaClusterDataModelCollector code today and
trying to learn more about how watcher builds the nova CDM (and when)
and got digging into this change from Stein  where I noted what
appear to be several issues. I'd like to enumerate a few of those issues
here and then figure out how to proceed.
1. In general, a lot of this code for building the compute node model is
based on at least using the 2.53 microversion (Pike) in nova where the
hypervisor.id is a UUID - this is actually necessary for a multi-cell
environment like CERN. The nova_client.api_version config option already
defaults to 2.56 which was in Queens. I'm not sure what the
compatibility matrix looks like for Watcher, but would it be possible
for us to say that Watcher requires nova at least at Queens level API
(so nova_client.api_version >= 2.60), add a release note and a
"watcher-status upgrade check" if necessary. This might make things a
bit cleaner in the nova CDM code to know we can rely on a given minimum
2. I had a question about when the nova CDM gets built now . It looks
like the nova CDM only gets built when there is an audit? But I thought
the CDM was supposed to get built on start of the decision-engine
service and then refreshed every hour (by default) on a periodic task or
as notifications are processed that change the model. Does this mean the
nova CDM is rebuilt fresh whenever there is an audit even if the audit
is not scoped? If so, isn't that potentially inefficient (and an
unnecessary load on the compute API every time an audit runs?).
3. The host_aggregates and availability_zone compute audit scopes don't
appear to be documented in the docs or the API reference, just the spec
. Should I open a docs bug about what are the supported audit scopes
and how they work (it looks like the host_aggregates scope works for
aggregate ids or names and availability_zone scope works for AZ names).
4. There are a couple of issues with how the unscoped compute nodes are
retrieved from nova .
a) With microversion 2.33 there is a server-side configurable limit
applied when listing hypervisors (defaults to 1000). In a large cloud
this could be a problem since the watch client-side code is not paging.
b) The code is listing hypervisors with details, but then throwing away
those details to just get the hypervisor_hostname, then iterating over
each of those node names and getting the details per hypervisor again. I
see why this is done because of the scope vs unscoped cases, but we
could still optimize this I think (we might need some changes to
python-novaclient for this though, which should be easy enough to add).
5. For each server on a node, we get the details of the server in
separate API calls to nova . Why can't we just do a GET
/servers/detail and filter on "host" or "node" so it's a single API call
to nova per hypervisor?
I'm happy to work on any of this but if there are any reasons things
need to be done this way please let me know before I get started. Also,
how would the core team like these kinds of improvements tracked? With bugs?
https://review.opendev.org/#/c/640585/10/watcher/decision_engine/model/collector/nova.py at 181
https://review.opendev.org/#/c/640585/10/watcher/decision_engine/model/collector/nova.py at 257
https://review.opendev.org/#/c/640585/10/watcher/decision_engine/model/collector/nova.py at 399