git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

State of the Gate (placement?)


On 11/1/2019 5:34 AM, Chris Dent wrote:
> Instead it's "just being slow" which could be any of:
> 
> * something in the auth process (which if other services are also
>    being slow could be a unifying thing)
> * the database server having a sad (do we keep a slow query log?)

Not by default but it can be enabled in devstack, like [1] but the 
resulting log file is so big I can't open it in my browser. So hitting 
this kind of issue with that enabled is going to be like finding a 
needle in a haystack I think.

> * the vm/vms has/have a noisy neighbor

This is my guess as to the culprit. Like Slawek said elsewhere, he's 
seen this in other APIs that are otherwise really fast.

> 
> Do we know anything about cpu and io stats during the slow period?

We have peakmemtracker [2] which shows a jump during the slow period 
noticed:

Oct 31 16:51:43.132081 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: [iscsid (pid:18546)]=34352KB; [dmeventd 
(pid:18039)]=119732KB; [ovs-vswitchd (pid:3626)]=691960KB
Oct 31 16:51:43.133012 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: ]]]
Oct 31 16:55:23.220099 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: [[[
Oct 31 16:55:23.221255 ubuntu-bionic-inap-mtl01-0012620879 
memory_tracker.sh[26186]: Thu Oct 31 16:55:23 UTC 2019

I don't know what that means though.

We also have dstat [3] (which I find hard as hell to read - do people 
throw that into a nice graphical tool to massage that data for 
inspection?) which shows a jump as well:

Oct 31 16:53:31.515005 ubuntu-bionic-inap-mtl01-0012620879 
dstat.sh[25748]: 31-10 16:53:31| 19   7  12  61   0|5806M  509M  350M 
1194M| 612B    0 |   0     0 |   0     0 |1510  2087 |17.7 9.92 5.82|3.0 
  13 2.0|   0     0 |qemu-system-x86      24768 13%   0    0 |python2 
           25797  99k 451B  0%|mysqld       464M|  20M 8172M|  37  513 
  0    4   12
Oct 31 16:55:08.604634 ubuntu-bionic-inap-mtl01-0012620879 
dstat.sh[25748]: 31-10 16:53:32| 20   7  12  61   0|5806M  509M  350M 
1194M|1052B    0 |   0     0 |   0     0 |1495  2076 |17.7 9.92 5.82|4.0 
  13   0|   0     0 |qemu-system-x86      24768 13%   0    0 |python2 
           25797  99k 446B0.1%|mysqld       464M|  20M 8172M|  37  513 
  0    4   12

Looks like around the time of slowness mysqld is the top consuming 
process which is probably not surprising.

> 
> We've (sorry, let me rephrase, I've) known for the entire 5.5 years
> I've been involved in OpenStack that the CI resources are way
> over-subscribed and controller nodes in the wild are typically way
> under-specified.  Yes, our code should be robust in the face of
> that, but...

Looking at logstash this mostly hits on OVH and INAP nodes. Question to 
infra: do we know if those are more oversubscribed than some others like 
RAX or VEXXHOST nodes?

[1] https://review.opendev.org/#/c/691995/
[2] 
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcdn.com/692206/3/check/tempest-integrated-compute/35ecb4a/controller/logs/screen-peakmem_tracker.txt.gz
[3] 
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcdn.com/692206/3/check/tempest-integrated-compute/35ecb4a/controller/logs/screen-dstat.txt.gz

-- 

Thanks,

Matt