State of the Gate (placement?)
On Fri, Nov 1, 2019, at 6:30 AM, Matt Riedemann wrote:
> On 11/1/2019 5:34 AM, Chris Dent wrote:
> > Instead it's "just being slow" which could be any of:
> > * something in the auth process (which if other services are also
> > Â being slow could be a unifying thing)
> > * the database server having a sad (do we keep a slow query log?)
> Not by default but it can be enabled in devstack, like  but the
> resulting log file is so big I can't open it in my browser. So hitting
> this kind of issue with that enabled is going to be like finding a
> needle in a haystack I think.
> > * the vm/vms has/have a noisy neighbor
> This is my guess as to the culprit. Like Slawek said elsewhere, he's
> seen this in other APIs that are otherwise really fast.
> > Do we know anything about cpu and io stats during the slow period?
> We have peakmemtracker  which shows a jump during the slow period
> Oct 31 16:51:43.132081 ubuntu-bionic-inap-mtl01-0012620879
> memory_tracker.sh: [iscsid (pid:18546)]=34352KB; [dmeventd
> (pid:18039)]=119732KB; [ovs-vswitchd (pid:3626)]=691960KB
> Oct 31 16:51:43.133012 ubuntu-bionic-inap-mtl01-0012620879
> memory_tracker.sh: ]]]
> Oct 31 16:55:23.220099 ubuntu-bionic-inap-mtl01-0012620879
> memory_tracker.sh: [[[
> Oct 31 16:55:23.221255 ubuntu-bionic-inap-mtl01-0012620879
> memory_tracker.sh: Thu Oct 31 16:55:23 UTC 2019
> I don't know what that means though.
> We also have dstat  (which I find hard as hell to read - do people
> throw that into a nice graphical tool to massage that data for
> inspection?) which shows a jump as well:
I put the dstat csv files in https://lamada.eu/dstat-graph/ and that works reasonably well.
> Oct 31 16:53:31.515005 ubuntu-bionic-inap-mtl01-0012620879
> dstat.sh: 31-10 16:53:31| 19 7 12 61 0|5806M 509M 350M
> 1194M| 612B 0 | 0 0 | 0 0 |1510 2087 |17.7 9.92 5.82|3.0
> 13 2.0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2
> 25797 99k 451B 0%|mysqld 464M| 20M 8172M| 37 513
> 0 4 12
> Oct 31 16:55:08.604634 ubuntu-bionic-inap-mtl01-0012620879
> dstat.sh: 31-10 16:53:32| 20 7 12 61 0|5806M 509M 350M
> 1194M|1052B 0 | 0 0 | 0 0 |1495 2076 |17.7 9.92 5.82|4.0
> 13 0| 0 0 |qemu-system-x86 24768 13% 0 0 |python2
> 25797 99k 446B0.1%|mysqld 464M| 20M 8172M| 37 513
> 0 4 12
> Looks like around the time of slowness mysqld is the top consuming
> process which is probably not surprising.
> > We've (sorry, let me rephrase, I've) known for the entire 5.5 years
> > I've been involved in OpenStack that the CI resources are way
> > over-subscribed and controller nodes in the wild are typically way
> > under-specified.Â Yes, our code should be robust in the face of
> > that, but...
> Looking at logstash this mostly hits on OVH and INAP nodes. Question to
> infra: do we know if those are more oversubscribed than some others like
> RAX or VEXXHOST nodes?
I believe both OVH and INAP give us dedicated hypervisors. This means that we will end up being our own noisy neighbors there. I don't know what level we oversubscribe at but amorin (OVH) and benj_ (INAP) can probably provide more info.
INAP was also recently turned back on. It had been offline for redeployment and that was completed and added back to the pool. Possible that more than just the openstack version has changed?
OVH controls the disk IOPs that we get pretty aggressively as well. Possible it is an IO thing?
>  https://review.opendev.org/#/c/691995/