git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[cinder][tooz]Lock-files are remained


On 17/10, Herve Beraud wrote:
> Thanks Ben for your feedbacks.
>
> I already tried to follow the `remove_external_lock_file` few months ago,
> but unfortunately, I don't think we can goes like this with Cinder...
>
> As Gorka has explained to me few months ago:
>
> > Those are not the only type of locks we use in Cinder.  Those are the
> > ones we call "Global locks" and use TooZ so the DLM can be configured
> > for Cinder Active-Active.
> >
> > We also use Oslo's synchronized locks.
> >
> > More information is available in the Cinder HA dev ref I wrote last
> > year.  It has a section dedicated to the topic of mutual exclusion and
> > the 4 types we currently have in Cinder [1]:
> >
> > - Database locking using resource states.
> > - Process locks.
> > - Node locks.
> > - Global locks.
> >
> > As for calling the remove_external_lock_file_with_prefix directly on
> > delete, I don't think that's something we can do, as the locks may still
> > be in use. Example:
> >
> > - Start deleting volume -> get lock
> > - Try to clone volume -> wait for lock
> > - Finish deleting volume -> release and delete lock
> > - Cloning recreates the lock when acquiring it
> > - Cloning fails because the volume no longer exists but leaves the lock
>
> So the Cinder workflow and mechanisms seems to definitively forbid to
> us the possibility to use the remove features of oslo.concurrency...
>
> Also like discussed on the review (https://review.opendev.org/#/c/688413),
> this issue can't be fixed in the underlying libraries, and I think that if
> we want to fix that on stable branches then Cinder need to address it
> directly by adding some piece of code who will be triggered if needed and
> in a safely manner, in other words, only Cinder can really address it and
> remove safely these file.
>
> See the discussion extract on the review (
> https://review.opendev.org/#/c/688413):
>

Hi,

I've given it some more thought, and I am now on the side of those that
argue that "something imperfect" is better than what we currently have,
so maybe we can reach some sort of compromise doing the following:

- Cleanup locks directory on node start
- Remove locks on delete volume/snapshot operation
- Remove locks on missing source on create volume (volume/snapshot)

It may not cover 100% of the cases, and in a long running system with
few deletions will not help, but it should at least alleviate the issue
for some users.

To illustrate the Cinder part of this approach I have written a WIP
patch [1] (and an oslo.concurrency one [3] that is not needed, but would
improve the user experience).  I haven't bothered to test it yet, but
I'll do it if we agree this is a reasonable compromise we are
comfortable with.

Cheers,
Gorka.

[1]: https://review.opendev.org/689486
[2]: https://review.opendev.org/689482

>  > Thanks Gorka for your feedback, then in view of all the discussions
>  > about this topic I suppose only Cinder can really address it safely
>  > on stable branches.
>  >
>  > > It is not a safe assumption that *-delete_volume file locks can be
>  > > removed just because they have not been used in a couple of days.
>  > > A new volume clone could come in that would use it and then we
>  > > could have a race condition if the cron job was running.
>  > >
>  > > The only way to be sure that it can be removed is checking in the
>  > > Cinder DB and making sure that the volume has been deleted or it
>  > > doesn't even exist (DB has been purged).
>  > >
>  > > Same thing with detach_volume, delete_snapshot, and those that are
>  > > directly volume ids locks.
>  >
>  > I definitely think that it can't be fixed in the underlying
>  > libraries like Eric has suggested [1], indeed, as you has explained
>  > only Cinder can know if a lock file can be removed safely.
>  >
>  > > In my opinion the fix should be done in fasteners, or we should add
>  > > code in Cinder that cleans up all locks related to a volume or
>  > > snapshot when this one is deleted.
>  >
>  > I agree the most better solution is to fix the root cause and so to
>  > fix fasteners, but I don't think it's can be backported to stable
>  > branches because we will need to bump a requirement version on
>  > stable branche in this case and also because it'll introduce new
>  > features, so I guess Cinder need to add some code to remove these
>  > files and possibly backport it to stable branches.
>  >
>  > [1]
> http://lists.openstack.org/pipermail/openstack-discuss/2019-September/009563.html
>
> The Fasteners fix IMHO can only be used by future versions of openstack,
> due to the version bump and due to the new features added. I think that it
> could be available only from the ussuri or future cycle like V.
>
> The main goal of the cron approach was to definitively avoid to unearth
> this topic each 6 months, try to address it on stable branches, and try to
> take care of the file system usage even if it's a theoretical issue, but by
> getting feedbacks from the Cinder team and their warnings I don't think
> that this track is still followable.
>
> Definitely, this is not an oslo.concurrency bug. Anyway your proposed
> "Administrator Guide" is a must to have, to track things in one place,
> inform users and avoid to spend time to explain the same things again and
> again about this topic... so it's worth-it. I'll review it and propose my
> related knowledge on this topic. oslo.concurrency can't address this safely
> because we risk to introduce race conditions and worse situations than the
> leftover lock files.
>
> So, due to all these elements, only cinder can address it for the moment
> and for fix that on stable branches too.
>
> Le mer. 16 oct. 2019 à 00:15, Ben Nemec <openstack at nemebean.com> a écrit :
>
> > In the interest of not having to start this discussion from scratch
> > every time, I've done a bit of a brain dump into
> > https://review.opendev.org/#/c/688825/ that covers why things are the
> > way they are and what we recommend people do about it. Please take a
> > look and let me know if you see any issues with it.
> >
> > Thanks.
> >
> > -Ben
> >
>
>
> --
> Hervé Beraud
> Senior Software Engineer
> Red Hat - Openstack Oslo
> irc: hberaud
> -----BEGIN PGP SIGNATURE-----
>
> wsFcBAABCAAQBQJb4AwCCRAHwXRBNkGNegAALSkQAHrotwCiL3VMwDR0vcja10Q+
> Kf31yCutl5bAlS7tOKpPQ9XN4oC0ZSThyNNFVrg8ail0SczHXsC4rOrsPblgGRN+
> RQLoCm2eO1AkB0ubCYLaq0XqSaO+Uk81QxAPkyPCEGT6SRxXr2lhADK0T86kBnMP
> F8RvGolu3EFjlqCVgeOZaR51PqwUlEhZXZuuNKrWZXg/oRiY4811GmnvzmUhgK5G
> 5+f8mUg74hfjDbR2VhjTeaLKp0PhskjOIKY3vqHXofLuaqFDD+WrAy/NgDGvN22g
> glGfj472T3xyHnUzM8ILgAGSghfzZF5Skj2qEeci9cB6K3Hm3osj+PbvfsXE/7Kw
> m/xtm+FjnaywZEv54uCmVIzQsRIm1qJscu20Qw6Q0UiPpDFqD7O6tWSRKdX11UTZ
> hwVQTMh9AKQDBEh2W9nnFi9kzSSNu4OQ1dRMcYHWfd9BEkccezxHwUM4Xyov5Fe0
> qnbfzTB1tYkjU78loMWFaLa00ftSxP/DtQ//iYVyfVNfcCwfDszXLOqlkvGmY1/Y
> F1ON0ONekDZkGJsDoS6QdiUSn8RZ2mHArGEWMV00EV5DCIbCXRvywXV43ckx8Z+3
> B8qUJhBqJ8RS2F+vTs3DTaXqcktgJ4UkhYC2c1gImcPRyGrK9VY0sCT+1iA+wp/O
> v6rDpkeNksZ9fFSyoY2o
> =ECSj
> -----END PGP SIGNATURE-----