git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[cinder][dev] Bug for deferred deletion in RBD


Yes, I also used eventlets because RBDPool call eventlet.tpool.

Anyway, and finally I found the cause of the problem. That was because the
file descriptor reached its limit.
My test environment was ulimit 1024, and every time I deleted a volume, the
fd number increased by 3,40,
and when it exceeded 1024, the cinder-volume no longer worked exactly.

I changed the ulimit to a large value so fd exceeded 2300 until we erased
200 volumes.
When all the volumes were erased, fd also decreased normally.

In the end, I think there will be an increase in fd in the source code that
deletes the volume.
This is because fd remains stable during volume creation.

Thanks,
Jaesang.

2019ë?? 2ì?? 13ì?¼ (ì??) ì?¤í?? 6:37, Gorka Eguileor <geguileo at redhat.com>ë??ì?´ ì??ì?±:

> On 13/02, Jae Sang Lee wrote:
> > As mentioned in Gorka, sql connection is using pymysql.
> >
> > And I increased max_pool_size to 50(I think gorka mistaken max_pool_size
> to
> > max_retries.),
>
> My bad, I meant "max_overflow", which was changed a while back to 50
> (though I don't remember when).
>
>
>
> > but it was the same that the cinder-volume stucked from the time that
> 4~50
> > volumes were deleted.
> >
> > There seems to be a problem with the cinder rbd volume driver, so I
> tested
> > to delete 200 volumes continously
> > by used only RBDClient and RBDProxy. There was no problem at this time.
>
> I assume you tested it using eventlets, right?
>
> Cheers,
> Gorka.
>
>
> >
> > I think there is some code in the cinder-volume that causes a hang but
> it's
> > too hard to find now.
> >
> > Thanks.
> >
> > 2019ë?? 2ì?? 12ì?¼ (í??) ì?¤í?? 6:24, Gorka Eguileor <geguileo at redhat.com>ë??ì?´ ì??ì?±:
> >
> > > On 12/02, Arne Wiebalck wrote:
> > > > Jae,
> > > >
> > > > One other setting that caused trouble when bulk deleting cinder
> volumes
> > > was the
> > > > DB connection string: we did not configure a driver and hence used
> the
> > > Python
> > > > mysql wrapper instead â?¦ essentially changing
> > > >
> > > > connection = mysql://cinder:<pw>@<host>:<port>/cinder
> > > >
> > > > to
> > > >
> > > > connection = mysql+pymysql://cinder:<pw>@<host>:<port>/cinder
> > > >
> > > > solved the parallel deletion issue for us.
> > > >
> > > > All details in the last paragraph of [1].
> > > >
> > > > HTH!
> > > >  Arne
> > > >
> > > > [1]
> > >
> https://techblog.web.cern.ch/techblog/post/experiences-with-cinder-in-production/
> > > >
> > >
> > > Good point, using a C mysql connection library will induce thread
> > > starvation.  This was thoroughly discussed, and the default changed,
> > > like 2 years ago...  So I assumed we all changed that.
> > >
> > > Something else that could be problematic when receiving many concurrent
> > > requests on any Cinder service is the number of concurrent DB
> > > connections, although we also changed this a while back to 50.  This is
> > > set as sql_max_retries or max_retries (depending on the version) in the
> > > "[database]" section.
> > >
> > > Cheers,
> > > Gorka.
> > >
> > >
> > > >
> > > >
> > > > > On 12 Feb 2019, at 01:07, Jae Sang Lee <hyangii at gmail.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I tested today by increasing EVENTLET_THREADPOOL_SIZE size to 100.
> I
> > > wanted to have good results,
> > > > > but this time I did not get a response after removing 41 volumes.
> This
> > > environment variable did not fix
> > > > > the cinder-volume stopping.
> > > > >
> > > > > Restarting the stopped cinder-volume will delete all volumes that
> are
> > > in deleting state while running the clean_up function.
> > > > > Only one volume in the deleting state, I force the state of this
> > > volume to be available, and then delete it, all volumes will be
> deleted.
> > > > >
> > > > > This result was the same for 3 consecutive times. After removing
> > > dozens of volumes, the cinder-volume was down,
> > > > > and after the restart of the service, 199 volumes were deleted and
> one
> > > volume was manually erased.
> > > > >
> > > > > If you have a different approach to solving this problem, please
> let
> > > me know.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > 2019ë?? 2ì?? 11ì?¼ (ì??) ì?¤í?? 9:40, Arne Wiebalck <Arne.Wiebalck at cern.ch>ë??ì?´
> ì??ì?±:
> > > > > Jae,
> > > > >
> > > > >> On 11 Feb 2019, at 11:39, Jae Sang Lee <hyangii at gmail.com> wrote:
> > > > >>
> > > > >> Arne,
> > > > >>
> > > > >> I saw the messages like ''moving volume to trash"  in the
> > > cinder-volume logs and the peridic task also reports
> > > > >> like "Deleted <vol-uuid> from trash for backend '<backends-name>'"
> > > > >>
> > > > >> The patch worked well when clearing a small number of volumes.
> This
> > > happens only when I am deleting a large
> > > > >> number of volumes.
> > > > >
> > > > > Hmm, from cinderâ??s point of view, the deletion should be more or
> less
> > > instantaneous, so it should be able to â??deleteâ??
> > > > > many more volumes before getting stuck.
> > > > >
> > > > > The periodic task, however, will go through the volumes one by
> one, so
> > > if you delete many at the same time,
> > > > > volumes may pile up in the trash (for some time) before the tasks
> gets
> > > round to delete them. This should not affect
> > > > > c-vol, though.
> > > > >
> > > > >> I will try to adjust the number of thread pools by adjusting the
> > > environment variables with your advices
> > > > >>
> > > > >> Do you know why the cinder-volume hang does not occur when create
> a
> > > volume, but only when delete a volume?
> > > > >
> > > > > Deleting a volume ties up a thread for the duration of the deletion
> > > (which is synchronous and can hence take very
> > > > > long for ). If you have too many deletions going on at the same
> time,
> > > you run out of threads and c-vol will eventually
> > > > > time out. FWIU, creation basically works the same way, but it is
> > > almost instantaneous, hence the risk of using up all
> > > > > threads is simply lower (Gorka may correct me here :-).
> > > > >
> > > > > Cheers,
> > > > >  Arne
> > > > >
> > > > >>
> > > > >>
> > > > >> Thanks.
> > > > >>
> > > > >>
> > > > >> 2019ë?? 2ì?? 11ì?¼ (ì??) ì?¤í?? 6:14, Arne Wiebalck <Arne.Wiebalck at cern.ch>ë??ì?´
> ì??ì?±:
> > > > >> Jae,
> > > > >>
> > > > >> To make sure deferred deletion is properly working: when you
> delete
> > > individual large volumes
> > > > >> with data in them, do you see that
> > > > >> - the volume is fully â??deleted" within a few seconds, ie. not
> staying
> > > in â??deletingâ?? for a long time?
> > > > >> - that the volume shows up in trash (with â??rbd trash lsâ??)?
> > > > >> - the periodic task reports it is deleting volumes from the trash?
> > > > >>
> > > > >> Another option to look at is â??backend_native_threads_pool_size":
> this
> > > will increase the number
> > > > >> of threads to work on deleting volumes. It is independent from
> > > deferred deletion, but can also
> > > > >> help with situations where Cinder has more work to do than it can
> > > cope with at the moment.
> > > > >>
> > > > >> Cheers,
> > > > >>  Arne
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On 11 Feb 2019, at 09:47, Jae Sang Lee <hyangii at gmail.com>
> wrote:
> > > > >>>
> > > > >>> Yes, I added your code to pike release manually.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> 2019ë?? 2ì?? 11ì?¼ (ì??) ì?¤í?? 4:39ì?? Arne Wiebalck <Arne.Wiebalck at cern.ch
> >ë??ì?´
> > > ì??ì?±:
> > > > >>> Hi Jae,
> > > > >>>
> > > > >>> You back ported the deferred deletion patch to Pike?
> > > > >>>
> > > > >>> Cheers,
> > > > >>>  Arne
> > > > >>>
> > > > >>> > On 11 Feb 2019, at 07:54, Jae Sang Lee <hyangii at gmail.com>
> wrote:
> > > > >>> >
> > > > >>> > Hello,
> > > > >>> >
> > > > >>> > I recently ran a volume deletion test with deferred deletion
> > > enabled on the pike release.
> > > > >>> >
> > > > >>> > We experienced a cinder-volume hung when we were deleting a
> large
> > > amount of the volume in which the data was actually written(I make 15GB
> > > file in every volumes), and we thought deferred deletion would solve
> it.
> > > > >>> >
> > > > >>> > However, while deleting 200 volumes, after 50 volumes, the
> > > cinder-volume downed as before. In my opinion, the trash_move api does
> not
> > > seem to work properly when removing multiple volumes, just like remove
> api.
> > > > >>> >
> > > > >>> > If these test results are my fault, please let me know the
> correct
> > > test method.
> > > > >>> >
> > > > >>>
> > > > >>> --
> > > > >>> Arne Wiebalck
> > > > >>> CERN IT
> > > > >>>
> > > > >>
> > > > >> --
> > > > >> Arne Wiebalck
> > > > >> CERN IT
> > > > >>
> > > > >
> > > > > --
> > > > > Arne Wiebalck
> > > > > CERN IT
> > > > >
> > > >
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190213/5760c9ff/attachment-0001.html>