git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Pyarrow Plasma client.release() fault


Update:

I'm investigating the possibility that I've reached the overcommit limit in
the kernel as a result of all the parallel processes.

This still doesn't fix the client.release() problem but it might explain
why the processing appears to halt, after some time, until I restart the
Jupyter kernel.

On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cjnolet@xxxxxxxxx> wrote:

> Wes,
>
> Unfortunately, my code is on a separate network. I'll try to explain what
> I'm doing and if you need further detail, I can certainly pseudocode
> specifics.
>
> I am using multiprocessing.Pool() to fire up a bunch of threads for
> different filenames. In each thread, I'm performing a pd.read_csv(),
> sorting by the timestamp field (rounded to the day) and chunking the
> Dataframe into separate Dataframes. I create a new Plasma ObjectID for each
> of the chunked Dataframes, convert them to RecordBuffer objects, stream the
> bytes to Plasma and seal the objects. Only the objectIDs are returned to
> the orchestration thread.
>
> In follow-on processing, I'm combining the ObjectIDs for each of the
> unique day timestamps into lists and I'm passing those into a function in
> parallel using multiprocessing.Pool(). In this function, I'm iterating
> through the lists of objectIds, loading them back into Dataframes,
> appending them together until their size
> is > some predefined threshold, and performing a df.to_parquet().
>
> The steps in the 2 paragraphs above are performing in a loop, batching up
> 500-1k files at a time for each iteration.
>
> When I run this iteration a few times, it eventually locks up the Plasma
> client. With regards to the release() fault, it doesn't seem to matter when
> or where I run it (in the orchestration thread or in other threads), it
> always seems to crash the Jupyter kernel. I'm thinking I might be using it
> wrong, I'm just trying to figure out where and what I'm doing.
>
> Thanks again!
>
> On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>
>> hi Corey,
>>
>> Can you provide the code (or a simplified version thereof) that shows
>> how you're using Plasma?
>>
>> - Wes
>>
>> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cjnolet@xxxxxxxxx> wrote:
>> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
>> Plasma
>> > client to convert a series of CSV files (via Pandas) into a Parquet
>> store.
>> >
>> > I've got a little over 20k CSV files to process which are about 1-2gb
>> each.
>> > I'm loading 500 to 1000 files at a time.
>> >
>> > In each iteration, I'm loading a series of files, partitioning them by a
>> > time field into separate dataframes, then writing parquet files in
>> > directories for each day.
>> >
>> > The problem I'm having is that the Plasma client & server appear to
>> lock up
>> > after about 2-3 iterations. It locks up to the point where I can't even
>> > CTRL+C the server. I am able to stop the notebook and re-trying the code
>> > just continues to lock up when interacting with Jupyter. There are no
>> > errors in my logs to tell me something's wrong.
>> >
>> > Just to make sure I'm not just being impatient and possibly need to wait
>> > for some background services to finish, I allowed the code to run
>> overnight
>> > and it was still in the same state when I came in to work this morning.
>> I'm
>> > running the Plasma server with 4TB max.
>> >
>> > In an attempt to pro-actively free up some of the object ids that I no
>> > longer need, I also attempted to use the client.release() function but I
>> > cannot seem to figure out how to make this work properly. It crashes my
>> > Jupyter kernel each time I try.
>> >
>> > I'm using Pyarrow 0.9.0
>> >
>> > Thanks in advance.
>>
>