git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug? pyarrow deserialize_components doesn't work in multiple processes


This seems possibly similar to the issue reported in
https://github.com/apache/arrow/issues/1946 -- we never found a
resolution. Could we open a JIRA to track the problem?

On Fri, Jul 6, 2018 at 3:57 AM, Josh Quigley
<josh.quigley@xxxxxxxxxxxxxxxxxx> wrote:
> That works. I've tried a bunch of debugging and work arounds- as far as I
> can tell this is just a problem with deserializr from components and
> multiprocess.
>
> On Fri., 6 Jul. 2018, 5:12 pm Robert Nishihara, <robertnishihara@xxxxxxxxx>
> wrote:
>
>> Can you reproduce it without all of the multiprocessing code? E.g., just
>> call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes
>> into another interpreter and call *pyarrow.deserialize *or
>> *pyarrow.deserialize_components*?
>> On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley <
>> josh.quigley@xxxxxxxxxxxxxxxxxx>
>> wrote:
>>
>> > Attachment inline:
>> >
>> > import pyarrow as pa
>> > import multiprocessing as mp
>> > import numpy as np
>> >
>> > def make_payload():
>> >     """Common function - make data to send"""
>> >     return ['message', 123, np.random.uniform(-100, 100, (4, 4))]
>> >
>> > def send_payload(payload, connection):
>> >     """Common function - serialize & send data through a socket"""
>> >     s = pa.serialize(payload)
>> >     c = s.to_components()
>> >
>> >     # Send
>> >     data = c.pop('data')
>> >     connection.send(c)
>> >     for d in data:
>> >         connection.send_bytes(d)
>> >     connection.send_bytes(b'')
>> >
>> >
>> > def recv_payload(connection):
>> >     """Common function - recv data through a socket & deserialize"""
>> >     c = connection.recv()
>> >     c['data'] = []
>> >     while True:
>> >         r = connection.recv_bytes()
>> >         if len(r) == 0:
>> >             break
>> >         c['data'].append(pa.py_buffer(r))
>> >
>> >     print('...deserialize')
>> >     return pa.deserialize_components(c)
>> >
>> >
>> > def run_same_process():
>> >     """Same process: Send data down a socket, then read data from the
>> > matching socket"""
>> >     print('run_same_process')
>> >     recv_conn,send_conn = mp.Pipe(duplex=False)
>> >     payload = make_payload()
>> >     print(payload)
>> >     send_payload(payload, send_conn)
>> >     payload2 = recv_payload(recv_conn)
>> >     print(payload2)
>> >
>> >
>> > def receiver(recv_conn):
>> >     """Separate process: runs in a different process, recv data &
>> > deserialize"""
>> >     print('Receiver started')
>> >     payload = recv_payload(recv_conn)
>> >     print(payload)
>> >
>> >
>> > def run_separate_process():
>> >     """Separate process: launch the child process, then send data"""
>> >
>> >
>> >     print('run_separate_process')
>> >     recv_conn,send_conn = mp.Pipe(duplex=False)
>> >     process = mp.Process(target=receiver, args=(recv_conn,))
>> >     process.start()
>> >
>> >     payload = make_payload()
>> >     print(payload)
>> >     send_payload(payload, send_conn)
>> >
>> >     process.join()
>> >
>> > if __name__ == '__main__':
>> >     run_same_process()
>> >     run_separate_process()
>> >
>> >
>> > On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley <
>> > josh.quigley@xxxxxxxxxxxxxxxxxx>
>> > wrote:
>> >
>> > > A reproducible program attached - it first runs serialize/deserialize
>> > from
>> > > the same process, then it does the same work using a separate process
>> for
>> > > the deserialize.
>> > >
>> > > The behaviour see is (after the same process code executes happily) is
>> > > hanging / child-process crashing during the call to deserialize.
>> > >
>> > > Is this expected, and if not, is there a known workaround?
>> > >
>> > > Running Windows 10, conda distribution,  with package versions listed
>> > > below. I'll also see what happens if I run on *nix.
>> > >
>> > >   - arrow-cpp=0.9.0=py36_vc14_7
>> > >   - boost-cpp=1.66.0=vc14_1
>> > >   - bzip2=1.0.6=vc14_1
>> > >   - hdf5=1.10.2=vc14_0
>> > >   - lzo=2.10=vc14_0
>> > >   - parquet-cpp=1.4.0=vc14_0
>> > >   - snappy=1.1.7=vc14_1
>> > >   - zlib=1.2.11=vc14_0
>> > >   - blas=1.0=mkl
>> > >   - blosc=1.14.3=he51fdeb_0
>> > >   - cython=0.28.3=py36hfa6e2cd_0
>> > >   - icc_rt=2017.0.4=h97af966_0
>> > >   - intel-openmp=2018.0.3=0
>> > >   - numexpr=2.6.5=py36hcd2f87e_0
>> > >   - numpy=1.14.5=py36h9fa60d3_2
>> > >   - numpy-base=1.14.5=py36h5c71026_2
>> > >   - pandas=0.23.1=py36h830ac7b_0
>> > >   - pyarrow=0.9.0=py36hfe5e424_2
>> > >   - pytables=3.4.4=py36he6f6034_0
>> > >   - python=3.6.6=hea74fb7_0
>> > >   - vc=14=h0510ff6_3
>> > >   - vs2015_runtime=14.0.25123=3
>> > >
>> > >
>> >
>>