git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SplittableDoFn for zipWithIndex for a large file



I've also mirrored my response on StackOverflow: https://stackoverflow.com/a/53771980/33791 

On Thu, Dec 13, 2018 at 4:21 PM Chak-Pong Chung <cchung49@xxxxxxxxxx> wrote:
Hello everyone!

I asked the following question and think I might get some suggestions whether what I want is doable or not. 


If I can get `PCollection` id and the number of (contiguous)lines in each `PCollection`, then I can calculate the row order within each partition/`PCollection`  first and then do prefix-sum to compute the offset for each partition. This is doable in MPI or openMP since I can get the id/rank of each processor/thread.

Best,
Chak-Pong


--