git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Returning dataframe from parDo and printing its value - advice?


Thanks for the response.
I tried this within the current parDo, CreateColForSampleFn, Apache beam returns a warning with recommendation not to return a string.

So, my questions are:
- Is it essential to separate this transformation in a different ParDo?
- Should I ignore that message? When is this message relevant?

Many thanks,
Eila

On Mon, Jun 18, 2018 at 2:52 PM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:
User is the correct mailing list.

beam.io.WriteToText takes 'strings' which means that you have to format the whole line yourself. You'll want to apply another ParDo after CreateColForSampleFn which takes the 1x164 record and concatenates each value with ',' in between.

On Mon, Jun 18, 2018 at 9:00 AM OrielResearch Eila Arich-Landkof <eila@xxxxxxxxxxxxxxxxx> wrote:
Hi,

Is anyone listening on the user@ mailing list? or should I use a different mailing list?

I have made some progress. 
- ParDo returns a List now
- add a header to the WriteToText.

The pipeline looks like that:
ExploreData = (p | "Extract the rows from dataframe" >> beam.io.Read(beam.io.BigQuerySource('archs4.Debug_annotation'))
                | "create more columns" >> beam.ParDo(CreateColForSampleFn(colListSubset,outputPath)))

(ExploreData | 'writing to CSV files' >> beam.io.WriteToText('gs://dataExploration.txt',file_name_suffix='.csv',num_shards=1,append_trailing_newlines=True,header=colListStr))


The remaining issue is that the output has new line after each value:
None
None
None
None
None
 30
 Primary Tissue
None
None
None
Please let me know how do I get read from this new lines. I hope to be able to open the output file with Google Sheet.

Thanks,
Eila


On Fri, Jun 15, 2018 at 2:45 PM, OrielResearch Eila Arich-Landkof <eila@xxxxxxxxxxxxxxxxx> wrote:
Hi all,

I am running a pipeline, where a table from BQ is being processed line by line using ParDo function. 
CreateColForSampleFn generates a data frame, with headers and values (shape: 1x164 ) that I want to pass to WriteToText.
See the followings:

ExploreData = (p | "Extract the rows from dataframe" >> beam.io.Read(beam.io.BigQuerySource('archs4.Debug_annotation'))
                | "create more columns" >> beam.ParDo(CreateColForSampleFn(colListSubset,outputPath)))

(ExploreData | 'writing to CSV files' >> beam.io.WriteToText('gs://dataExploration.txt',num_shards=1))
 
My questions are related to the returned DF and WriteToText:
1. when I pass DF from the CreateColForSampleFn to WriteToText , I get only the headers:
Sample_contact_phone
Sample_extract_protocol_ch1
Sample_platform_id
Sick
Sample_title
index
Sample_last_update_date
Sample_contact_country
Sample_channel_count
Sample_library_source
Sample_taxid_ch1

2. When I return the df in a list [df], I get the following txt for each row (including the dimensions)
 Sample_contact_phone                        Sample_extract_protocol_ch1 Sample_platform_id  Sick
0                       Library construction protocol: Four µg of tota...           GPL11154  None
[1 rows x 168 columns]


I want to generate a text file that includes:
- One header (if needed, I will add it after the pipeline completed)
- All the values from each rows that was processed and generated DF
- Full cell values, without ... in the middle

What am I missing? any advice?
 
Thanks,
--



--
--