git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with adding python package dependencies when executing pyhton pipeline


Based on https://stackoverflow.com/questions/44423769/how-to-use-google-cloud-storage-in-dataflow-pipeline-run-from-datalab
I tried this:
options = PipelineOptions(flags = ["--requirements_file", "./requirements.txt"])
the requirements file was generated by:
pip freeze > requirements.txt

But it fires the following error:
CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/dataflow-requirements-cache', '-r', 'requirements.txt', '--no-binary', ':all:']' returned non-zero exit status 1

any suggestion?
Thanks,
Eila

On Tue, Jul 3, 2018 at 5:25 PM, OrielResearch Eila Arich-Landkof <eila@xxxxxxxxxxxxxxxxx> wrote:
thank you. where do i add the reference to requirements.txt? can i do it from the pipline options code?

On Tue, Jul 3, 2018 at 5:13 PM, Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:

On Tue, Jul 3, 2018 at 2:09 PM OrielResearch Eila Arich-Landkof <eila@xxxxxxxxxxxxxxxxx> wrote:
Hello all,


I am using the python code to run my pipeline. similar to the following:

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'my-project-id'
google_cloud_options.job_name = 'myjob'
google_cloud_options.staging_location = 'gs://your-bucket-name-here/staging'
google_cloud_options.temp_location = 'gs://your-bucket-name-here/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'


I would like to add pandas-gbq package installation to my workers. What would be the recommendation to do so. Can I add it to the PipelineOptions()?  
I remember that there are few options, one of them was with creating a requirements text file but I can not remember where I saw it and if it is the simplest way when running the pipeline from datalab.

Thanks you for any reference!



--



--