git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transferring files between S3 and GCS


I ran into the same issue and ended building a separate operator that
works as you describe, though I haven't submitted it as a PR. Happy to
share my implementation with you.
I found that it's useful to have both ways of transferring data.
Initially, I migrated all of my S3ToGCS tasks to use the transfer
service, but I found that its performance can be unreliable with some
combination of 1) transferring smaller datasets and 2) invoking many
transfers in parallel. The transfer service is a bit of a black box, so
when it doesn't work as expected you're stuck. Because of this, I ended
up migrating some of my tasks to the original implementation. I would
definitely keep both options around--I don't think I have a preference
between new operator vs a param on the existing operator.
Chris


On Fri, Oct 19, 2018, at 7:09 AM, Conrad Lee wrote:
> Hello Airflow community,
> 
> I'm interested in transferring data between S3 and Google Cloud
> Storage.  I> want to transfer data on the scale of hundreds of gigabytes to a few
> terrabytes.
> 
> Airflow already has an operator that could be used for this use-case:> the S3ToGoogleCloudStorageOperator.
> However, looking over its implementation it appears that all the
> data to be> transferred actually passes through the machine running airflow.  That> seems completely unnecessary to me, and will place a lot of
> burden on the> airflow workers and will be bottlenecked by the bandwidth of the
> workers.> It could even lead to out of disk errors like this one
> <https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device>> .
> 
> I would much rather use Google Cloud's 'Transfer Service' for doing
> this--that way the airflow operator just needs to make an API call and> (optionally) keep polling the API until the transfer is done (this
> last bit> could be done in a sensor).  The heavy work of performing the
> transfer is> offloaded to the Transfer Service.
> 
> Was it an intentional design decision to avoid using the Google
> Transfer> Service?  If I create a PR that adds the ability to perform
> transfers with> the Google Transfer Service, should it
> 
>   - replace the existing operator
>   - be an option on the existing operator (i.e., add an argument that>   toggles between 'local worker transfer' and 'google hosted
>   transfer')>   - make a new operator
> 
> Thanks,
> Conrad Lee