git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Reading multiple files from S3 source in parallel


Hello,
I'm interested in creating a Flink batch app that can process multiple files from S3 source in parallel. Let's say I have the following S3 structure and that my Flink App has Parallelism set to 3 workers.
     s3://bucket/data-1/worker-1/file-1.txt
     s3://bucket/data-1/worker-1/file-2.txt
     s3://bucket/data-1/worker-2/file-1.txt
     s3://bucket/data-1/worker-2/file-2.txt
     s3://bucket/data-1/worker-3/file-1.txt
     s3://bucket/data-1/worker-3/file-2.txt

     s3://bucket/data-2/worker-1/file-1.txt
     s3://bucket/data-2/worker-1/file-2.txt
     s3://bucket/data-2/worker-2/file-1.txt
     s3://bucket/data-2/worker-2/file-2.txt
     s3://bucket/data-2/worker-3/file-1.txt
     s3://bucket/data-2/worker-3/file-2.txt

     s3://bucket/data-3/worker-1/file-1.txt
     s3://bucket/data-3/worker-1/file-2.txt
     s3://bucket/data-3/worker-2/file-1.txt
     s3://bucket/data-3/worker-2/file-2.txt
     s3://bucket/data-3/worker-3/file-1.txt
     s3://bucket/data-3/worker-3/file-2.txt

I'm interested in having the flink workers process in parallel. For example, flink worker #1 should process only these files and in this order:
     s3://bucket/data-1/worker-1/file-1.txt
     s3://bucket/data-1/worker-1/file-2.txt
     s3://bucket/data-2/worker-1/file-1.txt
     s3://bucket/data-2/worker-1/file-2.txt
     s3://bucket/data-3/worker-1/file-1.txt
     s3://bucket/data-3/worker-1/file-2.txt

How can I configure the data source to the Flink App to handle this? Thank you for your help.