Thanks for responding.
I managed to resolve the problem last Friday; I had a single datasource for each file, instead of one big datasource for all the files. The reading of the one or two HDFS blocks within each datasource was then distributed to a small percentage of slots (let's say ~10%). Some Beam runner-specific knowledge for Flink I did not yet have.
Using a single TextIO for all the files allowed Flink to use all available parallelism. It did require some hoops to jump through, as there was metadata associated with each file, each PCollection, which was harder to match to each file in one big PCollection.
From: Aljoscha Krettek <aljoscha@xxxxxxxxxx>
Sent: 13 March 2018 18:29:52
Subject: Re: HDFS data locality and distribution, Flink
There should be no data-locality awareness with Beam on Flink because there are no APIs in Beam that Flink could use to schedule tasks with awareness. It seems it just happens that the readers are distributed as they are.
Are the files roughly of equal size?
|( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-beam-users/msg02074.html on line 181|
|( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-beam-users/msg02074.html on line 181|