git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hashjoin implementation


Hi vino,

thanks. 

I was running a join operation on two DataSets and writing the result to disk and the results were correct.
I just was not able to identify the moment when the Hashtable is built. (HashPartition.java is not used in this case?)

Do you have an idea why I cannot find it?


Here is a part of my code:
DataSet<RowCustomers> customers = env.fromElements( new RowCustomers(1, Mayer));
tEnv.registerDataSet("Customers", customers, "customerID, customerName");
DataSet<RowOrders> orders = env.fromElements( new RowOrders(1, 1, new String[]{Pen, Paper"}, "22.08.2018"));
tEnv.registerDataSet("Orders", orders, "orderID, customerID, items, date");
DataSet<Tuple2 <RowCustomers, RowOrders>> result = customers.join(orders, JoinOperatorBase.JoinHint.REPARTITION_HASH_FIRST)
.where("customerID").equalTo("customerID");
result.writeAsText(„/“);

Thanks a lot.
Benjamin


Am 11.09.2018 um 04:24 schrieb vino yang <yanghua1127@xxxxxxxxx>:

Hi Benjamin,

The approximate location is this package, the more accurate location is here.[1]

Specifically, Hash Join is divided into two steps:

1) build side
2) probe side

Thanks ,vino.


Benjamin Burkhardt <benjamin.burkhardt@xxxxxxxxxxxxxxxxxx> 于2018年9月10日周一 下午10:09写道:
Hi,

can anyone tell me where the default hybrid hash join function for partitioning (shuffle phase) is implemented?
Even after deeper dinning I was not able to figure out where it is located.

Might be somewhere here?
—> https://github.com/apache/flink/tree/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash

Thanks in advance.

Benjamin