git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ETL options from Hive/Presto/s3 to cassandra


Spark is scalable to as many nodes as you want and could be collocated with the data nodes — sstableloader wont be as performant for larger datasets. Although it can be run in parallel on different nodes I don’t believe it to be as fault tolerant.

If you have to do it continuously I would even think about leveraging Kafka as the transport layer and using Kafka Connect. It brings other tooling to get data into Cassandra from a variety of sources.

Rahul
On Aug 6, 2018, 3:16 PM -0400, srimugunthan dhandapani <srimugunthan.dhandapani@xxxxxxxxx>, wrote:
Hi all,
We have data that gets filled into Hive/ presto  every few hours.
We want that data to be transferred to cassandra tables.
What are some of the high performance ETL options for transferring data between hive  or presto into cassandra?

Also does anybody have any performance numbers comparing
- loading data from S3 to cassandra using SStableloader
- and loading data from S3 to cassandra using other means (like spark-api)?

Thanks,
mugunthan


( ! ) Warning: include(msgfooter.php): failed to open stream: No such file or directory in /var/www/git/apache-cassandra-users/msg06695.html on line 94
Call Stack
#TimeMemoryFunctionLocation
10.0006363576{main}( ).../msg06695.html:0

( ! ) Warning: include(): Failed opening 'msgfooter.php' for inclusion (include_path='.:/var/www/git') in /var/www/git/apache-cassandra-users/msg06695.html on line 94
Call Stack
#TimeMemoryFunctionLocation
10.0006363576{main}( ).../msg06695.html:0