[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ETL options from Hive/Presto/s3 to cassandra

Spark is scalable to as many nodes as you want and could be collocated with the data nodes — sstableloader wont be as performant for larger datasets. Although it can be run in parallel on different nodes I don’t believe it to be as fault tolerant.

If you have to do it continuously I would even think about leveraging Kafka as the transport layer and using Kafka Connect. It brings other tooling to get data into Cassandra from a variety of sources.

On Aug 6, 2018, 3:16 PM -0400, srimugunthan dhandapani <srimugunthan.dhandapani@xxxxxxxxx>, wrote:
Hi all,
We have data that gets filled into Hive/ presto  every few hours.
We want that data to be transferred to cassandra tables.
What are some of the high performance ETL options for transferring data between hive  or presto into cassandra?

Also does anybody have any performance numbers comparing
- loading data from S3 to cassandra using SStableloader
- and loading data from S3 to cassandra using other means (like spark-api)?