Code Mesh 2014

Lightning Fast Cluster Computing with Spark and Cassandra

Piotr Kołaczkowski

Lightning Fast Cluster Computing with Spark and Cassandra

Apache Spark is a fast cluster computing engine, based on the concept of Resilient Distributed Datasets. Apache Cassandra is a distributed, fully decentralized database system. The presentation will show you how we integrated them in a single product, DataStax Enterprise 4.5, using Scala language. The connector we've built allows to query Cassandra from Spark and to store query results back in Cassandra. The talk will cover how the whole system works from the user-perspective, its high level architecture, some implementation details and best practices. We'll also discuss similarities and differences between this system and the integration layer connecting Cassandra and Hadoop.

Talk objectives: Recently, we can see a huge rise in popularity of fast distributed, parallel query/computation engines. Apache Spark/Shark or Presto are good examples. Particularly, Spark opens Cassandra to many new use cases around real time analytics, iterative algorithms, machine learning, complex event processing, etc. That's why some of our customers started building their own solutions around Spark and Cassandra, using the existing Hadoop support. We integrated Apache Spark with Apache Cassandra natively and we would like to talk on how it was done, how it works in practice and why it is better than using Hadoop intermediate layer.

Target audience: Software engineers and solution architects using or planning to use NoSql products for analytics, particularly Cassandra and Spark.

Video

About Piotr

Piotr Kołaczkowski is the Lead Software Engineer for the Analytics team at DataStax, where he develops analytic components of DataStax Enterprise platform built on top of Apache Cassandra. Together with his team, he has recently worked on components integrating Apache Spark platform with Cassandra which is one of the key features of DSE 4.5. A significant part of this Scala project has been open-sourced. Piotr formerly worked as an assistant professor at Warsaw University of Technology where he was involved in research projects related to self-tuning databases, artificial intelligence and data mining. He lives with his family in Warsaw, Poland.

Github: pkolaczk

Twitter: @pkolaczk

Back to conference page