Big Data Trends - Oct 2014

Getting started with a new distributed system typically requires looking through tutorials, documentation, and even source code. This presentation aims to gather all of that information (and more) into a single training deck for Apache Storm. It covers five key areas—an introduction, Storm’s core concepts, operational considerations, Storm app examples, and wirbelsturm for local development.

http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/

This presentation gives an introduction to Apache Optiq (incubating) and describes how the Optiq cost-based optimizer is being added to Apache Hive 0.14. There are some examples of optimizing the query plan for star schema, left-deep tree, and bushy tree queries. It also explores the importance of having statistics about the data, and there are some impressive benchmarks on TPC-DS queries at the end.

http://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-hive-014

This post walks through five different types of logs that are important for understanding and debugging a Hadoop cluster. Given that YARN is relatively new, this is a good introduction to the new types of logs introduced in recent versions of Hadoop.

https://www.altiscale.com/top-10-hadoop-yarn-part-1/

Spark’s MLlib contains a decision tree implementation which can be used in data classification problems. Even if you don’t know what a decision tree is, the article contains an introduction before it dies into the technical details. The post has an example in python (and links to examples for Java and Scala), describes the optimizations in the implementation, and has an overview of scalability (both dataset size and number of features). There were also some impressive speed gains in Spark 1.1 vs. Spark 1.0.

http://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html

DataStax Enterprise 4.5 integrates Apache Cassandra with Apache Spark using the Spark Cassandra Connector. This post includes a walkthrough of using Spark’s MLlib with data stored in Cassandra.

http://www.datastax.com/dev/blog/interactive-advanced-analytic-with-dse-and-spark-mllib

The SequenceIQ blog has an example of implementing a correlation function for Spark. While the implementation duplicates some functionality found in MLlib, the example shows how to write testable Spark code (and has example tests). The code is available in its entirety on github.

http://blog.sequenceiq.com/blog/2014/09/29/spark-correlation-and-testing/

Many folks get started with Hadoop in the cloud and end up storing data in object stores like S3 as a result. This post from the Altiscale blog discusses some of the drawbacks of storing data in an object store vs. a true file system.

https://www.altiscale.com/hdfs-object-stores-best-place-land-big-data/

Datameer has written about how they’ve reengineered the backend to Datameer 5 to be framework agnostic. Previously, the system was tightly coupled with MapReduce, but it can now also use Tez and small job/local execution engines. The post also describes why they use Apache Tez over Spark (although they do say that Spark will eventually be integrated).

http://www.datameer.com/blog/announcements/the-challenge-to-choosing-the-right-execution-engine.html

While Spark has had integration with Kafka for several releases, this post goes much further than the Spark-bundled KafkaWordCount example. In fact, the post contains everything needed to get started with Kafka and Spark Streaming—including overviews of both systems that describe core concepts. The post culminates with a full example that reads Avro-encoded data from Kafka (in parallel across partitions), does some simple computing, and writes the data back to Kafka. There is also a summary of known issues, testing, and performance testing.

http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

This post shows how to build an Amazon Elastic MapReduce (EMR) cluster that integrates RStudio. After bootstrapping a cluster, it walks through changing security settings to allow access to the RStudio web interface, describes how to use the rmr2 package to run a MapReduce job from R, and shows how to pull in some real-world (global weather measurement) data for analysis.

http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR

This tutorial explains how to install Apache Spark in the MapR sandbox (a VM running in VMWare or Virtualbox). After that, it has some examples with the spark-shell to run simple queries against a text-based Spark RRD.

https://www.mapr.com/blog/getting-started-spark-mapr-sandbox