Technical Updates - March 30/2015

This tutorial describes how to build a kerberos-enabled Hadoop cluster inside of a VM (the steps are valuable outside of a VM, too). The author provides a script for setting up kerberos before running the quickstart wizard that comes with Cloudera Manager. The script, which includes thorough comments, makes kerberos much less intimidating.

http://blog.cloudera.com/blog/2015/03/how-to-quickly-configure-kerberos-for-your-apache-hadoop-cluster/

This post provides a brief introduction to the DockerContainerExecutor that was introduced in YARN as part of Apache Hadoop 2.6. It describes one of the main motivations for running inside of docker containers—managing system-level dependencies.

https://www.altiscale.com/hadoop-blog/dockercontainerexecutor/

The following slides and video are from a presentation given at the recent Strata San Jose conference on optimizing Spark programs. Topics covered include understanding shuffle in Spark (and common problems), understanding which code runs on the client vs. the workers, and tips for organizing code for reusability and testability.

http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

https://www.youtube.com/watch?v=Wg2boMqLjCg&feature=youtu.be

As noted in the Apache Spark 1.3 release, Spark SQL is no-longer alpha. This post explains that this guarantee means binary compatibility across Spark 1.x. It also describes some plans for improving Spark SQL (better integration with Hive), the new data sources API, improvements to Parquet support (automatic partition discovery and schema migration), and support for JDBC sources.

https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

The Cloudera blog has a post from a software engineer working at Edmunds.com on how they built a spark-streaming based analytics dashboard to monitor traffic related to superbowl ads. The system also uses Flume, HBase, Solr, Morphlines, and Banana (a port of kibana to Solr) as well as algebird's implementation of HyperLogLog. The post is a good end-to-end description of how the system was built and how it works (with screenshots).

http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

For those looking to scale machine learning implementations, the Databricks blog has a post on Spark 1.3's implementation of Latent Dirichlet Allocation (LDA). The post describes LDA, common use-cases, and how it's implemented atop of GraphX (the Graph API for Spark).

https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html

This post describes how to enable support for impersonation from Hue in HBase so that users can only view/modify data which they're allowed to via HBase permissions. It also describes how to configure the HBase Thrift Server for kerberos authentication. There are screen shots of the Hue-HBase application, and several troubleshooting steps for common configuration issues.

http://gethue.com/hbase-browsing-with-doas-impersonation-and-kerberos/

As a developer, it can become easy to get used to peculiarities of a system you're working with. It's good to take a step back and understand these issues (or even decide if they really are issues!). In this case, the ingest.tips blog has a post that gathers feedback on "what is confusing about Kafka?" In addition to collecting the feedback, there are responses/links for several of the issues.

http://ingest.tips/2015/03/26/what-is-confusing-about-kafka/

The Hortonworks blog has the third part in a series on anomaly detection in healthcare data. In this post, they use SociaLite, an open-source graph analysis framework to compute a variant of PageRank. The post gives an overview of SociaLite (which integrates with Python) and describes the implementation to find anomalies. All code is available on github.

http://hortonworks.com/blog/using-pagerank-to-detect-anomalies-and-fraud-in-healthcare-part3/

Most folks working with batch systems start out with a simple workflow system that spawns one job after another via cron. From their, they often move to a job that runs based on the availability of input data. As a post on the Cask blog explains, it's difficult to implement a data-driven workflow efficiently. Most systems poll for the availability of input, which can be slow. The Cask Data Application Platform (CDAP) uses notifications to trigger jobs. The follow post describes the architecture in greater detail.

http://blog.cask.co/2015/03/data-driven-job-scheduling-in-hadoop/