Technical Updates - 16 September 2018

Technical Updates - 16 September 2018

New open source projects from Facebook, LinkedIn, Two Sigma, and Oath this week. Several great posts about company’s data experiences—the Netflix Keystone platform, Hike’s experiences with BigQuery, Clio’s experience sharding a production database, nextgen timeseries database at Pinterest, optimizing Redshift at Plaid, and more. And based on some of the news out of Strata, it sounds like Hadoop is really getting ready to ride the Kubernetes wave.


Azure Data Factory is a tool for visually designing and running ETLs between various systems (it has a bunch of connectors). This tutorial demonstrates setting up a job to load data from blob storage to a SQL database.

More Info


Hike shares their experiences in moving from a Hive-based ad hoc analytics system to Google BigQuery. They saw good speedups, especially after making use of clustered tables. They detail their tooling and why they enabled require_partition_filter as a guard rail. Overall, they’re seeing 50x speedups and half the cost.

More Info


Clio recently went through the process of sharding their online MySQL database, and they’ve documented the details of the transition. Among these, they applied a regex to detect which operations contained joins and transactions that might be problematic. Lots of practical advice if you’re facing something similar.

More Info


Keystone is Netflix’s platform for real-time stream processing for analytics. It’s built on Apache Kafka and Apache Flink (in addition to a number of Netflix tools). This overview shows just how big the challenges are for building a multi-tenant tool at their scale—all the various flavors of stream processing are needed. The post then describes how they’ve built the system to meet those requirements and to be self-service with good operational characteristics.

Keystone


The AWS blog has published a sample Complex Event Processing application built on Apache Flink and EMR. It’s built to detect brush fires based on sensor data.

Apache Flink and EMR


Reference : Data Eng Weekly Publication