Scalding's Architecture: A Case Study of Deep Embedding
11:30 - 12:20 Extra Spicy
Cascading is a Java framework for assembling MapReduce flows over Hadoop, using primitives known as pipes to describe data flows. The data is described in a dynamic fashion, with no typing information available, and thus the flow is checked for correctness only at runtime.
Scalding is a Scala framework over Cascading, providing combinators for flow manipulation similar to the Scala collections. Scalding has two APIs:
A field-based API which is in essence a shallow embedding of the combinators into the Cascading domain.
A typed API which is implemented via a deep embedding of the combinators, thus allowing adding types to the data flows.
In this talk I'll discuss the concepts of shallow and deep embeddings and their pros and cons, and then demonstrate the two techniques by diving into Scalding's architecture. Familiarity with Scalding is not a prerequisite.
Ofer Ron is a senior data scientist at LivePerson. He is the tech lead for LivePerson's research group which is responsible for designing and implementing various machine learning engines underlying LivePerson's products.
Ofer has been working with Scala for the past four years, both for computationally intensive tasks and for big data processing, using Scalding over Hadoop. He would never voluntarily go back to writing in Java.