Astyanax Update

December 20, 2013, 12:13 pm

≫ Next: Introducing PigPen: Map-Reduce for Clojure

≪ Previous: Netflix Presentation Videos from AWS Re:Invent 2013

Overview

Astyanax is one of Netflix’s more popular OSS Projects. About a year ago, whenever we spoke with Cassandra users, Astyanax was considered the de facto Java Client library for Cassandra (C*) because of features such as connection pooling, intelligent load balancing between C* nodes as well as the Recipes.

With the advent of the new Java Driver from DataStax, which has incorporated some of the best Astyanax concepts, many in the community have been wondering if Astyanax is still relevant. They’ve also wondered why there hasn't been any major new release, or if Netflix is still using it?

This article is meant to address those questions. We will also share the findings we've discovered over the past few months while we've been investigating the new DataStax Java Driver. Finally, we will also share why Astyanax is still relevant in the Apache Cassandra Community and what our plans are for the project in 2014.

State of Affairs at Netflix

Netflix uses Astyanax as it's primary Cassandra client. Astyanax has a structured API which makes use of a fluent design that allows the consumer to perform complex queries with relative ease. The fluent design also makes it intuitive to explore the vast API. Complex queries supported by the API range from mixed batch mutations to sophisticated row range queries with composite columns.

Here is an example of a reverse range query on a composite column for a specified row.

keyspace.prepareQuery(CF)

.getRow("myRowKey")

.withColumnRange(new CompositeRangeBuilder()

.withPrefix("1st component of comp col")

.greaterThan("a")

.lessThan("z")

.reverse()

.limit(11)

.build())

.execute();

Astyanax also provides other layers of functionality on top of it's structured API.

Astyanax recipes - these allow users to perform complex operations such as reading all rows from multiple token ranges, chunked object storage, optimistic distributed locking etc.
Entity Mapper - this layer allows users to easily map their business logic entities to persistent objects in Cassandra.

The following diagram shows the overall architectural components of Astyanax. The library has been heavily adopted both within Netflix as well as externally. At Netflix many Cassandra users make heavy use of the api, recipes, entity layer and have also built their own DAOs on top of these layers, thus making Astyanax the common gateway to Cassandra.

DataStax Java Driver

Earlier this year, DataStax announced the release of their new Java Driver that uses the native protocol with CQL3. The new driver has good support for all the client functionality but also provides other benefits beyond that. Just to mention a few

Out-of-the-box support for async operations over netty
Cursors for pagination
Query tracing when debugging
Support for collections
Built-in load balancing, retry policies etc.

Java Driver is a CQL3-based client and you should familiarize yourself with some key CQL3 concepts. Here are some good posts to read before you start playing with the driver.

1.0 and 2.0 releases

Note that the Java Driver 1.0 release was succeeded by a 2.0 release. The 2.0 release had some useful features and improvements over its predecessor such as cursor support for pagination and better handling of large mutation batches. However the 2.0 release also had a bunch of backwards-incompatible API changes. You need to consider the changes cited here by DataStax when upgrading from 1.0 release to the 2.0 release.

Integration

In the past few months we have been doing some R&D with the java driver and have a sample implementation of the structured Astyanax API using the native protocol driver. This integration leverages the best of both worlds (Astyanax and Java Driver) and we can in turn proxy the combined benefits to our consumers.

Note that we have implemented the Astyanax API which is compatible with the current thrift-based implementation. Our thrift-based driver is operational and still being supported.

Here are some of the benefits based on our findings and integration work

Structured api with async support

You can now use the Astyanax async interface with all the structured APIs.

ListenableFuture<OperationResult<Row>> future = keyspace.prepareQuery(myCF).getRow("myRowKey").execAsync();

future.addListener( new Runnable() {

...

}, myExecutor );

The overall benefit of using the async protocol is that one can multiplex more requests on the same connection to the Cassandra nodes. Request multiplexing on the same connection yields higher throughput for the same number of connections.

However one still has to design their application with an async model for truly realizing the performance gains from this feature. With the async interface, calling code gets a future that gives a callback when the result is ready, thus “unblocking” the calling code. But if the caller just blocks on the future anyways, then the app will probably not benefit a whole lot from this feature.

Astyanax helps with prepared statement management

It is important to note that in order to use the native driver in a performant manner, one must make use of prepared statements. The thing that makes prepared statements performant is that they can be re-used and hence you pay the cost to “prepare” only once. However the management of prepared statements for different queries is left up to the app. Astyanax uses a structured query interface which enables it to intelligently manage prepared statements for consumer apps. More on this in the sections below on findings and caveats.

Astyanax provides a unified interface across multiple drivers

Note that since the Astyanax API is the lowest common denominator we decided to provide an alternate implementation of the API using an adaptor layer over Java Driver. Furthermore we have preserved the semantics of rows and columns as before irrespective of the driver being used underneath.

For certain commonly used schemas, CQL3 will treat unique columns as unique rows. When you read a row via Astyanax, it will always be a row regardless of the driver being used underneath. This gives consumers the freedom of choice when deciding what driver to use and even switch back and forth while still maintaining the same application logic, data model etc.

Overall Astyanax provides a performant, structured and driver-independent client that enables developers to focus efforts on other useful higher level abstractions in their applications.

JD 1.0 v/s 2.0

Initially we started with wrapping Java Driver 1.0 but then realized that it was lacking some critical features like the cursor support which made certain complicated range queries much harder and almost impractical to implement. Further we found compile-time incompatibilities between the 2 drivers and wanted to avoid writing multiple adaptors around multiple driver versions. Hence we decided to directly adopt the 2.0 release of the driver. This did cause significant re-work on the Astyanax side which also included a re-run of benchmarking tests for the driver. More on benchmark numbers below.

Migration path

Note that the 2.0 driver works only against C* 2.x clusters. Hence If you are directly using the Java Driver and if you are on a 1.x cluster, then you'd have to continue to use the 1.0 release while you upgrade your cluster to 2.x. Once this is done, you can then switch to the 2.0 release after making code changes to your app due to the non backwards compatibility between the 2 releases of the driver.

If you are currently using Astyanax and want to use the new native protocol based driver, then you should stay with the thrift-based driver until you upgrade your cluster to 2.x and only then switch to the new driver based impl. The Astyanax adaptor over java driver implements the same Astyanax API, hence we don't anticipate major code changes when switching over.

Note that even though the api is backwards compatible with the new release, this is a beta release and there are caveats to note here. More on this below.

Caveats / Findings

Note that in the interest of brevity, the following is not an exhaustive list of all our findings. We will have the complete list available on the Astyanax wiki soon.

Performance characteristics

Basic setup

We deployed a single 2.0 cluster with 6 nodes (m2.4xlarge). We used the bare min config with a replication factor of 3. The goal here was NOT to benchmark the 2.0 cluster performance but compare the different drivers using the same client setup.

We deployed a simple client that wrapped Astyanax and repeatedly issued simple queries in a multi-threaded setup. The client was operated in 3 modes.

The standard Astyanax client that uses the thrift driver underneath
The standard Java Driver released by DataStax.
Astyanax over Java Driver - a limited implementation of the Astyanax interface that uses the Java Driver underneath instead of thrift.

The client was run concurrently in all 3 modes against the same 2.0 cluster using 3 separate ASGs with the same instance type (m2.xlarge), no of instances and no of client threads.

Here are the keyspace and column family settings

CREATE KEYSPACE astyanaxperf WITH replication = {

'class': 'NetworkTopologyStrategy',

'us-east': '3'

};

USE astyanaxperf;

CREATE TABLE test1 (

key bigint,

column1 int,

value text,

PRIMARY KEY (key, column1)

) WITH COMPACT STORAGE AND

bloom_filter_fp_chance=0.010000 AND

caching='KEYS_ONLY' AND

comment='' AND

dclocal_read_repair_chance=0.000000 AND

gc_grace_seconds=864000 AND

index_interval=128 AND

read_repair_chance=0.100000 AND

replicate_on_write='true' AND

populate_io_cache_on_flush='false' AND

default_time_to_live=0 AND

speculative_retry='99.0PERCENTILE' AND

memtable_flush_period_in_ms=0 AND

compaction={'class': 'SizeTieredCompactionStrategy'} AND

compression={'sstable_compression': 'SnappyCompressor'};

Other settings included using 8 connections to the cluster from each client (aws instance) and consistency level was ONE for both reads and writes.

Overall we found that java driver (with prepared statements) was roughly comparable to the thrift-based driver with the latter getting about 5-10% more throughput especially for large batch mutations.

Batch Mutations

There is about a 5-10% deviation between thrift and the java driver, where thrift was faster than java driver.

Single Row Reads

Variation between reads was lesser (2%-5%).

Using Prepared Statements

Note that when using Java Driver one must use prepared statements for better performance. Here are examples of using queries both with and without prepared statements.

Regular query

while (!perfTest.stopped()) {

String query = "select * from astyanaxperf.test1 where key=" + random.nextInt();

ResultSet resultSet = session.executeQuery(query);

parseResultSet(resultSet);

}

prepared statement

String query = "select * from astyanaxperf.test1 where key=?";

// Cache the prepared statement for later re-use

PreparedStatement pStatement = session.prepareQuery(query);

while (!perfTest.stopped()) {

// Build exact query from prepared statement

BoundStatement bStatement = new BoundStatement(pStatement);

bStatement.bind(random.nextInt());

ResultSet resultSet = session.executeQuery(bStatement);

parseResultSet(resultSet);

}

As you can see below, the performance difference between these 2 blocks of code is significant, hence users are encouraged on using prepared statements when using either Java Driver or the Astyanax adaptor over Java Driver.

Prepared statements require management

Note that prepared statements (by design) need to be managed by the application that uses Java Driver. In the examples above, you can see that the prepared statement needs to be supplied back to driver for re-use (and hence better performance), which means that the caller has to manage this statement. When client apps build sophisticated DAOs, they generally make use of several queries and hence need to maintain the mapping between their use cases (query patterns) and the corresponding prepared statements. This will complicate the DAO implementations.

This is where Astyanax can help. Astyanax uses a fluent yet structured query syntax. The structured API design makes it feasible to construct a query signature for each query. The query signature can be used by Astyanax to detect recurring queries with similar structure (signature) and hence re-use the corresponding prepared statement. Hence Astyanax users can get automatic prepared statement management for free.

Here are examples on how to do this.

Writes

MutationBatch m = keyspace.prepareMutationBatch();

m.withRow( myCF, rowKey)

.useCaching() // tell Astyanax to reuse prepared statements

.addColumn( columnName, columnValue )

.execute();

Reads

pStmt = keyspace.prepareQuery( myCF )

.useCaching(true)

.withRow( myRowKey )

.execute();

The catch with prepared-statements management!

An important caveat to note here is that prepared statements work well when your queries are highly cacheable, which means that your queries have the same signature. For writes this means that if you are adding, updating, deleting columns with different timestamps and TTLs then you can't leverage prepared statements since there is no "cacheability" in those queries. If you want the mutation batches to be highly cacheable, then you must use the same TTLs and timestamps when re-using prepared statements for subsequent batches.

Similarly for reads, you can't prepare a statement by doing a row query and then reuse that statement for a subsequent row query that also has a column slice component to it, since the underlying query structure is actually different.

Here is an example to illustrate my point

// Select a row with a column slice specification (column range query)

select * from test1 where key = ? and column1 > ?;

// is very different from a similar column slice query

select * from test1 where key = ? and column1 > ? and column1 < ?;

// Hence both these queries have different signatures and hence need their own prepared statements.

Hence use the automatic statement management feature in Astyanax with caution. Inspect your table schema and your query patterns. When using queries with different patterns, turn caching OFF.

Rows v/s Columns

In CQL3 columns can be rows as well depending on your column family schema. Here is a simple example. Say that you have a simple table with the following schema.

Key Validator: INT_TYPE

Column Comparator: INT_TYPE

Default Validator : UTF8_TYPE

Here is what data would look like at the storage level (example data)

In Astyanax, rows and columns directly map to this structure. i.e Astyanax rows directly map to storage rows.

But in CQL3 the same CF has a clustering key that corresponds to your column comparator. Hence unique columns will appear as distinct rows.

If you are migrating your app from using a thrift based driver (such as Astyanax) to the new java driver, then your application business logic must account for this.

In order to keep things consistent in Astyanax and also avoid a lot of confusion, we have preserved the original semantics for rows and columns. i.e Astyanax's view of rows and columns is the same regardless of the driver being used underneath. This approach helps application owners keep their biz logic and DAO models the same and avoids a re-write of all the numerous DAOs at Netflix.

Summary - Astyanax in 2014

We have been quiet on the Astyanax front for a while. However, I’m sure based on this article you finally understand why. Astyanax still provides value to the Apache Cassandra (C*) community. Even with the integration of the DataStax Java Driver you still get a Structured API with async support, help with prepared statement management, backward compatibility within your existing Cassandra applications and a unified interface across multiple drivers. According to our findings above, in some cases, the continued use of the current Astyanax library will provide better performance than the Java Driver alone.

Here is a quote from Patrick McFadin, Chief Evangelist for Apache Cassandra:

“ Once again Netflix has shown it's leadership in the community by creating incredible tools for Apache Cassandra. Netflix and DataStax provide an excellent example of when working together for the benefit of users, amazing things can happen. I love where Astyanax is headed with it's rich API and great recipes. This type of collaboration is what I feel is the best part of Open Source Software.”

A Beta release of Astyanax will be released in mid-January 2014. This new release will contain our integration with Java Driver 2.0 and all the features mentioned in this article.

Netflix believes very strongly in the Astyanax Project and will continue to work with the Apache Cassandra Community to move it forward. If you would like to contribute to the project, feel free to submit code to the Astyanax project, open issues or even apply at jobs.netflix.com.

↧

Introducing PigPen: Map-Reduce for Clojure

January 2, 2014, 7:53 pm

≫ Next: S3mper: Consistency in the Cloud

≪ Previous: Astyanax Update

by: Matt Bossenbroek

It is our pleasure to release PigPen to the world today. PigPen is map-reduce for Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it.

What is PigPen?

A map-reduce language that looks and behaves like clojure.core
The ability to write map-reduce queries as programs, not scripts
Strong support for unit tests and iterative development

Note: If you are not familiar at all with Clojure, we strongly recommend that you try a tutorial here, here, or here to understand some of the basics.

Really, yet another map-reduce language?

If you know Clojure, you already know PigPen

The primary goal of PigPen is to take language out of the equation. PigPen operators are designed to be as close as possible to the Clojure equivalents. There are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program.

Here’s the proverbial word count:

(require '[pigpen.core :as pig])

(defn word-count [lines]
  (->> lines
    (pig/mapcat #(-> % first
                   (clojure.string/lower-case)
                   (clojure.string/replace #"[^\w\s]" "")
                   (clojure.string/split #"\s+")))
    (pig/group-by identity)
    (pig/map (fn [[word occurrences]] [word (count occurrences)]))))

This defines a function that returns a PigPen query expression. The query takes a sequence of lines and returns the frequency that each word appears. As you can see, this is just the word count logic. We don’t have to conflate external concerns, like where our data is coming from or going to.

Will it compose?

Yep - PigPen queries are written as function compositions - data in, data out. Write it once and avoid the copy & paste routine.

Here we use our word-count function (defined above), along with a load and store command, to make a PigPen query:

(defn word-count-query [input output]
  (->>
    (pig/load-tsv input)
    (word-count)
    (pig/store-tsv output)))

This function returns the PigPen representation of the query. By itself, it won’t do anything - we have to execute it locally or generate a script (more on that later).

You like unit tests? Yeah, we do that

With PigPen, you can mock input data and write a unit test for your query. No more crossing your fingers & wondering what will happen when you submit to the cluster. No more separate files for test input & output.

Mocking data is really easy. With pig/return and pig/constantly, you can inject arbitrary data as a starting point for your script.

A common pattern is to use pig/take to sample a few rows of the actual source data. Wrap the result with pig/return and you’ve got mock data.

(use 'clojure.test)

(deftest test-word-count
  (let [data (pig/return [["The fox jumped over the dog."]
                          ["The cow jumped over the moon."]])]
    (is (= (pig/dump (word-count data))
           [["moon" 1]
            ["jumped" 2]
            ["dog" 1]
            ["over" 2]
            ["cow" 1]
            ["fox" 1]
            ["the" 4]]))))

The pig/dump operator runs the query locally.

Closures (yes, the kind with an S)

Parameterizing your query is trivial. Any in-scope function parameters or let bindings are available to use in functions.

(defn reusable-fn [lower-bound data]
  (let [upper-bound (+ lower-bound 10)]
    (pig/filter (fn [x] (< lower-bound x upper-bound)) data)))

Note that lower-bound and upper-bound are present when we generate the script, and are made available when the function is executed within the cluster.

So how do I use it?

Just tell PigPen where to write the query as a Pig script:

(pig/write-script "word-count.pig"
                  (word-count-query "input.tsv" "output.tsv"))

And then you have a Pig script which you can submit to your cluster. The script uses pigpen.jar, an uberjar with all of the dependencies, so make sure that is submitted with it. Another option is to build an uberjar for your project and submit that instead. Just rename it prior to submission. Check out the tutorial for how to build an uberjar.

As you saw before, we can also use pig/dump to run the query locally and return Clojure data:

=> (def data (pig/return [["The fox jumped over the dog."]
                          ["The cow jumped over the moon."]]))
#'pigpen-demo/data

=> (pig/dump (word-count data))
[["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]]

If you want to get started now, check out getting started & tutorials.

Why do I need Map-Reduce?

Map-Reduce is useful for processing data that won’t fit on a single machine. With PigPen, you can process massive amounts of data in a manner that mimics working with data locally. Map-Reduce accomplishes this by distributing the data across potentially thousands of nodes in a cluster. Each of those nodes can process a small amount of the data, all in parallel, and accomplish the task much faster than a single node alone. Operations such as join and group, which require coordination across the dataset, are computed by partitioning the data with a common join key. Each value of the join key is then sent to a specific machine. Once that machine has all of the potential values, it can then compute the join or do other interesting work with them.

To see how PigPen does joins, let’s take a look at pig/cogroup. Cogroup takes an arbitrary number of data sets and groups them all by a common key. Say we have data that looks like this:

foo:

  {:id 1, :a "abc"}
  {:id 1, :a "def"}
  {:id 2, :a "abc"}

bar:

  [1 42]
  [2 37]
  [2 3.14]

baz:

  {:my_id "1", :c [1 2 3]]}

If we want to group all of these by id, it looks like this:

(pig/cogroup (foo by :id)
             (bar by first)
             (baz by #(-> % :my_id Long/valueOf))
             (fn [id foos bars bazs] ...))

The first three arguments are the datasets to join. Each one specifies a function that is applied to the source data to select the key. The last argument is a function that combines the resulting groups. In our example, it would be called twice, with these arguments:

[1 ({:id 1, :a "abc"}, {:id 1, :a "def"})
   ([1 42])
   ({:my_id "1", :c [1 2 3]]})]
[2 ({:id 2, :a "abc"})
   ([2 37] [2 3.14])
   ()]

As you can see, this consolidates all of the values with an id of 1, all of the values with 2, etc. Each different key value can then be processed independently on different machines. By default, keys are not required to be present in all sources, but there are options that can make them required.

Hadoop provides a very low-level interface for doing map-reduce jobs, but it’s also very limited. It can only run a single map-reduce pair at a time as it has no concept of data flow or a complex query. Pig adds a layer of abstraction on top of Hadoop, but at the end of the day, it’s still a scripting language. You are still required to use UDFs (user defined functions) to do interesting tasks with your data. PigPen furthers that abstraction by making map-reduce available as a first class language.

If you are new to map-reduce, we encourage you to learn more here.

Motivations for creating PigPen

Code reuse. We want to be able to define a piece of logic once, parameterize it, and reuse it for many different jobs.
Consolidate our code. We don’t like switching between a script and a UDF written in different languages. We don’t want to think about mapping between differing data types in different languages.
Organize our code. We want our code in multiple files, organized how we want - not dictated by the job it belongs to.
Unit testing. We want our sample data inline with our unit tests. We want our unit tests to test the business logic of our jobs without complications of loading or storing data.
Fast iteration. We want to be able to trivially inject mock data at any point in our jobs. We want to be able to quickly test a query without waiting for a JVM to start.
Name what you want to. Most map-reduce languages force too many names and schemas on intermediate products. This can make it really difficult to mock data and test isolated portions of jobs. We want to organize and name our business logic as we see fit - not as dictated by the language.
We’re done writing scripts; we’re ready to start programming!

Note: PigPen is not a Clojure wrapper for writing Pig scripts you can hand edit. While it’s entirely possible, the resulting scripts are not intended for human consumption.

Design & Features

PigPen was designed to match Clojure as closely as possible. Map-reduce is functional programming, so why not use an awesome functional programming language that already exists? Not only is there a lower learning curve, but most of the concepts translate very easily to big data.

In PigPen, queries are manipulated as expression trees. Each operation is represented as a map of information about what behavior is desired. These maps can be nested together to build a tree representation of a complex query. Each command also contains references to its ancestor commands. When executed, that query tree is converted into a directed acyclic query graph. This allows for easy merging of duplicate commands, optimizing sequences of related commands, and instrumenting the query with debug information.

Optimization

De-duping

When we represent our query as a graph of operations, de-duping them is trivial. Clojure provides value-equality, meaning that if two objects have the same content, they are equal. If any two operations have the same representation, then they are in fact identical. No care has to be taken to avoid duplicating commands when writing the query - they’re all optimized before executing it.

For example, say we have the following query:

(let [even-squares (->>
                     (pig/load-clj "input.clj")
                     (pig/map (fn [x] (* x x)))
                     (pig/filter even?)
                     (pig/store-clj "even-squares.clj"))
      odd-squares (->>
                    (pig/load-clj "input.clj")
                    (pig/map (fn [x] (* x x)))
                    (pig/filter odd?)
                    (pig/store-clj "odd-squares.clj"))]
  (pig/script even-squares odd-squares))

In this query, we load data from a file, compute the square of each number, and then split it into even and odd numbers. Here’s what a graph of this operation would look like:

This matches our query, but it’s doing some extra work. It’s loading the same input.clj file twice and computing the squares of all of the numbers twice. This might not seem like a lot of work, but when you do it on a lot of data, simple operations really add up. To optimize this query, we look for operations that are identical. At first glance it looks like our operation to compute squares might be a good candidate, but they actually have different parents so we can’t merge them yet. We can, however, merge the load functions because they don’t have any parents and they load the same file.

Now our graph looks like this:

Now we’re loading the data once, which will save some time, but we’re still doing the squares computation twice. Since we now have a single load command, our map operations are now identical and can be merged:

Now we have an optimized query where each operation is unique. Because we always merge commands one at a time, we know that we’re not going to change the logic of the query. You can easily generate queries within loops without worrying about duplicated execution - PigPen will only execute the unique parts of the query.

Serialization

After we’re done processing data in Clojure, our data must be serialized into a binary blob so Pig can move it around between machines in a cluster. This is an expensive, but essential step for PigPen. Luckily, many consecutive operations in a script can often be packed together into a single operation. This saves a lot of time by not serializing and deserializing the data when we don’t need to. For example, any consecutive map, filter, and mapcat operations can be re-written as a single mapcat operation.

Let’s look at some examples to illustrate this:

In this example, we start with a serialized (blue) value, 4, deserialize it (orange), apply our map function, and re-serialize the value.

Now let’s try a slightly more complex (and realistic) example. In this example, we apply a map, mapcat, and filter operation.

If you haven’t used it before, mapcat is an operation where we apply a function to a value and that function returns a sequence of values. That sequence is then ‘flattened’ and each single value is fed into the next step. In Clojure, it’s the result of combining map and concat. In Scala, this is called flatMap and in c# it’s called selectMany.

In the diagram below, the flow on the left is our query before the optimization; the right is after the optimization. We start with the same value, 4, and calculate the square of the value; same as the first example. Then we take our value and apply a function that decrements the value, returns the value, and increments the value. Pig then takes this set of values and flattens them, making each one an input to the next step. Note that we had to serialize and deserialize the data when interacting with Pig. The third and final step is to filter the data; in this example we’re retaining only odd values. As you can see, we end up serializing and deserializing the data in between each step.

The right hand side shows the result of the optimization. Put simply, each operation now returns a sequence of elements. Our map operation returns a sequence of one element, 16, our mapcat remains the same, and our filter returns a sequence of zero or one elements. By making these commands more uniform, we can merge them more easily. We end up flattening more sequences of values within the set of commands, but there is no serialization cost between steps. While it looks more complex, this optimization results in much faster execution of each if these steps.

Testing, Local Execution, and Debugging

Iterative development, testing, and debuggability are key tenants of PigPen. When you have jobs that can run for days at a time, the last thing you need is an unexpected bug to show up in the eleventh hour. PigPen has a local execution mode that’s powered by rx. This allows us to write unit tests for our queries. We can then know with very high confidence that something will not crash when run and will actually return the expected results. Even better, this feature allows for iterative development of queries.

Typically, we start with just a few records of the source data and use that to populate a unit test. Because PigPen returns data in the REPL, we don’t have to go elsewhere to build our test data. Then, using the REPL, we add commands to map, filter, join, and reduce the mock data as required; each step of the way verifying that the result is what we expect. This approach produces more reliable results than building a giant monolithic script and crossing your fingers. Another useful pattern is to break up large queries into smaller functional units. Map-reduce queries tend to explode and contract the source data by orders of magnitude. When you try to test the script as a whole, you often have to start with a very large amount of data to end up with just a few rows. By breaking the query into smaller parts, you can test the first part, which may take 100 rows to produce two; and then test the second part by using those two rows as a template to simulate 100 more fake ones.

Debug mode has proven to be really useful for fixing the unexpected. When enabled, it will write to disk the result of every operation in the script, in addition to the normal outputs. This is very useful in an environment such as Hadoop, where you can’t step through code and hours may pass in between operations. Debug mode can also be coupled with a graph-viz visualization of the execution graph. You can then visually associate what it plans to do with the actual output of each operation.

To enable debug mode, see the options for pig/write-script and pig/generate-script. It will write the extra debug output to the folder specified.

Example of debug mode enabled:

(pig/write-script {:debug "/debug-output/"} "my-script.pig" my-pigpen-query)

To enable visualization, take a look at pig/show and pig/dump&show

Example of visualization:

(pig/show my-pigpen-query)        ;; Shows a graph of the query
(pig/dump&show my-pigpen-query)   ;; Shows a graph and runs it locally

Extending PigPen

One nice feature of PigPen is that it’s easy to build your own operators. For example, we built set and multi-set operators such as difference and intersection. These are just variants of other operations like co-group, but it’s really nice to define them once, test them thoroughly, and not have to think about the logic behind a multiset intersection of n sets ever again.

This is useful for more complex operators as well. We have a reusable statistics operator that computes the sum, avg, min, max, sd, and quantiles for a set of data. We also have a pivot operator that groups dimensional fields in the data and counts each group.

While each of these by themselves are simple operations, when you abstract them out of your query, your query starts to become a lot smaller and simpler. When your query is smaller and simpler, you can spend more time focusing on the actual logic of the problem you’re trying to solve instead of re-writing basic statistics each time. What you’re doing becomes clearer, every step of the way.

Why Pig?

We chose Pig because we didn’t want to re-implement all of the optimization work that has gone into Pig already. If you take away the language, Pig does an excellent job of moving big data around. Our strategy was to use Pig’s DataByteArray binary format to move around serialized Clojure data. In most cases, we found that Pig didn’t need to be aware of the underlying types present in the data. Byte arrays can be compared trivially and quickly, so for joins and groupings, Pig simply needs to compare the serialized blob. We get Clojure’s great value equality for free as equivalent data structures produce the same serialized output. Unfortunately, this doesn’t hold true for sorting data. The sorted order of a binary blob is far less than useful, and doesn’t match the sorted order of the native data. To sort data, we must fall back to the host language, and as such, we can only sort on simple types. This is one of very few places where Pig has imposed a limitation on PigPen.

We did evaluate other languages before deciding to build PigPen. The first requirement was that it was an actual programming language, not just a scripting language with UDFs. We briefly evaluated Scalding, which looks very promising, but our team primarily uses Clojure. It could be said that PigPen is to Clojure what Scalding is to Scala. Cascalog is usually the go-to language for map-reduce in Clojure, but from past experiences, datalog has proven less than useful for everyday tasks. There’s a complicated new syntax and concepts to learn, aligning variable names to do implicit joins is not always ideal, misplaced ordering of operations can often cause big performance problems, datalog will flatten data structures (which can be wasteful), and composition can be a mind bender.

We also evaluated a few options to use as a host language for PigPen. It would be possible to build a similar abstraction on top of Hive, but schematizing every intermediate product doesn’t fit well with the Clojure ideology. Also, Hive is similar to SQL, making translation from a functional language more difficult. There’s an impedance mismatch between relational models like SQL and Hive and functional models like Clojure or Pig. In the end, the most straightforward solution was to write an abstraction over Pig.

Future Work

Currently you can reference in-scope local variables within code that is executed remotely, as shown above. One limitation to this feature is that the value must be serializable. This has the downside of not being able to utilize compiled functions - you can’t get back the source code that created them in the first place. This means that the following won’t work:

(defn foo [x] ...)

(pig/map foo)

In this situation, the compiler will inform you that it doesn’t recognize foo. We’re playing around with different methods for requiring code remotely, but there are some nuances to this problem. Blindly loading the code that was present when the script was generated is an easy option, but it might not be ideal if that code accidentally runs something that was only intended to run locally. Another option would be for the user to explicitly specify what to load remotely, but this poses challenges as well, such as an elegant syntax to express what should be loaded. Everything we’ve tried so far is a little clunky and jar hell with Hadoop doesn’t make it any easier. That said, any code that’s available can be loaded from within any user function. If you upload your uberjar, you can then use a require statement to load other arbitrary code.

So far, performance in PigPen doesn’t seem to be an issue. Long term, if performance issues crop up, it will be relatively easy to migrate to running PigPen directly on Hadoop (or similar) without changing the abstraction. One of the key performance features we still have yet to build is incremental aggregation. Pig refers to this as algebraic operators (also referenced by Rich Hickey here as combining functions). These are operations that can compute partial intermediate products over aggregations. For example, say we want to take the average of a LOT of numbers - so many that we need map-reduce. The naive approach would be to move all of the numbers to one machine and compute the average. A better approach would be to partition the numbers, compute the sum and count of each of these smaller sets, and then use those intermediate products to compute the final average. The challenge for PigPen will be to consume many of these operations within a single function. For example, say we have a set of numbers and we want to compute the count, sum, and average. Ideally, we would want to define each of these computations independently as algebraic operations and then use them together over the same set of data, having PigPen do the work of maintaining a set of intermediate products. Effectively, we need to be able to compose and combine these operations while retaining their efficiency.

We use a number of other Pig & Hadoop tools at Netflix that will pair nicely with PigPen. We have some prototypes for integration with Genie, which adds a pig/submit operator. There’s also a loader for Aegisthus data in the works. And PigPen works with Lipstick as the resulting scripts are Pig scripts.

Conclusion

PigPen has been a lot of fun to build and we hope it’s even more fun to use. For more information on getting started with PigPen and some tutorials, check out the tutorial, or to contribute, take a look at our Github page: https://github.com/Netflix/PigPen

There are three distinct audiences for PigPen, so we wrote three different tutorials:

Those coming from the Clojure community who want to do map-reduce: PigPen for Clojure users
Those coming from the Pig community who want to use Clojure: PigPen for Pig users
And a general tutorial for anybody wanting to learn it all: PigPen Tutorial

If you know both Clojure and Pig, you’ll probably find all of the tutorials interesting.

The full API documentation is located here

And if you love big data, check out our jobs.

↧

S3mper: Consistency in the Cloud

January 9, 2014, 12:52 pm

≫ Next: Improving Netflix’s Operational Visibility with Real-Time Insight Tools

≪ Previous: Introducing PigPen: Map-Reduce for Clojure

by: Daniel C. Weeks

In previous posts, we discussed how the Hadoop platform at Netflix leverages AWS’s S3 offering (read more here). In short, Netflix considers S3 the “source of truth” for all data warehousing. There are many attractive features that draw us to this service including: 99.999999999% durability, 99.99% availability, effectively infinite storage, versioning (data recovery), and ubiquitous access. In combination with AWS’s EMR, we can dynamically expand/shrink clusters, provision/decommission clusters based on need or availability of reserved capacity, perform live cluster swapping without interrupting processing, and explore new technologies all utilizing the same data warehouse. In order to provide the capabilities listed above, S3 makes one particular concession which is the focus of this discussion: consistency.

The consistency guarantees for S3 vary by region and operation (details here), but in general, any list or read operation is susceptible to inconsistent information depending on preceding operations. For basic data archival, consistency is not a concern. However, in a data centric computing environment where information flows through a complex workflow of computations and transformations, an eventually consistent model can cause problems ranging from insidious data loss to catastrophic job failure.

Over the past few years, sporadic inaccuracies presented which only after extensive investigation pointed to consistency as the culprit. With the looming concern of data inaccuracy and no way to identify the scope or impact, we invested some time exploring how to diagnose issues resulting from eventual consistency and methods to mitigate the impact. The result of this endeavor is a library that continues to evolve, but is currently in production here at Netflix: s3mper (latin:Always).

Netflix is pleased to announce that s3mper is now released as open source under the Apache License v2.0. We hope that the availability of this library will inspire constructive discussion focusing on how to better manage consistency at scale with the Hadoop stack across the many cloud offerings currently available.

How Inconsistency Impacts Processing

The Netflix ETL Process is predominantly Pig and Hive jobs scheduled through enterprise workflow software that resolves dependencies and manages task execution. To understand how eventual consistency affects processing, we can distill the process down to a simple example of two jobs where the results of one feed into another. If we take a look at Pig-1 from the diagram, it consists of two MapReduce jobs in a pipeline. The initial dataset is loaded from S3 due to the source location referencing an S3 path. All intermediate data is stored in HDFS since that is the default file system. Consistency is not a concern for these intermediate stages. However, the results from Pig-1 are stored directly back to S3 so the information is immediately available for any other job across all clusters to consume.

Pig-2 is activated based on the completion of Pig-1 and immediately lists the output directories of the previous task. If the S3 listing is incomplete when the second job starts, it will proceed with incomplete data. This is particularly problematic, as we stated earlier, because there is no indication that a problem occurred. The integrity of resulting data is entirely at the mercy of how consistent the S3 listing was when the second job started.

A variety of other scenarios may result in consistency issues, but inconsistent listing is our primary concern. If the input data is incomplete, there is no indication anything is wrong with the result. Obviously it is noticeable when the expected results vary significantly from long standing patterns or emit no data at all, but if only a small portion of input is missing the results will appear convincing. Data loss occurring at the beginning of a pipeline will have a cascading effect where the end product is wildly inaccurate. Due to the potential impact, it is essential to understand the risks and methods to mitigate loss of data integrity.

Approaches to Managing Consistency

The Impractical

When faced with eventual consistency, the most obvious (and naive) approach is to simply wait a set amount of time before a job starts with the expectation that data will show up. The problem is knowing how long “eventual” will last. Injecting an artificial delay is detrimental because it defers processing even if requisite data is available and still misses data if it fails to materialize in time. The result is a net loss for both processing time and confidence in the resulting data.

Staging Data

A more common approach to processing in the cloud is to load all necessary data into HDFS, complete all processing, and store the final results to S3 before terminating the cluster. This approach works well if processing is isolated to a single cluster and performed in batches. As we discussed earlier, having the ability to decouple the data from the computing resources provides flexibility that cannot be achieved within a single cluster. Persistent clusters also make this approach difficult. Data in S3 may far exceed the capacity of the HDFS cluster and tracking what data needs to be staged and when it expires is a particularly complex problem to solve.

Consistency through Convention

Conventions can be used to eliminate some cases of inconsistency. Read and list inconsistency resulting from overwriting the same location can result in data corruption in that a listing may include old versions of data with new therefore producing an amalgam of two incomplete datasets. Eliminating update inconsistency is achievable by imposing a convention where the same location is never overwritten. Here at Netflix, we encourage the use of a batching pattern, where results are written into partitioned batches and the Hive metastore only references the valid batches. This approach removes the possibility of inconsistency due to update or delete. For all AWS regions except US Standard that provide “read-after-write” consistency, this approach may be sufficient, but relies on strict adherence.

Secondary Index

S3 is designed with an eventually consistent index, which is understandable in context of the scale and the guarantees it provides. At smaller scale, it is possible to achieve consistency through use of a consistent, secondary index to catalog file metadata while backing the raw data on S3. This approach becomes more difficult to achieve as the scale increases, but as long as the secondary index can handle the request rate and still provide guaranteed consistency, it will suffice. There are costs to this approach. The probability of data loss and the complexity increases while performance degrades due to relying on two separate systems.

S3mper: A Hybrid Approach

S3mper is an experimental approach to tracking file metadata through use of a secondary index that provides consistent reads and writes. The intent is to identify when an S3 list operation returns inconsistent results and provide options to respond. We implemented s3mper using aspects to advise methods on the Hadoop FileSystem interface and track file metadata with DynamoDB as the secondary index. The reason we chose DynamoDB is that it provides capabilities similar to S3 (e.g. high availability, durability through replication), but also adds consistent operations and high performance.

What makes s3mper a hybrid approach is its use of the S3 listing for comparison and only maintaining a window of consistency. The “source of truth” is still S3, but with an additional layer of checking added. The window of consistency allows for falling back to the S3 listing without concern that the secondary index will fail and lose important information or risk consistency issues that arise from using tools outside the hadoop stack to modify data in S3.

The key features s3mper provides include (see here for more detailed design and options):

Recovery: When an inconsistent listing is identified, s3mper will optionally delay the listing and retry until consistency is achieved. This will delay a job only long enough for data to become available without unnecessarily impacting job performance.
Notification: If listing cannot be achieved, a notification is sent immediately and a determination can be made as to whether to kill the job or let it proceed with incomplete data.
Reporting: A variety of events are sent to track the number of recoveries, files missed, what jobs were affected, etc.
Configurability: Options are provided to control how long a job should wait, how frequently to recheck a listing, and whether to fail a job if the listing is inconsistent.
Modularity: The implementations for the metastore and notifications can be overridden based on the environment and services at your disposal.
Administration: Utilities are provided for inspecting the metastore and resolving conflicts between the secondary index in DynamoDB and the S3 index.

S3mper is not intended to solve every possible case where inconsistency can occur. Deleting data from S3 outside of the hadoop stack will result in divergence of the secondary index and jobs being delayed unnecessarily. Directory support is also limited such that recursive listings are still prone to inconsistency, but since we currently derive all our data locations from a Hive metastore, this does not impact us. While this library is still in its infancy and does not support every case, using it in combination with the conventions discussed earlier will alleviate the concern for our workflow and allow for further investigation and development of new capabilities.

Performance in production

S3mper has been running in production at Netflix for a few months and the result is an interesting dataset with respect to consistency. For context, Netflix operates out of the US Standard region where we run tens of thousands of Pig, Hive, and Hadoop jobs across multiple clusters of varying size and process several hundreds of terabytes of data every day. The number of listings is hard to estimate because any given job will perform several listings depending on the number of partitions processed, but s3mper is tracking every interaction Hadoop has with S3 across all clusters and datasets. At any given time, DynamoDB contains metadata on millions of files within our configured 24 hour sliding window of consistency. We keep track of metrics on how frequently s3mper recovers a listing (i.e. postpones a job until it receives a complete listing) and when the delay is exceeded resulting in a job executing with data acquired through an inconsistent listing.

It is clear from these numbers that inconsistent listings make up a tiny fraction of all S3 operations. In many cases all files are available within a few minutes and s3mper can recover the listing. In cases where listings are not recovered, notification goes out to the job owner and they can determine if a rerun is necessary. We can only speculate at the variation seen over time because S3 is a shared resource and we have little knowledge of the underlying implementation.

After investigating a sample of affected jobs, patterns do emerge that appear to result in increased probability of inconsistent listing. For example, a stage within a single job that produces tens of thousands of files and reads them immediately in the next stage appears to have a higher likelihood of consistency issues. We also make use of versioned buckets, which track history through use of delete markers. Jobs that experience slower consistency often overwrite the same location repeatedly, which may have some correlation to how quickly an updated listing is available. These observations are based purely on the types of queries and access patterns that have resulted in inconsistent listings as reported by s3mper.

Conclusion

With the petabytes of data we store in S3 and several million operations we perform each day, our experience with eventual consistency demonstrates that only a very small percentage of jobs are impacted, but the severity of inaccurate results warrants attention. Being able to identify when a consistency issue occurs is beneficial not only due to confidence in resulting data, but helps to exclude consistency in diagnosing where a problem exists elsewhere in the system. There is still more to be learned and we will continue to investigate avenues to better identify and resolve consistency issues, but s3mper is a solution we use in production and will continue to provide insight into these areas.

↧

Improving Netflix’s Operational Visibility with Real-Time Insight Tools

January 16, 2014, 11:37 am

≫ Next: Scryer: Netflix's Predictive Auto Scaling Engine - Part 2

≪ Previous: S3mper: Consistency in the Cloud

ByRanjit Mavinkurve,Justin Becker andBen Christensen

For Netflix to be successful, we have to be vigilant in supporting the tens of millions of connected devices that are used by our 40+ million members throughout 40+ countries. These members consume more than one billion hours of content every month and account for nearly a third of the downstream Internet traffic in North America during peak hours.

From an operational perspective, our system environments at Netflix are large, complex, and highly distributed. And at our scale, humans cannot continuously monitor the status of all of our systems. To maintain high availability across such a complicated system, and to help us continuously improve the experience for our customers, it is critical for us to have exceptional tools coupled with intelligent analysis to proactively detect and communicate system faults and identify areas of improvement.

In this post, we will talk about our plans to build a new set of insight tools and systems that create greater visibility into our increasingly complicated and evolving world.

Extending Our Current Insight Capabilities

Our existing insight tools include dashboards that display the status of our systems in near real time, and alerting mechanisms that notify us of major problems. While these tools are highly valuable, we have the opportunity to do even better.

Perspective

Many of our current insight tools are systems-oriented. They are built from the perspective of the system providing the metrics. This results in a proliferation of custom tools and views that require specialized knowledge to use and interpret. Also, some of these tools tend to focus more on system health and not as much on our customers’ streaming experience.

Instead, what we need is a cohesive set of tools that provide relevant insight, effective visualization of that insight, and smooth navigability from the perspective of the tools’ consumers. The consumers of these tools are internal staff members such as engineers who want to view the health of a particular part of our system or look at some aspect of our customers’ streaming experience. To reduce the time needed to detect, diagnose, and resolve problems, it is important for these tools to be highly effective.

Context

When attempting to deeply understand a particular part of the system, or when troubleshooting a problem, it is invaluable to have access to accurate, up-to-date information and relevant context. Rich and relevant context is a highly desirable feature to have in our insight tools.

Consider the following example. Our current insight tools provide ways to visualize and analyze a metric and trigger an alert based on the metric. We use this to track stream-starts-per-second (SPS), a metric used to gauge user engagement. If SPS rises above or drops below its "normal" range, our monitoring and alerting system triggers an alert. This helps us detect that our customers are unable to stream. However, since the system does not provide context around the alert, we have to use a variety of other means to diagnose the problem, delaying resolution of the problem. Instead, suppose we had a tool tracking all changes to our environment. Now, if an alert for SPS were triggered, we could draw correlations between the system changes and the alert to provide context to help troubleshoot the problem.

Alerts are by no means the only place where context is valuable. In fact, the insight provided by virtually any view or event is significantly more effective when relevant context is provided.

Automation

While we need to be able to analyze and identify interesting or unusual behavior and surface any anomalies, we also need for this to happen automatically and continuously. Further, not only should it be possible to add new anomaly detection or analysis algorithms easily, but also the system itself should be able to create and apply new rules and actions for these algorithms dynamically via machine learning.

Ad Hoc Queries

In the streaming world, we have multiple facets like customer, device, movie, region, ISP, client version, membership type, etc., and we often need quick insight based on a combination of these facets. For example, we may want to quickly view the current time-weighted average bitrate for content delivered to Xbox 360s in Mexico on Megacable ISP. An insight system should be able to provide quick answers to such questions in order to enable quick analysis and troubleshooting.

Dynamic Visualizations

Our existing insight tools have dashboards with time-series graphs that are very useful and effective. With our new insight tools, we want to take our tools to a whole new level, with rich, dynamic data visualizations that visually communicate relevant, up-to-date details of the state of our environments for any operational facets of interest. For example, we want to surface interesting patterns, system events, and anomalies via visual cues within a dynamic timeline representation that is updated in near real-time.

Cohesive Insight

We have many disparate insight tools that are owned and used by separate teams. What we need is a set of tools that allow for shared communication, insights, and context in a cohesive manner. This is vital for the smooth operation of our complex, dynamic operational environments. We need a single, convenient place to communicate anomalous and informational changes as well as user-provided context.

The New Way

With our next generation of insight tools, we have the opportunity to create new and transformative ways to effectively deliver insights and extend our existing insight capabilities. We plan to build a new set of tools and systems for operational visibility that provide the insight features and capabilities that we need.

Timeliness Matters

Operational insight is much more valuable when delivered immediately. For example, when a streaming outage occurs, we want to take remedial action as soon as possible, to minimize downtime. We want anomaly detection to be automatic, but we also want it quickly, in near real-time. For ad hoc analysis as well, quick insight is the key to troubleshooting a problem and getting to a fast resolution.

Event Stream Processing

Given the importance of timeliness, a key piece of technology for the backend system for our new set of insight tools for operational visibility will be dynamic, real-time Event Stream Processing (aka Complex Event Processing) with the ability to run ad hoc dynamic queries on live event streams. Event processing enables one to identify meaningful events, such as opportunities or threats, and respond to them as quickly as possible. This fits the goals of our insight tools very well.

Scaling Challenge

Our massive scale poses an interesting challenge with regard to data scaling. How can we track many millions of metrics with fine-grained context and also support fast ad hoc queries on live data while remaining cost-effective? If we persisted every permutation of each metric along with all of its associated dimensions for all the events flowing through our system, we could end up with trillions of values, taking up several TBs of disk space every day.

Since the value of fine-grained insight data for operational visibility is high when the data is fresh but diminishes over time, we need a system that exploits these characteristics of insight data and has the ability to make recent data available quickly, without having to persist all of the data.

We have the opportunity to build an innovative and powerful stream processing system that meets our insight requirements, such as support for ad hoc faceted queries on live streams of big data, and has the ability to scale dynamically to handle multiple, simultaneous queries.

A Picture is Worth a Thousand Words

Good visualization helps to communicate and deliver insights effectively. As we develop our new insight tools for operational visibility, it is vital that the front-end interface to this system provide dynamic data visualizations that can communicate the insights in a very effective manner. As mentioned earlier, we want to tailor the insights and views to meet the needs of the tool’s consumers.

We envision the front-end to our new operational insight tool to be an interactive, single-page web application with rich and dynamic data visualizations and dashboards updated in real-time. The design below is a mockup (with fake data) for one of the views.

There are several components within the design: a top level navigation bar to switch between different views in the system, a breadcrumbs component highlighting the selected facets, a main view module (a map in this instance), a key metrics component, a timeline and an incident view, on the right side of the screen. The main view communicates data based on the selected facets.

The mockup shown above (with fake data) represents another view in the system and displays routes in our edge tier with request rates, error rates and other key metrics for each route updated in near real-time.

All views in the system are dynamic and reflect the current operational state based on selected facets. A user can modify the facets and immediately see changes to the user interface.

Summary

Operational visibility with real-time insight enables us to deeply understand our operational systems, make product and service improvements, and find and fix problems quickly so that we can continue to innovate rapidly and delight our customers at every interaction. We are building a new set of tools and systems for operational visibility at Netflix with powerful insight capabilities.

Join Us!

Do you want to help design and build the next generation of innovative insight tools for operational visibility at Netflix? This is a greenfield project where you can have a large impact working in a small team. Are you interested in real-time stream processing or data visualization? We are looking for talented engineers www.netflix.com/jobs.

↧

Scryer: Netflix's Predictive Auto Scaling Engine - Part 2

December 4, 2013, 8:30 am

≫ Next: Distributed Neural Networks with GPUs in the AWS Cloud

≪ Previous: Improving Netflix’s Operational Visibility with Real-Time Insight Tools

by Danny Yuan, Neeraj Joshi, Daniel Jacobson, Puneet Oberai

In part 1 of this series, we introduced Scryer, Netflix’s predictive autoscaling engine, and discussed its use cases and how it runs in Netflix. In this second installment, we will discuss the design of Scryer ranging from the technical implementation to the algorithms that drive its predictions.

Design of Scryer

Scryer has a simple data flow architecture. On a very high level, historical data flows into Scryer, and predicted actions flow out. The diagram below shows the architecture.

The API layer provides a RESTful interface for a web UI, as well as automation scripts to interact with Scryer.

The Data Collector module pulls metrics from a pluggable list of data sources, cleans the data, and transforms it into a format suitable for the Predictor. The data retrieval is currently done incrementally within a sliding time window to minimize the load on the data source. The data is is also stored in a secondary persistent store for resiliency purposes.

The Predictor generates predictions based on a pluggable list of prediction algorithms. We implemented two prediction algorithms for production, one of which is an augmented linear regression based algorithm, the other based on Fast Fourier Transformation. The Predictor module also provides life cycle hooks for pre and post processing of predictions. A pluggable prediction combiner is then used to combine multiple predictions to generate a single final prediction.

The Action Plan Generator module uses the prediction and other control parameters (e.g., server throughput, server start time etc), to compute an auto scaling plan. The auto scaling plan is optimized to minimize the number of scale-up events while maintaining an optimal scale-up batch size for each scale-up event. Pre and post action hooks are available to apply additional padding to instance counts if required. For example, we may need to add extra instances for holidays.

The Scalermodule carries out the action plan generated by the Action Plan Generator module. It allows a different implementation of actions. Currently, he have implemented three different actions:

Emitting predictions and action steps to our monitoring dashboard at scheduled time. This is great for simulating the behavior of Scryer. We can easily visualize the predictions and actions, and compare the predictions with the actual workload in a same graph.
Scheduling each step using AWS API for Scheduled Actions
Scheduling actions that will scale a cluster using EC2 API

Metrics for Prediction Algorithms

The first order of business for building the prediction algorithm is to determine what metrics are to be used for prediction and autoscaling actions. When using Amazon Auto Scaling, we normally settle on load average. Load average fits because it is a good indicator of the capacity of a cluster, and it is independent of traffic pattern. Our goal is simply to keep load average within a certain range by adjusting cluster size. However, load average is a misfit for prediction because it is a result of auto scaling. It is too complicated, if not impossible, to predict something that also changes by the prediction. A metrics has to satisfy two conditions to be easily predictable:

It has a clear, relatively stable, and preferably a recurring pattern. We can predict reliably only what has repeatedly happened in the past.
It is independent of cluster performance. We deploy our code frequently, and the performance of a deployment may vary per deployment. If the metrics depends on cluster performance, prediction may deviate widely from the actual values of the metrics.

Therefore, we decided to use user traffic for prediction. In particular, we use request per second by default because most of our services are request-based. User traffic satisfies the aforementioned two conditions.

Once we determined which metrics to predict on, we would also need to figure out how to calculate scaling actions. Since the goal of auto scaling is to ensure a cluster has sufficient number of machines to serve all the user traffic, all we need to do is to predict the size of cluster, which depends on the average throughput of a server:

We can get throughput metrics from our monitoring system, or from stress testing. Scryer also allows users to override the throughput value manually via web UI or by calling a RESTful API.

Prediction Algorithms

The key to effective prediction algorithms is making use of as many signals as possible and at the same time ignoring noise in input metrics. We observed that our input metrics had the following characteristics:

They have clear weekly periodicity for the same day of a week. That is, the traffic of two adjacent Tuesdays is more similar than that of adjacent Tuesday and Wednesday.
Their daily patterns are similar, albeit different in shapes and scales
They have some small spikes and drops that we can deem as noise.
The change of traffic is relatively constant week by week. In other words, the traffic at the same time in the same day moves approximately linearly week by week.
There could be occasional large spikes or large drops due to system outages.

Based on our observations, we took two different approaches: FFT-based smoothing, and linear regression with clustered data points.

FFT-Based Prediction

The idea of this algorithm is to treat incoming traffic as a combination of multiple sine curves. Noise is of high frequency and low amplitude. Therefore, we can use an FFT filter that filters out noise based on given thresholds of frequency and amplitude. The filtered result is a smoothened curve. To predict a future value, we shift the curve to find the past value that is exactly one period away. Mathematically speaking, if the filtered result is a function of time \(f(t)\), and the future value is another function of time \(g(t)\), then \(g(t) = f(t - \omega)\), where \(\omega\) is the periodicity of the function \(f(t)\). The figure below illustrates the idea. The black curve is the input, and the blue curve is the smoothened result. We can see the sharp spikes are filtered out because they have much higher frequency and much smaller amplitude than the blue curve.

The FFT based algorithm is also capable of ignoring outages. It detects outages by applying standard statistical methods. Once an outage is detected, the algorithm will iteratively apply FFT on adjusted data until the outage is ignored. The following figure shows that a simulated big drop is reasonably ignored:

The prediction algorithm undergoes multiple iterations to gradually remove the effect of such drop, as shown by the figure below. The first iteration is red, and the last iteration is yellow. We can see that prediction becomes better with each iteration.

Linear Regression on Clustered Data Points

We can’t apply linear regression directly on input metrics, as the shape of the input is sinoid. However, given that each day has similar pattern with a linear trend, we can pick data points at the same time but different days, and then apply linear regression. This approach would require a lot of days. However, if we zoom in on the data, we would see that within a smaller time window, say 10 minutes, the data points are of relatively identical values. Therefore, we can pick a cluster of data points from each time window, and then apply linear regression. This turns out to produce very accurate predictions. The following series of figures illustrate how linear regression works. This method also complements the FFT-based method. Some traffic patterns may contain regular but short-lived spikes. Such spikes are not noises. FFT-based method unfortunately filters such spikes out. However, this method will predict such spikes. The diagram below illustrates such regular patterns that will get filtered by FFT-based method.

This diagram below shows workload of one of Netflix clusters within a 30-minute window. The workload does not fluctuate more than 4%.

Therefore, we can pick data points around the same time of different days in a specified time window. The number of chosen data points are progressively reduced as we move back in time. This naturally gives newer data more influence than older ones. We also choose larger set of data points from the days that are highly similar to each other based on a weight matrix. For example, Saturday's traffic is similar to other Saturdays' and to its adjacent Sunday's, so we choose more data points from weekends for a Saturday than from weekdays.

Once we have clusters of data points, we can then apply linear regression. The blue dots are selected clustered points, and the red line is the result of linear regression. Once we obtain the line, we can predict by a simple extrapolation of the line to a future time.

One potential problem with this approach is that a single outage may make a lot of points invalid, therefore skewing regression results. This is why we still combine FFT-based method with this algorithm. In addition, we also apply outlier detection algorithms to remove invalid points. We implemented both distribution based algorithm and deviation based algorithm. Both turned out to work well.

Future Work

While Scryer has dramatically improved our system scaling, there are still many things that we can do to make it better. We plan to improve Scryer on three areas in near future:

Making Scryer distributed. The current implementation of Scryer runs on a single server. It is capable of handling hundreds of clusters, and tolerates temporary system crash because it checkpoints important states in Cassandra. That said, making it distributed can reduce Scryer’s bootstrap time, therefore reducing its potential down time. A distributed Scryer can also be scaled up to handle many more clusters. Distributing Scryer also helps its resilience. If the single instance fails, so does Scryer. Of course, we still have AAS, but that is not optimal. Plus, because it is on a single instance, we have to opt it out of Chaos Monkey. Opting it in gives us more data on how the system fares if it does drop out, so being distributed has that benefit as well.
Implementing automatic feedback loop so Scryer can auto tune. We record and monitor the accuracy of Scryer’s predictions, effect of scaled actions, as well instance start time. We then use such data to tune parameters of our algorithms. This work, however, can be largely automated. We plan to implement a trend detector. If the prediction starts to consistently deviate from actual workload, the detector will capture such a deviation, and feed it to an auto-correcting module. The auto-correcting module will compensate the auto scaling accordingly, and will also tune the prediction algorithm if needed.
Improving our prediction algorithms. For example, we ran experiments to find out how to choose cluster of data points for linear regression. We plan to automate this process so we always get accurate parameters for choosing data points. We also plan to improve the heuristics on how to filter out noises in our FFT-based algorithms.

Conclusion

Scryer adopts a simple yet flexible design that allows users to configure its behaviors with ease. It has built-in fault tolerant features to cope with temporary data unavailability, occasional data irregularities such as outages, and system downtime. The algorithms employed by Scryer take advantage of Netflix’s traffic patterns, and achieve accurate results. Although the approaches and algorithms described above are already yielding excellent results, we are constantly reviewing them in an effort to improve Scryer..

Finally, we work on these kinds of exciting challenges all the time at Netflix. If you would like to join us in tackling such problems, check out our Jobs site.

↧

Distributed Neural Networks with GPUs in the AWS Cloud

February 10, 2014, 8:51 am

≫ Next: Netflix Hack Day

≪ Previous: Scryer: Netflix's Predictive Auto Scaling Engine - Part 2

by Alex Chen, Justin Basilico, and Xavier Amatriain

As we have described previously on this blog, at Netflix we are constantly innovating by looking for better ways to find the best movies and TV shows for our members. When a new algorithmic technique such as Deep Learning shows promising results in other domains (e.g. Image Recognition, Neuro-imaging, Language Models, and Speech Recognition), it should not come as a surprise that we would try to figure out how to apply such techniques to improve our product. In this post, we will focus on what we have learned while building infrastructure for experimenting with these approaches at Netflix. We hope that this will be useful for others working on similar algorithms, especially if they are also leveraging the Amazon Web Services (AWS) infrastructure. However, we will not detail how we are using variants of Artificial Neural Networks for personalization, since it is an active area of research.

Many researchers have pointed out that most of the algorithmic techniques used in the trendy Deep Learning approaches have been known and available for some time. Much of the more recent innovation in this area has been around making these techniques feasible for real-world applications. This involves designing and implementing architectures that can execute these techniques using a reasonable amount of resources in a reasonable amount of time. The first successful instance of large-scale Deep Learning made use of 16000 CPU cores in 1000 machines in order to train an Artificial Neural Network in a matter of days. While that was a remarkable milestone, the required infrastructure, cost, and computation time are still not practical.

Andrew Ng and his team addressed this issue in follow up work . Their implementation used GPUs as a powerful yet cheap alternative to large clusters of CPUs. Using this architecture, they were able to train a model 6.5 times larger in a few days using only 3 machines. In another study, Schwenk et al. showed that training these models on GPUs can improve performance dramatically, even when comparing to high-end multicore CPUs.

Given our well-known approach and leadership in cloud computing, we sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud. We wanted to use a reasonable number of machines to implement a powerful machine learning solution using a Neural Network approach. We also wanted to avoid needing special machines in a dedicated data center and instead leverage the full, on-demand computing power we can obtain from AWS.

In architecting our approach for leveraging computing power in the cloud, we sought to strike a balance that would make it fast and easy to train Neural Networks by looking at the entire training process. For computing resources, we have the capacity to use many GPU cores, CPU cores, and AWS instances, which we would like to use efficiently. For an application such as this, we typically need to train not one, but multiple models either from different datasets or configurations (e.g. different international regions). For each configuration we need to perform hyperparameter tuning, where each combination of parameters requires training a separate Neural Network. In our solution, we take the approach of using GPU-based parallelism for training and using distributed computation for handling hyperparameter tuning and different configurations.

Distributing Machine Learning: At what level?

Some of you might be thinking that the scenario described above is not what people think of as a distributed Machine Learning in the traditional sense. For instance, in the work by Ng et al. cited above, they distribute the learning algorithm itself between different machines. While that approach might make sense in some cases, we have found that to be not always the norm, especially when a dataset can be stored on a single instance. To understand why, we first need to explain the different levels at which a model training process can be distributed.

In a standard scenario, we will have a particular model with multiple instances. Those instances might correspond to different partitions in your problem space. A typical situation is to have different models trained for different countries or regions since the feature distribution and even the item space might be very different from one region to the other. This represents the first initial level at which we can decide to distribute our learning process. We could have, for example, a separate machine train each of the 41 countries where Netflix operates, since each region can be trained entirely independently.

However, as explained above, training a single instance actually implies training and testing several models, each corresponding to a different combinations of hyperparameters. This represents the second level at which the process can be distributed. This level is particularly interesting if there are many parameters to optimize and you have a good strategy to optimize them, like Bayesian optimization with Gaussian Processes. The only communication between runs are hyperparameter settings and test evaluation metrics.

Finally, the algorithm training itself can be distributed. While this is also interesting, it comes at a cost. For example, training ANN is a comparatively communication-intensive process. Given that you are likely to have thousands of cores available in a single GPU instance, it is very convenient if you can squeeze the most out of that GPU and avoid getting into costly across-machine communication scenarios. This is because communication within a machine using memory is usually much faster than communication over a network.

The following pseudo code below illustrates the three levels at which an algorithm training process like us can be distributed.

for each region -> level 1 distribution
for each hyperparameter combination -> level 2 distribution
train model -> level 3 distribution
end for
end for

In this post we will explain how we addressed level 1 and 2 distribution in our use case. Note that one of the reasons we did not need to address level 3 distribution is because our model has millions of parameters (compared to the billions in the original paper by Ng).

Optimizing the CUDA Kernel

Before we addressed distribution problem though, we had to make sure the GPU-based parallel training was efficient. We approached this by first getting a proof-of-concept to work on our own development machines and then addressing the issue of how to scale and use the cloud as a second stage. We started by using a Lenovo S20 workstation with a Nvidia Quadro 600 GPU. This GPU has 98 cores and provides a useful baseline for our experiments; especially considering that we planned on using a more powerful machine and GPU in the AWS cloud. Our first attempt to train our Neural Network model took 7 hours.

We then ran the same code to train the model in on a EC2’s cg1.4xlarge instance, which has a more powerful Tesla M2050 with 448 cores. However, the training time jumped from 7 to over 20 hours. Profiling showed that most of the time was spent on the function calls to Nvidia Performance Primitive library, e.g. nppsMulC_32f_I, nppsExp_32f_I. Calling the npps functions repeatedly took 10x more system time on the cg1 instance than in the Lenovo S20.

While we tried to uncover the root cause, we worked our way around the issue by reimplementing the npps functions using the customized cuda kernel, e.g. replace nppsMulC_32f_I function with:

__global__
void KernelMulC(float c, float *data, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
data[i] = c * data[i];
}
}

Replacing all npps functions in this way for the Neural Network code reduced the total training time on the cg1 instance from over 20 hours to just 47 minutes when training on 4 million samples. Training 1 million samples took 96 seconds of GPU time. Using the same approach on the Lenovo S20 the total training time also reduced from 7 hours to 2 hours. This makes us believe that the implementation of these functions is suboptimal regardless of the card specifics.

PCI configuration space and virtualized environments

While we were implementing this “hack”, we also worked with the AWS team to find a principled solution that would not require a kernel patch. In doing so, we found that the performance degradation was related to the NVreg_CheckPCIConfigSpace parameter of the kernel. According to RedHat, setting this parameter to 0 disables very slow accesses to the PCI configuration space. In a virtualized environment such as the AWS cloud, these accesses cause a trap in the hypervisor that results in even slower access.

NVreg_CheckPCIConfigSpace is a parameter of kernel module nvidia-current, that can be set using:

sudo modprobe nvidia-current NVreg_CheckPCIConfigSpace=0

We tested the effect of changing this parameter using a benchmark that calls MulC repeatedly (128x1000 times). Below are the results (runtime in sec) on our cg1.4xlarge instances:

	KernelMulC	npps_MulC
CheckPCI=1	3.37	103.04
CheckPCI=0	2.56	6.17

As you can see, disabling accesses to PCI space had a spectacular effect in the original npps functions, decreasing the runtime by 95%. The effect was significant even in our optimized Kernel functions saving almost 25% in runtime. However, it is important to note that even when the PCI access is disabled, our customized functions performed almost 60% better than the default ones.

We should also point out that there are other options, which we have not explored so far but could be useful for others. First, we could look at optimizing our code by applying a kernel fusion trick that combines several computation steps into one kernel to reduce the memory access. Finally, we could think about using Theano, the GPU Match compiler in Python, which is supposed to also improve performance in these cases.

G2 Instances

While our initial work was done using cg1.4xlarge EC2 instances, we were interested in moving to the new EC2 GPU g2.2xlarge instance type, which has a GRID K520 GPU (GK104 chip) with 1536 cores. Currently our application is also bounded by GPU memory bandwidth and the GRID K520‘s memory bandwidth is 198 GB/sec, which is an improvement over the Tesla M2050’s at 148 GB/sec. Of course, using a GPU with faster memory would also help (e.g. TITAN’s memory bandwidth is 288 GB/sec).

We repeated the same comparison between the default npps functions and our customized ones (with and without PCI space access) on the g2.2xlarge instances.

	KernelMulC	npps_MulC
CheckPCI=1	2.01	299.23
CheckPCI=0	0.97	3.48

One initial surprise was that we measured worse performance for npps on the g2 instances than the cg1 when PCI space access was enabled. However, disabling it improved performance between 45% and 65% compared to the cg1 instances. Again, our KernelMulC customized functions are over 70% better, with benchmark times under a second. Thus, switching to G2 with the right configuration allowed us to run our experiments faster, or alternatively larger experiments in the same amount of time.

Distributed Bayesian Hyperparameter Optimization

Once we had optimized the single-node training and testing operations, we were ready to tackle the issue of hyperparameter optimization. If you are not familiar with this concept, here is a simple explanation: Most machine learning algorithms have parameters to tune, which are called often called hyperparameters to distinguish them from model parameters that are produced as a result of the learning algorithm. For example, in the case of a Neural Network, we can think about optimizing the number of hidden units, the learning rate, or the regularization weight. In order to tune these, you need to train and test several different combinations of hyperparameters and pick the best one for your final model. A naive approach is to simply perform an exhaustive grid search over the different possible combinations of reasonable hyperparameters. However, when faced with a complex model where training each one is time consuming and there are many hyperparameters to tune, it can be prohibitively costly to perform such exhaustive grid searches. Luckily, you can do better than this by thinking of parameter tuning as an optimization problem in itself.

One way to do this is to use a Bayesian Optimization approach where an algorithm’s performance with respect to a set of hyperparameters is modeled as a sample from a Gaussian Process. Gaussian Processes are a very effective way to perform regression and while they can have trouble scaling to large problems, they work well when there is a limited amount of data, like what we encounter when performing hyperparameter optimization. We use package spearmint to perform Bayesian Optimization and find the best hyperparameters for the Neural Network training algorithm. We hook up spearmint with our training algorithm by having it choose the set of hyperparameters and then training a Neural Network with those parameters using our GPU-optimized code. This model is then tested and the test metric results used to update the next hyperparameter choices made by spearmint.

We’ve squeezed high performance from our GPU but we only have 1-2 GPU cards per machine, so we would like to make use of the distributed computing power of the AWS cloud to perform the hyperparameter tuning for all configurations, such as different models per international region. To do this, we use the distributed task queue Celery to send work to each of the GPUs. Each worker process listens to the task queue and runs the training on one GPU. This allows us, for example, to tune, train, and update several models daily for all international regions.

Although the Spearmint + Celery system is working, we are currently evaluating more complete and flexible solutions using HTCondor or StarCluster. HTCondor can be used to manage the workflow of any Directed Acyclic Graph (DAG). It handles input/output file transfer and resource management. In order to use Condor, we need each compute node register into the manager with a given ClassAd (e.g. SLOT1_HAS_GPU=TRUE; STARD_ATTRS=HAS_GPU). Then the user can submit a job with a configuration "Requirements=HAS_GPU" so that the job only runs on AWS instances that have an available GPU. The main advantage of using Condor is that it also manages the distribution of the data needed for the training of the different models. Condor also allows us to run the Spearmint Bayesian optimization on the Manager instead of having to run it on each of the workers.

Another alternative is to use StarCluster , which is an open source cluster computing framework for AWS EC2 developed at MIT. StarCluster runs on the Oracle Grid Engine (formerly Sun Grid Engine) in a fault-tolerant way and is fully supported by Spearmint.

Finally, we are also looking into integrating Spearmint with Jobman in order to better manage the hyperparameter search workflow.
Figure below illustrates the generalized setup using Spearmint plus Celery, Condor, or StarCluster:

Conclusions

Implementing bleeding edge solutions such as using GPUs to train large-scale Neural Networks can be a daunting endeavour. If you need to do it in your own custom infrastructure, the cost and the complexity might be overwhelming. Levering the public AWS cloud can have obvious benefits, provided care is taken in the customization and use of the instance resources. By sharing our experience we hope to make it much easier and straightforward for others to develop similar applications.

We are always looking for talented researchers and engineers to join our team. So if you are interested in solving these types of problems, please take a look at some of our open positions on the Netflix jobs page .

↧

Netflix Hack Day

February 27, 2014, 12:52 pm

≫ Next: The Netflix Dynamic Scripting Platform

≪ Previous: Distributed Neural Networks with GPUs in the AWS Cloud

by Daniel Jacobson, Ruslan Meshenberg, Matt McCarthy and Leslie Posada

At Netflix, we pride ourselves in creating a culture of innovation and experimentation. We are constantly running A/B tests on virtually every enhancement to the Netflix experience. There are other ways in which we instill this culture within Netflix, including internal events such as Netflix Hack Day, which was held last week. For Hack Day, our primary goal is to provide a fun, experimental, and creative outlet for our engineers. If something interesting and potentially useful comes from it, that is fine, but the real motivation is fun. With that spirit in mind, most teams started hacking on Thursday morning, hacked through the night, and they wrapped up by Friday morning to present a demo to their peers.

It is not unusual for us to see a lot of really good ideas come from Hack Day, but last week we saw some really spectacular work. The hackers generated a wide range of ideas on just about anything, including ideas to improve developer productivity, ways to help troubleshooting, funky data visualizations, and of course a diversity of product feature ideas. These ideas get categorized, then to determine the winner for each category the audience of Netflix employees rated each hack, in true Netflix fashion, on a 5-star scale.

The following are some examples of our favorite hacks and/or videos to give you a taste. Most of these hacks and videos were conceived of and produced in about 24 hours. We should also note that, while we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or be used beyond Hack Day. We are surfacing them here publicly to share the spirit of the Netflix Hack Day.

Netflix Beam
by Sassda Chap

Radial
by Jia Pu, Aaron Tull, George Campbell

Custom Playlists
by Marco Vinicius Caldiera, Ian Kirk, Adam Excelsior Butterworth, Glenn Cho

Sleep Tracking with Fitbit
by Sam Horner, Rachel Nordman, Arlene Aficial, Sam Park, Bogdan Ciuca

Pin Protected Profiles
by Mike Kim, Dianne Marsh, Nick Ryabov

Here are some images from the event:

Thanks to all of the hackers and we look forward to the next one. If you are interested in being a part of our next Hack Day, let us know!

Also, we will be hosting our next Open Source meetup at Netflix HQ in Los Gatos on March 12th at 6:30pm. If you are interested, please sign up while there are still slots.

↧

The Netflix Dynamic Scripting Platform

March 3, 2014, 9:40 am

≫ Next: The Netflix Signup Flow - Our Journey to a Responsive Design

≪ Previous: Netflix Hack Day

The Engine that powers the Netflix API

By Sangeeta Narayanan

Over the past couple of years, we haveoptimized the Netflix APIwith a view towards improving performance and increasing agility. In doing so, the API has evolved from a provider of RESTful web services to a platform that distributes development and deployment of new functionality across various teams within Netflix.

At the core of the redesign is a Dynamic Scripting Platform which provides us the ability to inject code into a running Java application at any time. This means we can alter the behavior of the application without a full scale deployment. As you can imagine, this powerful capability is useful in many scenarios. The API Server is one use case, and in this post, we describe how we use this platform to support a distributed development model at Netflix.

As a reminder, devices make HTTP requests to the API to access content from a distributed network of services. Device Engineering teams use the Dynamic Scripting Platform to deploy new or updated functionality to the API Server based on their own release cycles. They do so by uploading adapter code in the form of Groovy scripts to a running server and defining custom endpoints to front those scripts. The scripts are responsible for calling into the Java API layer and constructing responses that the client application expects. Client applications can access the new functionality within minutes of the upload by requesting the new or updated endpoints. The platform can support any JVM-compatible language, although at the moment we primarily use Groovy.

Architecting for scalability is a key goal for any system we build at Netflix and the Scripting Platform is no exception. In addition, as our platform has gained adoption, we are developing supporting infrastructure that spans the entire application development lifecycle for our users. In some ways, the API is now like an internal PaaS system that needs to provide a highly performant, scalable runtime; tools to address development tasks like revision control, testing and deployment; and operational insight into system health. The sections that follow explore these areas further.

Architecture

Here is a view into the API Server under the covers.

[1] Endpoint Controller routes endpoint requests to the corresponding groovy scripts. It consults a registry to identify the mapping between an endpoint URI and its backing scripts.

[2] Script Manager handles requests for script management operations. The management API is exposed as a RESTful interface.

[3] Script source, compiled bytecode and related metadata are stored in a Cassandra cluster, replicated across all AWS regions.

[4] Script Cache is an in memory cache that holds compiled bytecode fetched from Cassandra. This eliminates the Cassandra lookup during endpoint request processing. Scripts are compiled by the first few servers running a new API build and the compiled bytecode is persisted in Cassandra. At startup, a server looks for persisted bytecode for a script before attempting to compile it in real time. Because deploying a set of canary instances is a standard step in our delivery workflow, the canary servers are the ones to incur the one-time penalty for script compilation. The cache is refreshed periodically to pick up new scripts.

[5] Admin Console and Deployment Tools are built on top of the script management API.

Script Development and Operations

Our experience in building a Delivery Pipeline for the API Server has influenced our thinking around the workflows for script management. Now that a part of the client code resides on the server in the form of scripts, we want to simplify the ways in which Device teams integrate script management activities into their workflows. Because technologies and release processes vary across teams, our goal is to provide a set of offerings from which they can pick and choose the tools that best suit their requirements.

The diagram below illustrates a sample script workflow and calls out the tools that can be used to support it. It is worth noting that such a workflow would represent just a part of a more comprehensive build, test and deploy process used for a client application.

Distributed Development

To get script developers started, we provide them with canned recipes to help with IDE setup and dependency management for the Java API and related libraries. In order to facilitate testing of the script code, we have built a Script Test Framework based on JUnit and Mockito. The Test Framework looks for test methods within a script that have standard JUnit annotations, executes them and generates a standard JUnit result report. Tests can also be run against a live server to validate functionality in the scripts.

Additionally, we have built a REPL tool to facilitate ad-hoc testing of small scripts or snippets of groovy code that can be shared as samples, for debugging etc.

Distributed Deployment

As mentioned earlier, the release cycles of the Device teams are decoupled from those of the API Team. Device teams have the ability to dynamically create new endpoints, update the scripts backing existing endpoints or delete endpoints as part of their releases. We provide command line tools built on top of our Endpoint Management API that can be used for all deployment related operations. Device teams use these tools to integrate script deployment activities with their automated build processes and manage the lifecycle of their endpoints. The tools also integrate with our internal auditing system to track production deployments.

Admin Operations & Insight

Just as the operation of the system is distributed across several teams, so is the responsibility of monitoring and maintaining system health. Our role as the platform provider is to equip our users with the appropriate level of insight and tooling so they can assume full ownership of their endpoints. The Admin Console provides users full visibility into real time health metrics, deployment activity, server resource consumption and traffic levels for their endpoints.

Engineers on the API team can get an aggregate view of the same data, as well as other metrics like script compilation activity that are crucial to track from a server health perspective. Relevant metrics are also integrated into our real time monitoring and alerting systems.

The screenshot to the left is from a top-level dashboard view that tracks script deployment activity.

Experiences and Lessons Learnt

Operating this platform with an increasing number of teams has taught us a lot and we have had several great wins!

Client application developers are able to tailor the number of network calls and the size of the payload to their applications. This results in more efficient client-side development and overall, an improved user experience for Netflix customers.
Distributed, decoupled API development has allowed us to increase our rate of innovation.
Response and recovery times for certain classes of bugs have gone from days or hours down to minutes. This is even more powerful in the case of devices that cannot easily be updated or require us to go through an update process with external partners.
Script deployments are inherently less risky than server deployments because the impact of the former is isolated to a certain class or subset of devices. This opens the door for increased nimbleness.
We are building an internal developer community around the API. This provides us an opportunity to promote collaboration and sharing of resources, best practices and code across the various Device teams.

As expected, we have had our share of learnings as well.

The flexibility of our platform permits users to utilize the system in ways that might be different from those envisioned at design time. The strategy that several teams employed to manage their endpoints put undue stress on the API server in terms of increased system resources, and in one instance, caused a service outage. We had to react quickly with measures in the form of usage quotas and additional self-protect switches while we identified design changes to allow us to handle such use cases.
When we chose Cassandra as the repository for script code for the server, our goal was to have teams use their SCM tool of choice for managing script source code. Over time, we are finding ourselves building SCM-like features into our Management API and tools, as we work to address the needs of various teams. It has become clear to us that we need to offer a unified set of tools that cover deployment and SCM functionality.
The distributed development model combined with the dynamic nature of the scripts makes it challenging to understand system behavior and debug issues. RxJava introduces another layer of complexity in this regard because of its asynchronous nature. All of this highlights the need for detailed insights into scripts’ usage of the API.

We are actively working on solutions for the above and will follow up with another post when we are ready to share details.

Conclusion

The evolution of the Netflix API from a web services provider to a dynamic platform has allowed us to increase our rate of innovation while providing a high degree of flexibility and control to client application developers. We are investing in infrastructure to simplify the adoption and operation of this platform. We are also continuing to evolve the platform itself as we find new use cases for it. If this type of work excites you, reach out to us - we are always looking to add talented engineers to our team!

↧

The Netflix Signup Flow - Our Journey to a Responsive Design

March 3, 2014, 1:10 pm

≫ Next: NetflixOSS Season 2, Episode 1

≪ Previous: The Netflix Dynamic Scripting Platform

by Joel Sass

In the Spring of 2013, the User Experience team was gearing up for the impending Netherlands country launch scheduled for September. To reduce barriers to adoption, we wanted to launch with a smooth signup experience on smartphones and tablets. Additionally, we were planning to implement the iDEAL online payment method, which is commonly used in the Netherlands but new to us both technically and from a user experience perspective.

At that time, we had two very different technology stacks that served our signup experience: one for desktop browsers and one for mobile and tablet browsers. There were substantial differences in the way these two websites worked, and they shared no common platform. Each website had unique capabilities, but the desktop site provided a much larger superset of features compared to the mobile optimized site.

To move forward with enabling iDEAL and other new payment methods across multiple platforms, we quickly came to the conclusion that the best way forward was a unified approach to supporting multiple platforms using a single UI. This ultimately started us on our path towards responsive web design (RWD). Responsive design is a technique for delivering a consistent set of functionality across a wide range of screen sizes, from a single website. For our effort, we focused on the following goals:

Enable access to all features and capabilities, regardless of device
Deliver a consistent user experience that is optimized for device capabilities, including screen size and input method

Cross-functional alignment and prototyping

In order to successfully tackle a responsive design project, we first needed to answer the following question for our team: what is responsive design? In initial brainstorming meetings, we aligned around a common definition that emphasizes the use of CSS and JS to adapt a common user experience to varying screen sizes and input methods.

At Netflix, we like to move quickly and not let unnecessary process slow down our ability to innovate and move fast. Rapid design, rapid development. To kickstart this project, we assembled a core group of designers and user interface engineers, and held a weeklong workshop. The end goal was to produce a working prototype of a responsively designed signup flow.

This workshop approach allowed us to streamline the entire design and development process. Developers and designers working side by side to immediately tackle issues as they came up. In effect, we were pair programming. This allowed us to minimize the need for comps and wireframes, and develop straight in the browser. It provided the freedom to easily experiment with different design and engineering techniques, and identify common patterns that could be used across the entire signup flow. By the end of that week we had created a prototype of a fully functional, looks-good-on-whatever-device-you’re-using, Netflix signup experience.

A/B testing and rollout

With a functional responsive flow now built, we came to the final stretch. Being highly data-driven at Netflix, we wanted to measure that our changes were having a positive impact on how customers interact with the signup flow. So we cleaned up our prototype, turned it into a production quality experience and tested it globally against the current split stacks. We saw no impact to desktop signups, but we did a see an increase in conversion rates on phones and tablets due to the additional payment types we enabled for those devices. The results of our A/B test gave us the confidence to roll out this new signup flow as the default experience in all markets, and retire our separate phone/tablet stack.

App integration

However, we didn’t stop there. We had also supported signup from within our Android app prior to this project. Following our global rollout for browser-based signups, we quickly integrated our newly responsive signup experience into our Android app as an embedded web view. This enabled us to get much more leverage out of our responsive design investment. More about this unique approach in a future post.

Many devices, one platform

Whether on a 30” screen or a 4” one, our customers are now provided with an experience that works well and looks great across a wide range of devices. Development of the signup experience has been streamlined, increasing developer productivity. We now have a single platform that serves as foundation for all signup A/B testing and innovation and our customers are afforded the same options regardless of device.

Our journey towards responsive design has not ended, however. As device platforms evolve, and as user expectations change, our designers and engineers are constantly working towards enhancing cross-platform experiences for our customers.

What’s next?

In upcoming posts, we will further explore how we built our responsive signup flow. Some of the topics we will cover are: a deeper dive into the client and server-side techniques we used to integrate our browser experience into our Android app, a more in-depth look at the decisions we made in regards to mobile-first vs desktop-first approaches, and a review of the challenges in dealing with responsive images.

Can’t wait until the next post? Interested in learning more about responsive design? Come see Chris Saint-Amant, Netflix UI Engineering Manager, discuss the next generation of responsive design at SXSW Interactive in Austin, TX on March 8th.

Are you interested in working on cross-platform design and engineering challenges? We’re hiring front-end innovators to helps us realize our vision. Check out our jobs.

↧

NetflixOSS Season 2, Episode 1

March 13, 2014, 2:18 pm

≫ Next: Going Reactive - Asynchronous JavaScript at Netflix

≪ Previous: The Netflix Signup Flow - Our Journey to a Responsive Design

By Ruslan Meshenberg

Wondering what this headline means? It means that NetflixOSS continues to grow, both in the number of projects that are now available and in the use by others.

We held another NetflixOSS Meetup in our Los Gatos, Calif., headquarters last night. Four companies came specifically to share what they’re doing with the NetflixOSS Platform:

Matt Bookman from Google shared how to leverage NetlixOSS Lipstick on top of Google Compute Engine. Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, giving developers insight that previously required a lot of sifting through logs (or a Pig expert) to piece together.
Andrew Spyker from IBM shared how they’re leveraging many of NetflixOSS components on top of the SoftLayer infrastructure to build real-world applications - beyond the AcmeAir app that won last year’s Cloud Prize.
Peter Sankauskas from Answers4AWS talked about the motivation behind his work on NetflixOSS components setup automation, and his work towards 0-click setup for many of the components.
Daniel Chia from Coursera shared how they utilize NetflixOSS Priam and Aegithus to work with Cassandra.

Since our previous NetflixOSS Meetup we have open sourced several new projects in many areas: Big Data tools and solutions, Scalable Data Pipelines, language agnostic storage solutions and more. At the yesterday’s Meetup Netflix engineers talking about recent projects and gave previews of the projects that may be soon released.

Zeno - in memory data serialization and distribution platform
Suro - a distributed data pipeline which enables services to move, aggregate, route and store data
STAASH - a language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems
A preview of Dynomite - a thin Dynamo-based replication for cached data
Aegithus - a bulk data pipeline out of Cassandra
PigPen - Map-Reduce for Clojure
S3mper - a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
A preview of Inviso - a performance focused Big Data tool

All the slides are available on Slideshare:

Netflix oss season 2 episode 1 - meetup Lightning talks from Ruslan Meshenberg

In preparation for the event, we spruced up our Github OSS site - all components now feature new cool icons:

The event itself was a full house - people at the demo stations were busy all evening answering many questions about the components they wrote and opened.

It was great to see how many of our Open Source components are being used outside of Netflix. We hear of many more companies that are using and contributing to NetflixOSS components. If you’re one of them, and would like to have your logo featured on our Github site “Powered by NetflixOSS” page - contact us at netflixoss@netflix.com

If you’re interested to hear more about upcoming NetflixOSS projects and events, follow @NetflixOSS on Twitter, and join our Meetup group. The slides for this event are available on Slideshare, videos will be uploaded shortly.

↧

Going Reactive - Asynchronous JavaScript at Netflix

March 31, 2014, 5:22 pm

≫ Next: Women in Technology meetup at Netflix!

≪ Previous: NetflixOSS Season 2, Episode 1

By Matt Marenghi

We held the first in a series of JavaScript talks at our Los Gatos, Calif. headquarters on March 19th. Our own Jafar Husain shared how we’re using the Reactive Extensions (Rx) library to build responsive UIs across our device experiences.

The talk goes into detail on how Netflix uses Rx to:

Declaratively build complex events out of simple events (ex. drag and drop)
Coordinate and sequence multiple Ajax requests
Reactively update the UI in response to data changes
Eliminate memory leaks caused by neglecting to unsubscribe from events
Gracefully propagate and handle asynchronous exceptions

There was a great turnout for the event and the audience asked some really good questions. We hope to see you at our next event!

You can watch the video of the presentation at: https://www.youtube.com/watch?v=XRYN2xt11Ek

And the slides from the presentation are available at: http://www.slideboom.com/presentations/961830/Async-JavaScript-at-Netflix?pk=0cb3-a5bb-4b13-edda-bc21-019f-2b09-52b1

↧

Women in Technology meetup at Netflix!

April 1, 2014, 3:12 pm

≫ Next: Improving the performance of our JavaScript inheritance model

≪ Previous: Going Reactive - Asynchronous JavaScript at Netflix

Last week, Netflix welcomed about 150 women to its campus for a set of lightning talks followed by demos and networking. Organized in coordination with CloudNOW, the event was high on the fun and high on the tech.

Lightning talks included:

Devika Chawla, Director of Engineering, Netflix. Devika’s talk was titled “In Pursuit of RapidMessaging Innovation”. The Messaging Platform has probably sent you a message via email, push or in-app messaging channels. We learned how her team is building a platform for rapid innovation by de-coupling from senders and moving to a system driven by dynamic metadata.

Alolita Sharma from Wikipedia talked about “Rich Media Content Delivery and Open Source”. The challenges that her team faces regarding the sheer number of languages that they support are interesting -- and potentially foreshadow what many of us will face in the future.

Tracy Wright traveled up from the Netflix LA office to talk about how her team managed technology migration from a data center based high touch workflow to a the cloud based exception based approach in “Cloud Migration for Large Scale Content Operations”. The success of this transition depended on a team mindset migration as well as the toolset migration to the cloud which underscored the value of effective change management.

Seema Jethani, from Enstratius talked about the challenges that we all face in “Approach to Tool and Technology Choices”.

Wondering how (and why) Netflix has implemented a “Flexible Billing Integration Architecture”? Nirmal Varadarajan described this, and tied it back to the earlier talk given by Devika. The presentation focused on ways to build components that can be reused for large scale incremental migration and a flexible events pipeline to facilitate an easy way to share data between closely aligned but loosely coupled systems.

Evelyn De Souza joined us from Cisco Systems. She presented from her work on “Cloud Data Protection Certification”. You can learn more about Evelyn and her work at the Cloud Security Alliance.

Continuing with the theme of “Agility at Scale”, Sangeeta Narayanan explained how the Netflix Edge Engineering team tackles the challenge of moving fast at scale. She described the importance of building agility into system architecture as well as investing in Continuous Delivery. She showed us some screen shots of their Continuous Delivery dashboard, which was presented at the Edge demo station as well.

Eva Tse wrapped up the lightning talks with her discussion of “Running a Big Data Platform in the Cloud”. She showcased how they leverage the Hadoop ecosystem and AWS S3 service to architect the Netflix’s cloud native big data platform. Eva’s team is very active in open source, and many of the tools/services that built by the team are available on netflix.github.com.

There were several demo stations set up, and attendees lingered with food and drink, enjoying the demos and networking. The CodeChix featured a Pytheas demo, based on Netflix OSS!

If you missed the event, you can watch the recording here.

It was great to see so many people at this meetup!

↧

Improving the performance of our JavaScript inheritance model

May 16, 2014, 12:16 pm

≫ Next: Building Netflix Playback with Self-Assembling Components

≪ Previous: Women in Technology meetup at Netflix!

by Dylan Oudyk

When building the initial architecture for Netflix’s common TV UI in early 2009, we knew we wanted to use inheritance to share common functionality in our codebase. Since the first engineers working on the code were coming from the heavily Java-based Website, they preferred simulating a classical inheritance model: an easy way to extend objects and override functions, but also have the ability to invoke the overridden function. (Since then we’ve moved on to greener pastures, or should we say design patterns, but the classical inheritance model still lives on and remains useful within our codebase.) For example, every view in our UI inherits from a base view with common methods and properties, like show, hide, visible, and parent. Concrete views extend these methods while still retaining the base view behavior.

After searching around and considering a number of approaches, we landed on John Resig’s Simple JavaScript Inheritance. We really liked the syntactic sugar of this._super(), which allows you to invoke the super-class function from within the overriding function.

Resig’s approach allowed us to write simple code like this:

varHuman=Class.extend({
    init:function(height, weight){
this.height = height;
this.weight = weight;
}
});
varMutant=Human.extend({
    init:function(height, weight, abilities){
this._super(height, weight);
this.abilities = abilities;
}
});

Not so super, after all

While his approach was sugary sweet and simple, we’d always been leery of a few aspects, especially considering our performance- and memory-constrained environment on TV devices:

When extending an object, it loops over the prototype to find functions that use _super

To find _super it decompiles a function into a string and tests that string against a regular expression.
Any function using _super is then wrapped in a closure to achieve the super-class chaining.

All that sugar was slowing us down. We found that the _super implementation was about 70% slower in executing overridden functions than a vanilla approach of calling the super-class function directly. The overhead of having to invoke the closure then invoke the overriding function which then invokes the inherited function yields a performance penalty that starts to add up directly onto the execution stack. Our application not only runs on beefy game consoles like the PS3, PS4 and Xbox 360, but also on single-core Blu-ray players with 600 MHz ARM processors. A great user experience demands that the code run as fast as possible for the UI to remain responsive.

Not only was _super slowing us down, it was gobbling up memory. We have an automation suite that stress tests the UI by putting it through its paces: browsing the entire grid of content, conducting multiple playbacks, searching, etc. We run it against our builds to ensure our memory footprint remains relatively stable and to catch any memory leaks that might accidentally get introduced. On a PS3, we profiled the memory usage with and without _super. Our codebase had close to 720 uses of _super, that consumed about 12.2 MB. 12.2 MB is a drop in the bucket in the desktop browser world, but we work in a very constrained environment where 12.2 MB represents a large drop in a small bucket.

Worse still, when we went to move off the now deprecated YUI Compressor, more aggressive JavaScript minifiers like UglifyJS and Google’s Closure Compiler would obfuscate _super causing the regular expression test to fail and blow up our run-time execution.

We knew it was time to find a better way.

All your base are belong to us

We wanted to remove the performance and memory bottlenecks and also unblock ourselves from using a new minifier, but without having to rewrite significant portions of our large, existing codebase.

The _super implementation basically uses the wrapping closure as a complex pointer with references to the overriding function and the inherited function:

If we could find a way to remove the middle man, we’d be able to have the best of both worlds.

We were able to leverage a lesser known feature in JavaScript, named function expressions, to cut out the expensive work that _super had been doing.

Now when we’re extending an object, we loop over the prototype and add a baseproperty to every function. This base property points to the inherited function.

for(name in source){
    value = source[name];
    currentValue = receiver[name];
if(typeof value ==='function'&&typeof currentValue ==='function'&&
        value !== currentValue){
        value.base = currentValue;
}
}

We use a named function expression within the overriding function to invoke the inherited function.

varHuman=Class.extend({
    init:function(height, weight){
this.height = height;
this.weight = weight;
}
});
varMutant=Human.extend({
    init:functioninit(height, weight, abilities){
init.base.call(this, height, weight);
this.abilities = abilities;
}
});

var theWolverine =newMutant('5ft 3in',300,[
'adamantium skeleton',
'heals quickly',
'claws'
]);

(Please note, that if you need to support Internet explorer versions < 9 this may not be an option for you, but arguments.callee will be available, more details at Named function expressions demystified).

Arguably, we lost a teeny bit of sugar but at significant savings. The base approach is about 45% faster in executing an overridden method than _super. Add to that a significant memory savings.

As can be seen from the graph above, after running our application for one minute the memory savings is close to about 12.2MB. We could pull the line back to the beginning and the memory savings would be even more because after one minute the application code has long been interpreted, and the classes have been created.

Conclusion

We believe we found a great way to invoke overridden methods with the named function expression approach. By replacing _super, we saved RAM and CPU cycles. We’d rather save the RAM for gorgeous artwork and our streaming video buffer. The saved CPU cycles can be put to work on beautiful transitions. Overall a change that improves the experience for our users.

References:

John Resig’s Simple JavaScript Inheritance System http://ejohn.org/blog/simple-javascript-inheritance/

Invoking Overridden Methods Performance Tests: http://jsperf.com/fun-with-method-overrides/2

Named Function Expressions: http://kangax.github.io/nfe/

↧

Building Netflix Playback with Self-Assembling Components

June 2, 2014, 8:49 am

≫ Next: HTML5 Video in Safari on OS X Yosemite

≪ Previous: Improving the performance of our JavaScript inheritance model

by: Nicholas Eddy

Our 48 million members are accustomed to seeing a screen like this, whether on their TV or one of the 1000+ other Netflix devices they enjoy watching on. But the simple act of pressing play calls into action a deep and complex system that handles the DRM licenses, contract evaluations, CDN selection, and more. This system is known internally as the Playback Service and is responsible for making your Netflix streaming experience seem effortless.

The original Playback Service was built before Netflix was synonymous with streaming. As our product matured, the existing architecture became more difficult to support and started showing signs of stress. For example, we debuted HTML5 support last summer on IE11 in a major step towards standards-based playback on web platforms. This required adoption of a new device security model that works with the emerging HTML5 Web Cryptography API. It would have been very challenging to integrate this new model into our original architecture due to poor separation of business and security logic. This and other shortcomings pushed us to re-imagine our design, leading to a radically different solution with the following as our key design requirements:

Operation at massive scale
High velocity innovation
Reusability of components

High-level Architecture
The new Playback Service uses self-assembling components to handle the enormous volume of traffic Netflix gets each day. The following is an overview of that architecture; with special focus given to how requests are delegated to these components which are dynamically assembled and executed within an internal rule engine.

We will examine the building blocks of this architecture and show the benefits of using small, reusable components that are automatically wired together to create an emergent system. This post will focus on the smallest units of the new architecture and the ideas behind self-assembly. It will be followed up by others that go deeper into how we implement these concepts in the new architecture and address some challenges inherent to this approach.

Bottom-up
We started from the bottom up; defining the building blocks of the new architecture in a way that promotes loose coupling and clear separation of concerns. These building blocks are called Processors: entities that take zero or more inputs and generate no more than one output. These are the smallest computational unit of the architecture and behave like commands ala The Gang of Four. Below is a diagram that shows a processor that takes A, B, C and generates D.
This metaphor generalizes well given that many complex tasks can be subdivided into discrete, function-like steps. It also matches the way most engineers already think about problem decomposition. This definition enables processors to be as specialized as necessary, promoting low interconnectedness with other parts of the system. These qualities make the system easier to reason about, enhance, and test.

The Playback Service--like other complex systems--can be modelled as a black box that takes inputs and generates outputs. The conversion of some input A to an output E can be defined as a function f(A) = E and modelled as a single processor.
Of course, using a single processor only makes sense for very simple systems. More complex services would be decomposed into finer-grained processors as illustrated below.
Here you can see that the computation of Eis handled as several processor invocations. This flow resembles a series of function calls in a Java program, but there are some fundamental differences. The difficulty with normal functions is someone has to invoke them and decide how they are wired together. Essentially, the decomposition of f(A) = E above is usually something the entire team needs to understand and maintain. This places a cap on system evolution since scaling the system means scaling each engineer. It also increases the cost of scaling the team since minimum ramp-up time is directly proportional to the system complexity.

But what if you could have functions that self-assemble? What if processors could simply advertise their inputs/outputs and the wiring between them were an emergent property of that particular collection of processors?

Self-assembling Components
The hypothesis is that complex systems can be built efficiently if they are reduced to small, local problems that are solved in relative isolation with processors. These small blocks are then automatically assembled to reveal a fully formed system. Such a system would no longer require engineers to understand their entire scope before making significant contributions. These systems would be free to scale without taxing their engineering teams proportionally. Likewise, their teams could grow without investing in lots of onboarding time for each new member.

We can use the decomposition we did above for f(A) = E to illustrate how a self-assembly would work. Here is a simplified version of the diagram we saw earlier.
This system solves for A => E using the processors shown. However, this could be a more sophisticated system containing other processors that do not participate in the computation of E given A. Consider the following, where the system’s complete set of processors is included in the diagram.

The other processors are inactive for this computation, but various combinations would become active under different inputs. Take a case where the inputs were J, and W and processors were in place to handle these inputs such that the computation J,W => Y were possible.

The inputs J and W would trigger a different set of processors than before; leaving those that computed A => E dormant.

The processors triggered for some inputs is an emergent property of the complete set of processors within the system. An assembler mechanism exists to determine when each processor can participate in the computation. It makes this decision at runtime, allowing for a fully dynamic wiring for each request. As a result, processors can be organized in any way and do not need to be aware of each other. This makes their functionality easier to add, remove, and update than conventional mechanisms like switch statements or inheritance; which are statically determined and more rigidly structured.

Extending traditional systems often means ramping up on a lot of code to understand where the relevant inflection points are for a change or feature. Self-assembly relaxes the urgency for this deeper context and shifts the focus towards getting the right interaction designs for each component. It also enables more thorough testing since processors are naturally isolated from each other and simpler to unit test. They can also be assembled and run as a group with mocked dependencies to facilitate thorough end-to-end validation.

Self-assembly frees engineers to focus on solving local problems and adding value without having to wrestle with the entire end-to-end context. State validation is a good example of an enhancement that requires only local context with this architecture. The computation of J,W => Y above can be enhanced to include additional validation of V whenever it is generated. This could be achieved by adding a new processor that operates on V as an input: illustrated below.

The new processor V => V would take a value and raise an error if that value is invalid for some reason. This validation would be triggered whenever V is present in the system, whether or not J,W => Y is being computed. This is by design; meaning each processor is reused whenever its services are needed.

This validator pattern emerges often in the new Playback Service. For example, we use it to detect whether data sent by clients has been tampered with mid-flight. This is done using HMAC calculations to verify the data matches a client provided hash value. As with other processors, the integrity protection service provided this way is available for use during any request.

Challenges of Self-assembly
The use of self-assembling components offers clear advantages over hand wiring. It enables fluid architectures that can change dynamically at runtime and simplifies feature isolation so components can evolve rapidly with minimal impact to the overall system. Moreover, it decouples team size from system complexity so the two can scale independently.

Despite these benefits, building a working solution that enables self-assembly is non-trivial. Such a system has to decide which operations are executed when, and in what order. It has to manage the computation pipeline without adding too much overhead or complexity; all while scaling up with the set of processors. It also needs to be relatively unobtrusive so developers can remain focused on building the service. These were some of the challenges my team had to overcome when building the new Playback architecture atop the concepts of self-assembly.

Upcoming...
Subsequent blog posts will take us deeper into the workings of the new Playback Service architecture and provide more details about how we solved the challenges above and other issues intrinsic to self-assembly. We will also be discussing how this architecture is designed to enable fully dynamic end-points (where the set of rules/processors can change for each request) as well as dynamic services where the set of end-points can change for a running server.

The new Playback Service architecture based on self-assembling components provides a flexible programming model that is easy to develop and test. It greatly improves our ability to innovate as we continue to enhance the viewing experience for our members.

We are always looking for talented engineers to join us. So reach out if you are excited about this kind of engineering endeavor and would like to learn more about this and other things we are working on.

↧

HTML5 Video in Safari on OS X Yosemite

June 3, 2014, 10:04 am

≫ Next: Scale and Performance of a Large JavaScript Application

≪ Previous: Building Netflix Playback with Self-Assembling Components

By Anthony Park and Mark Watson.

We're excited to announce that Netflix streaming in HTML5 video is now available in Safari on OS X Yosemite! We've been working closely with Apple to implement the Premium Video Extensions in Safari, which allow playback of premium video content in the browser without the use of plugins. If you're in Apple's Mac Developer Program, or soon the OS X Beta Program, you can install the beta version of OS X Yosemite. With the OS X Yosemite Beta on a modern Mac, you can visit Netflix.com today in Safari and watch your favorite movies and TV shows using HTML5 video without the need to install any plugins.

We're especially excited that Apple implemented the Media Source Extensions (MSE) using their highly optimized video pipeline on OS X. This lets you watch Netflix in buttery smooth 1080p without hogging your CPU or draining your battery. In fact, this allows you to get up to 2 hours longer battery life on a MacBook Air streaming Netflix in 1080p - that’s enough time for one more movie!

Apple also implemented the Encrypted Media Extensions (EME) which provides the content protection needed for premium video services like Netflix.

Finally, Apple implemented the Web Cryptography API (WebCrypto) in Safari, which allows us to encrypt and decrypt communication between our JavaScript application and the Netflix servers.

The Premium Video Extensions do away with the need for proprietary plugin technologies for streaming video. In addition to Safari on OS X Yosemite, plugin-free playback is also available in IE 11 on Windows 8.1, and we look forward to a time when these APIs are available on all browsers.

Congratulations to the Apple team for advancing premium video on the web with Yosemite! We’re looking forward to the Yosemite launch this Fall.

↧

Scale and Performance of a Large JavaScript Application

June 5, 2014, 5:52 pm

≫ Next: Optimizing the Netflix Streaming Experience with Data Science

≪ Previous: HTML5 Video in Safari on OS X Yosemite

ByMatt Marenghi

We recently held our second JavaScript Talks event at our Netflix headquarters in Los Gatos, Calif. Matt Seeley discussed the development approaches we use at Netflix to build the JavaScript applications which run on TV-connected devices, phones and tablets. These large, rich applications run across a wide range of devices and require carefully managing network resources, memory and rendering. This talk explores various approaches the team uses to build well-performing UIs, monitor application performance, write consistent code, and scale development across the team.

You can watch the video of the talk at: http://youtu.be/TiDk8f9bojc

Slides from the talk are also available at: https://speakerdeck.com/mseeley/life-on-the-grid

↧

Optimizing the Netflix Streaming Experience with Data Science

June 11, 2014, 7:46 pm

≫ Next: Delivering Breaking Bad on Netflix in Ultra HD 4K

≪ Previous: Scale and Performance of a Large JavaScript Application

By Nirmal Govind

On January 16, 2007, Netflix started rolling out a new feature: members could now stream movies directly on their browser without having to wait for the red envelope in the mail. This event marked a substantial shift for Netflix and the entertainment industry. A lot has changed since then. Today, Netflix delivers over 1 billion hours of streaming per month to 48 million members in more than 40 countries. And Netflix accounts for more than a third of peak Internet traffic in the US. This level of engagement results in a humungous amount of data.

At Netflix, we use big data for deep analysis and predictive algorithms to help provide the best experience for our members. A well-known example of this is the personalized movie and show recommendations that are tailored to each member's tastes. The Netflix prize that was launched in 2007 highlighted Netflix's focus on recommendations. Another area that we're focusing on is the streaming quality of experience (QoE), which refers to the user experience once the member hits play on Netflix. This is an area that benefits significantly from data science and algorithms/models built around big data.

Netflix is committed to delivering outstanding streaming service and is investing heavily in advancing the state of the art in adaptive streaming algorithms and network technologies such asOpen Connect to optimize streaming quality. Netflix won a Primetime Emmy Engineering Award in 2012 for the streaming service. To put even more focus on "streaming science," we've created a new team at Netflix that's working on innovative approaches for using our data to improve QoE. In this post, I will briefly outline the types of problems we're solving, which include:

Understanding the impact of QoE on user behavior
Creating a personalized streaming experience for each member
Determining what movies and shows to cache on the edge servers based on member viewing behavior
Improving the technical quality of the content in our catalog using viewing data and member feedback

Understanding the impact of QoE on user behavior

User behavior refers to the way users interact with the Netflix service, and we use our data to both understand and predict behavior. For example, how would a change to our product affect the number of hours that members watch? To improve the streaming experience, we look at QoE metrics that are likely to have an impact on user behavior. One metric of interest is the rebuffer rate, which is a measure of how often playback is temporarily interrupted while more data is downloaded from the server to replenish the local buffer on the client device. Another metric, bitrate, refers to the quality of the picture that is served/seen - a very low bitrate corresponds to a fuzzy picture. There is an interesting relationship between rebuffer rate and bitrate. Since network capacity is limited, picking too high of a bitrate increases the risk of hitting the capacity limit, running out of data in the local buffer, and then pausing playback to refill the buffer. What’s the right tradeoff?

There are many more metrics that can be used to characterize QoE, but the impact that each one has on user behavior, and the tradeoffs between the metrics need to be better understood. More technically, we need to determine a mapping function that can quantify and predict how changes in QoE metrics affect user behavior. Why is this important? Understanding the impact of QoE on user behavior allows us to tailor the algorithms that determine QoE and improve aspects that have significant impact on our members' viewing and enjoyment.

Improving the streaming experience

The Netflix Streaming Supply Chain: opportunities to optimize the streaming experience exist at multiple points

How do we use data to provide the best user experience once a member hits play on Netflix?

Creating a personalized streaming experience

One approach is to look at the algorithms that run in real-time or near real-time once playback has started, which determine what bitrate should be served, what server to download that content from, etc.

With vast amounts of data, the mapping function discussed above can be used to further improve the experience for our members at the aggregate level, and even personalize the streaming experience based on what the function might look like based on each member's "QoE preference." Personalization can also be based on a member's network characteristics, device, location, etc. For example, a member with a high-bandwidth connection on a home network could have very different expectations and experience compared to a member with low bandwidth on a mobile device on a cellular network.

Optimizing content caching

A set of big data problems also exists on the content delivery side. Open Connect is Netflix's own content delivery network that allows ISPs to directly connect to Netflix servers at common internet exchanges, or place a Netflix-provided storage appliance (cache) with Netflix content on it at ISP locations. The key idea here is to locate the content closer (in terms of network hops) to our members to provide a great experience.

One of several interesting problems here is to optimize decisions around content caching on these appliances based on the viewing behavior of the members served. With millions of members, a large catalog, and limited storage capacity, how should the content be cached to ensure that when a member plays a particular movie or show, it is being served out of the local cache/appliance?

Improving content quality

Another approach to improving user experience involves looking at the quality of content, i.e. the video, audio, subtitles, closed captions, etc. that are part of the movie or show. Netflix receives content from the studios in the form of digital assets that are then encoded and quality checked before they go live on the content servers. Given our large and varied catalog that spans several countries and languages, the challenge is to ensure that all our movies and shows are free of quality issues such as incorrect subtitles or captions, our own encoding errors, etc.

In addition to the internal quality checks, we also receive feedback from our members when they discover issues while viewing. This data can be very noisy and may contain non-issues, issues that are not content quality related (for example, network errors encountered due to a poor connection), or general feedback about member tastes and preferences. In essence, identifying issues that are truly content quality related amounts to finding the proverbial needle in a haystack.

By combining member feedback with intrinsic factors related to viewing behavior, we're building models to predict whether a particular piece of content has a quality issue. For instance, we can detect viewing patterns such as sharp drop offs in viewing at certain times during the show and add in information from member feedback to identify problematic content. Machine learning models along with natural language processing (NLP) and text mining techniques can be used to build powerful models to both improve the quality of content that goes live and also use the information provided by our members to close the loop on quality and replace content that does not meet the expectations of Netflix members. As weexpand internationally, this problem becomes even more challenging with the addition of new movies and shows to our catalog and the increase in number of languages.

These are just a few examples of ways in which we can use data in creative ways to build models and algorithms that can deliver the perfect viewing experience for each member. There are plenty of other challenges in the streaming space that can benefit from a data science approach. If you're interested in working in this exciting space, please check out the Streaming Science & Algorithms position on theNetflix jobs site.

↧

Delivering Breaking Bad on Netflix in Ultra HD 4K

June 16, 2014, 10:19 pm

≫ Next: Announcing Security Monkey - AWS Security Configuration Monitoring and Analysis

≪ Previous: Optimizing the Netflix Streaming Experience with Data Science

This week Netflix is pleased to begin streaming all 62 episodes of Breaking Bad in UltraHD 4K. Breaking Bad in 4K comes from Sony Pictures Entertainment’s beautiful remastering of Breaking Bad from the original film negatives. This 4K experience is available on select 4K Smart TVs.

As pleased as I am to announce Breaking Bad in 4K, this blog post is also intended to highlight the collaboration between Sony Pictures Entertainment and Netflix to modernize the digital supply chain that transports digital media from content studios, like Sony Pictures, to streaming retailers, like Netflix.

Netflix and Sony agreed on an early subset of IMF for the transfer of the video and audio files for Breaking Bad. IMF stands for Interoperable Master Format, an emerging SMPTE specification governing file formats and metadata for digital media archiving and B2B exchange.

IMF specifies fundamental building blocks like immutable objects, checksums, globally unique identifiers, and manifests (cpl). These building blocks hold promise for vastly improving the efficiency, accuracy, and scale of the global digital supply chain.

At Netflix, we are excited about IMF and we are committing significant R&D efforts towards adopting IMF for content ingestion. Netflix has an early subset of IMF in production today and we will support most of the current IMF App 2 draft by the end of 2014. We are also developing a roadmap for IMF App 2 Extended and Extended+. We are pleased that Sony Pictures is an early innovator in this space and we are looking forward to the same collaboration with additional content studio partners.

Breaking Bad is joining House of Cards season 2 and the Moving Art documentaries in our global 4K catalog. We are also adding a few more 4K movies for our USA members. We have added Smurfs 2, Ghostbusters, and Ghostbusters 2 in the United States. All of these movies were packaged in IMF by Sony Pictures.

Kevin McEntee
VP Digital Supply Chain

↧

Announcing Security Monkey - AWS Security Configuration Monitoring and Analysis

June 30, 2014, 9:25 am

≫ Next: Billing & Payments Engineering Meetup

≪ Previous: Delivering Breaking Bad on Netflix in Ultra HD 4K

We are pleased to announce the open source availability of Security Monkey, our solution for monitoring and analyzing the security of our Amazon Web Services configurations.

At Netflix, responsibility for delivering the streaming service is distributed and the environment is constantly changing. Code is deployed thousands of times a day, and cloud configuration parameters are modified just as frequently. To understand and manage the risk associated with this velocity, the security team needs to understand how things are changing and how these changes impact our security posture.

Netflix delivers its service primarily out of Amazon Web Services’ (AWS) public cloud, and while AWS provides excellent visibility of systems and configurations, it has limited capabilities in terms of change tracking and evaluation. To address these limitations, we created Security Monkey - the member of the Simian Army responsible for tracking and evaluating security-related changes and configurations in our AWS environments.

Overview of Security Monkey

We envisioned and built the first version of Security Monkey in 2011. At that time, we used a few different AWS accounts and delivered the service from a single AWS region. We now use several dozen AWS accounts and leverage multiple AWS regions to deliver the Netflix service. Over its lifetime, Security Monkey has ‘evolved’ (no pun intended) to meet our changing and growing requirements.

Viewing IAM users in Security Monkey - highlighted users have active access keys.

There are a number of security-relevant AWS components and configuration items - for example, security groups, S3 bucket policies, and IAM users. Changes or misconfigurations in any of these items could create an unnecessary and dangerous security risk. We needed a way to understand how AWS configuration changes impacted our security posture. It was also critical to have access to an authoritative configuration history service for forensic and investigative purposes so that we could know how things have changed over time. We also needed these capabilities at scale across the many accounts we manage and many AWS services we use.

Security Monkey's filter interface allows you to quickly find the configurations and items you're looking for.

These needs are at the heart of what Security Monkey is - an AWS security configuration tracker and analyzer that scales for large and globally distributed cloud environments.

Architecture

At a high-level, Security Monkey consists of the following components:

Watcher - The component that monitors a given AWS account and technology (e.g. S3, IAM, EC2). The Watcher detects and records changes to configurations. So, if a new IAM user is created or if an S3 bucket policy changes, the Watcher will detect this and store the change in Security Monkey’s database.
Notifier - The component that lets a user or group of users know when a particular item has changed. This component also provides notification based on the triggering of audit rules.
Auditor - Component that executes a set of business rules against an AWS configuration to determine the level of risk associated with the configuration. For example, a rule may look for a security group with a rule allowing ingress from 0.0.0.0/0 (meaning the security group is open to the Internet). Or, a rule may look for an S3 policy that allows access from an unknown AWS account (meaning you may be unintentionally sharing the data stored in your S3 bucket). Security Monkey has a number of built-in rules included, and users are free to add their own rules.

In terms of technical components, we run Security Monkey in AWS on Ubuntu Linux, and storage is provided by a PostgreSQL RDS database. We currently run Security Monkey on a single m3.large instance - this instance type has been able to easily monitor our dozens of accounts and many hundreds of changes per day.

The application itself is written in Python using the Flask framework (including a number of Flask plugins). At Netflix, we use our standard single-sign on (SSO) provider for authentication, but for the OSS version we’ve implemented Flask-Login and Flask-Security for user management. The frontend for Security Monkey’s data presentation is written in Angular Dart, and JSON data is also available via a REST API.

General Features and Operations

Security Monkey is relatively straightforward from an operational perspective. Installation and AWS account setup is covered in the installation document, and Security Monkey does not rely on other Netflix OSS components to operate. Generally, operational use includes:

Initial Configuration

Setting up one or more Security Monkey users to use/administer the application itself.
Setting up one or more AWS accounts for Security Monkey to monitor.
Configuring user-specific notification preferences (to determine whether or not a given user should be notified for configuration changes and audit reports).

Typical Use Cases

Checking historical details for a given configuration item (e.g. the different states a security group has had over time).
Viewing reports to check what audit issues exist (e.g. all S3 policies that reference unknown accounts or all IAM users that have active access keys).
Justifying audit issues (providing background or context on why a particular issues exists and is acceptable though it may violate an audit rule).

Note on AWS CloudTrail and AWS Trusted Advisor

CloudTrail is AWS’ service that records and logs API calls. Trusted Advisor is AWS’ premium support service that automatically evaluates your cloud deployment against a set of best practices (including security checks).

Security Monkey predates both of these services and meets a bit of each services’ goals while having unique value of its own:

CloudTrail provides verbose data on API calls, but has no sense of state in terms of how a particular configuration item (e.g. security group) has changed over time. Security Monkey provides exactly this capability.
Trusted Advisor has some excellent checks, but it is a paid service and provides no means for the user to add custom security checks. For example, Netflix has a custom check to identify whether a given IAM user matches a Netflix employee user account, something that is impossible to do via Trusted Advisor. Trusted Advisor is also a per-account service, whereas Security Monkey scales to support and monitor an arbitrary number of AWS accounts from a single Security Monkey installation.

Open Items and Future Plans

Security Monkey has been in production use at Netflix since 2011 and we will continue to add additional features. The following list documents some of our planned enhancements.

Integration with CloudTrail for change detail (including originating IP, instance, IAM account).
Ability to compare different configuration items across regions or accounts.
CSRF protections for form POSTs.
Content Security Policy headers (currently awaiting a Dart issue to be addressed).
Additional AWS technology and configuration tracking.
Test integration with moto.
SSL certificate expiry monitoring.
Simpler installation script and documentation.
Roles/authorization capabilities for admin vs. user roles.
More refined AWS permissions for Security Monkey operations (the current policy in the install docs is a broader read-only role).
Integration with edda, our general purpose AWS change tracker. On a related note, our friends at Prezi have open sourced reddalert, a security change detector that is itself integrated with edda.

Conclusion

Security Monkey has helped the security teams @ Netflix gain better awareness of changes and security risks in our AWS environment. Its approach fits well with the general Simian Army approach of continuously monitoring and detecting potential anomalies and risky configurations, and we look forward to seeing how other AWS users choose to extend and adapt its capabilities. Security Monkey is now available on our GitHub site.

If you’re in the San Francisco Bay Area and would like to hear more about Security Monkey (and see a demo), our August Netflix OSS meetup will be focused specifically on security. It’s scheduled for August 20th and will be held at Netflix HQ in Los Gatos.

-Patrick Kelley, Kevin Glisson, and Jason Chan (Netflix Cloud Security Team)

↧

Billing & Payments Engineering Meetup

July 9, 2014, 4:49 pm

≫ Next: Revisiting 1 Million Writes per second

≪ Previous: Announcing Security Monkey - AWS Security Configuration Monitoring and Analysis

On June 18th, we hosted our first Billing & Payments Engineering Meetup at Netflix.

We wanted to create a space for exchanging information and learning among professionals. That space would serve as a forum, or an agora, for a community of people sharing the same interests in the engineering aspects of billing & payment systems.

The billing and payments space is a dynamic and innovative environment that requires increased attention as it evolves. Many of the Bay Area's tech companies may have different core products, yet we all monetize in a fairly similar way. Most created billing systems internally and had to overcome similar technical or business challenges as companies grew. Moreover, as our companies expand internationally, the need to process foreign payment methods is becoming critical and potentially defining factor in maximizing chances of success.

Several trend-setting companies responded to our invite to speak to the large audience that came looking for tips and best-of-industry practices.

Below is a recap of the agenda:

Mathieu Chauvin - Engineering Manager for Payments @ Netflix
Taylor Wicksell - Sr. Software Engineer for Billing @ Netflix
Jean-Denis Greze - Engineer @ Dropbox
Alec Holmes - Software Engineer @ Square
Emmanuel Cron - Software Engineer III, Google Wallet @ Google
Paul Huang - Engineering Manager @ Survey Monkey
Anthony Zacharakis - Lead Engineer @ Lumos Labs
Shengyong Li / Feifeng Yang - Dir. Engineering Commerce / Tech Lead Payment @ Electronic Arts

Below you can find the aggregate presentations. Thanks again to the presenters for sharing this material.

After the presentations, we held a networking session and engaged in very interesting conversations. It was a great event and another one will come up soon. Stay tuned on the meetup page to be notified!
http://www.meetup.com/Netflix-Billing-Payments-Engineering/

Netflix is always looking for talented people. If you share our passion for billing & payments innovation, check out our Careers page!
http://jobs.netflix.com/jobs.php?id=NFX00084

↧