AKA Training Smarter Monkeys

At Netflix, we have found that proactive failure testing is a great way to ensure that we have a reliable product for our members by helping us prepare our systems, and our teams, for the problems that arise in our production environment. Our various efforts in this space, some of which are manual, have helped us make it through the holiday season without incident (which is great if you’re on-call for New Year’s Eve!). But who likes manual processes? Additionally, we are only testing for the failures we anticipate, and often only for an individual service or component per exercise. We can do better!

Imagine a monkey that crawls through your code and infrastructure, injecting small failures and discovering if it results in member pain.

While looking for a way to build such a monkey, we discovered a failure testing approach developed by Peter Alvaro called Molly. Given that we already had a failure injection service called FIT, we believed we could build a prototype implementation in short order. And we thought it would be great to see how well the concepts outlined in the Molly paper translated into a large-scale production environment. So, we got in touch with Peter to see if he was interested in working together to build a prototype. He was and the results of our collaboration are detailed below.

Algorithm

“A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome.” [1]

Molly begins by looking at everything that went into a successful request and asking “What could have prevented this outcome?” Take this simplified request as an example:

(A or R or P or B)

At the start, everything is necessary - as far as we know. Symbolically we say that member pain could result from failing (A or R or P or B) where A stands for API, etc. We start by choosing randomly from the potential failure points and rerunning the request, injecting failure at the chosen point.

There are three potential outcomes:

The request fails - we’ve found a member facing failure

From this we can prune future experiments containing this failure

The request succeeds - the service/failure point is not critical
The request succeeds, and there is an alternative interaction that takes the place of the failure (i.e. a failover or a fallback).

In this example, we fail Ratings and the request succeeds, producing this graph:

(A or P or B) and (A or P or B or R)

We know more about this request’s behavior and update our failure equation. As Playlist is a potential failure point in this equation, we’ll fail it next, producing this graph:

(A or PF or B) and (A or P or B) and (A or P or B or R)

This illustrates #3 above. The request was still successful, but due to an alternate execution. Now we have a new failure point to explore. We update our equation to include this new information. Now we rinse, lather, and repeat until there are no more failures to explore.

Molly isn’t prescriptive on how to explore this search space. For our implementation we decided to compute all solutions which satisfy the failure equation, and then choose randomly from the smallest solution sets. For example, the solutions to our last representation would be: [{A}, {PF}, {B}, {P,PF}, {R,A}, {R,B} …]. We would begin by exploring all the single points of failure: A, PF, B; then proceed to all sets of size 2, and so forth.

Implementation

Lineage

What is the lineage of a Netflix request? We are able to leverage our tracing system to build a tree of the request execution across our microservices. Thanks to FIT, we have additional information in the form of “Injection Points”. These are key inflection points in our system where failures may occur. Injection Points include things like Hystrix command executions, cache lookups, DB queries, HTTP calls, etc. The data provided by FIT allows us to build a more complete request tree, which is what we feed into the algorithm for analysis.

In the examples above, we see simple service request trees. Here is the same request tree extended with FIT data:

Success

What do we mean by ‘success’? What is most important is our members’ experience, so we want a measurement that reflects this. To accomplish this, we tap into our device reported metrics stream. By analyzing these metrics we can determine if the request resulted in a member-facing error.

An alternate, more simplistic approach could be to rely on the HTTP status codes for determining successful outcomes. But status codes can be misleading, as some frameworks return a ‘200’ on partial success, with a member-impacting error embedded within the payload.

Currently only a subset of Netflix requests have corresponding device reported metrics. Adding device reported metrics for more request types presents us with the opportunity to expand our automated failure testing to cover a broader set of device traffic.

Idempotence

Being able to replay requests made things nice and clean for Molly. We don’t have that luxury. We don’t know at the time we receive a request whether or not it is idempotent and safe to replay. To offset this, we have grouped requests into equivalence classes, such that requests within each class ‘behave’ the same - i.e. executes the same dependent calls and fail in the same way.

To define request classes, we focused on the information we had available when we received the request: the path (netflix.com/foo/bar), the parameters (?baz=boo), and the device making the request. Our first pass was to see if a direct mapping existed between these request features and the set of dependencies executed. This didn’t pan out. Next we explored using machine learning to find and create these mappings. This seemed promising, but would require a fair amount of work to get right.

Instead, we narrowed our scope to only examine requests generated by the Falcor framework. These requests provide, through the query parameters, a set of json paths to load for the request, i.e. ‘videos’, ‘profiles’, ‘images’. We found that these Falcor path elements matched consistently with the internal services required to load those elements.

Future work involves finding a more generic way to create these request class mappings so that we can expand our testing beyond Falcor requests.

These request classes change as code is written and deployed by Netflix engineers. To offset this drift, we run an analysis of potential request classes daily through a sampling of the device reported metrics stream. We expire old classes that no longer receive traffic, and we create new classes for new code paths.

Member Pain

Remember that the goal of this exploration is to find and fix errors before they impact a large number of members. It’s not acceptable to cause a lot of member pain while running our tests. In order to mitigate this risk, we structure our exploration so that we are only running a small number of experiments over any given period.

Each experiment is scoped to a request class and runs for a short period (twenty to thirty seconds) for a miniscule percentage of members. We want at least ten good example requests from each experiment. In order to filter out false positives, we look at the overall success rate for an experiment, only marking a failure as found if greater than 75% of requests failed. Since our request class mapping isn’t perfect, we also filter out requests which, for any reason, didn’t execute the failure we intended to test.

Let’s say we are able to run 500 experiments in a day. If we are potentially impacting 10 members each run, then the worst case impact is 5,000 members each day. But not every experiment results in a failure - in fact the majority of them result in success. If we only find a failure in one in ten experiments (a high estimate), then we’re actually impacting 500 members requests in a day, some of which are further mitigated by retries. When you’re serving billions of requests each day, the impact of these experiments is very small.

Results

We were lucky that one of the most important Netflix requests met our criteria for exploration - the ‘App Boot’ request. This request loads the metadata needed to run the Netflix application and load the initial list of videos for a member. This is a moment of truth that, as a company, we want to win by providing a reliable experience from the very start.

This is also a very complex request, touching dozens of internal services and hundreds of potential failure points. Brute force exploration of this space would take 2^100 iterations (roughly 1 with 30 zeros following), whereas our approach was able to explore it in ~200 experiments. We found five potential failures, one of which was a combination of failure points.

What do we do once we’ve found a failure? Well, that part is still admittedly manual. We aren’t to the point of automatically fixing the failure yet. In this case, we have a list of known failure points, along with a ‘scenario’ which allows someone to use FIT to reproduce the failure. From this we can verify the failure and decide on a fix.

We’re very excited that we were able to build this proof of concept implementation and find real failures using it. We hope to be able to extend it to search a larger portion of the Netflix request space and find more member facing failures before they result in outages, all in an automated way.

And if you’re interested in failure testing and building resilient systems, get in touch with us - we’re hiring!

Kolton Andrus (@KoltonAndrus), Ben Schmaus (@schmaus)

In the summer of 2011, Astyanax, an Apache Cassandra (C*) Java client library was created to easily consume Cassandra, which at the time was in its infancy. Astyanax became so popular that for a good while, it became the de facto java client library for the Apache Cassandra community. Astyanax provides the following features:

High level, simple, object oriented interface to Cassandra.
Resilient behavior on the client side.
Connection pool abstraction. Implementation of a round robin and token-aware connection pool.
Monitoring abstraction to get event notification from the connection pool.
Complete encapsulation of the underlying Thrift API and structs.
Automatic retry of downed hosts.
Automatic discovery of additional hosts in the cluster.
Suspension of hosts for a short period of time after several timeouts.
Annotations to simplify use of composite columns.

Datastax, the enterprise company behind Apache Cassandra, took many of the lessons contained within Astyanax and included them within their official Java Cassandra driver.

When Astyanax was written, the protocol to communicate to Cassandra was Thrift and the API was very low level. Today, Cassandra is mostly consumed via a query language very similar to SQL. This new language is called CQL (Cassandra Query Language). The Cassandra community has also moved beyond the Thrift protocol to the CQL BINARY PROTOCOL.

Thrift will be deprecated in Apache Cassandra in version 4.0. Aside from that deprecation there are also the following reasons to move away from Thrift:

CQL Binary protocol performs better
Community development efforts have completely moved to the CQL Binary protocol. The thrift implementation is only in maintenance mode.
CQL is easier to consume since the API resembles SQL.

Today we are moving Astyanax from an active project in the NetflixOSS ecosystem, into an archived state. This means the project will still be available for public consumption, however, we will not be making any feature enhancements or performance improvements. There are still tens of thousands (if not more) lines of code, within Netflix, that use Astyanax. Moving forward, we will only be fixing Netflix critical bugs as we begin our efforts to refactor our internal systems to use the CQL Binary protocol.

If there are members of the community that would like to have a more hands-on role and maintain the project by becoming a committer, please reach out to me directly.

by: Christos Kalantzis

We want to make it easy for Netflix members to find great content to fulfill their unique tastes. To do this, we follow a data-driven algorithmic approach based on machine learning, which we have described in past posts and other publications. We aspire to a day when anyone can sit down, turn on Netflix, and the absolute best content for them will automatically start playing. While we are still way off from that goal, it sets a vision for us to improve the algorithms that span our service: from how we rank videos to how we construct the homepage to how we provide search results. To make our algorithms better, we follow a two-step approach. First, we try an idea offline using historical data to see if it would have made better recommendations. If it does, we then deploy a live A/B test to see if it performs well in reality, which we measure through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.

While there are many ways to improve machine learning approaches, arguably the most critical is to provide better input data. A model can only be as good as the data we give it. Thus, we spend a lot of time experimenting with new kinds of input signals for our models. Most machine learning models expect input to be represented as a vector of numbers, known as a feature vector. Somehow we need to take an arbitrary input entity (e.g. a tuple of member profile, video, country, time, device, etc.), with its associated, richly structured data, and provide a feature vector representing that entity for a machine learning algorithm to use. We call this transformation feature generation and it is central to providing the data needed for learning. Examples features include how many minutes a member has watched a video, the popularity of the video, its predicted rating, what genre a video belongs to, or how many videos are in a row. We use the term feature broadly, since a feature could be a simple indicator or have a full model behind it, such as a Matrix Factorization.

We will describe how we built a time machine for feature generation using Apache Spark that enables our researchers to quickly try ideas for new features on historical data such that running offline experiments and transitioning to online A/B tests is seamless.

Why build a time machine?

There are many ways to approach feature generation, several of which we’ve used in the past. One way is to use logged event data that we store on S3 and access via Hive by running queries on these tables to define features. While this is flexible for exploratory analysis, it has several problems. First, to run an A/B test we need the feature calculation to run within our online microservice architecture. We run the models online because we know that freshness and responsiveness of our recommendations is important to the member experience. This means we would need to re-implement feature generation to retrieve data from online services instead of Hive tables. It is difficult to match two such implementations exactly, especially since any discrepancies between offline and online data sources can create unexpected differences in the model output. In addition, not all of our data is available offline, particularly output of recommendation models, because these involve a sparse-to-dense conversion that creates a large volume of data.

On the other extreme, we could log our features online where a model would be used. While this removes the offline/online discrepancy and makes transitioning to A/B test easy, it means we need to deploy each idea for a new feature into production and wait for the data to collect before we can determine if a feature is useful. This slows down the iteration cycle for new ideas. It also requires that all the data for a feature to be available online, which could mean building new systems to serve that data, again before we have determined if it is valuable. We also need to compute features for many more members or requests than we may actually need for training based on how we choose label data.

We’ve also tried a middle ground where we use feature code that calls online services, such as the one that provides viewing history, and filters out all the data with timestamps past a certain point in time. However, this only works for situations where a service records a log of all historical events; services that just provide the current state cannot be used. It also places additional load on the online services each time we generate features.

Throughout these approaches, management of time is extremely important. We want an approach that balances the benefits of all the above approaches without the drawbacks. In particular, we want a system that:

Enables quick iteration from idea to modeling to running an A/B test
Uses the data provided by our online microservices, without overloading them
Accurately represents input data for a model at a point in time to simulate online use
Handles our scale of data with many researchers running experiments concurrently, without using more than 1.21 gigawatts of power
Works well in an interactive environment, such as using a notebook for experimentation, and also reliably in a batch environment, such as for doing periodic retraining
Should only need to write feature code once so that we don’t need to spend time verifying that two implementations are exactly equivalent
Most importantly, no paradoxes are allowed (e.g. the label can’t be in the features)

When faced with tough problems one often wishes for a time machine to solve them. So that is what we decided to build. Our time machine snapshots online services and uses the snapshot data offline to reconstruct the inputs that a model would have seen online to generate features. Thus, when experimenters design new feature encoders — functions that take raw data as input and compute features — they can immediately use them to compute new features for any time in the past, since the time machine can retrieve the appropriate snapshots and pass them to the feature encoders.

How to build a Time Machine

Here are the various components needed in a time machine that snapshots online services:

Select contexts to snapshot
Snapshot data of various micro services for the selected context
Build APIs to serve this data for a given time coordinate in the past

Context Selection

Snapshotting data for all contexts (e.g all member profiles, devices, times of day) would be very expensive. Instead, we select samples of contexts to snapshot periodically (typically daily), though different algorithms may need to train on different distributions. For example, some use stratified samples based on properties such as viewing patterns, devices, time spent on the service, region, etc. To handle this, we use Spark SQL to select an appropriate sample of contexts for each experiment from Hive. We merge the context set across experiments and persist it into S3 along with the corresponding experiment identifiers.

Data Snapshots

The next component in the time machine fetches data from various online services and saves a snapshot of the returned data for the selected contexts. Netflix embraces a fine-grained Service Oriented Architecture for our cloud-based deployment model. There are hundreds of such micro services that are collectively responsible for handling the member experience. Data from various such services like Viewing History, My List, and Predicted Ratings are used as input for the features in our models.

We use Netflix-specific components such as Eureka, Hystrix, and Archaius to fetch data from online services through their client libraries. However, some of these client libraries bulk-load data, so they have a high memory footprint and a large startup time. Spark is not well suited for loading such components inside its JVM. Moreover, the requirement of creating an uber jar to run Spark jobs can cause runtime jar incompatibility issues with other Netflix libraries. To alleviate this problem, we used Prana, which runs outside the Spark JVM, as a data proxy to the Netflix ecosystem.

Spark parallelizes the calls to Prana, which internally fetches data from various micro services for each of these contexts. We chose Thrift as the binary communication protocol between Spark and Prana. We store the snapshotted data in S3 using Parquet, a compressed column-oriented binary format, for both time and space efficiency, and persist the location of the S3 data in Cassandra.

Ensuring pristine data quality of these snapshots is critical for us to correctly evaluate our models. Hence, we store the confidence level for each snapshot service, which is the percentage of successful data fetches from the micro services excluding any fallbacks due to timeouts or service failures. We expose it to our clients, who can chose to use this information for their experimentation.

For both snapshotting and context selection, we needed to schedule several Spark jobs to run on a periodic basis, with dependencies between them. To that end, we built a general purpose workflow orchestration and scheduling framework called Meson, which is optimized for machine learning pipelines, and used it to run the Spark jobs for the components of the time machine. We intend to open source Meson in the future and will provide more detail about it in an upcoming blog post.

APIs for Time Travel

We built APIs that enable time travel and fetch the snapshot data from S3 for a given time in the past. Here is a sample API to get the snapshot data for the Viewing History service.

Given a destination time in the past, the API fetches the associated S3 location of the snapshot data from Cassandra and loads the snapshot data in Spark. In addition, when given an A/B test identifier, the API filters the snapshot data to return only those contexts selected for that A/B test. The system transforms the snapshot data back into the respective services’ Java objects (POJOs) so that the feature encoders operate on the exact same POJOs for both offline experimentation and online feature generation in production.

The following diagram shows the overall architecture of the time machine and where Spark is used in building it: from selecting members for experimentation, snapshotting data of various services for the selected members, to finally serving the data for a time in the past.

DeLorean: Generating Features via Time Travel

DeLorean is our internal project to build the system that takes an experiment plan, travels back in time to collect all the necessary data from the snapshots, and generates a dataset of features and labels for that time in the past to train machine learning models. Of course, the first step is to select the destination time, to bring it up to 88 miles per hour, then DeLorean takes care of the rest.

Running an Experiment

DeLorean allows a researcher to run a wide range of experiments by automatically determining how to launch the time machine, what time coordinates are needed, what data to retrieve, and how to structure the output. Thus, to run a new experiment, an experimenter only needs to provide the following:

Label data: A blueprint for obtaining a set of contexts with associated time coordinates, items, and labels for each. This is typically created by a Hive, Pig, or Spark SQL query
A feature model containing the required feature encoder configurations
Implementations of any new feature encoders that do not already exist in our library

DeLorean provides a capability for writing and modifying a new feature encoder during an experiment, for example, in a Zeppelin Notebook or in Spark Shell, so that it can be used immediately for feature generation. If we find that new feature encoder useful, we can later productionize it by adding it to our library of feature encoders.

The high-level process to generate features is depicted in the following diagram, where the blocks highlighted in light green are typically customized for new experiments. In this scenario, experimenters can also implement new feature encoders that are used in conjunction with existing ones.

DeLorean image by JMortonPhoto.com& OtoGodfrey.com

Label Data and Feature Encoders

One of the primary inputs to DeLorean is the labeldata, which contains information about the contexts, items, and associated labels for which to generate features. The contexts, as its name suggests, can be describe the setting for where a model is to be used (e.g. tuples of member profiles, country, time, device, etc.). Items are the elements which are to be trained on, scored, and/or ranked (e.g. videos, rows, search entities). Labels are typically the targets used in supervised learning for each context-item combination. For unsupervised learning approaches, the label is not required. As an example, for personalized ranking the context could be defined as the member profile ID, country code, and time, whereas the item as the video, and the labels as plays or non-plays. In this example, the label data is created by joining the set of snapshotted contexts to the logged play actions.

Once we have this label dataset, we need to compute features for each context-item combination in the dataset by using the desired set of feature encoders. Each feature encoder takes a context and each of target items associated with the context, together with some raw data elements in the form of POJOs, to compute one or more features.

Each type of item, context variable or data element, has a data key associated with it. Every feature encoder has a method that returns the set of keys for the data it consumes. DeLorean uses these keys to identify the required data types, retrieves the data, and passes it to the feature encoder as a data map— which is a map from data keys to data objects.

We made DeLorean flexible enough to allow the experiments to use different types of contexts and items without needing to customize the feature generation system. DeLorean can be used not only for recommendations, but also for a row ordering experiment which has profile-device tuple as context and rows of videos as items. Another use case may be a search experiment which has the query-profile-country tuple as context and individual videos as items. To achieve this, DeLorean automatically infers the type of contexts and items from the label data and the data keys required by the feature encoders.

Data Elements

Data elements are the ingredients that get transformed into features by a feature encoder. Some of these are context-dependent, such as viewing history for a profile, and others are shared by all contexts, such as metadata of the videos. We handle these two types of data elements differently.

For context-dependent data elements, we use the snapshots described above, and associate each one with a data key. We bring all the required snapshot data sources together with the values, items, and labels for each context, so that the data for a single context is sent to a single Spark executor. Different contexts are broken up to enable distributed feature generation. The snapshots are loaded as an RDD of (context, Map(data key -> data element)) in a lazy fashion and a series of joins between the label data and all the necessary context-dependent data elements are performed using Spark.

For context-independent data elements, DeLorean broadcasts these bulk data elements to each executor. Since these data elements have manageable sizes and often have a slow rate of change over time, we keep a record of each update that we use to rewind back to the appropriate previous version. These are kept in memory as singleton objects and made available to the feature generators for each context processed by an executor. Thus, a complete data map is created for each context containing the context data, context-dependent snapshot data elements, and shared data singletons.

Once the features are generated in Spark, the data is represented as a Spark DataFrame with an embedded schema. For many personalization application, we need to rank a number of items for each context. To avoid shuffling in the ranking process, item features are grouped by context in the output. The final features are stored in Hive using a Parquet format.

Model Training, Validation, and Testing

We use features generated using our time machine to train the models that we use in various parts of our recommendation systems. We use a standardized schema for passing the DataFrames of training features to machine learning algorithms, as well as computing predictions and metrics for trained models on the validation and test feature DataFrames. We also standardized a format to serialize the models that we use for publishing the models to be later consumed by online applications or in other future experiments.

The following diagram shows how we run a typical machine learning experiment. Once the experiment is designed, we collect the dataset of contexts, items, and labels. Next the features for the label dataset are generated. We then train models using either single machine, multi-core, or distributed algorithms and perform parameter tuning by computing metrics on a validation set. Then we pick the best models and compare them on a testing set. When we see a significant improvement in the offline metrics over the production model and that the outputs are different enough, we design an A/B test using variations of the model and run it online. If the A/B test shows a statistically significant increase in core metrics, we roll it out broadly. Otherwise, we learn from the results to iterate on the next idea.

Going Online

One of the primary motivations for building DeLorean is to share the same feature encoders between offline experiments and online scoring systems to ensure that there are no discrepancies between the features generated for training and those computed online in production. When an idea is ready to be tested online, the model is packaged with the same feature configuration that was used by DeLorean to generate the features.

To compute features in the production system, we directly call our online microservices to collect the data elements required by all the feature encoders used in a model, instead of obtaining them from snapshots as we do offline. We then assemble them into data maps and pass them to the feature encoders. The feature vector is then passed to the offline-trained model for computing predictions, which are used to create our recommendations. The following diagram shows the high-level process of transitioning from an offline experiment to an online production system where the blocks highlighted in yellow are online systems, and the ones highlighted in blue are offline systems. Note that the feature encoders are shared between online and offline to guarantee the consistency of feature generation.

Conclusion and Future work

By collecting the state of the online world at a point in time for a select set of contexts, we were able to build a mechanism for turning back time. Spark’s distributed, resilient computation power enabled us to snapshot millions of contexts per day and to implement feature generation, model training and validation at scale. DeLorean is now being used in production for feature generation in some of the latest A/B tests for our recommender system.

However, this is just a start and there are many ways in which we can improve this approach. Instead of batch snapshotting on a periodic cadence, we can drive the snapshots based on events, for example at a time when a particular member visits our service. To avoid duplicate data collection, we can also capture data changes instead of taking full snapshots each time. We also plan on using the time machine capability for other needs in evaluating new algorithms and testing our systems. Of course, we leave the ability to travel forward in time as future work.

Fast experimentation is the hallmark of a culture of innovation. Reducing the time to production for an idea is a key metric we use to measure the success of our infrastructure projects. We will continue to build on this foundation to bring better personalization to Netflix in our effort to delight members and win moments of truth. If you are interested in these types of time-bending engineering challenges, join us.

By Hossein Taghavi, Prasanna Padmanabhan, DB Tsai, Faisal Zakaria Siddiqi, Justin Basilico

Our new Keystone data pipeline went live in December of 2015. In this article, we talk about the evolution of Netflix’s data pipeline over the years. This is the first of a series of articles about the new Keystone data pipeline.

Netflix is a data-driven company. Many business and product decisions are based on insights derived from data analysis. The charter of the data pipeline is to collect, aggregate, process and move data at cloud scale. Almost every application at Netflix uses the data pipeline.

Here are some statistics about our data pipeline:

~500 billion events and ~1.3 PB per day
~8 million events and ~24 GB per second during peak hours

There are several hundred event streams flowing through the pipeline. For example:

Video viewing activities
UI activities
Error logs
Performance events
Troubleshooting & diagnostic events

Note that operational metrics don’t flow through this data pipeline. We have a separate telemetry system Atlas, which we open-sourced just like many other Netflix technologies.

Over the last a few years, our data pipeline has experienced major transformations due to evolving requirements and technological developments.

V1.0 Chukwa pipeline

The sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. As you can see, the architecture is rather simple. Chukwa collects events and writes them to S3 in Hadoop sequence file format. The Big Data Platform team further processes those S3 files and writes to Hive in Parquet format. End-to-end latency is up to 10 minutes. That is sufficient for batch jobs which usually scan data at daily or hourly frequency.

V1.5 Chukwa pipeline with real-time branch

With the emergence of Kafka and Elasticsearch over the last couple of years, there has been a growing demand for real-time analytics in Netflix. By real-time, we mean sub-minute latency.

In addition to uploading events to S3/EMR, Chukwa can also tee traffic to Kafka (the front gate of real-time branch). In V1.5, approximately 30%of the events are branched to the real-time pipeline. The centerpiece of the real-time branch is the router. It is responsible for routing data from Kafka to the various sinks: Elasticsearch or secondary Kafka.

We have seen explosive growth in Elasticsearch adoption within Netflix for the last two years. There are ~150 clusters totaling ~3,500 instances hosting ~1.3 PB of data. The vast majority of the data is injected via our data pipeline.

When Chukwa tees traffic to Kafka, it can deliver full or filtered streams. Sometimes, we need to apply further filtering on the Kafka streams written from Chukwa. That is why we have the router to consume from one Kafka topic and produce to a different Kafka topic.

Once we deliver data to Kafka, it empowers users with real-time stream processing: Mantis, Spark, or custom applications. “Freedom and Responsibility” is the DNA of Netflix culture. It’s up to users to choose the right tool for the task at hand.

Because moving data at scale is our expertise, our team maintains the router as a managed service. But there are a few lessons we learned while operating the routing service:

The Kafka high-level consumer can lose partition ownership and stop consuming some partitions after running stable for a while. This requires us to bounce the processes.
When we push out new code, sometimes the high-level consumer can get stuck in a bad state during rebalance.
We group hundreds of routing jobs into a dozen of clusters. The operational overhead of managing those jobs and clusters is an increasing burden. We need a better platform to manage the routing jobs.

V2.0 Keystone pipeline (Kafka fronted)

In addition to the issues related to routing service, there are other motivations for us to revamp our data pipeline:

Simplify the architecture.
Kafka implements replication that improves durability, while Chukwa doesn’t support replication.
Kafka has a vibrant community with strong momentum.

There are three major components:

Data Ingestion - There are two ways for applications to ingest data.

use our Java library and write to Kafka directly.
send to an HTTP proxy which then writes to Kafka.

Data Buffering - Kafka serves as the replicated persistent message queue. It also helps absorb temporary outages from downstream sinks.
Data Routing - The routing service is responsible for moving data from fronting Kafka to various sinks: S3, Elasticsearch, and secondary Kafka.

We have been running Keystone pipeline in production for the past few months. We are still evolving Keystone with a focus on QoS, scalability, availability, operability, and self-service.

In follow-up posts, we’ll cover more details regarding:

How do we run Kafka in cloud at scale?
How do we implement routing service using Samza?
How do we manage and deploy Docker containers for routing service?

If building large-scale infrastructure excites you, we are hiring!

Real-Time Data Infrastructure Team

Steven Wu, Allen Wang, Monal Daxini, Manas Alekar, Zhenzhong Xu, Jigish Patel, Nagarjun Guraja, Jonathan Bond, Matt Zimmer, Peter Bakas

#AlgorithmsEverywhere

by Yves Raimond and Justin Basilico

The Netflix experience is driven by a number of Machine Learning algorithms: personalized ranking, page generation, search, similarity, ratings, etc. On the 6th of January, we simultaneously launched Netflix in 130 new countries around the world, which brings the total to over 190 countries. Preparing for such a rapid expansion while ensuring each algorithm was ready to work seamlessly created new challenges for our recommendation and search teams. In this post, we highlight the four most interesting challenges we’ve encountered in making our algorithms operate globally and, most importantly, how this improved our ability to connect members worldwide with stories they'll love.

Challenge 1: Uneven Video Availability

Before we can add a video to our streaming catalog on Netflix, we need to obtain a license for it from the content owner. Most content licenses are region-specific or country-specific and are often held to terms for years at a time. Ultimately, our goal is to let members around the world enjoy all our content through global licensing, but currently our catalog varies between countries. For example, the dystopian Sci-Fi movie “Equilibrium” might be available on Netflix in the US but not in France. And “The Matrix” might be available in France but not in the US. Our recommendation models rely heavily on learning patterns from play data, particularly involving co-occurrence or sequences of plays between videos. In particular, many algorithms assume that when something was not played it is a (weak) signal that someone may not like a video, because they chose not to play it. However, in this particular scenario we will never observe any members who played both “Equilibrium” and “The Matrix”. A basic recommendation model would then learn that these two movies do not appeal to the same kinds of people just because the audiences were constrained to be different. However, if these two movies were available to the same set of members, we would likely observe a similarity between the videos and between the members who watch them. From this example, it is clear that uneven video availability potentially interferes with the quality of our recommendations.

Our search experience faces a similar challenge. Given a (partial) query from a member, we want to present the most relevant videos in the catalog. However, not accounting for availability differences reduces the quality of this ranking. For example, the top results for a given query from a ranking algorithm unaware of availability differences could include a niche video followed by a well-known one in a case where the latter is only available to a relatively small number of our global members and the former is available much more broadly.

Another aspect of content licenses is that they have start and end dates, which means that a similar problem arises not only across countries, but also within a given country across time. If we compare a well-known video that has only been available on Netflix for a single day to another niche video that was available for six months, we might conclude that the latter is a lot more engaging. However, if the recently added, well-known video had instead been on the site for six months, it probably would have more total engagement.

One can imagine the impact these issues can have on more sophisticated search or recommendation models when they already introduce a bias in something as simple as popularity. Addressing the issue of uneven availability across both geography and time lets our algorithms provide better recommendations for a video already on our service when it becomes available in a new country.

So how can we avoid learning catalog differences and focus on our real goal of learning great recommendations for our members? We incorporate into each algorithm the information that members have access to different catalogs based on geography and time, for example by building upon concepts from the statistical community on handling missing data.

Challenge 2: Cultural Awareness

Another key challenge in making our algorithms work well around the world is to ensure that we can capture local variations in taste. We know that even with the same catalog worldwide we would not expect a video to have the exact same popularity across countries. For example, we expect that Bollywood movies would have a different popularity in India than in Argentina. However, should two members get similar recommendations, if they have similar profiles but if one member lives in India and the other in Argentina? Perhaps if they are both watching a lot of Sci-Fi, their recommendations should be similar. Meanwhile, overall we would expect Argentine members should be recommended more Argentine Cinema and Indian members more Bollywood.

An obvious approach to capture local preferences would be to build models for individual countries. However, some countries are small and we will have very little member data available there. Training a recommendation algorithm on such sparse data leads to noisy results, as the model will struggle to identify clear personalization patterns from the data. So we need a better way.

Prior to our global expansion, our approach was to group countries into regions of a reasonable size that had a relatively consistent catalog and language. We would then build individual models for each region. This could capture the taste differences between regions because we trained separate models whose hyperparameters were tuned differently. Within a region, as long as there were enough members with certain taste preference and a reasonable amount of history, a recommendation model should be able to identify and use that pattern of taste. However, there were several problems with this approach. The first is that within a region the amount of data from a large country would dominate the model and dampen its ability to learn the local tastes for a country with a smaller number of members. It also presented a challenge of how to maintain the groupings as catalogs changed over time and memberships grew. Finally, because we’re continuously running A/B tests with model variants across many algorithms, the combinatorics involving a growing number of regions became overwhelming.

To address these challenges we sought to combine the regional models into a single global model that also improves the recommendations we make, especially in countries where we may not yet have many members. Of course, even though we are combining the data, we still need to reflect local differences in taste. This leads to the question: is local taste or personal taste more dominant? Based on the data we’ve seen so far, both aspects are important, but it is clear that taste patterns do travel globally. Intuitively, this makes sense: if a member likes Sci-Fi movies, someone on the other side of the world who also likes Sci-Fi would be a better source for recommendations than their next-door neighbor who likes food documentaries. Being able to discover worldwide communities of interest means that we can further improve our recommendations, especially for niche interests, as they will be based on more data. Then with a global algorithm we can identify new or different taste patterns that emerge over time.

To refine our models we can use many signals about the content and about our members. In this global context, two important taste signals could be language and location. We want to make our models aware of not just where someone is logged in from but also aspects of a video such as where it is from, what language it is in, and where it is popular. Going back to our example, this information would let us offer different recommendations to a brand new member in India as compared to Argentina, as the distribution of tastes within the two countries is different. We expand on the importance of language in the next section.

Challenge 3: Language

Netflix has now grown to support 21 languages and our catalog includes more local content than ever. This increase creates a number of challenges, especially for the instant search algorithm mentioned above. The key objective of this algorithm is to help every member find something to play whenever they search while minimizing the number of interactions. This is different than standard ranking metrics used to evaluate information retrieval systems, which do not take the amount of interaction into account. When looking at interactions, it is clear that different languages involve very different interaction patterns. For example, Korean is usually typed using the Hangul alphabet where syllables are composed from individual characters. For example, to search for “올드보이” (Oldboy), in the worst possible case, a member would have to enter nine characters: “ㅇ ㅗ ㄹㄷ ㅡ ㅂ ㅗ ㅇㅣ”. Using a basic indexing for the video title, in the best case a member would still need to type three characters: “ㅇ ㅗ ㄹ”, which would be collapsed in the first syllable of that title: “올”. In a Hangul-specific indexing, a member would need to write as little as one character: “ㅇ”. Optimizing for the best results with the minimum set of interactions and automatically adapting to newly introduced languages with significantly different writing systems is an area we’re working on improving.

Another language-related challenge relates to recommendations. As mentioned above, while taste patterns travel globally, ultimately people are most likely to enjoy content presented in a language they understand. For example, we may have a great French Sci-Fi movie on the service, but if there are no subtitles or audio available in English we wouldn’t want to recommend it to a member who likes Sci-Fi movies but only speaks English. Alternatively, if the member speaks both English and French, then there is a good chance it would be an appropriate recommendation. People also often have preferences for watching content that was originally produced in their native language, or one they are fluent in. While we constantly try to add new language subtitles and dubs to our content, we do not yet have all languages available for all content. Furthermore, different people and cultures also have different preferences for watching with subtitles or dubs. Putting this together, it seems clear that recommendations could be better with an awareness of language preferences. However, currently which languages a member understands and to what degree is not defined explicitly, so we need to infer it from ancillary data and viewing patterns.

Challenge 4: Tracking Quality

The objective is to build recommendation algorithms that work equally well for all of our members; no matter where they live or what language they speak. But with so many members in so many countries speaking so many languages, a challenge we now face is how to even figure out when an algorithm is sub-optimal for some subset of our members.

To handle this, we could use some of the approaches for the challenges above. For example, we could look at the performance of our algorithms by manually slicing along a set of dimensions (country, language, catalog, …). However, some of these slices lead to very sparse and noisy data. At the other end of the scale we could be looking at metrics observed globally, but this would dramatically limit our ability to detect issues until they impact a large number of our members. One approach this problem is to learn how to best group observations for the purpose of automatically detecting outliers and anomalies. Just as we work on improving our recommendation algorithms, we are innovating our metrics, instrumentation and monitoring to improve their fidelity and through them our ability to detect new problems and highlight areas to improve our service.

Conclusion

To support a launch of this magnitude, we examined each and every algorithm that is part of our service and began to address these challenges. Along the way, we found not just approaches that will make Netflix better for those signing up in the 130 new countries, but in fact better for all Netflix members worldwide. For example, solving the first and the second challenges let us discover worldwide communities of interest so that we can make better recommendations. Solving the third challenge means that regardless of where our members are based, they can use Netflix in the language that suits them the best, and quickly find the content they’re looking for. Solving the fourth challenge means that we’re able to detect issues at a finer grain and so that our recommendation and search algorithms help all our members find content they love. Of course, our global journey is just beginning and we look forward to making our service dramatically better over time. If you are an algorithmic explorer who finds this type of adventure exciting, take a look at our current job openings.

#CachesEverywhere

Netflix members have come to expect a great user experience when interacting with our service. There are many things that go into delivering a customer-focused user experience for a streaming service, including an outstanding content library, an intuitive user interface, relevant and personalized recommendations, and a fast service that quickly gets your favorite content playing at very high quality, to name a few.

The Netflix service heavily embraces a microservice architecture that emphasizes separation of concerns. We deploy hundreds of microservices, with each focused on doing one thing well. This allows our teams and the software systems they produce to be highly aligned while being loosely coupled. Many of these services are stateless, which makes it easier to (auto)scale them. They often achieve the stateless loose coupling by maintaining state in caches or persistent stores.

EVCache is an extensively used data-caching service that provides the low-latency, high-reliability caching solution that the Netflix microservice architecture demands.

It is a RAM store based on memcached, optimized for cloud use. EVCache typically operates in contexts where consistency is not a strong requirement. Over the last few years, EVCache has been scaled to significant traffic while providing a robust key-value interface. At peak, our production EVCache deployments routinely handle upwards of 30 million requests/sec, storing hundreds of billions of objects across tens of thousands of memcached instances. This translates to just under 2 trillion requests per day globally across all EVCache clusters.

Earlier this year, Netflix launched globally in 130 additional countries, making it available in nearly every country in the world. In this blog post we talk about how we built EVCache’s global replication system to meet Netflix’s growing needs. EVCache is open source, and has been in production for more than 5 years. To read more about EVCache, check out one of early blog posts.

Motivation

Netflix’s global, cloud-based service is spread across three Amazon Web Services (AWS) regions: Northern Virginia, Oregon, and Ireland. Requests are mostly served from the region the member is closest to. But network traffic can shift around for various reasons, including problems with critical infrastructure or region failover exercises (“Chaos Kong”). As a result, we have adopted a stateless application server architecture which lets us serve any member request from any region.

The hidden requirement in this design is that the data or state needed to serve a request is readily available anywhere. High-reliability databases and high-performance caches are fundamental to supporting our distributed architecture. One use case for a cache is to front a database or other persistent store. Replicating such caches globally helps with the “thundering herd” scenario: without global replication, member traffic shifting from one region to another would encounter “cold” caches for those members in the new region. Processing the cache misses would lengthen response times and overwhelm the databases.

Another major use case for caching is to “memoize” data which is expensive to recompute, and which doesn’t come from a persistent store. When the compute systems write this kind of data to a local cache, the data has to be replicated to all regions so it’s available to serve member requests no matter where they originate. The bottom line is that microservices rely on caches for fast, reliable access to multiple types of data like a member’s viewing history, ratings, and personalized recommendations. Changes and updates to cached data need to be replicated around the world to enable fast, reliable, and global access.

EVCache was designed with these use-cases in consideration. When we embarked upon the global replication system design for EVCache, we also considered non-requirements. One non-requirement is strong global consistency. It’s okay, for example, if Ireland and Virginia occasionally have slightly different recommendations for you as long as the difference doesn’t hurt your browsing or streaming experience. For non-critical data, we rely heavily on this “eventual consistency” model for replication where local or global differences are tolerated for a short time. This simplifies the EVCache replication design tremendously: it doesn’t need to deal with global locking, quorum reads and writes, transactional updates, partial-commit rollbacks, or other complications of distributed consistency.

We also wanted to make sure the replication system wouldn’t affect the performance and reliability of local cache operations, even if cross-region replication slowed down. All replication is asynchronous, and the replication system can become latent or fail temporarily without affecting local cache operations.

Replication latency is another loose requirement. How fast is fast enough? How often does member traffic switch between regions, and what is the impact of inconsistency? Rather than demand the impossible from a replication system ("instantaneous and perfect"), what Netflix needs from EVcache is acceptable latency while tolerating some inconsistency - as long as both are low enough to serve the needs of our applications and members.

Cross-Region Replication Architecture

EVCache replicates data both within a region and globally. The intra-region redundancy comes from a simultaneous write to all server groups within the region. For cross-region replication, the key components are shown in the diagram below.

Screen Shot 2016-02-19 at 12.42.41 PM.png

This diagram shows the replication steps for a SET operation. An application calls set() on the EVCache client library, and from there the replication path is transparent to the caller.

The EVCache client library sends the SET to the local region’s instance of the cache
The client library also writes metadata (including the key, but not the data) to the replication message queue (Kafka)
The “Replication Relay” service in the local region reads messages from this queue
The Relay fetches the data for the key from the local cache
The Relay sends a SET request to the remote region's “Replication Proxy” service
In the remote region, the Replication Proxy receives the request and performs a SET to its local cache, completing the replication
Local applications in the receiving region will now see the updated value in the local cache when they do a GET

This is a simplified picture, of course. For one thing, it refers only to SET - not other operations like DELETE, TOUCH, or batch mutations. The flows for DELETE and TOUCH are very similar, with some modifications: they don’t have to read the existing value from the local cache, for example.

It's important to note that the only part of the system that reaches across region boundaries is the message sent from the Replication Relay to the Replication Proxy (step 5). Clients of EVCache are not aware of other regions or of cross-region replication; reads and writes use only the local, in-region cache instances.

Component Responsibilities

Replication Message Queue

The message queue is the cornerstone of the replication system. We use Kafka for this. The Kafka stream for a fully-replicated cache has two consumers: one Replication Relay cluster for each destination region. By having separate clusters for each target region, we de-couple the two replication paths and isolate them from each other’s latency or other issues.

If a target region goes wildly latent or completely blows up for an extended period, the buffer for the Kafka queue will eventually fill up and Kafka will start dropping older messages. In a disaster scenario like this, the dropped messages are never sent to the target region. Netflix services which use replicated caches are designed to tolerate such occasional disruptions.

Replication Relay

The Replication Relay cluster consumes messages from the Kafka cluster. Using a secure connection to the Replication Proxy cluster in the destination region, it writes the replication request (complete with data fetched from the local cache, if needed) and awaits a success response. It retries requests which encounter timeouts or failures.

Temporary periods of high cross-region latency are handled gracefully: Kafka continues to accept replication messages and buffers the backlog when there are delays in the replication processing chain.

Replication Proxy

The Replication Proxy cluster for a cache runs in the target region for replication. It receives replication requests from the Replication Relay clusters in other regions and synchronously writes the data to the cache in its local region. It then returns a response to the Relay clusters, so they know the replication was successful.

When the Replication Proxy writes to its local region’s cache, it uses the same open-source EVCache client that any other application would use. The common client library handles all the complexities of sharding and instance selection, retries, and in-region replication to multiple cache servers.

As with many Netflix services, the Replication Relay and Replication Proxy clusters have multiple instances spread across Availability Zones (AZs) in each region to handle high traffic rates while being resilient against localized failures.

Design Rationale and Implications

The Replication Relay and Replication Proxy services, and the Kafka queue they use, all run separately from the applications that use caches and from the cache instances themselves. All the replication components can be scaled up or down as needed to handle the replication load, and they are largely decoupled from local cache read and write activity. Our traffic varies on a daily basis because of member watching patterns, so these clusters scale up and down all the time. If there is a surge of activity, or if some kind of network slowdown occurs in the replication path, the queue might develop a backlog until the scaling occurs, but latency of local cache GET/SET operations for applications won’t be affected.

As noted above, the replication messages on the queue contain just the key and some metadata, not the actual data being written. We get various efficiency wins this way. The major win is a smaller, faster Kafka deployment which doesn’t have to be scaled to hold all the data that exists in the caches. Storing large data payloads in Kafka would make it a costly bottleneck, due to storage and network requirements. Instead, the Replication Relay fetches the data from the local cache, with no need for another copy in Kafka.

Another win we get from writing just the metadata is that sometimes, we don’t need the data for replication at all. For some caches, a SET on a given key only needs to invalidate that key in the other regions - we don’t send the new data, we just send a DELETE for the key. In such cases, a subsequent GET in the other region results in a cache miss (rather than seeing the old data), and the application will handle it like any other miss. This is a win when the rate of cross-region traffic isn’t high - that is, when there are few GETs in region A for data that was written from region B. Handling these occasional misses is cheaper than constantly replicating the data.

Optimizations

We have to balance latency and throughput based on the requirements of each cache. The 99th percentile of end-to-end replication latency for most of our caches is under one second. Some of that time comes from a delay to allow for buffering: we try to batch up messages at various points in the replication flow to improve throughput at the cost of a bit of latency. The 99th percentile of latency for our highest-volume replicated cache is only about 400ms because the buffers fill and flush quickly.

Another significant optimization is the use of persistent connections. We found that the latency improved greatly and was more stable after we started using persistent connections between the Relay and Proxy clusters. It eliminates the need to wait for the 3-way handshake to establish a new TCP connection and also saves the extra network time needed to establish the TLS/SSL session before sending the actual replication request.

We improved throughput and lowered the overall communication latency between the Relay cluster and Proxy cluster by batching multiple messages in single request to fill a TCP window. Ideally the batch size would vary to match the TCP window size, which can change over the life of the connection. In practice we tune the batch size empirically for good throughput. While this batching can add latency, it allows us to get more out of each TCP packet and reduces the number of connections we need to set up on each instance, thus letting us use fewer instances for a given replication demand profile.

With these optimizations we have been able to scale EVCache’s cross-region replication system to routinely handle over a million RPS at peak daily.

Challenges and Learnings

The current version of our Kafka-based replication system has been in production for over a year and replicates more than 1.5 million messages per second at peak. We’ve had some growing pains during that time. We’ve seen periods of increased end-to-end latencies, sometimes with obvious causes like a problem with the Proxy application’s autoscaling rules, and sometimes without - due to congestion on the cross-region link on the public Internet, for example.

Before using VPC at Amazon, one of our biggest problems was the implicit packets-per-second limits on our AWS instances. Cross that limit, and the AWS instance experiences a high rate of TCP timeouts and dropped packets, resulting in high replication latencies, TCP retries, and failed replication requests which need to be retried later. The solution is simple: scale out. Using more instances means there is more total packets-per-second capacity. Sometimes two “large” instances are a better choice than a single “extra large,” even when the costs are the same. Moving into VPC significantly raised some limits, like packets per second, while also giving us access to other enhanced networking capabilities which allow the Relay and Proxy clusters to do more work per instance.

In order to be able to diagnose which link in the chain is causing latency, we introduced a number of metrics to track and monitor the latencies at different points in the system: from the client application to Kafka, in the Relay cluster’s reading from Kafka, from the Relay cluster to the remote Proxy cluster, and from Proxy cluster to its local cache servers. There are also end-to-end timing metrics to track how well the system is doing overall.

At this point, we have a few main issues that we are still working through:

Kafka does not scale up and down conveniently. When a cache needs more replication-queue capacity, we have to manually add partitions and configure the consumers with matching thread counts and scale the Relay cluster to match. This can lead to duplicate/re-sent messages, which is inefficient and may cause more than the usual level of eventual consistency skew.
If we lose an EVCache instance in the remote region, this results in an increase in latency as the Proxy cluster tries and fails to write to the missing instance. This latency leads back to the Relay side, which is awaiting confirmation for each (batched) replication request. We’ve worked to reduce the time spent in this state: we detect the lost instance earlier, and we are investigating reconciliation mechanisms to minimize the impact of these situations. We have made changes in the EVCache client that allow the Proxy instances to cope more easily with the possibility that cache instances can disappear.
Kafka monitoring, particularly for missing messages, is not an exact science. Software bugs can cause messages not to appear in the Kafka partition, or not to be received by our Relay cluster. We monitor by comparing the total number of messages received by our Kafka brokers (on a per topic basis) and the number of messages replicated by the Relay cluster. If there is more than a small acceptable threshold of difference for any significant time, we investigate. We also monitor maximum latencies (not the average), because the processing of one partition may be significantly slower for some reason. That situation requires investigation even if the average is acceptable. We are still improving these and other alerts to better detect real issues with fewer false-positives.

Future

We still have a lot of work to do on the replication system. Future improvements might involve pipelining replication messages on a single connection for better and more efficient connection use, optimizations to take better advantage of the network TCP window size, or transitioning to the new Kafka 0.9 API. We hope to make our Relay clusters (the Kafka consumers) autoscale cleanly without significantly increasing latencies or increasing the number of duplicate/re-sent messages.

EVCache is one of the critical components of Netflix's distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere. In this post we covered how we took on the challenge of providing reliable and fast replication for caching systems at a global scale. We look forward to improving more as our needs evolve and as our global member base expands. As a company, we strive to win more of our member’s moments of truth and our team helps in that mission by building highly-available distributed caching systems at scale. If this is something you’d enjoy too, reach out to us - we’re hiring!

- The EVCache Team (Shashi Madappa, Vu Nguyen, Scott Mansfield, Sridhar Enugula, Allan Pratt, Faisal Zakaria Siddiqi)

This blog post provides an introduction to the emerging IMF (Interoperable Master Format) standard from SMPTE (The Society of Motion Picture and Television Engineers), and delves into a short case study that highlights some of the operational benefits that Netflix receives from IMF today.

Have you ever noticed that your favorite movie or TV show feels a little different depending on whether you’re watching it on Netflix, on DVD, on an airplane or from your local cable provider? One reason could be that you’re watching a slightly different edit. In addition to changes for specific distribution channels (like theatrical widescreen, HD home video, airline edits, etc.), content owners typically need to create new versions of their movie or television show for distribution in different territories.

Netflix licenses the majority of its content from other owners, sometimes years after the original assets were created, and often for multiple territories. This leads to a number of problems, including receiving cropped or pan-and-scanned versions of films. We also frequently run into problems when we try to sync dubbed audio and/or subtitles. For example, a film shot and premiered theatrically at 24 frames per second (fps), may be converted to 29.97fps and/or re-cut for a specific distribution channel. Alternate language assets (like audio and timed text) are then created to match the derivative version.

In order to preserve the artist’s creative intent, Netflix always requests content in its original format (native aspect ratio, frame rate, etc.). In the case of a film, we would receive a 24fps theatrical version of the video, but the dubbed audio and subtitles won’t necessarily match, as they may have been created from the 29.97fps version, or even another version that was re-cut for international distribution. We’ve coined the term “Versionitis” to describe this asset-management malady.

Luckily, the good folks over at SMPTE (whom you may know from ubiquitous standards like countdown leader, timecode and color bars, among others) have been hard at work, capitalizing on some of the successes of digital cinema, to design a better system of component-ized file-based workflows with a solution to versioning right in its DNA. If not a cure for versionitis, we’re hoping that IMF will at least provide some relief from this pernicious condition.

The Interoperable Master Format

The advance of technology within the motion picture post-production industry has effected a paradigm shift, moving the industry from tape-based to file-based workflows. The need for a standardized set of specifications for the file-based workflow has given birth to the Interoperable Master Format (IMF). IMF is a file-based framework designed to facilitate the management and processing of multiple content versions (airline edits, special editions, alternate languages, etc.) of the same high-quality finished work (feature, episode, trailer, advertisement, etc.) destined for distribution channels worldwide. The key concepts underlying IMF include:

Facilitating an internal or business-to-business relationship. IMF is not intended to be delivered to the consumer directly.
While IMF is intended to be a specification for the Distribution Service Master, it could be used as an archival master as well.
Support for audio and video, as well as data essence in the form of subtitles, captions, etc.;
Support for descriptive and dynamic metadata (the latter can vary as a function of time) that is expected to be synchronized to an essence;
Wrapping (encapsulating) of media essence, data essence as well as dynamic metadata into well understood temporal units, called track files using the MXF (Material eXchange Format) file specification;
Each content version is embodied in a Composition, which combines metadata and essences. An example of a composition might be the US theatrical cut or an airline edit.
A Composition Playlist (CPL) defines the playback timeline for the Composition and includes metadata applicable to the Composition as a whole via XML.
IMF allows for the creation of many different distribution formats from the same composition. This can be accomplished by specifying the processing/transcoding instructions through an Output Profile List (OPL).

The Composition Playlist

The IMF Composition Playlist (CPL) XML defines the playback timeline for the Composition and includes metadata applicable to the Composition as a whole. The CPL is not designed to contain essence but rather reference external Track Files that contain the actual essence. This construct allows multiple compositions to be managed and processed without duplicating common essence files. The IMF CPL is constrained to contain exactly one video track.

The timeline of the CPL (light blue in example) contains multiple Segments designed to play sequentially. Each Segment (dark grey), in turn, contains multiple Sequences (e.g., an image sequence and an audio sequence, beige), that play in parallel. Each sequence is composed of multiple Resources (green and red for image and audio essence respectively) that refer to physical track files, and subsequently, the audio and video samples / frames that comprise the overall composition. In the example above, light grey portions of the track files represent essence samples that are not relevant to this composition.

The flexible CPL mechanism decouples the playback timeline from the underlying track files, allowing for economical and incremental updates to the timeline when necessary. Each CPL is associated with a universally unique identifier (UUID) that can be used to track versioning of the playback timeline. Likewise, resources within the CPL reference essence data via each track file’s UUID.

Composition Playlist for Supply Chain Automation

The core IMF principles help realize a better asset management system. In order to achieve a higher degree of ingest automation for Netflix’s Digital Supply Chain, additional information needs to be associated with an IMF delivery and meaningful constraints need to be applied to the IMF CPL. Examples of additional information include metadata that associates the viewable timeline with the the release title, regions and territories where the timeline can be viewed, and content maturity ratings. The IMF Composition Playlist defines optional constructs that can carry such information thus enabling an opportunity for tighter integration with business systems of various players in the entertainment industry eco-system.

Anatomy of an IMP

Asset delivery and playback timeline aspects are decoupled in IMF. The unit of delivery between two businesses is called an Interoperable Master Package (IMP). An IMP can be described as follows:

An Interoperable Master Package (IMP) shall consist of one Packing List (PKL - an XML file that describes a list of files), and all the files it references
An IMP (equivalently, the PKL) can contain one or more complete or incomplete Compositions
A Complete IMP is an IMP containing the complete set of assets comprising one or more Compositions. Mathematically, a complete IMP is such that all of the asset references of all of the CPLs described in the PKL are also contained in the PKL
A Partial IMP is an IMP containing one or more incomplete Compositions. In other words, some assets needed to complete the composition are not present in the package i.e., some of the assets referred to by a CPL are not contained in the PKL Depending upon the order in which IMPs arrive into a content ingestion system, the dangling references associated with a partial IMP may be resolved using assets that came with IMPs previously ingested into the system or may be resolved in the future as more IMPs are ingested.

In relation to the example above, the indicated composition could be delivered as a single, complete IMP. In this case, the IMP would contain the CPL file with UUID1, image essence track files with UUID6, UUID7 and UUID8 respectively, and audio essence track files with UUID11 and UUID12 respectively.

The same composition could also be delivered as multiple partial IMPs. One such scenario could comprise an IMP1 containing CPL file with UUID1 and one audio essence track file with UUID11, and an IMP2 containing image essence track files with UUID6, UUID7 and UUID8 respectively and the audio essence track file with UUID12.

Case Study - House of Cards Season 3

Netflix started ingesting Interoperable Master Packages in 2014, when we started receiving Breaking Bad 4K masters (see here). Initial support was limited to complete IMPs (as defined above), with constrained CPLs that only referenced one ImageSequence and up to two AudioSequences, each contained in its own track file. CPLs referencing multiple track files, with timeline offsets, were not supported, so these early IMPs are very similar to a traditional muxed audio / video file.

In February of 2015, shortly before the House of Cards Season 3 release date, the Netflix ident (the animated Netflix logo that precedes and follows a Netflix Original) was given the gift of sound.

Screen Shot 2016-02-25 at 10.51.33 AM.png

Unfortunately, all episodes of House of Cards had already been mastered and ingested with the original video-only ident, as had all of the alternative language subtitles and dubbed audio tracks. To this date House of Cards has represented a number of critical milestones for Netflix, and it was important to us to launch season 3 with the new ident. While addressing this problem would have been an expensive, operationally demanding, and very manual process in the pre-IMF days, requiring re-QC of all of our masters and language assets (dubbed audio and subtitles) for all episodes, instead it was a relatively simple exercise in IMF versioning and component-ized delivery.

Rather than requiring an entirely new master package, the addition of ident audio to each episode required only new per-episode CPLs. These new CPLs were identical to the old, but referenced a different set of audio track files for the first ~100 frames and the last ~100 frames. Because this did not change the overall duration of the timeline, and did not adjust the timing of any other audio or video resources, there was no danger of other, already encoded, synchronized assets (like dubbed audio or subtitles) falling out-of-sync as a result of the change.

To Be Continued …

Next in this series, we will describe our IMF ingest implementation and how it fits into our content processing pipeline.

By Rohit Puri, Andy Schuler and Sreeram Chakrovorthy

How does Netflix build code before it’s deployed to the cloud? While pieces of this story have been told in the past, we decided it was time we shared more details. In this post, we describe the tools and techniques used to go from source code to a deployed service serving movies and TV shows to more than 75 million global Netflix members.

The above diagram expands on a previous post announcing Spinnaker, our global continuous delivery platform. There are a number of steps that need to happen before a line of code makes it way into Spinnaker:

Code is built and tested locally using Nebula
Changes are committed to a central git repository
A Jenkins job executes Nebula, which builds, tests, and packages the application for deployment
Builds are “baked” into Amazon Machine Images
Spinnaker pipelines are used to deploy and promote the code change

The rest of this post will explore the tools and processes used at each of these stages, as well as why we took this approach. We will close by sharing some of the challenges we are actively addressing. You can expect this to be the first of many posts detailing the tools and challenges of building and deploying code at Netflix.

Culture, Cloud, and Microservices

Before we dive into how we build code at Netflix, it’s important to highlight a few key elements that drive and shape the solutions we use: our culture, the cloud, and microservices.

The Netflix culture of freedom and responsibility empowers engineers to craft solutions using whatever tools they feel are best suited to the task. In our experience, for a tool to be widely accepted, it must be compelling, add tremendous value, and reduce the overall cognitive load for the majority of Netflix engineers. Teams have the freedom to implement alternative solutions, but they also take on additional responsibility for maintaining these solutions. Tools offered by centralized teams at Netflix are considered to be part of a “paved road”. Our focus today is solely on the paved road supported by Engineering Tools.

In addition, in 2008 Netflix began migrating our streaming service to AWS and converting our monolithic, datacenter-based Java application to cloud-based Java microservices. Our microservice architecture allows teams at Netflix to be loosely coupled, building and pushing changes at a speed they are comfortable with.

Build

Naturally, the first step to deploying an application or service is building. We created Nebula, an opinionated set of plugins for the Gradle build system, to help with the heavy lifting around building applications. Gradle provides first-class support for building, testing, and packaging Java applications, which covers the majority of our code. Gradle was chosen because it was easy to write testable plugins, while reducing the size of a project's build file. Nebula extends the robust build automation functionality provided by Gradle with a suite of open source plugins for dependency management, release management, packaging, and much more.

A simple Java application build.gradle file.

The above ‘build.gradle’ file represents the build definition for a simple Java application at Netflix. This project’s build declares a few Java dependencies as well as applying 4 Gradle plugins, 3 of which are either a part of Nebula or are internal configurations applied to Nebula plugins. The ‘nebula’ plugin is an internal-only Gradle plugin that provides convention and configuration necessary for integration with our infrastructure. The ‘nebula.dependency-lock’ plugin allows the project to generate a .lock file of the resolved dependency graph that can be versioned, enabling build repeatability. The ‘netflix.ospackage-tomcat’ plugin and the ospackage block will be touched on below.

With Nebula, we provide reusable and consistent build functionality, with the goal of reducing boilerplate in each application’s build file. A future techblog post will dive deeper into Nebula and the various features we’ve open sourced. For now, you can check out the Nebula website.

Integrate

Once a line of code has been built and tested locally using Nebula, it is ready for continuous integration and deployment. The first step is to push the updated source code to a git repository. Teams are free to find a git workflow that works for them.

Once the change is committed, a Jenkins job is triggered. Our use of Jenkins for continuous integration has evolved over the years. We started with a single massive Jenkins master in our datacenter and have evolved to running 25 Jenkins masters in AWS. Jenkins is used throughout Netflix for a variety of automation tasks above just simple continuous integration.

A Jenkins job is configured to invoke Nebula to build, test and package the application code. If the repository being built is a library, Nebula will publish the .jar to our artifact repository. If the repository is an application, then the Nebula ospackage pluginwill be executed. Using the Nebula ospackage (short for “operating system package”) plugin, an application’s build artifact will be bundled into either a Debian or RPM package, whose contents are defined via a simple Gradle-based DSL. Nebula will then publish the Debian file to a package repository where it will be available for the next stage of the process, “baking”.

Bake

Our deployment strategy is centered around the Immutable Server pattern. Live modification of instances is strongly discouraged in order to reduce configuration drift and ensure deployments are repeatable from source. Every deployment at Netflix begins with the creation of a new Amazon Machine Image, or AMI. To generate AMIs from source, we created “the Bakery”.

The Bakery exposes an API that facilitates the creation of AMIs globally. The Bakery API service then schedules the actual bake job on worker nodes that use Aminator to create the image. To trigger a bake, the user declares the package to be installed, as well the foundation image onto which the package is installed. That foundation image, or Base AMI, provides a Linux environment customized with the common conventions, tools, and services required for seamless integration with the greater Netflix ecosystem.

When a Jenkins job is successful, it typically triggers a Spinnaker pipeline. Spinnaker pipelines can be triggered by a Jenkins job or by a git commit. Spinnaker will read the operating system package generated by Nebula, and call the Bakery API to trigger a bake.

Deploy

Once a bake is complete, Spinnaker makes the resultant AMI available for deployment to tens, hundreds, or thousands of instances. The same AMI is usable across multiple environments as Spinnaker exposes a runtime context to the instance which allows applications to self-configure at runtime. A successful bake will trigger the next stage of the Spinnaker pipeline, a deploy to the test environment. From here, teams will typically exercise the deployment using a battery of automated integration tests. The specifics of an application’s deployment pipeline becomes fairly custom from this point on. Teams will use Spinnaker to manage multi-region deployments, canary releases, red/black deployments and much more. Suffice to say that Spinnaker pipelines provide teams with immense flexibility to control how they deploy code.

The Road Ahead

Taken together, these tools enable a high degree of efficiency and automation. For example, it takes just 16 minutes to move our cloud resiliency and maintenance service, Janitor Monkey, from code check-in to a multi-region deployment.

A Spinnaker bake and deploy pipeline triggered from Jenkins.

That said, we are always looking to improve the developer experience and are constantly challenging ourselves to do it better, faster, and while making it easier.

One challenge we are actively addressing is how we manage binary dependencies at Netflix. Nebula provides tools focused on making Java dependency management easier. For instance, the Nebula dependency-lock plugin allows applications to resolve their complete binary dependency graph and produce a .lock file which can be versioned. The Nebula resolution rules plugin allows us to publish organization-wide dependency rules that impact all Nebula builds. These tools help make binary dependency management easier, but still fall short of reducing the pain to an acceptable level.

Another challenge we are working to address is bake time. It wasn’t long ago that 16-minutes from commit to deployment was a dream, but as other parts of the system have gotten faster, this now feels like an impediment to rapid innovation. From the Simian Army example deployment above, the bake process took 7 minutes or 44% of the total bake and deploy time. We have found the biggest drivers of bake time to be installing packages (including dependency resolution) and the AWS snapshot process itself.

As Netflix grows and evolves, there is an increasing demand for our build and deploy toolset to provide first-class support for non-JVM languages, like JavaScript/Node.js, Python, Ruby and Go. Our current recommendation for non-JVM applications is to use the Nebula ospackage plugin to produce a Debian package for baking, leaving the build and test pieces to the engineers and the platform’s preferred tooling. While this solves the needs of teams today, we are expanding our tools to be language agnostic.

Containers provide an interesting potential solution to the last two challenges and we are exploring how containers can help improve our current build, bake, and deploy experience. If we can provide a local container-based environment that closely mimics that of our cloud environments, we potentially reduce the amount of baking required during the development and test cycles, improving developer productivity and accelerating the overall development process. A container that can be deployed locally just as it would be in production without modification reduces cognitive load and allows our engineers to focus on solving problems and innovating rather than trying to determine if a bug is due to environmental differences.

You can expect future posts providing updates on how we are addressing these challenges. If these challenges sound exciting to you, come join the Engineering Tools team. You can check out our open jobs and apply today!

By Ed Bukoski, Brian Moyles and Mike McGarr

Back in January of 2014 we wrote about the need for better visibility into our complex operational environments. The core of the message in that post was about the need for fine-grained, contextual and scalable insights into the experiences of our customers and behaviors of our services. While our execution has evolved somewhat differently from our original vision, the underlying principles behind that vision are as relevant today as they were then. In this post we’ll share what we’ve learned building Mantis, a stream-processing service platform that’s processing event streams of up to 8 million events per second and running hundreds of stream-processing jobs around the clock. We’ll describe the architecture of the platform and how we’re using it to solve real-world operational problems.

Why Mantis?

There are more than 75 million Netflix members watching 125 million hours of content every day in over 190 countries around the world. To provide an incredible experience for our members, it’s critical for us to understand our systems at both the coarse-grained service level and fine-grained device level. We’re good at detecting, mitigating, and resolving issues at the application service level - and we’ve got some excellent tools for service-level monitoring - but when you get down to the level of individual devices, titles, and users, identifying and diagnosing issues gets more challenging.

We created Mantis to make it easy for teams to get access to realtime events and build applications on top of them. We named it after the Mantis shrimp, a freakish yet awesome creature that is both incredibly powerful and fast. The Mantis shrimp has sixteen photoreceptors in its eyes compared to humans’ three. It has one of the most unique visual systems of any creature on the planet. Like the shrimp, the Mantis stream-processing platform is all about speed, power, and incredible visibility.

So Mantis is a platform for building low-latency, high throughput stream-processing apps but why do we need it? It’s been said that the Netflix microservices architecture is a metrics generator that occasionally streams movies. It’s a joke, of course, but there’s an element of truth to it; our systems do produce billions of events and metrics on a daily basis. Paradoxically, we often experience the problem of having both too much data and too little at the same time. Situations invariably arise in which you have thousands of metrics at your disposal but none are quite what you need to understand what’s really happening. There are some cases where you do have access to relevant metrics, but the granularity isn’t quite good enough for you to understand and diagnose the problem you’re trying to solve. And there are still other scenarios where you have all the metrics you need, but the signal-to-noise ratio is so high that the problem is virtually impossible to diagnose. Mantis enables us to build highly granular, realtime insights applications that give us deep visibility into the interactions between Netflix devices and our AWS services. It helps us better understand the long tail of problems where some users, on some devices, in some countries are having problems using Netflix.

By making it easier to get visibility into interactions at the device level, Mantis helps us “see” details that other metrics systems can’t. It’s the difference between 3 photoreceptors and 16.

A Deeper Dive

With Mantis, we wanted to abstract developers away from the operational overhead associated with managing their own cluster of machines. Mantis was built from ground up to be cloud native. It manages a cluster of EC2 servers that is used to run stream-processing jobs. Apache Mesos is used to abstract the cluster into a shared pool of computing resources. We built, and open-sourced, a custom scheduling library called Fenzo to intelligently allocate these resources among jobs.

Architecture Overview

The Mantis platform comprises a master and an agent cluster. Users submit stream-processing applications as jobs that run as one or more workers on the agent cluster. The master consists of a Resource Manager that uses Fenzo to optimally assign resources to a jobs’ workers. A Job Manager embodies the operational behavior of a job including metadata, SLAs, artifact locations, job topology and life cycle.

The following image illustrates the high-level architecture of the system.

Mantis Jobs

Mantis provides a flexible model for defining a stream-processing job. A mantis job can be defined as single-stage for basic transformation/aggregation use cases or multi-stage for sharding and processing high-volume, high-cardinality event streams.

There are three main parts to a Mantis job.

The source is responsible for fetching data from an external source
One or more processing stages which are responsible for processing incoming event streams using high order RxJava functions
The sink to collect and output the processed data

RxNetty provides non-blocking access to the event stream for a job and is used to move data between its stages.

To give you a better idea of how a job is structured, let's take a look at a typical ‘aggregate by group’ example.

Imagine that we are trying to process logs sent by devices to calculate error rates per device type. The job is composed of three stages. The first stage is responsible for fetching events from a device log source job and grouping them based on device ID. The grouped events are then routed to workers in stage 2 such that all events for the same group (i.e., device ID) will get routed to the same worker. Stage 2 is where stateful computations like windowing and reducing - e.g., calculating error rate over a 30 second rolling window - are performed. Finally the aggregated results for each device ID are collected by Stage 3 and made available for dashboards or other applications to consume.

Job Chaining

One of the unique features of Mantis is the ability to chain jobs together. Job chaining allows for efficient data and code reuse. The image below shows an example of an anomaly detector application composed of several jobs chained together. The anomaly detector streams data from a job that serves Zuul request/response events (filtered using a simple SQL-like query) along with output from a “Top N” job that aggregates data from several other source jobs.

Scaling in Action

At Netflix the amount of data that needs to be processed varies widely based on the time of the day. Running with peak capacity all the time is expensive and unnecessary. Mantis autoscales both the cluster size and the individual jobs as needed.

The following chart shows how Fenzo autoscales the Mesos worker cluster by adding and removing EC2 instances in response to demand over the course of a week.

And the chart below shows an individual job’s autoscaling in action, with additional workers being added or removed based on demand over a week.

UI for Self-service, API for Integration

Mantis sports a dedicated UI and API for configuring and managing jobs across AWS regions. Having both a UI and API improves the flexibility of the platform. The UI gives users the ability to quickly and manually interact with jobs and platform functionality while the API enables easy programmatic integration with automated workflows.

The jobs view in the UI, shown below, lets users quickly see which jobs are running across AWS regions along with how many resources the jobs are consuming.

Each job instance is launched as part of a job cluster, which you can think of as a class definition or template for a Mantis job. The job cluster view shown in the image below provides access to configuration data along with a view of running jobs launched from the cluster config. From this view, users are able to update cluster configurations and submit new job instances to run.

How Mantis Helps Us

Now that we’ve taken a quick look at the overall architecture for Mantis, let’s turn our attention to how we’re using it to improve our production operations. Mantis jobs currently process events from about 20 different data sources including services like Zuul, API, Personalization, Playback, and Device Logging to name a few.

Of the growing set of applications built on these data sources, one of the most exciting use cases we’ve explored involves alerting on individual video titles across countries and devices.

One of the challenges of running a large-scale, global Internet service is finding anomalies in high-volume, high-cardinality data in realtime. For example, we may need access to fine-grained insights to figure out if there are playback issues with House of Cards, Season 4, Episode 1 on iPads in Brazil. To do this we have to track millions of unique combinations of data (what we call assets) all the time, a use case right in Mantis’ wheelhouse.

Let’s consider this use case in more detail. The rate of events for a title asset (title * devices * country) shows a lot of variation. So a popular title on a popular device can have orders of magnitude more events than lower usage title and device combinations. Additionally for each asset, there is high variability in event rate based on the time of the day. To detect anomalies, we track rolling windows of unique events per asset. The size of the window and alert thresholds vary dynamically based on the rate of events. When the percentage of anomalous events exceeds the threshold, we generate an alert for our playback and content platform engineering teams. This approach has allowed us to quickly identify and correct problems that would previously go unnoticed or, best case, would be caught by manual testing or be reported via customer service.

Below is a screen from an application for viewing playback stats and alerts on video titles. It surfaces data that helps engineers find the root cause for errors.

In addition to alerting at the individual title level, we also can do realtime alerting on our key performance indicator: SPS. The advantage of Mantis alerting for SPS is that it gives us the ability to ratchet down our time to detect (TTD) from around 8 minutes to less than 1 minute. Faster TTD gives us a chance to resolve issues faster (time to recover, or TTR), which helps us win more moments of truth as members use Netflix around the world.

Where are we going?

We’re just scratching the surface of what’s possible with realtime applications, and we’re exploring ways to help more teams harness the power of stream-processing. For example, we’re working on improving our outlier detection system by integrating Mantis data sources, and we’re working on usability improvements to get teams up and running more quickly using self-service tools provided in the UI.

Mantis has opened up insights capabilities that we couldn’t easily achieve with other technologies and we’re excited to see stream-processing evolve as an important and complementary tool in our operational and insights toolset at Netflix.

If the work described here sounds exciting to you, head over to our jobs page; we’re looking for great engineers to join us on our quest to reinvent TV!

by Ben Schmaus, Chris Carey, Neeraj Joshi, Nick Mahilani, and Sharma Podila

We have a collection of nearly two million images that play very prominent roles in helping members pick what to watch. This blog describes how we use computer vision algorithms to address the challenges of focal point, text placement and image clustering at a large scale.

Focal point

All images have a region that is the most interesting (e.g. a character’s face, sharpest region, etc.) part of the image. In order to effectively render an image on a variety of canvases like a phone screen or TV, it is often required to display only the interesting region of the image and dynamically crop the rest of an image depending on the available real-estate and desired user experience. The goal of the focal point algorithm is to use a series of signals to identify the most interesting region of an image, then use that information to dynamically display it.

[Examples of face and full-body features to determine the focal point of the image]

We first try to identify all the people and their body positioning using Haar-cascade like features. We also built haar based features to also identify if it is close-up, upper-body or a full-body shot of the person(s). With this information, we were able to build an algorithm that auto-selects what is considered the "best' or "most interesting" person and then focuses in on that specific location.

However, not all images have humans in them. So, to identify interesting regions in those cases, we created a different signal - edges. We heuristically identify the focus of an image based on first applying gaussian blur and then calculating edges for a given image.

Here is one example of applying such a transformation:

///Remove noise by blurring with a Gaussian filter

GaussianBlur( src, src, Size(n,n ), 0, 0, BORDER_DEFAULT );

/// Convert the image to grayscale

cvtColor( src, src_gray, CV_BGR2GRAY );

/// Apply Laplace function

Laplacian( src_gray, dst, ddepth, kernel_size, scale, delta, BORDER_CONSTANT );

convertScaleAbs( dst, abs_dst );

Below are a few examples of dynamically cropped images based on focal point for different canvases:

Text Placement

Another interesting challenge is determining what would be the best place to put text on an image. Examples of this are the ‘New Episode’ Badge and placement of subtitles in a video frame.

[Example of “New Episode” badge hiding the title of the show]

In both cases, we’d like to avoid placing new text on top of existing text on these images.

Using a text detection algorithm allows us to automatically detect and correct such cases. However, text detection algorithms have many false positives. We apply several transformations like watershed and thresholding before applying text detection. With such transformations, we can get fairly accurate probability of text in a region of interest for image in large corpus of images.

[Results of text detection on some of the transformations of the same image]

Image Clustering

Images play an important role in a member’s decision to watch a particular video. We constantly test various flavors of artwork for different titles to decide which one performs the best. In order to learn which image is more effective globally, we would like to see how an image performs in a given region. To get an overall global view of how well a particular set of visually similar images performed globally, it is required to group them together based on their visual similarity.

We have several derivatives of the same image to display for different users. Although visually similar, not all of these images come from the same source. These images have varying degrees of image cropping, resizing, color correction and title treatment to serve a global audience.

As a global company that is constantly testing and experimenting with imagery, we have a collection of millions of images that we are continuously shifting and evolving. Manually grouping these images and maintaining those images can be expensive and time consuming, so we wanted to create a process that was smarter and more efficient.

[An example of two images with slight color correction, cropping and localized title treatment]

These images are often transformed and color corrected so a traditional color histogram based comparison does not always work for such automated grouping. Therefore, we came up with an algorithm that uses the following combination of parameters to determine a similarity index - measurement of visual similarity among group of images.

We calculate similarity index based on following 4 parameters:

Histogram based distance
Structural similarity between two images
Feature matching between two images
Earth mover’s distance algorithm to measure overall color similarity

Using all 4 methods, we can get a numerical value of similarity between two images in a relatively fast comparison.

Below is example of images grouped based on a similarity index that is invariant to color correction, title treatment, cropping and other transformations:

[Final result with similarity index values for group of images]

Images play a crucial role in first impression of a large collection of videos, and we are just scratching the surface on what we can learn from media and we have many more ambitious and interesting problems to tackle in the road ahead.

If you are excited and passionate about solving big problems, we are hiring. Contact us

By Apurva Kansara

Last week we hosted our latest Netflix JavaScript Talks event at our headquarters in Los Gatos, CA. We gave two talks about our unflinching stance on performance. In our first talk, Steve McGuire shared how we achieved a completely declarative, React-based architecture that’s fast on the devices in your living room. He talked about our architecture principles (no refs, no observation, no mixins or inheritance, immutable state, and top-down rendering) and the techniques we used to hit our tough performance targets. In our second talk, Ben Lesh explained what RxJS is, and why we use and love it. He shared the motivations behind a new version of RxJS and how we built it from the ground up with an eye on performance and debugging.

React.js for TV UIs

RxJS Version 5

Videos from our past talks can always be found on our Netflix UI Engineering channel on YouTube. If you’re interested in being notified of future events, just sign up on our notification list.

By Kim Trott

This is a continuing post on the Netflix architecture for Global Availability. In the past we talked about efforts like Isthmus and Active-Active. We continue the story from where we left off at the end of the Active-Active project in 2013. We had achieved multi-regional resiliency for our members in the Americas, where the vast majority of Netflix members were located at the time. Our European members, however, were still at risk from a single point of failure.

Our expansion around the world since then, has resulted in a growing percentage of international members who were exposed to this single point of failure, so we set out to make our cloud deployment even more resilient.

Creating a Global Cloud

We decided to create a global cloud where we would be able to serve requests from any member in any AWS region where we are deployed. The diagram below shows the logical structure of our multi-region deployment and the default routing of member traffic to AWS region.

Getting There

Getting to the end state, while not disrupting our ongoing operations and the development of new features, required breaking the project down into a number of stages. From an availability perspective, removing AWS EU-West-1 as a single point of failure was the most important goal, so we started in the Summer of 2014 by identifying the tasks that we needed to execute in order to be able to serve our European members from US-East-1.

Data Replication

When we initially launched service in Europe in 2012, we made an explicit decision to build regional data islands for most, but not all, of the member related data. In particular, while a member’s subscription allowed them to stream anywhere that we offered service, information about what they watched while in Europe would not be merged with the information about what they watched while in the Americas. Since we figured we would have relatively few members travelling across the Atlantic, we felt that the isolation that these data islands created was a win as it would mitigate the impact of a region specific outage.

Cassandra

In order to serve our EU members a normal experience from US-East-1, we needed to replicate the data in the EU Cassandra island data sets to the Cassandra clusters in US-East-1 and US-West-2. We considered replicating this data into separate keyspaces in US clusters or merging the data with our Americas data. While using separate keyspaces would have been more cost efficient, merging the datasets was more in line with our longer term goal of being able to serve any member from any region as the Americas data would be replicated to the Cassandra clusters in EU-West-1.

Merging the EU and Americas data was more complicated than the replication work that was part of the 2013 Active-Active project as we needed to examine each component data set to understand how to merge the data. Some data sets were appropriately keyed such that the result was the union of the two island data sets. To simplify the migration of such data sets, the Netflix Cloud Database Engineering (CDE) team enhanced the Astyanax Cassandra client to support writing to two keyspaces in parallel. This dual write functionality was sometimes used in combination with another tool built by the CDE that could be used to forklift data from one cluster or keyspace to another. For other data sets, such as member viewing history, custom tools were needed to handle combining the data associated with each key. We also discovered one or two data sets in which there were unexpected inconsistencies in the data that required deeper analysis to determine which particular values to keep.

EVCache

As described in the blog post on the Active-Active project, we built a mechanism to allow updates to EVCache clusters in one region to invalidate the entry in the corresponding cluster in the other US region using an SQS message. EVCache now supports both full replication and invalidation of data in other regions, which allows application teams to select the strategy that is most appropriate to their particular data set. Additional details about the current EVCache architecture are available in a recent Tech Blog post.

Personalization Data

Historically the personalization data for any given member has been pre-computed in only one of our AWS regions and then replicated to whatever other regions might service requests for that member. When a member interacted with the Netflix service in a way that was supposed to trigger an update of the recommendations, this would only happen if the interaction was serviced in the member’s “home” region, or its active-active replica, if any.

This meant that when a member was serviced from a different region during a traffic migration, their personalized information would not be updated. Since there are regular, clock driven, updates to the precomputed data sets, this was considered acceptable for the first phase of the Global Cloud project. In the longer term, however, the precomputation system was enhanced to allow the events that triggered recomputation to be delivered across all three regions. This change also allowed us to redistribute the precomputation workload based on resource availability.

Handling Misrouted Traffic

In the past, Netflix has used a variety of application level mechanisms to redirect device traffic that has landed in the “wrong” AWS region, due to DNS anomalies, back to the member’s “home” region. While these mechanisms generally worked, they were often a source of confusion due the differences in their implementations. As we started moving towards the Global Cloud, we decided that, rather than redirecting the misrouted traffic, we would use the same Zuul-to-Zuul routing mechanism that we use when failing over traffic to another region to transparently proxy traffic from the “wrong” region to the “home” region.

As each region became capable of serving all members, we could then update the Zuul configuration to stop proxying the “misrouted” traffic to the member’s home region and simply serve it locally. While this potentially added some latency versus sticky redirects, it allowed several teams to simplify their applications by removing the often crufty redirect code. Application teams were given the guidance that they should no longer worry about whether a member was in the “correct” region and instead serve them the best response that they could give the locally available information.

Evolving Chaos Kong

With the Active-Active deployment model, our Chaos Kong exercises involved failing over a single region into another region. This is also the way we did our first few Global Cloud failovers. The following graph shows our traffic steering during a production issue in US-East-1. We steered traffic first from US-East-1 to US-West-2 and then later in the day to EU-West-1. The upper graph shows that the aggregate, global, stream starts tracked closely to the previous week’s pattern, despite the shifts in the amount of traffic being served by each region. The thin light blue line shows SPS traffic for each region the previous week and allows you to see the amount of traffic we are shifting.

Cleaned up version of traffic steering during INC-1453 in mid-October.

By enhancing our traffic steering tools, we are now able to steer traffic from one region to both remaining regions to make use of available capacity. The graphs below show a situation where we evacuated all traffic from US-East-1, sending most of the traffic to EU-West-1 and a smaller portion to US-West-2.

We have done similar evacuations for the other two regions, each of them involving rerouted traffic being split between both remaining regions based on available capacity and minimizing member impact. For more details on the evolution of the Kong exercises and our Chaos philosophy behind them, see our earlier post.

Are We Done?

Not even close. We will continue to explore new ways in which to efficiently and reliably deliver service to our millions of global members. We will report on those experiments in future updates here.

-Peter Stout on behalf of all the teams that contributed to the Global Cloud Project

Netflix Technology relies heavily on the Cloud, thanks to its low latency and high compatibility with the internet.

But the key to great Technology is having great Talent. So when John Stamos expressed an interest in becoming more involved in our Engineering initiatives, needless to say, we were
on Cloud Nine!

John Stamos, star of Fuller House

and world-renowned bongo player

Dreamy Results

Let’s take a look at the numbers. Earlier this year, we were operating at a median average

of Cloud 3.1. We introduced Mr. Stamos into the system in early March, and in just under a month, he has helped us achieve a remarkable 290% gain.

Here’s what our architecture looks like today:

Highly Available Micro-Stamos Cloud Architecture

Personal Forecast

As a Netflix user, you may already be seeing this effect on the quality of your recommendations, which has resulted in increased overall user engagement. For example, with the release of Fuller House, users watched an additional REDACTED million hours, and experienced an average heart rate increase of 18%.

One might say that our personalization algorithms have a lot more personality!

Stamos-optimized home page

Lifting the Fog

How does Mr. Stamos drive these results?

After extensive analysis and observation (mostly observation), we are certain (p=0.98) that the biggest factors are his amazing attitude and exceptionally high rate of Hearts Broken Per Second (HBPS).

Based on these learnings, we are currently A/B testing ways to increase HBPS. For example, which has a greater effect on the metrics: his impeccable hairstyle or his award-winning smile?

We’ll go over our findings in a follow-up blog post, but they look fabulous so far.

HBPS: Stamos (red) vs Control (blue)

Staying Cool

Long-time readers will be familiar with our innovative Netflix Simian Army. The best known example is Chaos Monkey, a process that tests the reliability of our services by intentionally causing failures at random.

Thanks to Mr. Stamos, we have a new addition to the army: Style Monkey, which tests how resilient our nodes are against unexpected clothing malfunctions and bad hair days.

As a pleasant side effect, we have noticed that the other monkeys in the Simian Army are much happier when Style Monkey is around.

The Style Monkey API

A Heavenly Future

Look for an Open Source version of our Stamos-driven Cloud architecture soon. With contributions from the community, and by hanging out with Mr. Stamos every chance we can get, we think we can achieve Cloud 11 by late 2016.

Gosh, isn’t he great?

-The Netflix Engineering Team

Following our introductory article, this (the second in our IMF series) post describes the Netflix IMF ingest implementation and how it fits within the scope of our content processing pipeline. While conventional essence containers (e.g., QuickTime) commonly include essence data in the container file, the IMF CPL is designed to contain essence by reference. As we will see soon, this has interesting architectural implications.

The Netflix Content Processing Pipeline

A simplified 3-step view of the Netflix content processing system is shown in the accompanying figure. Source files (audio, timed text or video) delivered to Netflix by content and fulfillment partners are first inspected for their correctness and conformance (the ingestion step). Examples of checks performed here include (a) data transfer sanctity checks such as file and metadata integrity, (b) compliance of source files to the Netflix delivery specification, (c) file format validation, (d) decodability of compressed bitstream, (e) “semantic” signal domain inspections, and more.

In summary, the ingestion step ensures that the sources delivered to Netflix are pristine and guaranteed to be usable by the distributed, cloud-scalable Netflix trans-coding engine. Following this, inspected sources are transcoded to create output elementary streams which are subsequently encrypted and packaged into streamable containers. IMF being a source format, the scope of the implementation is predominantly the first step.

IMF Ingestion: Main Concepts

An IMF ingestion workflow needs to deal with the inherent decoupling of physical assets (track files) from their playback timeline. A single track file can be applicable to multiple playback timelines and a single playback timeline can comprise portions of multiple track files. Further, physical assets and CPLs (which define the playback timelines) can be delivered at different times (via different IMPs). This design necessarily assumes that assets (CPL and track files) within a particular operational domain (in this context, an operational domain can be as small as a single mastering facility or a playback system, or as large as a planetary data service) are cataloged by an asset management service. Such a service would provide locator translation for UUID references (i.e., to locate physical assets somewhere in a file store) as well as import/export capability from/to other operational domains.

A PKL (equivalently an IMP) defines the exchange of assets between two independent operational domains. It allows the receiving system to verify complete and error-free transmission of the intended payload without any out-of-band information. The receiving system can compare the asset references in the PKL with the local asset management service, and then initiate transfer operations on those assets not already present. In this way de-duplication is inherent to inter-domain exchange. The scope of a PKL being the specification of the inter-domain transfer, it is not expected to exist in perpetuity. Following the transfer, asset management is the responsibility of the local asset management system at the receiver side.

The Netflix IMF Ingestion System

We have utilized the above concepts to build the Netflix IMF implementation. The accompanying figure describes our IMF ingestion workflow as a flowchart. A brief description follows:

For every IMP delivered to Netflix for a specific title, we first perform transfer/delivery validations. These include but are not limited to:

Checking the PKL and ASSETMAP files for correctness (while the PKL file contains a list of UUIDs corresponding to files that were a part of the delivery, the ASSETMAP file specifies the mapping of these asset identifiers (UUIDs) to locations (URIs). As an example, the ASSETMAP can contain HTTP URLs);
Ensuring that checksums (actually, message digests) corresponding to files delivered via the IMP match the values provided in PKL.
Track files in IMF follow the MXF (Material eXchange Format) file specification, and are mandated in the IMF context to contain a UUID value that identifies the file. The CPL schema also mandates an embedded identifier (a UUID) that uniquely identifies the CPL document. This enables us to cross-validate the list of UUIDs indicated in PKL against the files that were actually delivered as a part of the IMP.

We then perform validations on the contents of the IMP. Every CPL contained in the IMP is checked for syntactic correctness and specification compliance and every essence track file contained in the IMP is checked for syntactic as well as semantic correctness. Examples of some checks applicable to track files include:

MXF file format validation;
Frame-by-frame decodability of video essence bitstream;
channel mapping validation for audio essence;
We also collect significant descriptors such as sample rates and sample counts on these files.

Valid assets (essence track files and CPLs) are then cataloged in our asset management system.

Further, upon every IMP delivery all the tracks of all the CPLs delivered against the title are checked for completeness, i.e., whether all necessary essence track files have been received and validated.
Timeline inspections are then conducted against all CPL tracks that have been completed. Timeline inspections include for example:

detection of digital hits in the audio timeline
scene change detection in the video timeline
fingerprinting of audio and video tracks

At this point, the asset management system is updated appropriately. Following the completion of the timeline inspection, the completed CPL tracks are ready to be consumed by the Netflix trans-coding engine. In summary, we follow a two-pronged approach to inspections. While one set of inspections are conducted against delivered assets every time there is a delivery, another set of inspections is triggered every time a CPL track is completed.

Asset and Timeline Management

The asset management system is tasked with tracking associations between assets and playback timelines in the context of the many-to-many mapping that exists between the two. Any failures in ingestion due to problems in source files are typically resolved by redelivery by our content partners. For various reasons, multiple CPL files could be delivered against the same playback timeline over a period of time. This makes time versioning of playback timelines and garbage collection of orphaned assets as important attributes of the Netflix asset management system. The asset management system also serves as a storehouse for all of the analysis data obtained as a result of conducting inspections.

Perceived Benefits

Incorporating IMF primitives as first class concepts in our ingestion pipeline has involved a big architectural overhaul. We believe that the following benefits of IMF have justified this undertaking:

reduction of several of our most frustrating content tracking issues, namely those related to “versionitis”
improved video quality (as we get direct access to high quality IMF masters)
optimizations around redeliveries, incremental changes (like new logos, content revisions, etc.), and minimized redundancy (partners deliver the “diff” between two versions of the same content)
metadata (e.g., channel mapping, color space information) comes in-band with physical assets, decreasing the opportunity for human error
granular checksums on essence (e.g., audio, video) facilitate distributed processing in the cloud

Challenges

Following is a list of challenges faced by us as we roll out our IMF ingest implementation:

Legacy assets (in the content vault of content providers) as well as legacy content production workflows abound at this time. While one could argue that the latter will succumb to IMF in the medium term, the former is here to stay. The conversion of legacy assets to IMF would probably be very long drawn out. For all practical purposes, we need to work with a hybrid content ingestion workflow - one that handles both IMF and non-IMF assets. This introduces operational and maintenance complexities.
Global subtitles are core to the Netflix user experience. The current lack of standardization around timed text in IMF means that we are forced to accept timed text sources outside of IMF. In fact, IMSC1 (Internet Media Subtitles and Captions 1.0) - the current contender for the IMF timed text format does not support some of the significant rendering features that are inherent to Japanese as well as some other Asian languages.
The current definition of IMF allows for version management between one IMF publisher and one IMF consumer. In the real world, multiple parties (content partners as well as fulfillment partners) could come together to produce a finished work. This necessitates a multi-party version management system (along the lines of software version control systems). While the IMF standard does not preclude this - this aspect is missing in existing IMF implementations and does not have industry mind-share as of yet.

Whats next

In our next blog post, we will describe some of the community efforts we are undertaking to help move the IMF standard and its adoption forward.

By Rohit Puri, Andy Schuler and Sreeram Chakrovorthy

We recently uncovered a significant performance win for a key microservice here at Netflix, and we want to share the problems, challenges, and new developments for the kind of analysis that lead us to the optimization.

Optimizing performance is critical for Netflix. Improvements enhance the customer experience by lowering latency while concurrently reducing capacity costs through increased efficiency. Performance wins and efficiency gains can be found in many places - tuning the platform and software stack configuration, selecting different AWS instance types or features that better fit unique workloads, honing autoscaling rules, and highlighting potential code optimizations. While many developers throughout Netflix have performance expertise, we also have a small team of performance engineers dedicated to providing the specialized tooling, methodologies, and analysis to achieve our goals.

An on-going focus for the Netflix performance team is to proactively look for optimizations in our service and infrastructure tiers. It’s possible to save hundreds of thousands of dollars with just a few percentage points of improvement. Given that one of our largest workloads is primarily CPU-bound, we focused on collecting and analyzing CPU profiles in that tier. Fortunately with the work Brendan Gregg pushed forward to preserve the frame pointer in the JVM we can easily capture CPU sampling profiles using Linux perf_events on live production instances and visualize them with flame graphs:

For those unfamiliar with flame graphs, the default visualization places the initial frame in a call stack at the bottom of the graph and stacks subsequent frames on top of each other with the deepest frame at the top. Stacks are ordered alphabetically to maximize the merging of common adjacent frames that call the same method. As a quick side note, the magenta frames in the flame graphs are those frames that match a search phrase. Their particular significance in these specific visualizations will be discussed later on.

Broken Stacks

Unfortunately 20% of what we see above contains broken stack traces for this specific workload. This is due to the complex call patterns in certain frameworks that generate extremely deep stacks, far exceeding the current 127 frame limitation in Linux perf_events (PERF_MAX_STACK_DEPTH). The long, narrow towers outlined below are the call stacks truncated as a result of this limitation:

We experimented with Java profilers to get around this stack depth limitation and were able to hack together a flame graph with just few percent of broken stacks. In fact we captured so many unique stacks that we had to increase the minimum column width from a default 0.1 to 1 in order to generate a reasonably-sized flame graph that a browser can render. The end result is an extremely tall flame graph of which this is only the bottom half:

Although there are still a significant number of broken stacks, we begin to see a more complete picture materialize with the full call stacks intact in most of the executing code.

Hot Spot Analysis

Without specific domain knowledge of the application code there are generally two techniques to finding potential hot spots in a CPU profile. Based on the default stack ordering in a flame graph, a bottom-up approach starts at the base of the visualization and advances upwards to identify interesting branch points where large chunks of the sampled cost either split into multiple subsequent call paths or simply terminate at the current frame. To maximize potential impact from an optimization, we prioritize looking at the widest candidate frames first before assessing the narrower candidates.

Using this bottom-up approach for the above flame graph, we derived the following observations:

Most of the lower frames in the stacks are calls executing various framework code
We generally find the interesting application-specific code in the upper portion of the stacks
Sampled costs have whittled down to just a few percentage points at most by the time we reach the interesting code in the thicker towers

While it’s still worthwhile to pursue optimizing some of these costs, this visualization doesn’t point to any obvious hot spot.

Top-Down Approach

The second technique is a top-down approach where we visualize the sampled call stacks aggregated in reverse order starting from the leaves. This visualization is simple to generate with Brendan’s flame graph script using the options --inverted --reverse, which merges call stacks top-down and inverts the layout. Instead of a flame graph, the visualization becomes an icicle graph. Here are the same stacks from the previous flame graph reprocessed with those options:

In this approach we start at the top-most frames again prioritizing the wider frames over the more narrow ones. An icicle graph can surface some obvious methods that are called excessively or are by themselves computationally expensive. Similarly it can help highlight common code executed in multiple unique call paths.

Analyzing the icicle graph above, the widest frames point to map get/put calls. Some of the other thicker columns that stand out are related to a known inefficiency within one of the frameworks utilized. However, this visualization still doesn’t illustrate any specific code path that may produce a major win.

A Middle-Out Approach

Since we didn’t find any obvious hotspots from the bottom-up and top-down approaches, we thought about how to improve our focus on just the application-specific code. We began by reprocessing the stack traces to collapse the repetitive framework calls that constitute most of the content in the tall stacks. As we iterated through removing the uninteresting frames, we saw some towers in the flame graph coalescing around a specific application package.

Remember those magenta-colored frames in the images above? Those frames match the package name we uncovered in this exercise. While it’s somewhat arduous to visualize the package’s cost because the matching frames are scattered throughout different towers, the flame graph's embedded search functionality attributed more than 30% of the samples to the matching package.

This cost was obscured in the flame graphs above because the matching frames are split across multiple towers and appear at different stack depths. Similarly, the icicle graph didn’t fare any better because the matching frames diverge into different call paths that extend to varying stack heights. These factors defeat the merging of common frames in both visualizations for the bottom-up and top-down approaches because there isn’t a common, consolidated pathway to the costly code. We needed a new way to tackle this problem.

Given the discovery of the potentially expensive package above, we filtered out all the frames in each call stack until we reached a method calling this package. It’s important to clarify that we are not simply discarding non-matching stacks. Rather, we are reshaping the existing stack traces in different and better ways around the frames of interest to facilitate stack merging in the flame graph. Also, to keep the matching stack cost in context within the overall sampled cost, we truncated the non-matching stacks to merge into an “OTHER” frame. The resulting flame graph below reveals that the matching stacks account for almost 44% of CPU samples, and we now have a high degree of clarity of the costs within the package:

To explain how the matching cost jumped from 30% to 44% between the original flame graphs and the current one, we recall that earlier we needed to increase the column minimum width from 0.1 to 1 in the unfiltered flame graph. That surprisingly removed almost a third of the matching stack traces, which further highlights how the call patterns dispersed the package's cost across the profile. Not only have we simplified and consolidated the visualization for the package through reshaping the stack traces, but we have also improved the precision when accounting for its costs.

Hitting the Jackpot

Once we had this consolidated visualization, we began unraveling the application logic executing the call stack in the main tower. Reviewing this flame graph with the feature owners, it was quickly apparent that a method call into legacy functionality accounted for an unexpectedly large proportion of the profiled cost. The flame graph above highlights the suspect method in magenta and shows it matching 25% of CPU samples. As luck would have it, a dynamic flag already existed that gates the execution of this method. After validating that the method no longer returns distinct results, the code path was disabled in a canary cluster resulting in a dramatic drop in CPU usage and request latencies:

Capturing a new flame graph further illustrates the savings with the package’s cost decreasing to 18% of CPU samples and the unnecessary method call now matching just a fraction of a percent:

Quantifying the actual CPU savings, we calculated that this optimization reduces the service's computation time by more than 13 million minutes (or almost 25 years) of CPU time per day.

Looking Forward

Aggregating costs based on a filter is admittedly not a new concept. The novelty of what we’ve done here is to maximize the merging of interesting frames through the use of variable stack trace filters within a flame graph visualization.

In the future we’d like to be able to define ad hoc inclusion or exclusion filters in a full flame graph to dynamically update a visualization into a more focused flame graph such as the one above. It will also be useful to apply inclusion filters in reverse stack order to visualize the call stacks leading into matching frames. In the meantime, we are exploring ways to intelligently generate filtered flame graphs to help teams visualize the cost of their code running within shared platforms.

Our goal as a performance team is to help scale not only the company by also our own capabilities. To achieve this we look to develop new approaches and tools for others at Netflix to consume in a self-service fashion. The flame graph work in this post is part of a profiling platform we have been building that can generate on-demand CPU flame graphs for Java and Node.js applications and is available 24x7 across all services. We'll be using this platform for automated regression analysis as well to push actionable data directly to engineering teams. Lots of exciting and new work in this space is coming up.

Offering the same great Netflix experience to diverse audiences and cultures around the world is a core aspect of the global Netflix video delivery service. With high quality subtitle localization being a key component of the experience, we have developed (and are continuously refining) a Unicode standard based i18n-grade timed text processing pipeline. This pipeline allows us to meet the challenges of scale brought by the global Netflix platform as well as features unique to each script and language. In this article, we provide a description of this timed text processing pipeline at Netflix including factors and insights that shaped its architecture.

Timed Text At Netflix: Overview

As described above, Timed Text - subtitles, Closed Captions (CC), and Subtitles for Deaf and Hard of Hearing (SDH) - is a core component of the Netflix experience. A simplified view of the Netflix timed text processing pipeline is shown in the accompanying figure. Every timed text source delivered to Netflix by a content provider or a fulfillment partner goes through the following major steps before showing up on the Netflix service:

Ingestion: The first step involves delivery of the authored timed text asset to Netflix. Once a transfer has been completed, data is verified for any transport corruption and corresponding metadata is duly verified. Examples of such metadata include (but are not limited to) associated movie title and primary language of timed text source.
Inspection: The ingested asset is then subject to a rigorous set of automated checks to identify any authoring errors. These errors fall mostly into two categories, namely specification conformance and style compliance. Following sections give out more details on types and stages of these inspections.
Conversion: An error free inspected file, is then considered good for generating output files to support the device ecosystem. Netflix needs to host different output versions of the ingested asset to satisfy varying device capabilities in the field.

As the number of regions, devices and file formats grow, we must accommodate the ever growing requirements on the system. We have responded to these challenges by designing an i18n grade Unicode-based pipeline. Let’s look at these individual components in next sections.

Inspection: What’s In Timed Text?

The core information communicated in a timed text file corresponds to text translation of what is being spoken on screen along with the associated active time intervals. In addition, timed text files might carry positional information (e.g., to indicate who might be speaking or to place rendered text in a non-active area of the screen) as well as any associated text styles such as color (e.g., to distinguish speakers), italics etc. Readers who are familiar with HTML and CSS web technologies, might understand timed text to provide similar but lightweight way of formatting text data with layouts and style information.

Source Formats

Multiple formats for authoring timed text have evolved over time and across regions, each with different capabilities. Based on factors including the extent of standardization as well as the availability of authored sources, Netflix predominantly accepts the following timed text formats:

CEA-608 based Scenarist Closed Captions (.scc)
EBU Subtitling data exchange format (.stl)
TTML (.ttml, .dfxp, .xml)
Lambda Cap (.cap) (for Japanese language only)

An approximate distribution of timed text sources delivered to Netflix is depicted below. Given our experience with the broadcast lineage (.scc and .stl) as well as regional source formats (.cap), we prefer delivery in the TTML (Timed Text Markup Language) format. While formats like .scc and .stl have limited language support (e.g., both do not include Asian character set), .cap and .scc are ambiguous from a specification point of view. As an example, .scc files use drop frame timecode syntax to indicate 23.976 fps (frames per second) - however this was defined only for the NTSC 29.97 frame rate in SMPTE (Society of Motion Picture and Television Engineers). As a result, the proportion of TTML-based subtitles in our catalog is on the rise.

The Inspections Pipeline

Let’s now see how timed text files are inspected through the Netflix TimedText Processing Pipeline. Control-flow wise, given a source file, the appropriate parser performs source-specific inspections to ensure file adheres to the purported specification. The source is then converted to a common canonical format where semantic inspections are performed (An example of such an inspection is to check if timed text events would collide spatially when rendered. More examples are shown in the adjoining figure).

Given many possible character encodings (e.g., UTF-8, UTF-16, Shift-JIS), the first step is to detect the most probable charset. This information is used by the appropriate parser to parse the source file. Most parsing errors are fatal in nature resulting in termination of the inspection processing which is followed by a redelivery request to the content partner.

Semantic checks that are common to all formats are performed in ISD (Intermediate Synchronic Document) based canonical domain. Parsed objects from various sources generate ISD sequences on which more analysis is carried out. An ISD representation can be thought of as a decomposition of the subtitle timeline into a sequence of time intervals such that within each such interval the rendered subtitle matter stays the same (see adjacent figure). These snapshots include style and positional information during that interval and are completely isolated from other events in the sequence. This makes for a great model for running concurrent inspections as well. Following diagram better depicts how ISD format can be visualized.

Stylistic and language specific checks are performed on this data. An example of a canonical check is counting the number of active lines on the screen at any point in time. Some of these checks may be fatal, others may trigger a human-review-required warning and allow the source to continue down the workflow.

Another class of inspections are built around Unicode recommendations. Unicode TR-9, which specifies Unicode Bidirectional (BiDi) Algorithm, is used to check if a file with bi-directional text conforms to the specification and the final display ordered output would make sense (Bidirectional text is common in languages such as Arabic and Hebrew where the displayed text matter runs from right to left and numbers and text in other languages runs from left to right). Normalization rules (TR-15), may have to be applied to check glyph conformance and rendering viability before these assets could be accepted.

Language based checks are an interesting study. Take, for example, character limits. A sentence that is too long will force wrapping or line breaks. This ignores authoring intent and compromises rendering aesthetics. The actual limit will vary between languages (think Arabic versus Japanese). If enforcing reading speed, those character limits must also account for display time. For these reasons, canonical inspections must be highly configurable and pluggable.

Output Profiles: Why We Convert

While source formats for timed text are designed for the purpose of archiving, delivery formats are designed to be nimble so as to facilitate streaming and playback in bandwidth, CPU and memory constrained environments. To achieve this objective, we convert all timed text sources to the following family of formats:

TTML Based Output profiles
WebVTT Based Output profiles
Image Subtitles

After a source file passes inspection, the ISD-based canonical representation is saved in cloud storage. This forms the starting point for the conversion step. First, a set of broad filters that are applicable to all output profiles are applied. Then, models for corresponding standards (TTML, WebVTT) are generated. We continue to filter down based on output profile and language. From there, it’s simply a matter of writing and storing the downloadables. The following figure describes conversion modules and output profiles in the Netflix TimedText Processing Pipeline.

Multiple profiles within a family of formats may be required. Depending on the capabilities on the devices, the TTML set, for example, has been divided into further following profiles:

simple-sdh: Supports only text and timing information. This profile doesn’t support any positional information and is expected to be consumed by the most resource-limited devices.
ls-sdh: Abbreviated from less-simple sdh, it supports a richer variety of text styles and positional data on top of simple-sdh. The simple-sdh and ls-sdh serve the Pan-European and American geography and use the WGL4 (Windows Glyph List 4) character repertoire.
ntflx-ttml-gsdh: This is the latest addition to the TTML family. It supports a broad range of Unicode code-points as well as language features like Bidi, rubies, etc. Following text snippets show vertical writing mode with ruby and “tate-chu-yoko” features.

When a Netflix user activates subtitles, the device requests a specific profile based on its capabilities. Devices with enough RAM (and a good connection) might download the ls-sdh file. Resource-limited devices may ask for the smaller simple-sdh file.

Additionally, certain language features (like Japanese rubies and boutens) may require advanced rendering capabilities not available on all devices. To support this, image profile pre-renders subtitles as images in the cloud and transmits them to end-devices using a progressive transfer model. The WebVTT family of output profiles is primarily used by the Apple iOS platform. The accompanying pie chart shows a share on how these output profiles are being consumed.

QC: How Does it Look?

We have automated (or are working towards) a number of quality related measurements: these include spelling checks, text to audio sync, text overlapping burned-in text, reading speed limits, characters per line, total lines per screen. While such metrics go a long way towards improving quality of subtitles, they are by no means enough to guarantee a flawless user experience.

There are times when rendered subtitle text might occlude an important visual element, or subtitle translations from one language to another can result in an unidiomatic experience. Other times there could be intentional misspellings or onomatopoeia - we still need to rely on human eyes and human ears to judge subtitling quality in such cases. A lot of work remains to achieve full QC automation.

Netflix and the Community

Given the significance of subtitles to the Netflix business, Netflix has been actively involved in timed text standardization forums such as W3C TTWG (Timed Text Working Group). IMSC1 (Internet Media Subtitles and Captions) is a TTML-based specification that addresses some of the limitations encountered in existing source formats. Further, it has been deemed as mandatory for IMF (Interoperable Master Format). Netflix is 100% committed to IMF and we expect that our ingest implementation will support the text profile of IMSC. To that end, we have been actively involved in moving the IMSC1 specification forward. Multiple Netflix sponsored implementations for IMSC1 were announced to TTWG in February 2016 paving the way for the specification to move to recommendation status.

IMSC1 does not have support for essential features (e.g., rubies) for rendering Japanese and other Asian subtitles. To accelerate that effort we are actively involved in standardization of TTML2 - both from a specification and as well as an implementation perspective. Our objective is to get to TTML2-based IMSC2.

Examples of OSS (open source software) projects sponsored by Netflix in this context include “ttt” (Timed Text Toolkit). This project offers tools for validation and rendering of W3C TTML family of formats (e.g., TTML1, IMSC1, and TTML2). “Photon” is an example of a project developed internally at Netflix. The objective of Photon is to provide the complete set of tools for validation of IMF packages.

The role of Netflix in advancing subtitle standards in the industry has been recognized by The National Academy of Television Arts & Sciences, and Netflix was a co-recipient of the 2015 Technology and Engineering Emmy Award for “Standardization and Pioneering Development of Non-Live Broadcast Captioning”.

Future Work

In a closed system such as the Netflix playback system, where the generation and consumption of timed text delivery formats can be controlled, it is possible to have a firm grip on the end-to-end system. Further, the streaming player industry has moved to support leading formats. However, support on the ingestion side remains tricky. New markets can introduce new formats with new features.

Consider right-to-left languages. The bidirectional (bidi) algorithm has gone through many revisions. Many tools still in use were developed to old versions of the bidi specification. As these files are passed to newer tools with newer bidi implementations, chaos ensues.

Old but popular formats like SCC and STL were developed for broadcast frame rates. When conformed to film content, they fall outside the scope of the original intent. Following chart shows the distribution of these sources as delivered to Netflix. More than 60% of our broadcast-minded assets have been conformed to film content.

Such challenges generate ever increasing requirements on inspection routines. Thrash requires operational support to manage communication across teams for triage/redelivery. One idea to solve these overheads could be to offer inspections as a web service (see accompanying figure).

In such a model, a partner uploads their timed text file(s), inspections are run, and a common format file(s) is returned. This common format will be an open standard like TTML. The service also provides a preview of how the text will be rendered. In case an error has been found, we can show the partner where the error is and suggest recommendations to fix it.

Not only will this model reduce the frequency of software maintenance and enhancement, but will drastically cut down the need for manual intervention. It also provides an opportunity for our content partners who could integrate this capability into their authoring workflow and iteratively improve the quality of the authored subtitles.

Timed text assets carry a lot of untapped potential. For example, timed text files may contain object references in dialogue. Words used in a context could provide more information about possible facial expressions or actions on the screen. Machine learning and natural language processing may help solve labor-intensive QC challenges. Data mining into the timed text could even help automate movie ratings. As media consumption becomes more global, timed text sources will explode in number and importance. Developing a system that scales and learns over time is the demand of the hour.

By Shinjan Tiwary, Dae Kim, Harold Sutherland, Rohit Puri and David Ronca

This is the second blog of our Keystone pipeline series. Please refer to the first part for overview and evolution of the Keystone pipeline. In summary, the Keystone pipeline is a unified event publishing, collection, and routing infrastructure for both batch and stream processing.

We have two sets of Kafka clusters in Keystone pipeline: Fronting Kafka and Consumer Kafka. Fronting Kafka clusters are responsible for getting the messages from the producers which are virtually every application instance in Netflix. Their roles are data collection and buffering for downstream systems. Consumer Kafka clusters contain a subset of topics routed by Samza for real-time consumers.

We currently operate 36 Kafka clusters consisting of 4,000+ broker instances for both Fronting Kafka and Consumer Kafka. More than 700 billion messages are ingested on an average day. We are currently transitioning from Kafka version 0.8.2.1 to 0.9.0.1.

Design Principles

Given the current Kafka architecture and our huge data volume, to achieve lossless delivery for our data pipeline is cost prohibitive in AWS EC2. Accounting for this, we’ve worked with teams that depend upon our infrastructure to arrive at an acceptable amount of data loss, while balancing cost. We’ve achieved a daily data loss rate of less than 0.01%. Metrics are gathered for dropped messages so we can take action if needed.

The Keystone pipeline produces messages asynchronously without blocking applications. In case a message cannot be delivered after retries, it will be dropped by the producer to ensure the availability of the application and good user experience. This is why we have chosen the following configuration for our producer and broker:

acks = 1
block.on.buffer.full = false
unclean.leader.election.enable = true

Most of the applications in Netflix use our Java client library to produce to Keystone pipeline. On each instance of those applications, there are multiple Kafka producers, with each producing to a Fronting Kafka cluster for sink level isolation. The producers have flexible topic routing and sink configuration which are driven via dynamic configuration that can be changed at runtime without having to restart the application process. This makes it possible for things like redirecting traffic and migrating topics across Kafka clusters. For non-Java applications, they can choose to send events to Keystone REST endpoints which relay messages to fronting Kafka clusters.

For greater flexibility, the producers do not use keyed messages. Approximate message ordering is re-established in the batch processing layer (Hive / Elasticsearch) or routing layer for streaming consumers.

We put the stability of our Fronting Kafka clusters at a high priority because they are the gateway for message injection. Therefore we do not allow client applications to directly consume from them to make sure they have predictable load.

Challenges of running Kafka in the Cloud

Kafka was developed with data center as the deployment target at LinkedIn. We have made notable efforts to make Kafka run better in the cloud.

In the cloud, instances have an unpredictable life-cycle and can be terminated at anytime due to hardware issues. Transient networking issues are expected. These are not problems for stateless services but pose a big challenge for a stateful service requiring ZooKeeper and a single controller for coordination.

Most of our issues begin with outlier brokers. An outlier may be caused by uneven workload, hardware problems or its specific environment, for example, noisy neighbors due to multi-tenancy. An outlier broker may have slow responses to requests or frequent TCP timeouts/retransmissions. Producers who send events to such a broker will have a good chance to exhaust their local buffers while waiting for responses, after which message drop becomes a certainty. The other contributing factor to buffer exhaustion is that Kafka 0.8.2 producer doesn’t support timeout for messages waiting in buffer.

Kafka’s replication improves availability. However, replication leads to inter-dependencies among brokers where an outlier can cause cascading effect. If an outlier slows down replication, replication lag may build up and eventually cause partition leaders to read from the disk to serve the replication requests. This slows down the affected brokers and eventually results in producers dropping messages due to exhausted buffer as explained in previous case.

During our early days of operating Kafka, we experienced an incident where producers were dropping a significant amount of messages to a Kafka cluster with hundreds of instances due to a ZooKeeper issue while there was little we could do. Debugging issues like this in a small time window with hundreds of brokers is simply not realistic.

Following the incident, efforts were made to reduce the statefulness and complexity for our Kafka clusters, detect outliers, and find a way to quickly start over with a clean state when an incident occurs.

Kafka Deployment Strategy

The following are the key strategies we used for deploying Kafka clusters

Favor multiple small Kafka clusters as opposed to one giant cluster. This reduces the operational complexity for each cluster. Our largest cluster has less than 200 brokers.

Limit the number of partitions in each cluster. Each cluster has less than 10,000 partitions. This improves the availability and reduces the latency for requests/responses that are bound to the number of partitions.
Strive for even distribution of replicas for each topic. Even workload is easier for capacity planning and detection of outliers.
Use dedicated ZooKeeper cluster for each Kafka cluster to reduce the impact of ZooKeeper issues.

The following table shows our deployment configurations.

	Fronting Kafka Clusters	Consumer Kafka Clusters
Number of clusters	24	12
Total number of instances	3,000+	900+
Instance type	d2.xl	i2.2xl
Replication factor	2	2
Retention period	8 to 24 hours	2 to 4 hours

Kafka Failover

We automated a process where we can failover both producer and consumer (router) traffic to a new Kafka cluster when the primary cluster is in trouble. For each fronting Kafka cluster, there is a cold standby cluster with desired launch configuration but minimal initial capacity. To guarantee a clean state to start with, the failover cluster has no topics created and does not share the ZooKeeper cluster with the primary Kafka cluster. The failover cluster is also designed to have replication factor 1 so that it will be free from any replication issues the original cluster may have.

When failover happens, the following steps are taken to divert the producer and consumer traffic:

Resize the failover cluster to desired size.
Create topics on and launch routing jobs for the failover cluster in parallel.
(Optionally) Wait for leaders of partitions to be established by the controller to minimize the initial message drop when producing to it.
Dynamically change the producer configuration to switch producer traffic to the failover cluster.

The failover scenario can be depicted by the following chart:

With the complete automation of the process, we can do failover in less than 5 minutes. Once failover has completed successfully, we can debug the issues with the original cluster using logs and metrics. It is also possible to completely destroy the cluster and rebuild with new images before we switch back the traffic. In fact, we often use failover strategy to divert the traffic while doing offline maintenance. This is how we are upgrading our Kafka clusters to new Kafka version without having to do the rolling upgrade or setting the inter-broker communication protocol version.

Development for Kafka

We developed quite a lot of useful tools for Kafka. Here are some of the highlights:

Producer sticky partitioner

This is a special customized partitioner we have developed for our Java producer library. As the name suggests, it sticks to a certain partition for producing for a configurable amount of time before randomly choosing the next partition. We found that using sticky partitioner together with lingering helps to improve message batching and reduce the load for the broker. Here is the table to show the effect of the sticky partitioner:

partitioner	batched records per request	broker cpu utilization [1]
random without lingering	1.25	75%
sticky without lingering	2.0	50%
sticky with 100ms lingering	15	33%

[1] With an load of 10,000 msgs / second per broker and 1KB per message

Rack aware replica assignment

All of our Kafka clusters spans across three AWS availability zones. An AWS availability zone is conceptually a rack. To ensure availability in case one zone goes down, we developed the rack (zone) aware replica assignment so that replicas for the same topic are assigned to different zones. This not only helps to reduce the risk of a zone outage, but also improves our availability when multiple brokers co-located in the same physical host are terminated due to host problems. In this case, we have better fault tolerance than Kafka’s N - 1 where N is the replication factor.

The work is contributed to Kafka community in KIP-36 and Apache Kafka Github Pull Request #132.

Kafka Metadata Visualizer

Kafka’s metadata is stored in ZooKeeper. However, the tree view provided by Exhibitor is difficult to navigate and it is time consuming to find and correlate information.

We created our own UI to visualize the metadata. It provides both chart and tabular views and uses rich color schemes to indicate ISR state. The key features are the following:

Individual tab for views for brokers, topics, and clusters
Most information is sortable and searchable
Searching for topics across clusters
Direct mapping from broker ID to AWS instance ID
Correlation of brokers by the leader-follower relationship

The following are the screenshots of the UI:

Monitoring

We created a dedicated monitoring service for Kafka. It is responsible for tracking:

Broker status (specifically, if it is offline from ZooKeeper)
Broker’s ability to receive messages from producers and deliver messages to consumers. The monitoring service acts as both producer and consumer for continuous heartbeat messages and measures the latency of these messages.
For old ZooKeeper based consumers, it monitors the partition count for the consumer group to make sure each partition is consumed.
For Keystone Samza routers, it monitors the checkpointed offsets and compares with broker’s log offsets to make sure they are not stuck and have no significant lag.

In addition, we have extensive dashboards to monitor traffic flow down to a topic level and most of the broker’s metrics.

Future plan

We are currently in process of migrating to Kafka 0.9, which has quite a few features we want to use including new consumer APIs, producer message timeout and quotas. We will also move our Kafka clusters to AWS VPC and believe its improved networking (compared to EC2 classic) will give us an edge to improve availability and resource utilization.

We are going to introduce a tiered SLA for topics. For topics that can accept minor loss, we are considering using one replica. Without replication, we not only save huge on bandwidth, but also minimize the state changes that have to depend on the controller. This is another step to make Kafka less stateful in an environment that favors stateless services. The downside is the potential message loss when a broker goes away. However, by leveraging the producer message timeout in 0.9 release and possibly AWS EBS volume, we can mitigate the loss.

Stay tuned for future Keystone blogs on our routing infrastructure, container management, stream processing and more!

By Real-Time Data Infrastructure Team

Allen Wang,Steven Wu,Monal Daxini,Manas Alekar,Zhenzhong Xu,Jigish Patel,Nagarjun Guraja,Jonathan Bond,Matt Zimmer,Peter Bakas,Kunal Kundaje

Ever wonder how Netflix serves a great streaming experience with high-quality video and minimal playback interruptions? Thank the team of engineers and data scientists who constantly A/B test their innovations to our adaptive streaming and content delivery network algorithms. What about more obvious changes, such as the complete redesign of our UI layout or ournew personalized homepage? Yes, all thoroughly A/B tested.

In fact, every product change Netflix considers goes through a rigorous A/B testing process before becoming the default user experience. Major redesigns like the ones above greatly improve our service by allowing members to find the content they want to watch faster. However, they are too risky to roll out without extensive A/B testing, which enables us to prove that the new experience is preferred over the old.

And if you ever wonder whether we really set out to test everything possible, consider that even the images associated with many titles are A/B tested, sometimes resulting in20% to 30% more viewing for that title!

Results like these highlight why we are so obsessed with A/B testing. By following an empirical approach, we ensure that product changes are not driven by the most opinionated and vocal Netflix employees, but instead by actual data, allowing our members themselves to guide us toward the experiences they love.

In this post we’re going to discuss the Experimentation Platform: the service which makes it possible for every Netflix engineering team to implement their A/B tests with the support of a specialized engineering team. We’ll start by setting some high level context around A/B testing before covering the architecture of our current platform and how other services interact with it to bring an A/B test to life.

Overview

The general concept behind A/B testing is to create an experiment with a control group and one or more experimental groups (called “cells” within Netflix) which receive alternative treatments. Each member belongs exclusively to one cell within a given experiment, with one of the cells always designated the “default cell”. This cell represents the control group, which receives the same experience as all Netflix members not in the test. As soon as the test is live, we track specific metrics of importance, typically (but not always) streaming hours and retention. Once we have enough participants to draw statistically meaningful conclusions, we can get a read on the efficacy of each test cell and hopefully find a winner.

From the participant’s point of view, each member is usually part of several A/B tests at any given time, provided that none of those tests conflict with one another (i.e. two tests which modify the same area of a Netflix App in different ways). To help test owners track down potentially conflicting tests, we provide them with a test schedule view in ABlaze, the front end to our platform. This tool lets them filter tests across different dimensions to find other tests which may impact an area similar to their own.

Screen Shot 2016-04-16 at 11.30.36 AM.png

There is one more topic to address before we dive further into details, and that is how members get allocated to a given test. We support two primary forms of allocation: batch and real-time.

Batch allocations give analysts the ultimate flexibility, allowing them to populate tests using custom queries as simple or complex as required. These queries resolve to a fixed and known set of members which are then added to the test. The main cons of this approach are that it lacks the ability to allocate brand new customers and cannot allocate based on real-time user behavior. And while the number of members allocated is known, one cannot guarantee that all allocated members will experience the test (e.g. if we’re testing a new feature on an iPhone, we cannot be certain that each allocated member will access Netflix on an iPhone while the test is active).

Real-Time allocations provide analysts with the ability to configure rules which are evaluated as the user interacts with Netflix. Eligible users are allocated to the test in real-time if they meet the criteria specified in the rules and are not currently in a conflicting test. As a result, this approach tackles the weaknesses inherent with the batch approach. The primary downside to real-time allocation, however, is that the calling app incurs additional latencies waiting for allocation results. Fortunately we can often run this call in parallel while the app is waiting on other information. A secondary issue with real-time allocation is that it is difficult to estimate how long it will take for the desired number of members to get allocated to a test, information which analysts need in order to determine how soon they can evaluate the results of a test.

A Typical A/B Test Workflow

With that background, we’re ready to dive deeper. The typical workflow involved in calling the Experimentation Platform (referred to as A/B in the diagrams for shorthand) is best explained using the following workflow for an Image Selection test. Note that there are nuances to the diagram below which I do not address in depth, in particular the architecture of the Netflix API layer which acts as a gateway between external Netflix apps and internal services.

In this example, we’re running a hypothetical A/B test with the purpose of finding the image which results in the greatest number of members watching a specific title. Each cell represents a candidate image. In the diagram we’re also assuming a call flow from a Netflix App running on a PS4, although the same flow is valid for most of our Device Apps.

Screen Shot 2016-04-29 at 7.42.46 AM.png

The Netflix PS4 App calls the Netflix API. As part of this call, it delivers a JSON payload containing session level information related to the user and their device.
The call is processed in a script written by the PS4 App team. This script runs in the Client Adaptor Layer of the Netflix API, where each Client App team adds scripts relevant to their app. Each of these scripts come complete with their own distinct REST endpoints. This allows the Netflix API to own functionality common to most apps, while giving each app control over logic specific to them. The PS4 App Script now calls the A/B Client, a library our team maintains, and which is packaged within the Netflix API. This library allows for communication with our backend servers as well as other internal Netflix services.
The A/B Client calls a set of other services to gather additional context about the member and the device.
The A/B Client then calls the A/B Server for evaluation, passing along all the context available to it.
In the evaluation phase:

The A/B Server retrieves all test/cell combinations to which this member is already allocated.
For tests utilizing the batch allocation approach, the allocations are already known at this stage.
For tests utilizing real-time allocation, the A/B Server evaluates the context to see if the member should be allocated to any additional tests. If so, they are allocated.
Once all evaluations and allocations are complete, the A/B Server passes the complete set of tests and cells to the A/B Client, which in turn passes them to the PS4 App Script. Note that the PS4 App has no idea if the user has been in a given test for weeks or the last few microseconds. It doesn’t need to know or care about this.

Given the test/cell combinations returned to it, the PS4 App Script now acts on any tests applicable to the current client request. In our example, it will use this information to select the appropriate piece of art associated with the title it needs to display, which is returned by the service which owns this title metadata. Note that the Experimentation Platform does not actually control this behavior: doing so is up to the service which actually implements each experience within a given test.
The PS4 App Script (through the Netflix API) tells the PS4 App which image to display, along with all the other operations the PS4 App must conduct in order to correctly render the UI.

Now that we understand the call flow, let’s take a closer look at that box labelled “A/B Server”.

The Experimentation Platform

Screen Shot 2016-04-29 at 6.58.44 AM.png

The allocation and retrieval requests described in the previous section pass through REST API endpoints to our server. Test metadata pertaining to each test, including allocation rules, are stored in a Cassandra data store. It is these allocation rules which are compared to context passed from the A/B Client in order to determine a member’s eligibility to participate in a test (e.g. is this user in Australia, on a PS4, and has never previously used this version of the PS4 app).

Member allocations are also persisted in Cassandra, fronted by a caching layer in the form of an EVCache cluster, which serves to reduce the number of direct calls to Cassandra. When an app makes a request for current allocations, the AB Client first checks EVCache for allocation records pertaining to this member. If this information was previously requested within the last 3 hours (the TTL for our cache), a copy of the allocations will be returned from EVCache. If not, the AB Server makes a direct call to Cassandra, passing the allocations back to the AB Client, while simultaneously populating them in EVCache.

When allocations to an A/B test occur, we need to decide the cell in which to place each member. This step must be handled carefully, since the populations in each cell should be as homogeneous as possible in order to draw statistically meaningful conclusions from the test. Homogeneity is measured with respect to a set of key dimensions, of which country and device type (i.e. smart TV, game console, etc.) are the most prominent. Consequently, our goal is to make sure each cell contains similar proportions of members from each country, using similar proportions of each device type, etc. Purely random sampling can bias test results by, for instance, allocating more Australian game console users in one cell versus another. To mitigate this issue we employ a sampling method called stratified sampling, which aims to maintain homogeneity across the aforementioned key dimensions. There is a fair amount of complexity to our implementation of stratified sampling, which we plan to share in a future blog post.

In the final step of the allocation process, we persist allocation details in Cassandra and invalidate the A/B caches associated with this member. As a result, the next time we receive a request for allocations pertaining to this member, we will experience a cache miss and execute the cache related steps described above.

We also simultaneously publish allocation events to a Kafka data pipeline, which feeds into several data stores. The feed published to Hive tables provides a source of data for ad-hoc analysis, as well as Ignite, Netflix’s internal A/B Testing visualization and analysis tool. It is within Ignite that test owners analyze metrics of interest and evaluate the results of a test. Once again, you should expect an upcoming blog post focused on Ignite in the near future.

The latest updates to our tech stack added Spark Streaming, which ingests and transforms data from Kafka streams before persisting them in ElasticSearch, allowing us to display near real-time updates in ABlaze. Our current use cases involve simple metrics, allowing users to view test allocations in real-time across dimensions of interest. However, these additions have laid the foundation for much more sophisticated real-time analysis in the near future.

Upcoming Work

The architecture we’ve described here has worked well for us thus far. We continue to support an ever-widening set of domains: UI, Recommendations, Playback, Search, Email, Registration, and many more. Through auto-scaling we easily handle our platform’s typical traffic, which ranges from 150K to 450K requests per second. From a responsiveness standpoint, latencies fetching existing allocations range from an average of 8ms when our cache is cold to < 1ms when the cache is warm. Real-time evaluations take a bit longer, with an average latency around 50ms.

However, as our member base continues to expand globally, the speed and variety of A/B testing is growing rapidly. For some background, the general architecture we just described has been around since 2010 (with some obvious exceptions such as Kafka). Since then:

Netflix has grown from streaming in 2 countries to 190+
We’ve gone from 10+ million members to 80+ million
We went from dozens of device types to thousands, many with their own Netflix app

International expansion is part of the reason we’re seeing an increase in device types. In particular, there is an increase in the number of mobile devices used to stream Netflix. In this arena, we rely on batch allocations, as our current real-time allocation approach simply doesn’t work: the bandwidth on mobile devices is not reliable enough for an app to wait on us before deciding which experience to serve… all while the user is impatiently staring at a loading screen.

Additionally, some new areas of innovation conduct A/B testing on much shorter time horizons than before. Tests focused on UI changes, recommendation algorithms, etc. often run for weeks before clear effects on user behavior can be measured. However the adaptive streaming tests mentioned at the beginning of this post are conducted in a matter of hours, with internal users requiring immediate turn around time on results.

As a result, there are several aspects of our architecture which we are planning to revamp significantly. For example, while the real-time allocation mechanism allows for granular control, evaluations need to be faster and must interact more effectively with mobile devices.

We also plan to leverage the data flowing through Spark Streaming to begin forecasting per-test allocation rates given allocation rules. The goal is to address the second major drawback of the real-time allocation approach, which is an inability to foresee how much time is required to get enough members allocated to the test. Giving analysts the ability to predict allocation rates will allow for more accurate planning and coordination of tests.

These are just a couple of our upcoming challenges. If you’re simply curious to learn more about how we tackle them, stay tuned for upcoming blog posts. However, if the idea of solving these challenges and helping us build the next generation of Netflix’s Experimentation platform excites you, we’re always looking for talented engineers to join our team!

By Steve Urban, Rangarajan Sreenivasan, and Vineet Kannan.

At Netflix, we are constantly looking at ways to help our 81.5M members discover great stories that they will love. A big part of that is creating a user experience that is intuitive, fun, and meaningfully helps members find and enjoy stories on Netflix as fast as possible.

This blog post and the corresponding non-technical blog by my Creative Services colleague Nick Nelson take a deeper look at the key findings from our work in image selection -- focusing on how we learned, how we improved the service and how we are constantly developing new technologies to make Netflix better for our members.

Gone in 90 seconds

Broadly, we know that if you don’t capture a member’s attention within 90 seconds, that member will likely lose interest and move onto another activity. Such failed sessions could at times be because we did not show the right content or because we did show the right content but did not provide sufficient evidence as to why our member should watch it. How can we make it easy for our members to evaluate if a piece of content is of interest to them quickly?

As the old saying goes, a picture is worth a thousand words. Neuroscientists have discovered that the human brain can process an image in as little as 13 milliseconds, and that across the board, it takes much longer to process text compared to visual information. Will we be able to improve the experience by improving the images we display in the Netflix experience?

This blog post sheds light on the groundbreaking series of A/B tests Netflix did which resulted in increased member engagement. Our goals were the following:

Identify artwork that enabled members to find a story they wanted to watch faster.
Ensure that our members increase engagement with each title and also watch more in aggregate.
Ensure that we don’t misrepresent titles as we evaluate multiple images.

The series of tests we ran is not unlike any other area of the product -- where we relentlessly test our way to a better member experience with an increasingly complex set of hypotheses using the insights we have gained along the way.

Background and motivation

When a typical member comes to the above homepage the member glances at several details for each title including the display artwork (e.g. highlighted “Narcos” artwork in the “Popular on Netflix” row), title (“Narcos”), movie ratings (TV-MA), synopsis, star rating, etc. Through various studies, we found that our members look at the artwork first and then decide whether to look at additional details. Knowing that, we asked ourselves if we could improve the click-through rate for that first glance? To answer this question, we sought the support of our Creative Services team who work on creating compelling pieces of artwork that convey the emotion of the entire title in a single image, while staying true to the spirit. The Creative Services team worked with our studio partners and at times with our internal design team to create multiple artwork variants.

Examples of artwork that were used in other contexts that don’t naturally lend themselves to be used on the Netflix service.

Historically, this was a largely unexplored area at Netflix and in the industry in general. Netflix would get title images from our studio partners that were originally created for a variety of purposes. Some were intended for roadside billboards where they don’t live alongside other titles. Other images were sourced from DVD cover art which don’t work well in a grid layout in multiple form factors (TV, mobile, etc.). Knowing that, we set out to develop a data driven framework through which we can find the best artwork for each video, both in the context of the Netflix experience and with the goal of increasing overall engagement -- not just move engagement from one title to another.

Testing our way into a better product

Broadly, Netflix’s A/B testing philosophy is about building incrementally, using data to drive decisions, and failing fast. When we have a complex area of testing such as image selection, we seek to prove out the hypothesis in incremental steps with increasing rigor and sophistication.

Experiment 1 (single title test with multiple test cells)

One of the earliest tests we ran was on the single title “The Short Game” - an inspiring story about several grade school students competing with each other in the game of golf. If you see the default artwork for this title you might not realize easily that it is about kids and skip right past it. Could we create a few artwork variants that increase the audience for a title?

Cells	Cell 1 (Control)	Cell 2	Cell 3
Box Art	Default artwork	14% better take rate	6% better take rate

To answer this question, we built a very simple A/B test where members in each test cell get a different image for that title. We measured the engagement with the title for each variant - click through rate, aggregate play duration, fraction of plays with short duration, fraction of content viewed (how far did you get through a movie or series), etc. Sure enough, we saw that we could widen the audience and increase engagement by using different artwork.

A skeptic might say that we may have simply moved hours to this title from other titles on the service. However, it was an early signal that members are sensitive to artwork changes. It was also a signal that there were better ways we could help our members find the types of stories they were looking for within the Netflix experience. Knowing this, we embarked on an incrementally larger test to see if we could build a similar positive effect on a larger set of titles.

Experiment 2 (multi-cell explore-exploit test)

The next experiment ran with a significantly larger set of titles across the popularity spectrum - both blockbusters and niche titles. The hypothesis for this test was that we can improve aggregate streaming hours for a large member allocation by selecting the best artwork across each of these titles.

This test was constructed as a two part explore-exploit test. The “explore” test measured engagement of each candidate artwork for a set of titles. The “exploit” test served the most engaging artwork (from explore test) for future users and see if we can improve aggregate streaming hours.

Explore test cells	Control cell	Explore Cell 1	Explore Cell 2	Explore cell 3
	Serve default artwork for all titles	Serve artwork variant 1 for all titles	Serve artwork variant 2 for all titles	Serve artwork variant 3 for all titles
Measure best artwork variant for each title over 35 days and feed the exploit test.

Exploit test cells	Control cell	Exploit Cell 1	Exploit Cell 2	Exploit cell 3
	Serve default artwork for the title	Serve winning artwork for the title based on metric 1	Serve winning artwork for the title based on metric 2	Serve winning artwork for the title based on metric 3
Compare by cell the “Total streaming hours” and “Hour share of the titles we tested”

Using the explore member population, we measured the take rate (click-through rate) of all artwork variants for each title. We computed take rate by dividing number of plays (barring very short plays) by the number of impressions on the device. We had several choices for the take rate metric across different grains:

Should we include members who watch a few minutes of a title, or just those who watched an entire episode, or those who watched the entire show?
Should we aggregate take rate at the country level, region level, or across global population?

Using offline modeling, we narrowed our choices to 3 different take rate metrics using a combination of the above factors. Here is a pictorial summary of how the two tests were connected.

The results from this test were unambiguous - we significantly raised view share of the titles testing multiple variants of the artwork and we were also able to raise aggregate streaming hours. It proved that we weren't simply shifting hours. Showing members more relevant artwork drove them to watch more of something they have not discovered earlier. We also verified that we did not negatively affect secondary metrics like short duration plays, fraction of content viewed, etc. We did additional longitudinal A/B tests over many months to ensure that simply changing artwork periodically is not as good as finding a better performing artwork and demonstrated the gains don’t just come from changing the artwork.

There were engineering challenges as we pursued this test. We had to invest in two major areas - collecting impression data consistently across devices at scale and across time.

1. Client side impression tracking: One of the key components to measuring take rate is knowing how often a title image came into the viewport on the device (impressions). This meant that every major device platform needed to track every image that came into the viewport when a member stopped to consider it even for a fraction of a second. Every one of these micro-events is compacted and sent periodically as a part of the member session data. Every device should consistently measure impressions even though scrolling on an iPad is very different than the navigation on a TV. We collect billions of such impressions daily with low loss rate across every stage in the network - a low storage device might evict events before successfully sending them, lossiness on the network, etc.

2. Stable identifiers for each artwork: An area that was surprisingly challenging was creating stable unique ids for each artwork. Our Creative Services team steadily makes changes to the artwork - changing title treatment, touching up to improve quality, sourcing higher resolution artwork, etc.

The above diagram shows the anatomy of the artwork - it contains the background image, a localized title treatment in most languages we support, an optional ‘new episode’ badge, and a Netflix logo for any of our original content.

These two images have different aspect ratios and localized title treatments but have the same lineage ID.

So, we created a system that automatically grouped artwork that had different aspect ratios, crops, touch ups, localized title treatments but had the same background image. Images that share the same background image were associated with the same “lineage ID”.

Even as Creative Services changed the title treatment and the crop, we logged the data using the lineage ID of the artwork. Our algorithms can combine data from our global member base even as their preferred locale varied. This improved our data particularly in smaller countries and less common languages.

Experiment 3 (single cell title level explore test)

While the earlier experiment was successful, there are faster and more equitable ways to learn the performance of an artwork. We wish to impose on the fewest number of randomly selected members for the least amount of time before we can confidently determine the best artwork for every title on the service.

Experiment 2 pre-allocated each title into several equal sized cells -- one per artwork variant. We potentially wasted impressions because every image, including known under-performing ones, continue to get impressions for many days. Also, based on the allocation size, say 2 million members, we would accurately detect performance of images for popular titles but not for niche titles due to sample size. If we allocated a lot more members, say 20 million members, then we would accurately learn performance of artwork for niche titles but we would be over exposing poor performing artwork of the popular titles.

Experiment 2 did not handle dynamic changes to the number of images that needed evaluation. i.e. we could not evaluate 10 images for a popular title while evaluating just 2 for another.

We tried to address all of these limitations in the design for a new “title level explore test”. In this new experiment, all members of the explore population are in a single cell. We dynamically assign an artwork variant for every (member, title) pair just before the title gets shown to the member. In essence, we are performing the A/B test for every title with a cell for each artwork. Since the allocation happens at the title level, we are now able to accommodate different number of artwork variants per title.

This new test design allowed us to get results even faster than experiment 2 since the first N members, say 1 million, who see a title could be used to evaluate performance of its image variants. We continue to stay in explore phase as long as it takes for us to determine a significant winner -- typically a few days. After that, we exploit the win and all members enjoy the benefit by seeing the winning artwork.

Here are some screenshots from the tool that we use to track relative artwork performance.

Dragons: Race to the Edge: the two marked images below significantly outperformed all others.

Unbreakable Kimmy Schmidt

Conclusion

Over the course of this series of tests, we have found many interesting trends among the winning images as detailed in this blog post. Images that have expressive facial emotion that conveys the tone of the title do particularly well. Our framework needs to account for the fact that winning images might be quite different in various parts of the world. Artwork featuring recognizable or polarizing characters from the title tend to do well. Selecting the best artwork has improved the Netflix product experience in material ways. We were able to help our members find and enjoy titles faster.

We are far from done when it comes to improving artwork selection. We have several dimensions along which we continue to experiment. Can we move beyond artwork and optimize across all asset types (artwork, motion billboards, trailers, montages, etc.) and choose between the best asset types for a title on a single canvas?

This project brought together the many strengths of Netflix including a deep partnership between best-in-class engineering teams, our Creative Services design team, and our studio partners. If you are interested in joining us on such exciting pursuits, then please look at our open job descriptions around product innovation and machine learning.

Gopal Krishnan

(on behalf of the teams that collaborated)

Personalized recommendations and search are the primary ways Netflix members find great content to watch. We’ve written much about how we build them and some of the open challenges. Recently we organized a full-day workshop at Netflix on Personalization, Recommendation and Search (PRS2016), bringing together researchers and practitioners in these three domains. It was a forum to exchange information, challenges and practices, as well as strengthen bridges between these communities. Seven invited speakers from the industry and academia covered a broad range of topics, highlighted below. We look forward to hearing more and continuing the fascinating conversations at PRS2017!

Semantic Entity Search, Maarten de Rijke, University of Amsterdam

Entities, such as people, products, organizations, are the ingredients around which most conversations are built. A very large fraction of the queries submitted to search engines revolve around entities. No wonder that the information retrieval community continues to devote a lot of attention to entity search. In this talk Maarten discussed recent advances in entity retrieval. Most of the talk was focused on unsupervised semantic matching methods for entities that are able to learn from raw textual evidence associated with entities alone. Maarten then pointed out challenges (and partial solutions) to learn such representations in a dynamic setting and to learn to improve such representations using interaction data.

Combining matrix factorization and LDA topic modeling for rating prediction and learning user interest profiles, Deborah Donato, StumbleUpon

Matrix Factorization through Latent Dirichlet Allocation (fLDA) is a generative model for concurrent rating prediction and topic/persona extraction. It learns topic structure of URLs and topic affinity vectors for users, and predicts ratings as well. The fLDA model achieves several goals for StumbleUpon in a single framework: it allows for unsupervised inference of latent topics in the URLs served to users and for users to be represented as mixtures over the same topics learned from the URLs (in the form of affinity vectors generated by the model). Deborah presented an ongoing effort inspired by the fLDA framework devoted to extend the original approach to an industrial environment. The current implementation uses a (much faster) expectation maximization method for parameter estimation, instead of Gibbs sampling as in the original work and implements a modified version of in which topic distributions are learned independently using LDA prior to training the main model. This is an ongoing effort but Deborah presented very interesting results.

Exploiting User Relationships to Accurately Predict Preferences in Large Scale Networks, Jennifer Neville, Purdue University

The popularity of social networks and social media has increased the amount of information available about users' behavior online--including current activities and interactions among friends and family. This rich relational information can be used to predict user interests and preferences even when individual data is sparse, since the characteristics of friends are often correlated. Although relational data offer several opportunities to improve predictions about users, the characteristics of online social network data also present a number of challenges to accurately incorporate the network information into machine learning systems. This talk outlined some of the algorithmic and statistical challenges that arise due to partially-observed, large-scale networks, and describe methods for semi-supervised learning and active exploration that address the challenges.

Why would you recommend me THAT!?, Aish Fenton, Netflix

With so many advances in machine learning recently, it’s not unreasonable to ask: why aren’t my recommendations perfect by now? Aish provided a walkthrough of the open problems in the area of recommender systems, especially as they apply to Netflix’s personalization and recommendation algorithms. He also provided a brief overview of recommender systems, and sketched out some tentative solutions for the problems he presented.

Diversity in Radio, David Ross, Google

Many services offer streaming radio stations seeded by an artist or song, but what does that mean? To get specific, what fraction of the songs in “Taylor Swift Radio” should be by Taylor Swift? David provided a short introduction to the YouTube Radio project, and dived into the diversity problem, sharing some insights Google has learned from live experiments and human evaluations.

Immersive Recommendation Using Personal Digital Traces, Deborah Estrin and Andy Hsieh, Cornell Tech

From topics referred to in Twitter or email, to web browser histories, to videos watched and products purchased online, our digital traces (small data) reflect who we are, what we do, and what we are interested in. In this talk, Deborah and Andy presented a new user-centric recommendation model, called Immersive Recommendation, that incorporate cross-platform, diverse personal digital traces into recommendations. They discussed techniques that infer users' interests from personal digital traces while suppressing context-specified noise replete in these traces, and propose a hybrid collaborative filtering algorithm to fuse the user interests with content and rating information to achieve superior recommendation performance throughout a user's lifetime, including in cold-start situations. They illustrated this idea with personalized news and local event recommendations. Finally they discussed future research directions and applications that incorporate richer multimodal user-generated data into recommendations, and the potential benefits of turning such systems into tools for awareness and aspiration.

Response prediction for display advertising, Olivier Chapelle, Criteo (paper)

Click-through and conversion rates estimation are two core predictions tasks in display advertising. Olivier presented a machine learning framework based on logistic regression that is specifically designed to tackle the specifics of display advertising. The resulting system has the following characteristics: it is easy to implement and deploy; it is highly scalable (they have trained it on terabytes of data); and it provides models with state-of-the-art accuracy. Olivier described how the system uses explore/exploit machinery to constantly vary and evolve its predictive model on live streaming data.

We hope you find these presentations stimulating. We certainly did, and look forward to organizing similar events in the future! If you’d like to help us tackle challenges in these areas, to help our members find great stories to enjoy, checkout our job postings.