Creating Your Own EC2 Spot Market

September 28, 2015, 12:01 pm

by: Andrew Park, Darrell Denlinger, & Coburn Watson

Netflix prioritizes innovation and reliability above efficiency, and as we continue to scale globally, finding opportunities that balance these three variables becomes increasingly difficult. However, every so often there is a process or application that can shift the curve out on all three factors; for Netflix this process was incorporating hybrid autoscaling engines for our services via Scryer& Amazon Auto Scaling.

Currently over 15% of our EC2 footprint autoscales, and the majority of this usage is covered by reserved instances as we value the pricing and capacity benefits. The combination of these two factors have created an “internal spot market” that has a daily peak of over 12,000 unused instances. We have been steadily working on building an automated system that allows us to effectively utilize these troughs.

Creating the internal spot capacity is straightforward: implement auto scaling and purchase reserved instances. In this post we’ll focus on how to leverage this trough given the complexities that stem from our large scale and decentralized microservice architecture. In the upcoming posts, we will discuss the technical details in automating Netflix’s internal spot market and highlight some of the lessons learned.

How the internal spot began

The initial foray into large scale borrowing started in the Spring of 2015. A new algorithm for one of our personalization services ballooned their video ranking precompute cluster, expanding the size by 5x overnight. Their precompute cluster had an SLA to complete their daily jobs between midnight and 11am, leaving over 1,500 r3.4xlarges unused during the afternoon and evening.

Motivated by the inefficiencies, we actively searched for another service that had relatively interruptible jobs that could run during the off-hours. The Encoding team, who is responsible for converting the raw master video files into consumable formats for our device ecosystem, was the perfect candidate. The initial approach applied was a borrowing schedule based on historical availability, with scale-downs manually communicated between the Personalization, Encoding, and Cloud Capacity teams.

Preliminary Manual Borrowing

As the Encoding team continued to reap the benefits of the extra capacity, they became interested in borrowing from the various sizable troughs in other instance types. Because of a lack of real time data exposing the unused capacity between our accounts, we embarked on a multi-team effort to create the necessary tooling and processes to allow borrowing to occur on a larger, more automated scale.

Current Automated Borrowing

Borrowing considerations

The first requirement to automated borrowing is building out the telemetry exposing unused reservation counts. Given our autoscaling engines operate at a minute granularity, we could not leverage AWS’ billing file as our data source. Instead, the Engineering Tools team built an API inside our deployment platform that exposed real time unused reservations at the minute level. This unused calculation combined input data from our deployment tool, monitoring system, and AWS’ reservation system.

The second requirement is finding batch jobs that are short in duration or interruptible in nature. Our batch Encoding jobs had a minimum duration SLA between five minutes to an hour, making them a perfect fit for our initial twelve hour borrowing window. An additional benefit is having jobs that are resource agnostic, allowing for more borrowing opportunities as our usage landscape creates various troughs by instance type.

The last requirement is for teams to absorb the telemetry data and to set appropriate rules for when to borrow instances. The main concern was whether or not this borrowing would jeopardize capacity for services in the critical path. We alleviated this issue by placing all of our borrowing into a separate account from our production account and leveraging the financial advantages of consolidated billing. Theoretically, a perfectly automated borrowing system would have the same operational and financial results regardless of account structure, but leveraging consolidated billing creates a capacity safety net.

Conclusion

In the ideal state, the internal spot market can be the most efficient platform for running short duration or interruptible jobs through instance level bin-packing. A series of small steps moved us in the right direction, such as:

Identifying preliminary test candidates for resource sharing
Creating shorter run-time jobs or modifying jobs to be more interruptible
Communicating broader messaging about resource sharing

In our next post in this series, the Encoding team will talk through their use cases of the internal spot market, depicting the nuances of real time borrowing at such scale. Their team is actively working through this exciting efficiency problem and many others at Netflix; please check our Jobs site if you want to help us solve these challenges!

↧

Moving from Asgard to Spinnaker

September 30, 2015, 1:33 pm

≫ Next: Flux: A New Approach to System Intuition

≪ Previous: Creating Your Own EC2 Spot Market

Six years ago, Netflix successfully jumped headfirst into the AWS Cloud and along the way we ended up writing quite a lot of software to help us out. One particular project proved instrumental in allowing us to efficiently automate AWS deployments: Asgard.

Asgard created an intuitive model for cloud-based applications that has made deployment and ongoing management of AWS resources easy for hundreds of engineers at Netflix. Introducing the notion of clusters, applications, specific naming conventions, and deployment options like rolling push and red/black has ultimately yielded more productive teams who can spend more time coding business logic rather than becoming AWS experts. What’s more, Asgard has been a successful OSS project adopted by various companies. Indeed, the utility of Asgard’s battle-hardened AWS deployment and management features is undoubtedly due to the hard work and innovation of its contributors both within Netflix and the community.

Netflix, nevertheless, has evolved since first embracing the cloud. Our footprint within AWS has expanded to meet the demand of an increasingly global audience; moreover, the number of applications required to service our customers has swelled. Our rate of innovation, which maintains our global competitive edge, has also grown. Consequently, our desire to move code rapidly, with a high degree of confidence and overall visibility, has also increased. In this regard Asgard has fallen short.

Everything required to produce a deployment artifact, in this case an AMI, has never been addressed in Asgard. Consequently, many teams at Netflix constructed their own Continuous Delivery workflows. These workflows were typically related Jenkins jobs that tied together code check-ins with building and testing, then AMI creations and, finally, deployments via Asgard. This final step involved automation against Asgard’s REST API, which was never intended to be leveraged as a first class citizen.

Roughly a year ago a new project, dubbed Spinnaker, kicked off to enable end-to-end global Continuous Delivery at Netflix. The goals of this project were to create a Continuous Delivery platform that would:

enable repeatable automated deployments captured as flexible pipelines and configurable pipeline stages
provide a global view across all the environments that an application passes through in its deployment pipeline
offer programmatic configuration and execution via a consistent and reliable API
be easy to configure, maintain, and extend
be operationally resilient
provide the existing benefits of Asgard without a migration

What’s more, we wanted to leverage a few lessons learned from Asgard. One particular goal of this new platform is to facilitate innovation within its umbrella. The original Asgard model was difficult to extend so the community forked Asgard to provide alternative implementations. Since these changes weren’t merged back into Asgard, those innovations were lost to the wider community. Spinnaker aims to make it easier to extend and enhance cloud deployment models in a way that doesn't require forking. Whether the community desires additional cloud providers, different deployment artifacts or new stages in a Continuous Delivery pipeline, extensions to Spinnaker will be available to everyone in the community without the need to fork.

We additionally wanted to create a platform that, while replacing Asgard, doesn’t exclude it. A big-bang migration process off Asgard would be out of the question for Netflix and for the community. Consequently, changes to cloud assets via Asgard are completely compatible with changes to those same assets via our new platform. And vice versa!

Finally, we deliberately chose not to reimplement everything in Asgard. Ultimately, Asgard took on too much undifferentiated heavy lifting from the AWS console. Consequently, for those features that are not directly related to cluster management, such as SNS, SQS, and RDS Management, Netflix users and the community are encouraged to use the AWS Console.

Our new platform only implements those Asgard-like features related to cluster management from the point of view of an application (and even a group of related applications: a project). This application context allows you to work with a particular application’s related clusters, ASGs, instances, Security Groups, and ELBs, in all the AWS accounts in which the application is deployed.

Today, we have both systems running side by side with the vast majority of all deployments leveraging our new platform. Nevertheless, we’re not completely done with gaining the feature parity we desire with Asgard. That gap is closing rapidly and in the near future we will be sunsetting various Asgard instances running in our infrastructure. At this point, Netflix engineers aren’t committing code to Asgard’s Github repository; nevertheless, we happily encourage the OSS community’s active participation in Asgard going forward.

Asgard served Netflix well for quite a long time. We learned numerous lessons along our journey and are ready to focus on the future with a new platform that makes Continuous Delivery a first-class citizen at Netflix and elsewhere. We plan to share this platform, Spinnaker, with the Open Source Community in the coming months.

-Delivery Engineering Team at Netflix

↧

Flux: A New Approach to System Intuition

October 1, 2015, 9:20 am

≫ Next: Netflix at AWS re:Invent 2015

≪ Previous: Moving from Asgard to Spinnaker

First level of Flux

On the Traffic and Chaos Teams at Netflix, our mission requires that we have a holistic understanding of our complex microservice architecture. At any given time, we may be called upon to move the request traffic of many millions of customers from one side of the planet to the other. More frequently, we want to understand in real time what effect a variable is having on a subset of request traffic during a Chaos Experiment. We require a tool that can give us this holistic understanding of traffic as it flows through our complex, distributed system.

The two use cases have some common requirements. We need:

Realtime data.
Data on the volume, latency, and health of requests.
Insight into traffic at the network edge.
The ability to drill into IPC traffic.
Dependency information about the microservices as requests travel through the system.

So far, these requirements are rather standard fare for a network monitoring dashboard. Aside from the actual amount of traffic that Netflix handles, you might find a tool at that accomplishes the above at any undifferentiated online service.

Here’s where it gets interesting.

In general, we assume that if anything is best represented numerically, then we don’t need to visualize it. If the best representation is a numerical one, then a visualization could only obscure a quantifiable piece of information that can be measured, compared, and acted upon. Anything that we can wrap in alerts or some threshold boundary should kick off some automated process. No point in ruining a perfectly good system by introducing a human into the mix.

Instead of numerical information, we want a tool that surfaces relevant information to a human, for situations that would be too onerous to create a heuristic. These situations require an intuition that we can’t codify.

If we want to be able to intuit decisions about the holistic state of the system, then we are going to need a tool that gives us an intuitive understanding of the system. The network monitoring dashboards that we are familiar with won’t suffice. The current industry tools present data and charts, but we want a something that will let us feel the traffic and the state of the system.

In trying to explain this requirement for a visceral, gut-level understanding of the system, we came up with a metaphor that helps illustrate the point. It’s absurd, but explanatory.

Let's call it the "Pain Suit."

Imagine a suit that is wired with tens of thousands of electrodes. Electrode bundles correspond to microservices within Netflix. When a Site Reliability Engineer is on call, they have to wear this suit. As a microservice experiences failures, the corresponding electrodes cause a painful sensation. We call this the “Pain Suit.”

Now imagine that you are wearing the Pain Suit for a few short days. You wake up one morning and feel a pain in your shoulder. “Of course,” you think. “Microservice X is misbehaving again.” It would not take you long to get a visceral sense of the holistic state of the system. Very quickly, you would have an intuitive understanding of the entire service, without having any numerical facts about any events or explicit alerts.

It is our contention that this kind of understanding, this mechanical proprioception, is not only the most efficient way for us to instantly have a holistic understanding, it is also the best way to surface relevant information in a vast amount of data to a human decision maker. Furthermore, we contend that even brief exposure to this type of interaction with the system leads to insights that are not easily attained in any other way.

Of course, we haven’t built a pain suit. [Not yet. ;-)]

Instead, we decided to take advantage of the brain’s ability to process massive amounts of visual information in multiple dimensions, in parallel, visually. We call this tool Flux.

In the home screen of Flux, we get a representation of all traffic coming into Netflix from the Internet, and being directed to one of our three AWS Regions. Below is a video capture of this first screen in Flux during a simulation of a Regional failover:

The circle in the center represents the Internet. The moving dots represent requests coming in to our service from the Internet. The three Regions are represented by the three peripheral circles. Requests are normally represented in the bluish-white color, but errors and fallbacks are indicated by other colors such as red.

In this simulation, you can see request errors building up in the region in the upper left [victim region] for the first twenty seconds or so. The cause of the errors could be anything, but the relevant effect is that we can quickly see that bad things are happening in the victim region.

Around twenty seconds into the video, we decide to initiate a traffic failover. For the following 20 seconds, the requests going to the victim region are redirected to the upper right region [savior region] via an internal proxy layer. We take this step so that we can programmatically control how much traffic is redirected to the savior region while we scale it up. In this situation we don’t have enough extra capacity running hot to instantly fail over, so scaling up takes some time.

The inter-region traffic from victim to savior increases while the savior region scales up. At that point, we switch DNS to point to the savior region. For about 10 seconds you see traffic to the victim region die down as DNS propagates. At this point, about 56 seconds in, nearly all of the victim region’s traffic is now pointing to the savior region. We hold the traffic there for about 10 seconds while we ‘fix’ the victim region, and then we revert the process.

The victim region has been fixed, and we end the demo with traffic more-or-less evenly distributed. You may have noticed that in this demonstration we only performed a 1:1 mapping of victim to savior region traffic. We will speak to more sophisticated failover strategies in future posts.

RESULTS

Even before Flux v1.0 was up and running, when it was still in Alpha on a laptop, it found an issue in our production system. As we were testing real data, Justin noticed a stream that was discolored in one region. “Hey, what’s that?” led to a short investigation which revealed that our proxy layer had not scaled to a proper size on the most recent push in that region and was rejecting SSO requests. Flux in action!

Even a split-second glance at the Flux interface is enough to show us the health of the system. Without reading any numbers or searching for any particular signal, we instantly know by the color and motion of the elements on the screen whether the service is running properly. Of course if something is really wrong with the service, it will be highly visible. More interesting to us, we start to get a feeling when things are right in the system even before the disturbance is quantifiable.

STAY TUNED

This blog post is part of a series. In the next post on Flux, we will look at two layers that are deeper than the regional view, and talk specifically about the implementation. If you have thoughts on experiential tools like this or how to advance the state of the art in this field, we’d love to hear your feedback. Feel free to reach out to traffic@netflix.com.

-Traffic Team at Netflix

Luke Kosewski, Jeremy Tatelman, Justin Reynolds, Casey Rosenthal

↧

Netflix at AWS re:Invent 2015

October 2, 2015, 7:44 am

≫ Next: Innovating SSO with Google For Work

≪ Previous: Flux: A New Approach to System Intuition

Ever since AWS started the re:Invent conference, Netflix has actively participated each and every year. This year is no exception, and we’re planning on presenting at 8 different sessions. The topics span the domains of availability, engineering velocity, security, real-time analytics, big data, operations, cost management, and efficiency all at web scale.

In the past, our sessions have received a lot of interest, so we wanted to share the schedule in advance, and provide a summary of the topics and how they might be relevant to you and your company. Please join us at re:Invent if you’re attending. After the conference, we will link slides and videos to this same post.

ISM301 - Engineering Global Operations in the Cloud

Wednesday, Oct 7, 11:00AM - Palazzo N

Josh Evans, Director of Operations Engineering

Abstract: Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, Operations Engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault-injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.

ISM309- Efficient Innovation - High Velocity Cost Management at Netflix

Wednesday, Oct 7, 2:45PM- Palazzo C

Andrew Park, Manager FPNA

Abstract: At many high growth companies, staying at the bleeding edge of innovation and maintaining the highest level of availability often sideline financial efficiency goals. This problem is exacerbated in a micro-service environment where decentralized engineering teams can spin up thousands of instances at a moment’s notice, with no governing body tracking financial or operational budgets. But instead of allowing costs to spin out of control causing senior leaders to have a “knee-jerk” reaction to rein in costs, there are proactive and reactive initiatives one can pursue to replace high velocity cost with efficient innovation. Primarily, these initiatives revolve around developing a positive cost-conscious culture and assigning the responsibility of efficiency to the appropriate business owners.

At Netflix, our Finance and Operations Engineering teams bear that responsibility to ensure the rate of innovation is not only fast, but also efficient. In the following presentation, we’ll cover the building blocks of AWS cost management and discuss the best practices used at Netflix.

BDT318 - Netflix Keystone - How Netflix handles Data Streams up to 8 Million events per second

Wednesday, Oct 7, 2:45PM - San Polo 3501B

Peter Bakas, Director of Event and Data Pipelines

Abstract: In this talk, we will provide an overview of Keystone - Netflix's new Data Pipeline. We will cover our migration from Suro to Keystone - including the reasons behind the transition and the challenges of zero loss to the over 400 billion events we process daily. We will discuss in detail how we deploy, operate and scale Kafka, Samza, Docker and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.

DVO203 - A Day in the Life of a Netflix Engineer using 37% of the Internet

Wednesday, Oct 7, 4:15PM - Venetian H

Dave Hahn, Senior Systems Engineer & AWS Liaison

Abstract: Netflix is a large and ever-changing ecosystem made up of:
* hundreds of production changes every hour
* thousands of micro services
* tens of thousands of instances
* millions of concurrent customers
* billions of metrics every minute

And I'm the guy with the pager.

An in-the-trenches look at what operating at Netflix scale in the cloud is really like. How Netflix views the velocity of innovation, expected failures, high availability, engineer responsibility, and obsessing over the quality of the customer experience. Why Freedom & Responsibility key, trust is required, and why chaos is your friend.

SPOT302 - Availability: The New Kind of Innovator’s Dilemma

Wednesday, Oct 7, 4:15PM - Marcello 4501B

Coburn Watson, Director of Reliability and Performance Engineering

Abstract: Successful companies, while focusing on their current customers' needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the "Innovator's Dilemma," eventually leads to many companies' downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator's Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.

BDT207 - Real-Time Analytics In Service of Self-Healing Ecosystems

Wednesday, Oct 7, 4:15PM - Lido 3001B

Roy Rappoport, Manager of Insight Engineering

Chris Sanden, Senior Analytics Engineer

Abstract: Netflix strives to provide an amazing experience to each member. To accomplish this, we need to maintain very high availability across our systems. However, at a certain scale humans can no longer scale their ability to monitor the status of all systems, making it critical for us to build tools and platforms that can automatically monitor our production environments and make intelligent real-time operational decisions to remedy the problems they identify.

In this talk, we'll discuss how Netflix uses data mining and machine learning techniques to automate decisions in real-time with the goal of supporting operational availability, reliability, and consistency. We'll review how we got to the current states, the lessons we learned, and the future of Real-Time Analytics at Netflix.

While Netflix's scale is larger than most other companies, we believe the approaches and technologies we intend to discuss are highly relevant to other production environments, and an audience member will come away with actionable ideas that should be implementable in, and will benefit, most other environments.

BDT303 - Running Spark and Presto in Netflix Big Data Platform

Thursday, Oct 8, 11:00AM - Palazzo F
Eva Tse, Director of Engineering - Big Data Platform

Daniel Weeks, Engineering Manager - Big Data Platform

Abstract: In this talk, we will discuss how Spark & Presto complement our big data platform stack that started with Hadoop; and the use cases that they address. Also, we will discuss how we run Spark and Presto on top of the EMR infrastructure. Specifically, how we use S3 as our DW and how we leverage EMR as a generic data processing cluster management framework.

SEC310 - Splitting the Check on Compliance and Security: Keeping Developers and Auditors Happy in the Cloud

Thursday, Oct 8, 11:00AM - Marcello 4501B

Jason Chan, Director of Cloud Security

Abstract: Often times - developers and auditors can be at odds. The agile, fast-moving environments that developers enjoy will typically give auditors heartburn. The more controlled and stable environments that auditors prefer to demonstrate and maintain compliance are traditionally not friendly to developers or innovation. We'll walk through how Netflix moved its PCI and SOX environments to the cloud and how we were able to leverage the benefits of the cloud and agile development to satisfy both auditors and developers. Topics covered will include shared responsibility, using compartmentalization and microservices for scope control, immutable infrastructure, and continuous security testing.

We also have a booth on the show floor where the speakers and other Netflix engineers will hold office hours. We hope you join us for these talks and stop by our booth and say hello!

By Ruslan Meshenberg, Josh Evans

↧

Innovating SSO with Google For Work

October 14, 2015, 6:00 am

≫ Next: Falcor for Android

≪ Previous: Netflix at AWS re:Invent 2015

The modern workforce deserves access to technology that will help them work the way they want to in this increasingly mobile world. When Netflix moved to Google Apps, employees and contractors quickly adopted the Google experience, from signing into Gmail, to saving files on Drive, to creating and sharing documents. They are now so accustomed to the Google Apps login flow, down to the two-factor authentication, that we wanted to make Google their central sign on service for all cloud apps, not just Google Apps for Work or apps in the Google Apps Marketplace.

A growing number of companies like us are looking to Google Apps for Work to be their central sign on service for good reason. Google gives today's highly mobile workforces access to all the cloud applications they need to do their jobs from anywhere on any device, all with a familiar and trusted user experience.

Netflix has a complex workforce environment with more than 400 cloud applications, many of which were custom-built for specific use cases unique to our business. This was part of the challenge we came up against in making Google Apps for Work a truly universal SSO solution. Also, Google provides the key components foundational to a secure central access point for employees and contractors to access cloud apps, but we needed more granular contextual control over who could access the apps. For example, someone in the marketing department doesn’t always need to use an app that’s built specifically for the finance department.

The second challenge was that we needed it to be as straightforward as possible to deploy apps across the organization without making developers jump through unnecessary hoops just to get them onto the single sign-on environment. For this we built libraries supporting all common programming languages used at Netflix.

At Netflix, the security context from application to application is quite complex. Google is focused on providing business critical solutions like serving as the central secure access point for cloud apps, while also providing infrastructure for these services like the identity directory. We trust Google to play this foundational role, but wouldn’t expect it to meet unique needs that fall between the directory and the login for every one of its customers.

We decided to bring in Ping Identity to fill these gaps. Ping’s Identity Defined Security platform serves as the glue that enables our workforce to have seamless and secure access to the additional apps and services needed while giving our IT team the control over securing application access that we need. Ping also helps us empower developers to build and deploy new apps based on standards so the workforce can use them securely, quickly and easily in this single sign-on environment.

No cutting edge SSO solution is made up of just one component. We have packaged ours, and run the non-SaaS components in AWS architected for high availability and performance like any other Netflix service. Our employees, contractors, and application owners consume a true IDaaS solution. We have built it in such a way that as the Identity landscape continues to improve, we can add or remove pieces from the authentication chain without being disruptive to users.

We’ve been working closely with Eric Sachs and Google’s identity team as well as Ping Identity’s CTO office to make this into a reality. I will talk about our experience with Google and Ping Identity tomorrow at Identify 2015 in New York, on October 21 in San Francisco, and on November 18 in London. My colleague, Google’s Product Management Director for Identity, Eric Sachs, will also be at these events to discuss how these same standards can be used in work and consumer-facing identity systems. If you’re interested in the Identity space, and would like to discuss in more depth what we have done, please reach out to me. Also feel free to look at our job postings in this space here.

By Justin Slaten

↧

Falcor for Android

October 20, 2015, 9:13 pm

≫ Next: Evolution of Open Source at Netflix

≪ Previous: Innovating SSO with Google For Work

We’re happy to have open-sourced the Netflix Falcor library earlier this year. On Android, we wanted to make use of Falcor in our client app for its efficient model of data fetching as well as its inherent cache coherence.

Falcor requires us to model data on both the client and the server in the same way (via a path query language). This provides the benefit that clients don’t need any translation to fetch data from the server (see What is Falcor). For example, the application may request path [“video”, 12345, “summary”] from Falcor and if it doesn’t exist locally then Falcor can request this same path from the server.

Another benefit that Falcor provides is that it can easily combine multiple paths into a single http request. Standard REST APIs may be limited in the kind of data they can provide via one specific URL. However Falcor’s path language allows us to retrieve any kind of data the client needs for a given view (see the “Batching” heading in “How Does Falcor Work?”). This also provides a nice mechanism for prefetching larger chunks of data if needed, which our app does on initialization.

The Problem

Being the only Java client at Netflix necessitated writing our own implementation of Falcor. The primary goal was to increase the efficiency of our caching code, or in other words, to decrease the complexity and maintenance costs associated with our previous caching layer. The secondary goal was to make these changes while maintaining or improving performance (speed & memory usage).

The main challenge in doing this was to swap out our existing data caching layer for the new Falcor component with minimal impact on app quality. This warranted an investment in testing to validate the new caching component but how could we do this extensive testing most efficiently?

Some history: prior to our Falcor client we had not made much of an investment in improving the structure or performance of our cache. After a light-weight first implementation, our cache had grown to be incoherent (same item represented in multiple places in memory) and the code was not written efficiently (lots of hand-parsing of individual http responses). None of this was good.

Our Solution

Falcor provides cache coherence by making use of a JSON Graph. This works by using a custom path language to define internal references to other items within the JSON document. This path language is consistent to Falcor, and thus a path or reference used locally on the client will be the same path or reference when sent to the server.

{

"topVideos":{

// List, with indices

       0:{ $type:"ref", value:["video",123]},// JSON Graph reference
        1:{ $type:"ref", value:["video",789]}
   },
   "video":{

// Videos by ID

       123:{
           "name":"Orange Is the New Black",
           "year":2015,
           ...
       },
       789:{
           "name":"House of Cards",
           "year":2015,
           ...
       }
   }
}

Our original cache made use of the gson library for parsing model objects and we had not implemented any custom deserializers. This meant we were implicitly using reflection within gson to handle response parsing. We were curious how much of a cost this use of reflection introduced when compared with custom deserialization. Using a subset of model objects, we wrote a benchmark app that showed the deserialization using reflection took about 6x as much time to process when compared with custom parsing.

We used the transition to Falcor as an opportunity to write custom deserializers that took json as input and correctly set fields within each model. There is a slightly higher cost here to write parsing code for the models. However most models are shared across a few different get requests so the cost becomes amortized and seemed worth it considering the improved parsing speed.

// Custom deserialization method for Video.Summary model
public void populate(JsonElement jsonElem) {
   JsonObject json = jsonElem.getAsJsonObject();
   for(Map.Entry<String, JsonElement> entry : json.entrySet()){
       JsonElement value = entry.getValue();
       switch(entry.getKey()){
       case"id": id = value.getAsString();break;
       case"title": title = value.getAsString();break;
       ...
       }
   }
}

Once the Falcor cache was implemented, we compared cache memory usage over a typical user browsing session. As provided by cache coherence (no duplicate objects), we found that the cache footprint was reduced by about 10-15% for a typical user browse session, or about 500kB.

Performance and Threading

When a new path of data is requested from the cache, the following steps occur:

Determine which paths, if any, already exist locally in the cache
Aggregate paths that don't exist locally and request them from the server
Merge server response back into the local cache
Notify callers that data is ready, and/or pass data back via callback methods

We generalized these operations in a component that also managed threading. By doing this, we were able to take everything off of the main thread except for task instantiation. All other steps above are done in worker threads.

Further, by isolating all of the cache and remote operations into a single component we were easily able to add performance information to all requests. This data could be used for testing purposes (by outputting to a specific logcat channel) or simply as a debugging aid during development.

// Sample logcat output
15:29:10.956: FetchDetailsTask ++ time to build paths: 0ms
15:29:10.956: FetchDetailsTask ++ time to check cache for missing paths: 1ms
15:29:11.476: FetchDetailsTask ~~ http request took: 516ms
15:29:11.486: FetchDetailsTask ++ time to parse json response: 8ms
15:29:11.486: FetchDetailsTask ++ time to fetch results from cache: 0ms
15:29:11.486: FetchDetailsTask == total task time from creation to finish: 531ms

Testing

Although reflection had been costly for the purposes of parsing json, we were able to use reflection on interfaces to our advantage when it came to testing our new cache. In our test harness, we defined tables that mapped test interfaces to each of the model classes. For example, when we made a request to fetch a ShowDetails object, the map defined that the ShowDetails and Playable interfaces should be used to compare the results.

// INTERFACE_MAP sample entries

put(netflix.model.branches.Video.Summary.class,            // Model/class
   newClass<?>[]{netflix.model._interfaces.Video.class});// Interfaces to test
put(netflix.model.ShowDetails.class,
   newClass<?>[]{netflix.model._interfaces.ShowDetails.class,
                  netflix.model._interfaces.Playable.class});
put(netflix.model.EpisodeDetails.class,
   newClass<?>[]{netflix.model._interfaces.EpisodeDetails.class,
                  netflix.model._interfaces.Playable.class});

// etc.

We then used reflection on the interfaces to get a list of all their methods and then recursively apply each method to each item or item in a list. The return values for the method/object pair were compared to find any differences between the previous cache implementation and the Falcor implementation. This provided a first-pass of detection for errors in the new implementation and caught most problems early on.

private Result validate(Object o1, Object o2) {   //...snipped...   Class<?>[] validationInterfaces = INTERFACE_MAP.get(o1.getClass());
   for(Class<?> testingInterface : validationInterfaces){       Log.d(TAG,"Getting methods for interface: "+ testingInterface);       Method[] methods = testingInterface.getMethods();// Public methods only
       for(Method method : methods){           Object rtn1 = method.invoke(o1);// Old cache object           Object rtn2 = method.invoke(o2);// Falcor cache object
           if(rtn1 instanceof FalcorValidator){               Result rtn = validate(rtn1, rtn2);// Recursively validate objects               if(rtn.isError()){                   return rtn;               }           }           elseif(! rtn1.equals(rtn2)){               return Result.VALUE_MISMATCH.append(rtnMsg);           }       }   }   return Result.OK;}

Bonus for Debugging

Because of the structure of the Falcor cache, writing a dump() method was trivial using recursion. This became a very useful utility for debugging since it can succinctly express the whole state of the cache at any point in time, including all internal references. This output can be redirected to the logcat output or to a file.

void doCacheDumpRecursive(StringBuilder output, BranchNode node, int offset) {

StringBuilder sb =new StringBuilder();

for(int i =0; i < offset; i++){

sb.append((i == offset -1)?" |-":" | "); // Indentation chars

}

String spacer = sb.toString();

for(String key : keys){

Object value = node.get(key);

if(value instanceof Ref){

output.append(spacer).append(key).append(" -> ")

                 .append(((Ref)value).getRefPath()).append(NEWLINE);
       }
       else{
           output.append(spacer).append(key).append(NEWLINE);
       }
       if(value instanceof BranchNode){
           doCacheDumpRecursive(output,(BranchNode)value, offset +1);
       }
   }
}

Sample Cache Dump File

Results

The result of our work was that we created an efficient, coherent cache that reduced its memory footprint when compared with our previous cache component. In addition, the cache was structured in a way that was easier to maintain and extend due to an increase in clarity and a large reduction in redundant code.

We achieved the above objectives while also reducing the time taken to parse json responses and thus speed performance of the cache was improved in most cases. Finally, we minimized our regressions by using a thorough test harness that we wrote efficiently using reflection.

Future Improvements

Multiple views may be bound to the same data path so how can we notify all views when the underlying data changes? Observer pattern or RxJava.
Cache invalidation: We do this manually in a few specific cases now but we could implement a more holistic approach that includes expiration times for paths that can expire. Then, if that data is later requested, it is considered invalid and a remote request is again required.
Disk caching. It would be fairly straightforward to serialize our cache, or portions of the cache, to disk. Caching manager could then check in-memory cache, on-disk cache, and then finally go remote if needed.

Links

Falcor project: https://netflix.github.io/falcor/
Netflix Releases Falcor Developer Preview: http://techblog.netflix.com/2015/08/falcor-developer-preview.html

↧

Evolution of Open Source at Netflix

October 28, 2015, 12:52 pm

≫ Next: Netflix Hack Day - Autumn 2015

≪ Previous: Falcor for Android

When we started our Netflix Open Source (aka NetflixOSS) Program several years ago, we didn’t know how it would turn out. We did not know whether our OSS contributions would be used, improved, or ignored; whether we’d have a community of companies and developers sending us feedback; and whether middle-tier vendors would integrate our solutions into theirs. The reasons for starting the OSS Programs were shared previously here.

Fast forward to today. We have over fifty open source projects, ranging from infrastructural platform components to big data tools to deployment automation. Over time, our OSS site became very busy with more and more components piling on. Now, even more components are on the path to being open.

While many of our OSS projects are being successfully used across many companies all over the world, we got a very clear signal from the community that it was getting harder to figure out which projects were useful for a particular company or a team; which were fully independent; and which were coupled together. The external community was also unclear about which components we (Netflix) continued to invest and support, and which were in maintenance or sunset mode. That feedback was very useful to us, as we’re committed in making our OSS Program a success.

We recently updated our Netflix Open Source site on Github pages. It does not yet address all of the feedback and requests we received, but we think it’s moving us in the right direction:

Clear separation of categories. Looking for Build and Delivery tools? You shouldn’t have to wade through many unrelated projects to find them.
With the new overview section of each category we can now explain in short form how each project should be used in concert with other projects. With the old “box art” layout, it wasn’t clear how the projects fit together (or if they did) in a way that provided more value when used together.
Categories now match our internal infrastructure engineering organization. This means that the context within each category will reflect the approach to engineering within the specific technical area. Also, we have appointed category leaders internally that will help keep each category well maintained across projects within that area.
Clear highlighting of the projects we’re actively investing and supporting. If you see the project on the site - it’s under active development and maintenance. If you don’t see it - it may be either in maintenance only or sunset mode. We’ll be providing more transparency on that shortly.
Support for multi-repo projects. We have several big projects that are about to be Open Sourced. Each one will consist of many Github repos. The old site would list each of the repos, thus making the overall navigation even less usable. The new site allows us to group the relevant repos together under a single project

Other feedback we’re addressing is that it was hard to get started with many of our OSS projects. Setup / configuration was often difficult and tricky. We’re addressing it by packaging most (not yet all) of our projects in the Docker format for easy setup. Please note, this packaging is not intended for direct use in production, but purely for a quick ramp-up curve for understanding the open source projects. We have found that it is far easier to help our users’ setup of our projects by running pre-built, runnable Docker containers rather than publish source code, build and setup instructions in prose on a Wiki.

The next steps we’ll be taking in our Open Source Program:

Provide full transparency on which projects are archived - i.e. no longer actively developed or maintained. We will not be removing any code from Github repos, but will articulate if we’re no longer actively developing or using a particular project. Netflix needs change over time, and this will affect and reflect our OSS projects.
Provide a better roadmap for what new projects we are planning to open, and which Open projects are still in the state of heavy flux (evolution). This will allow the community to better decide whether particular projects are interesting / useful.
Expose some of the internal metrics we use to evaluate our OSS projects - number of issues, commits, etc. This will provide better transparency of the maturity / velocity of each project.
Documentation. Documentation. Documentation.

While we’re continuing on our path to make NetflixOSS relevant and useful to many companies and developers, your continuous feedback is very important to us. Please let us know what you think at netflixoss@netflix.com.

We’re planning our next NetflixOSS Meetup in early 2016 to coincide with some new and exciting projects that are about to be open. Stay tuned and follow @netflixoss for announcements and updates.

By Andrew Spyker, Ruslan Meshenberg

↧

Netflix Hack Day - Autumn 2015

November 9, 2015, 10:05 am

≫ Next: Global Continuous Delivery with Spinnaker

≪ Previous: Evolution of Open Source at Netflix

ByDaniel Jacobson,Ruslan Meshenberg,Matt McCarthy,Leslie Posada

Last week, we hosted our latest installment of Netflix Hack Day. As always, Hack Day is a way for our product development staff to get away from everyday work, to have fun, experiment, collaborate, and be creative.

The following video is an inside look at what our Hack Day event looks like:

Video credit: Sean Williams

This time, we had 75 hacks that were produced by about 200 engineers and designers (and even some from the legal team!). We’ve embedded some of our favorites below. You can also see some of our past hacks in our posts for March 2015, Feb. 2014&Aug. 2014.

While we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day. We are posting them here publicly to simply share the spirit of the event.

Thanks to all of the hackers for putting together some incredible work in just 24 hours!

Netflix VHF

Watch Netflix on your Philco Predicta, the TV of tomorrow! We converted a 1950's era TV into a smart TV that runs Netflix.

By Bogdan Ciuca, Evan Browning, Sam Horner and Corey Grunewald

Narcos: Plata O Plomo Video Game

A fun game based on the Netflix original series, Narcos.

By Adnan Abbas, Joey Cato, Leonid Pekker and Marco Vinicius Caldeira

Stream Possible

Netflix TV experience over a 3G cell connection for the non-broadband rich parts of the world.

ByGuy Cirino,Alex Wolfe, and Carenina Garcia Motion

Ok Netflix

Ok Netflix can find the exact scene in a movie or episode from listening to the dialog that comes from that scene. Speak the dialog into Ok Netflix and Ok Netflix will do the rest, starting the right title in the right location.

By Robert Reta, Srdjan Pantic, Ryan Schroder, Suudhan Rangarajan and Subramanian Nagarajan

Smart Channels

A way to watch themed collections of content that are personalized and also autoplay like serialized content.

By Marco Vinicius Caldeira, Adam Butterworth and Matt Canton

And here are some pictures taken during the event.

↧

Global Continuous Delivery with Spinnaker

November 16, 2015, 10:00 am

≫ Next: Sleepy Puppy Extension for Burp Suite

≪ Previous: Netflix Hack Day - Autumn 2015

After over a year of development and production use at Netflix, we’re excited to announce that our Continuous Delivery platform, Spinnaker, is available on GitHub. Spinnaker is an open source multi-cloud Continuous Delivery platform for releasing software changes with high velocity and confidence. Spinnaker is designed with pluggability in mind; the platform aims to make it easy to extend and enhance cloud deployment models. To create a truly extensible multi-cloud platform, the Spinnaker team partnered with Google, Microsoft and Pivotal to deliver out-of-the-box cluster management and deployment. As of today, Spinnaker can deploy to and manage clusters simultaneously across both AWS and Google Cloud Platform with full feature compatibility across both cloud providers. Spinnaker also features deploys to Cloud Foundry; support for its newest addition, Microsoft Azure, is actively underway.

If you’re familiar with Netflix’s Asgard, you’ll be in good hands. Spinnaker is the replacement for Asgard and builds upon many of its concepts. There is no need for a migration from Asgard to Spinnaker as changes to AWS assets via Asgard are completely compatible with changes to those same assets via Spinnaker and vice versa.

Continuous Delivery with Spinnaker

Spinnaker facilitates the creation of pipelines that represent a delivery process that can begin with the creation of some deployable asset (such as an machine image, Jar file, or Docker image) and end with a deployment. We looked at the ways various Netflix teams implemented continuous delivery to the cloud and generalized the building blocks of their delivery pipelines into configurable Stages that are composable into Pipelines. Pipelines can be triggered by the completion of a Jenkins Job, manually, via a cron expression, or even via other pipelines. Spinnaker comes with a number of stages, such as baking a machine image, deploying to a cloud provider, running a Jenkins Job, or manual judgement to name a few. Pipeline stages can be run in parallel or serially.

Spinnaker Pipelines

Spinnaker also provides cluster management capabilities and provides deep visibility into an application’s cloud footprint. Via Spinnaker’s application view, you can resize, delete, disable, and even manually deploy new server groups using strategies like Blue-Green (or Red-Black as we call it at Netflix). You can create, edit, and destroy load balancers as well as security groups.

Cluster Management in Spinnaker

Spinnaker is a collection of JVM-based services, fronted by a customizable AngularJS single-page application. The UI leverages a rich RESTful API exposed via a gateway service.

You can find all the code for Spinnaker on GitHub. There are also installation instructions on how to setup and deploy Spinnaker from source as well as instructions for deploying Spinnaker from pre-existing images that Kenzan and Google have created. We’ve set up a Slack channel and we are committed to leveraging StackOverflow as a means for answering community questions. Issues, questions, and pull requests are welcome.

↧

Sleepy Puppy Extension for Burp Suite

November 20, 2015, 7:33 am

≫ Next: Creating Your Own EC2 Spot Market -- Part 2

≪ Previous: Global Continuous Delivery with Spinnaker

Netflix recently open sourced Sleepy Puppy - a cross-site scripting (XSS) payload management framework for security assessments. One of the most frequently requested features for Sleepy Puppy has been for an extension for Burp Suite, an integrated platform for web application security testing. Today, we are pleased to open source a Burp extension that allows security engineers to simplify the process of injecting payloads from Sleep Puppy and then tracking the XSS propagation over longer periods of time and over multiple assessments.

Prerequisites and Configuration

First, you need to have a copy of Burp Suite running on your system. If you do not have a copy of Burp Suite, you can download/buy Burp Suite here. You also need a Sleepy Puppy instance running on a server. You can download Sleepy Puppy here. You can try out Sleepy Puppy using Docker. Detailed instructions on setup and configuration are available on the wiki page.

Once you have these prerequisites taken care of, please download the Burp extension here.

If the Sleepy Puppy server is running over HTTPS (which we would encourage), you need to inform the Burp JVM to trust the CA that signed your Sleepy Puppy server certificate. This can be done by importing the cert from Sleepy Puppy server into a keystore and then specifying the keystore location and passphrase while starting Burp Suite. Specific instructions include:

Visit your Sleepy Puppy server and export the certificate using Firefox in pem format

Import the cert in pem format into a keystore with the command below. keytool -import -file </path/to/cert.pem> -keystore sleepypuppy_truststore.jks -alias sleepypuppy

You can specify the truststore information for the plugin either as an environment variable or as a JVM option.

Set truststore info as environmental variables and start Burp as shown below: export SLEEPYPUPPY_TRUSTSTORE_LOCATION=</path/to/sleepypuppy_truststore.jks> export SLEEPYPUPPY_TRUSTSTORE_PASSWORD=<passphrase provided while creating the truststore using keytool command above> java -jar burp.jar

Set truststore info as part of the Burp startup command as shown below: java -DSLEEPYPUPPY_TRUSTSTORE_PASSWORD=</path/to/sleepypuppy_truststore.jks> -DSLEEPYPUPPY_TRUSTSTORE_PASSWORD=<passphrase provided while creating the truststore using keytool command above> -jar burp.jar

Now it is time to load the Sleepy Puppy extension and explore its functionality.

Using the Extension

Once you launch Burp and load up the Sleepy Puppy extension, you will be presented with the Sleepy Puppy tab.

This tab will allow you to leverage the capabilities of Burp Suite along with the Sleepy Puppy XSS Management framework to better manage XSS testing.

Some of the features provided by the extension include:

Create a new assessment or select an existing assessment
Add payloads to your assessment and the Sleepy Puppy server from the the extension
When an Active Scan is conducted against a site or URL, the XSS payloads from the selected Sleepy Puppy Assessment will be executed after Burp's built-in XSS payloads
In Burp Intruder, the Sleepy Puppy Extension can be chosen as the payload generator for XSS testing
In Burp Repeater, you can replace any value in an existing request with a Sleepy Puppy payload using the context menu
The Sleepy Puppy tab provides statistics about any payloads that have been triggered for the selected assessment

You can watch the Sleepy Puppy extension in action at youtube.

Interested in Contributing?

Feel free to reach out or submit pull requests if there’s anything else you’re looking for. We hope you’ll find Sleepy Puppy and the Burp extension as useful as we do!

by: Rudra Peram

↧

Creating Your Own EC2 Spot Market -- Part 2

November 23, 2015, 7:00 am

≫ Next: Linux Performance Analysis in 60,000 Milliseconds

≪ Previous: Sleepy Puppy Extension for Burp Suite

In Part 1 Creating Your Own EC2 Spot Market of this series, we explained how Netflix manages its EC2 footprint and how we take advantage of our daily peak of 12,000 unused instances which we named the “internal spot market.” This sizeable trough has significantly improved our encoding throughput, and we are pursuing other benefits from this large pool of unused resources.

The Encoding team went through two iterations of internal spot market implementations. The initial approach was a simple schedule-based borrowing mechanism that was quickly deployed in June in the us-east AZ to reap immediate benefits. We applied the experience we gained to influence the next iteration of the design based on real-time availability.

The main challenge of using the spot instances effectively is handling the dynamic nature of our instance availability. With correct timing, running spot instances is effectively free; when the timing is off, however, any EC2 usage is billed at the on-demand price. In this post we will discuss how the real-time, availability-based internal spot market system works and efficiently uses the unused capacity

Benefits of Extra Capacity

The encoding system at Netflix is responsible for encoding master media source files into many different output formats and bitrates for all Netflix supported devices. A typical workload is triggered by source delivery, and sometimes the encoding system receives an entire season of a show within moments. By leveraging the internal spot market, we have measured the equivalent of a 210% increase in encoding capacity. With the extra boost of computing resources, we have improved our ability to handle sudden influx work and to quickly reduce our of backlog.

In addition to the production environment, the encoding infrastructure maintains 40 “farms” for development and testing. Each farm is a complete encoding system with 20+ micro-services that matches the capability and capacity of the production environment.

Computing resources are continuously evaluated and redistributed based on workload. With the boost of spot market instances, the total encoding throughput increases significantly. On the R&D side, researchers leverage these extra resources to carry out experiments in a fraction of the time it used to take. Our QA automation is able to broaden the coverage of our comprehensive suite of continuous integration and run these jobs in less time.

Spot Market Borrowing in Action

We started the new spot market system in October, and we are encouraged by the improved performance compared to our borrowing in the first iteration.

For instance, in one of the research projects, we triggered 12,000 video encoding jobs over a weekend. We had anticipated the work to finish in a few days, but we were pleasantly surprised to discover that the jobs were completed in only 18 hours.

The following graph captures that weekend’s activity.

The Y-axis denotes the amount of video encoder jobs queued in the messaging system, the red line represents high priority jobs, and the yellow area graph shows the amount of medium and low priority jobs.

Important Considerations

By launching on-demand instances in the Encoding team AWS account, the Encoding team never impacts guaranteed capacity (reserved instances) from the main Netflix account.
The Encoding team competes for on-demand instances with other Netflix accounts.
Spot instance availability fluctuates and can become unavailable at any moment. The encoding service needs to react to these changes swiftly.
It is possible to dip into unplanned on-demand usage due to sudden surge of instance usage in other Netflix accounts while we have internal spot instances running. The benefits of borrowing must significantly outweigh the cost of these on-demand charges.
Available spot capacity comes in different types and sizes. We can make the most out of them by making our jobs instance type agnostic.

Design Goals

Cost Effectiveness: Use as many spot instances as are available. Incur as little unplanned on-demand usage as possible.

Good Citizenship: We want to minimize contention that may cause a shortage in the on-demand pool. We take a light-handed approach by yielding spot instances to other Netflix accounts when there is competition on resources.

Automation: The Encoding team invests heavily in automation. The encoding system is responsible for encoding activities for the entire Netflix catalog 24x7, hands free. Spot market borrowing needs to function continuously and autonomously.

Metrics: Collect Atlas metrics to measure effectiveness, pinpoint areas of inefficiency, and trend usage patterns in near real-time.

Key Borrowing Strategies

We spend a great deal of the effort devising strategies to address the goals of Cost Effectiveness and Good Citizenship. We started with a set of simple assumptions, and then constantly iterated using our monitoring system, allowing us to validate and fine tune the initial design to the following set of strategies below:

Real-time Availability Based Borrowing: Closely align utilization based on the fluctuating real-time spot instance availability using a Spinnaker API. Spinnaker is a Continuous Delivery Platform that manages Netflix reservations and deployment pipelines. It is in the optimal position to know what instances are in use across all Netflix accounts.

Negative Surplus Monitor: Sample spot market availability, and quickly terminate (yield) borrowed instances when we detect overdraft of internal spot instances. It enforces that our spot borrowing is treated as the lowest priority usage in the company and leads to reduced on-demand contention.

Idle Instance Detection: Detect over-allocated spot instances. Accelerate down scaling of spot instances to improve time to release, with an additional benefit of reducing borrowing surface area.

Fair Distribution: When spot instances are abundant, distribute assignment evenly to avoid exhausting one EC2 instance type on a specific AZ. This helps minimize on-demand shortage and contention while reducing involuntary churn due to negative surplus.

Smoothing Function: The resource scheduler evaluates assignments of EC2 instances based on a smoothed representation of workload, smoothing out jitters and spikes to prevent over-reaction.

Incremental Stepping & Fast Evaluation Cycle: Acting in incremental steps avoids over-reaction and allows us to evaluate the workload frequently for rapid self correction. Incremental stepping also helps distribute instance usage across instance types and AZ more evenly.

Safety Margin: Reduce contention by leaving some amount of available spot instances unused. It helps reduce involuntary termination due to minor fluctuations in usage in other Netflix accounts.

Curfew: Preemptively reduce spot usage before a predictable pattern of negative surplus inflection that drops rapidly (e.g. Nightly Netflix personal recommendation computation schedule). These curfews help minimize preventable on-demand charges.

Evacuation Monitor: A system-wide toggle to immediately evacuate all borrowing usage in case of emergency (e.g. regional traffic failover). Eliminate on-demand contention in case of emergency.

Observations

The following graph depicts a five day span on spot usage by instance type.

This graph illustrates a few interesting points:

The variance in color represents different instance types in use, and in most cases the relatively even distribution of bands of color shows that instance type usage is reasonably balanced.
The sharp rise and drop of the peaks confirms that the encoding resource manager scales up and down relatively quickly in response to changes in workload.
The flat valleys show the frugality of instance usage. Spot instances are only used when there is work for them to do.
Not all color bands have the same height because the size of the reservation varies between instance types. However, we are able to borrow from both large (orange) and small (green) pools, collectively satisfying the entire workload.
Finally, although this graph reports instance usage, it indirectly tracks the workload. The overall shape of the graphs shows that there is no discernible pattern of the workload, such is the event driven nature of the encoding activities.

Efficiency

Based on the AWS billing data from October, we summed up all the borrowing hours and adjusted them relative to the r3.4xlarge instance type that makes up the Encoding reserved capacity. With the addition of spot market instances, the effective encoding capacity increased by 210%.

Dark blue denotes spot market borrowing, and light blue represents on-demand usage.

On-demand pricing is multiple times more expensive than reserved instances, and it varies depending on instance type. We took the October spot market usage and calculated what it would have cost with purely on-demand pricing and computed a 92% cost efficiency.

Lessons Learned

On-demand is Expensive: We already knew this fact, but the idea sinks in once we observed on-demand charges as a result of sudden overdrafts of spot usage. A number of the strategies (e.g. Safety Margin, Curfew) listed in the above section were devised to specifically mitigate this occurrence.

Versatility: Video encoding represents 70% of our computing needs. We made some tweaks to the video encoder to run on a much wider selection of instance types. As a result, we were able to leverage a vast number of spot market instances during different parts of the day.

Tolerance to Interruption: The encoding system is built to withstand interruptions. This attribute works well with the internal spot market since instances can be terminated at any time.

Next Steps

Although the current spot market borrowing system is a notable improvement over the previous attempt, we are uncovering the tip of the iceberg. In the future, we want to leverage spot market instances from different EC2 regions as they become available. We are also heavily investing in the next generation of encoding architecture that scales more efficiently and responsively. Here are some ideas we are exploring:

Cross Region Utilization: By borrowing from multiple EC2 regions, we triple the access to unused reservations from the current usable pool. Using multiple regions also significantly reduces concentration of on-demand usages in a single EC2 region.

Containerization: The current encoding system is based on ASG scaling. We are actively investing in the next generation of our encoding infrastructure using container technology. The container model will reduce overhead in ASG scaling, minimize overhead of churning, and increase performance and throughput as Netflix continues to grow its catalog.

Resource Broker: The current borrowing system is monopolistic in that it assumes the Encoding service is the sole borrower. It is relatively easy to implement for one borrower. We need to create a resource broker to better coordinate access to the spot surplus when sharing amongst multiple borrowers.

Conclusion

In the first month of deployment, we observed significant benefits in terms of performance and throughput. We were successful in making use of Netflix idle capacity for production, research, and QA. Our encoding velocity increased dramatically. Experimental research turn-around time was drastically reduced. A comprehensive full regression test finishes in half the time it used to take. With a cost efficiency of 92%, the spot market is not completely free but it is worth the cost.

All of these benefits translate to faster research turnaround, improved playback quality, and ultimately a better member experience.

-- Media Cloud Engineering

Contributors: Rick Wong, Darrell Denlinger, Abhishek Shiroor, Naveen Mareddy, Frank San Miguel, Rodrigo Gallardo, and Mangala Prabhu

↧

Linux Performance Analysis in 60,000 Milliseconds

November 30, 2015, 1:38 pm

≫ Next: Caching Content for Holiday Streaming

≪ Previous: Creating Your Own EC2 Spot Market -- Part 2

You login to a Linux server with a performance issue: what do you check in the first minute?

At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools.

In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available.

First 60 Seconds: Summary

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.


uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

Some of these commands require the sysstat package installed. The metrics these commands expose will help you complete some of the USE Method: a methodology for locating performance bottlenecks. This involves checking utilization, saturation, and error metrics for all resources (CPUs, memory, disks, e.t.c.). Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study, and directs any follow on investigation.

The following sections summarize these commands, with examples from a production system. For more information about these tools, see their man pages.

1. uptime


$ uptime
 23:51:26 up 21:31,  1 user,  load average: 30.02, 26.43, 19.02

This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O). This gives a high level idea of resource load (or demand), but can’t be properly understood without other tools. Worth a quick look only.

The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant. The three numbers give us some idea of how load is changing over time. For example, if you’ve been asked to check a problem server, and the 1 minute value is much lower than the 15 minute value, then you might have logged in too late and missed the issue.

In the example above, the load averages show a recent increase, hitting 30 for the 1 minute value, compared to 19 for the 15 minute value. That the numbers are this large means a lot of something: probably CPU demand; vmstat or mpstat will confirm, which are commands 3 and 4 in this sequence.

2. dmesg | tail


$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.  Check SNMP counters.

This views the last 10 system messages, if there are any. Look for errors that can cause performance issues. The example above includes the oom-killer, and TCP dropping a request.

Don’t miss this step! dmesg is always worth checking.

3. vmstat 1


$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0
32  0    0 200889920  73708 591860    0    0     0   592 13284 4282 98  1  1  0  0
32  0    0 200890112  73708 591860    0    0     0     0 9501 2154 99  1  0  0  0
32  0    0 200889568  73712 591856    0    0     0    48 11900 2459 99  0  0  0  0
32  0    0 200890208  73712 591860    0    0     0     0 15898 4840 98  1  1  0  0
^C

Short for virtual memory stat, vmstat(8) is a commonly available tool (first created for BSD decades ago). It prints a summary of key server statistics on each line.

vmstat was run with an argument of 1, to print one second summaries. The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second. For now, skip the first line, unless you want to learn and remember which column is which.

Columns to check:

r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is saturation.
free: Free memory in kilobytes. If there are too many digits to count, you have enough free memory. The “free -m” command, included as command 7, better explains the state of free memory.
si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).

The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.

System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.

In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead. The CPUs are also well over 90% utilized on average. This isn’t necessarily a problem; check for the degree of saturation using the “r” column.

4. mpstat -P ALL 1


$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

07:38:49 PM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99
07:38:50 PM    1  97.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   2.00
07:38:50 PM    2  98.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   1.00
07:38:50 PM    3  96.97   0.00   0.00    0.00   0.00   0.00    0.00    0.00    0.00   3.03
[...]

This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application.

5. pidstat 1


$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

07:41:02 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:03 PM     0         9    0.00    0.94    0.00    0.94     1  rcuos/0
07:41:03 PM     0      4214    5.66    5.66    0.00   11.32    15  mesos-slave
07:41:03 PM     0      4354    0.94    0.94    0.00    1.89     8  java
07:41:03 PM     0      6521 1596.23    1.89    0.00 1598.11    27  java
07:41:03 PM     0      6564 1571.70    7.55    0.00 1579.25    28  java
07:41:03 PM 60004     60154    0.94    4.72    0.00    5.66     9  pidstat

07:41:03 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:04 PM     0      4214    6.00    2.00    0.00    8.00    15  mesos-slave
07:41:04 PM     0      6521 1590.00    1.00    0.00 1591.00    27  java
07:41:04 PM     0      6564 1573.00   10.00    0.00 1583.00    28  java
07:41:04 PM   108      6718    1.00    0.00    0.00    1.00     0  snmp-pass
07:41:04 PM 60004     60154    1.00    4.00    0.00    5.00     9  pidstat
^C

Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.

The above example identifies two java processes as responsible for consuming CPU. The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs.

6. iostat -xz 1


$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.96    0.00    3.73    0.03    0.06   22.21

Device:   rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda        0.00     0.23    0.21    0.18     4.52     2.08    34.37     0.00    9.98   13.80    5.42   2.44   0.09
xvdb        0.01     0.00    1.02    8.94   127.97   598.53   145.79     0.00    0.43    1.78    0.28   0.25   0.25
xvdc        0.01     0.00    1.02    8.86   127.79   595.94   146.50     0.00    0.45    1.82    0.30   0.27   0.26
dm-0        0.00     0.00    0.69    2.32    10.47    31.69    28.01     0.01    3.23    0.71    3.98   0.13   0.04
dm-1        0.00     0.00    0.00    0.94     0.01     3.78     8.00     0.33  345.84    0.04  346.81   0.01   0.00
dm-2        0.00     0.00    0.09    0.07     1.35     0.36    22.50     0.00    2.55    0.23    5.62   1.78   0.03
[...]
^C

This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance. Look for:

r/s, w/s, rkB/s, wkB/s: These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.
await: The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.
avgqu-sz: The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)
%util: Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.

If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work.

Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes).

7. free -m


$ free -m
             total       used       free     shared    buffers     cached
Mem:        245998      24545     221453         83         59        541
-/+ buffers/cache:      23944     222053
Swap:            0          0          0

The right two columns show:

buffers: For the buffer cache, used for block device I/O.
cached: For the page cache, used by file systems.

We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each.

The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion.

It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free -m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed.

8. sar -n DEV 1


$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015     _x86_64_    (32 CPU)

12:16:48 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:49 AM      eth0  18763.00   5032.00  20686.42    478.30      0.00      0.00      0.00      0.00
12:16:49 AM        lo     14.00     14.00      1.36      1.36      0.00      0.00      0.00      0.00
12:16:49 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:16:49 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:50 AM      eth0  19763.00   5101.00  21999.10    482.56      0.00      0.00      0.00      0.00
12:16:50 AM        lo     20.00     20.00      3.25      3.25      0.00      0.00      0.00      0.00
12:16:50 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
^C

Use this tool to check network interface throughput: rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached. In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit).

This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00).

9. sar -n TCP,ETCP 1


$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

12:17:19 AM  active/s passive/s    iseg/s    oseg/s
12:17:20 AM      1.00      0.00  10233.00  18846.00

12:17:19 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:20 AM      0.00      0.00      0.00      0.00      0.00

12:17:20 AM  active/s passive/s    iseg/s    oseg/s
12:17:21 AM      1.00      0.00   8359.00   6039.00

12:17:20 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:21 AM      0.00      0.00      0.00      0.00      0.00
^C

This is a summarized view of some key TCP metrics. These include:

active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).
passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).
retrans/s: Number of TCP retransmits per second.

The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active). It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection).

Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per-second.

10. top


$ top
top - 00:15:40 up 21:56,  1 user,  load average: 31.09, 29.87, 29.92
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu(s): 96.8 us,  0.4 sy,  0.0 ni,  2.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  25190241+total, 24921688 used, 22698073+free,    60448 buffers
KiB Swap:        0 total,        0 used,        0 free.   554208 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20248 root      20   0  0.227t 0.012t  18748 S  3090  5.2  29812:58 java
  4213 root      20   0 2722544  64640  44232 S  23.5  0.0 233:35.37 mesos-slave
 66128 titancl+  20   0   24344   2332   1172 R   1.0  0.0   0:00.07 top
  5235 root      20   0 38.227g 547004  49996 S   0.7  0.2   2:02.74 java
  4299 root      20   0 20.015g 2.682g  16836 S   0.3  1.1  33:14.42 java
     1 root      20   0   33620   2920   1496 S   0.0  0.0   0:03.82 init
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   0:05.35 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:06.94 kworker/u256:0
     8 root      20   0       0      0      0 S   0.0  0.0   2:38.05 rcu_sched

The top command includes many of the metrics we checked earlier. It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable.

A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output. Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl-S to pause, Ctrl-Q to continue), and the screen clears.

Follow-on Analysis

There are many more commands and methodologies you can apply to drill deeper. See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing.

Tackling system reliability and performance problems at web scale is one of our passions. If you would like to join us in tackling these kinds of challenges we are hiring!

↧

Caching Content for Holiday Streaming

December 1, 2015, 10:00 pm

≫ Next: Debugging Node.js in Production

≪ Previous: Linux Performance Analysis in 60,000 Milliseconds

'Tis the season for holiday binging. How do seasonal viewing patterns affect how Netflix stores and streams content?

Our solution for delivering streaming content is Open Connect, a custom-built system that distributes and stores the audio and video content our members download when they stream a Netflix title (a movie or episode). Netflix has a unique library of large files with relatively predictable popularity, and Open Connect's global, distributed network of caching servers was designed with these attributes in mind. This system localizes content as close to our members as possible to achieve a high-quality playback experience through low latency access to content over optimal internet paths. A subset of highly-watched titles makes up a significant share of total streaming, and caching, the process of storing content based on how often it’s streamed by our members, is critical to ensuring enough copies of a popular title are available in a particular location to support the demand of all the members who want to stream it.

We curate rich, detailed data on what our members are watching, giving us a clear signal of which content is popular today. We enrich that signal with algorithms to produce a strong indicator of what will be popular tomorrow. As a title increases in popularity, more copies of it are added to our caching servers, replacing other, less popular content on a nightly cadence when the network is least busy. Deleting and adding files from servers comes with overhead, however, and we perform these swaps of content with the help of algorithms designed to balance cache efficiency with the network cost of replacing content.

How does this play out for titles with highly seasonal patterns? Metadata assembled by human taggers and reviewed by our internal enhanced content team tells us which titles are holiday-related, and using troves of streaming data we can track their popularity throughout the year. Holiday titles ramp in popularity starting in November, so more copies of these titles will be distributed among the network starting in November through their popularity peak at the end of December. The cycle comes full circle when the holiday content is displaced by relatively more popular titles in January.

Holiday viewing follows a predictable annual pattern, but we also have to deal with less predictable scenarios like introducing new shows without any viewing history or external events that suddenly drive up the popularity of certain titles. For new titles, we model a combination of external and internal data points to create a predicted popularity, allowing us to appropriately cache that content before the first member ever streams it. For unexpected spikes in popularity driven by events like actors popping up in the news, we are designing mechanisms to let us quickly push content to our caches outside of the nightly replacement schedule as they are actively serving members. We're also exploring ways to evaluate popularity more locally; what's popular in Portland may not be what's popular in Philadelphia.

Whether your tastes run toward Love Actually or The Nightmare before Christmas, your viewing this holiday season provides valuable information to help optimize how Netflix stores its content.

Love data, measurement and Netflix? Join us!

-Laura Pruitt

↧

Debugging Node.js in Production

December 3, 2015, 3:28 pm

≫ Next: High Quality Video Encoding at Scale

≪ Previous: Caching Content for Holiday Streaming

By Kim Trott, Yunong Xiao

We recently hosted our latest JavaScript Talks event on our new campus at Netflix headquarters in Los Gatos, California. Yunong Xiao, senior software engineer on our Node.js platform, presented on debugging Node.js in production. Yunong showed hands-on techniques using the scientific method to root cause and solve for runtime performance issues, crashes, errors, and memory leaks.

We’ve shared some useful links from our talk:

Videos from our past talks can always be found on our Netflix UI Engineering channel on YouTube. If you’re interested in being notified of future events, just sign up on our notification list.

↧

High Quality Video Encoding at Scale

December 9, 2015, 10:28 am

≫ Next: Optimizing Content Quality Control at Netflix with Predictive Modeling

≪ Previous: Debugging Node.js in Production

At Netflix we receive high quality sources for our movies and TV shows and encode them to the best video streams possible for a given member’s viewing device and bandwidth capabilities. With the continued growth of our service it has been essential to build a video encoding pipeline that is highly robust, efficient and scalable. Our production system is designed to easily scale to support the demands of the business (i.e., more titles, more video encodes, shorter time to deploy), while guaranteeing a high quality of experience for our members.

Pipeline in the Cloud

The video encoding pipeline runs EC2 Linux cloud instances. The elasticity of the cloud enables us to seamlessly scale up when more titles need to be processed, and scale down to free up resources. Our video processing applications don’t require any special hardware and can run on a number of EC2 instance types. Long processing jobs are divided into smaller tasks and parallelized to reduce end-to-end delay and local storage requirements. It also allows us to exploit our internal spot market where instances are dynamically allocated based on real-time availability of the compute resources. If a task does not complete because an instance is abruptly terminated, only a small amount of work is lost and the task is rescheduled for another instance. The ability to recover from these transient errors is essential for a robust cloud-based system.

The figure below shows a high-level overview of our system. We ingest high quality video sources and generate video encodes of various codec profiles, at multiple quality representations per profile. The encodes are packaged and then deployed to a content delivery network for streaming. During a streaming session, the client requests the encodes it can play and adaptively switches among quality levels based on network conditions.

Video Source Inspection

To ensure that we have high quality output streams, we need pristine video sources. Netflix ingests source videos from our originals production houses or content partners. In some undesirable cases, the delivered source video contains distortion or artifacts which would result in bad quality video encodes – garbage in means garbage out. These artifacts may have been introduced by multiple processing and transcoding steps before delivery, data corruption during transmission or storage, or human errors during content production. Rather than fixing the source video issues after ingest (for example, apply error concealment to corrupted frames or re-edit sources which contain extra content), Netflix rejects the problematic source video and requests redelivery. Rejecting problematic sources ensures that:

The best source video available is ingested into the system. In many cases, error mitigation techniques only partially fix the problem.
Complex algorithms (which could have been avoided by better processes upstream) do not unnecessarily burden the Netflix ingest pipeline.
Source issues are detected early where a specific and actionable error can be raised.
Content partners are motivated to triage their production pipeline and address the root causes of the problems. This will lead to improved video source deliveries in the future.

Our preferred source type is Interoperable Master Format (IMF). In addition we support ProRes, DPX, and MPEG (typically older sources). During source inspection, we 1) verify that the source is conformed to the relevant specification(s), 2) detect content that could lead to a bad viewing experience and 3) generate metadata required by the encoding pipeline. If the inspection deems the source unacceptable, the system automatically informs our content partner about issues and requests a redelivery of the source.

A modern 4K source file can be quite large. Larger, in fact, than a typical drive on an EC2 instance. In order to efficiently support these large source files, we must run the inspection on the file in smaller chunks. This chunked model lends itself to parallelization. As shown in the more detailed diagram below, an initial inspection step is performed to index the source file, i.e. determine the byte offsets for frame-accurate seeking, and generate basic metadata such as resolution and frame count. The file segments are then processed in parallel on different instances. For each chunk, bitstream-level and pixel-level analysis is applied to detect errors and generate metadata such as temporal and spatial fingerprints. After all the chunks are inspected, the results are assembled by the inspection aggregator to determine whether the source should be allowed into the encoding pipeline. With our highly optimized inspection workflow, we can inspect a 4K source in less than 15 minutes. Note that longer duration sources would have more chunks, so the total inspection time will still be less than 15 minutes.

Parallel Video Encoding

At Netflix we stream to a heterogenous set of viewing devices. This requires a number of codec profiles: VC1, H.264/AVC Baseline, H.264/AVC Main and HEVC. We also support varying bandwidth scenarios for our members, all the way from sub-0.5 Mbps cellular to 100+ Mbps high-speed Internet. To deliver the best experience, we generate multiple quality representations at different bitrates (ranging from 100 kbps to 16 Mbps) and the Netflix client adaptively selects the optimal stream given the instantaneous bandwidth.

Similar to inspection, encoding is performed on chunks of the source file, which allows for efficient parallelization. Since we strive for quality control at every step of the pipeline, we verify the correctness of each encoded chunk right after it completes encoding. If a problem is detected, we can immediately triage the problem (or in the case of transient errors, resubmit the task) without waiting for the entire video to complete. When all the chunks corresponding to a stream have successfully completed, they are stitched together by a video assembler. To guard against frame accuracy issues that may have been introduced by incorrect parallel encoding (for example, chunks assembled in the wrong order, or frames dropped or duplicated at chunk boundaries), we validate the assembled stream by comparing the spatial and temporal fingerprints of the encode with that of the source video (fingerprints of the source are generated during the inspection stage).

In addition to straightforward encoding, the system calculates multiple full-reference video quality metrics for each output video stream. By automatically generating quality scores for each encode, we can monitor video quality at scale. The metrics also help pinpoint bugs in the system and guide us in finding areas for improving our encode recipes. We will provide more detail on the quality metrics we utilize in our pipeline in a future blog post.

Quality of Service

Before we implemented parallel chunked encoding, a 1080p movie could take days to encode, and a failure occurring late in the process would delay the encode even further. With our current pipeline, a title can be fully inspected and encoded at the different profiles and quality representations, with automatic quality control checks, within a few hours. This enables us to stream titles within just a few hours of their original broadcast. We are currently working on further improvements to our system which will allow us to inspect and encode a 1080p source in 30 minutes or less. Note that since the work is done in parallel, processing time is not increased for longer sources.

Before automated quality checks were integrated into our system, encoding issues (picture corruption, inserted black frames, frame rate conversion, interlacing artifacts, frozen frames, etc) could go unnoticed until reported by Netflix members through Customer Support. Not only was this a poor member experience, triaging these issues was costly and inefficient, often escalating through many teams before the root cause was found. In addition, encoding failures (for example due to corrupt sources) would also require manual intervention and long delays in root-causing the failure. With our investment in automated inspection at scale, we detect the issues early, whether it was because of a bad source delivery, an implementation bug, or a glitch in one of the cloud instances, and we provide specific and actionable error messages. For a source that passes our inspections, we have an encode reliability of 99.99% or better. When we do find a problem that was not caught by our algorithms, we design new inspections to detect those issues in the future.

In Summary

High quality video streams are essential for delivering a great Netflix experience to our members. We have developed, and continue to improve on, a video ingest and encode pipeline that runs on the cloud reliably and at scale. We designed for automated quality control checks throughout so that we fail fast and detect issues early in the processing chain. Video is processed in parallel segments. This decreases end-to-end processing delay, reduces the required local storage and improves the system’s error resilience. We have invested in integrating video quality metrics into the pipeline so that we can continuously monitor performance and further optimize our encoding.

Our encoding pipeline, combined with the compute power of the Netflix internal spot market, has value outside our day-to-day production operations. We leverage this system to run large-scale video experiments (codec comparisons, encode recipe optimizations, quality metrics design, etc.) which strive to answer questions that are important to delivering the highest quality video streams, and at the same time could benefit the larger video research community.

by Anne Aaron and David Ronca

↧

Optimizing Content Quality Control at Netflix with Predictive Modeling

December 10, 2015, 11:11 am

≫ Next: Per-Title Encode Optimization

≪ Previous: High Quality Video Encoding at Scale

By Nirmal Govind and Athula Balachandran

Over 69 million Netflix members stream billions of hours of movies and shows every month in North and South America, parts of Europe and Asia, Australia and New Zealand. Soon, Netflix will be available in every corner of the world with an even more global member base.

As we expand globally, our goal is to ensure that every member has a high-quality experience every time they stream content on Netflix. This challenging problem is impacted by factors that include quality of the member's Internet connection, device characteristics, content delivery network, algorithms on the device, and quality of content.

We previously looked at opportunities to improve the Netflix streaming experience using data science. In this post, we'll focus on predictive modeling to optimize the quality control (QC) process for content at Netflix.

Content Quality

An important aspect of the streaming experience is the quality of the video, audio, and text (subtitle, closed captions) assets that are used.

Imagine sitting down to watch the first episode of a new season of your favorite show, only to find that the video and audio are off by 20 seconds. You decide to watch it anyway and turn on subtitles to follow along. What if the subtitles are poorly positioned and run off the screen?

Depending on the severity of the issue, you may stop watching, or continue because you’re already invested in the content. Either way, it leaves a bad impression and can negatively impact member satisfaction and retention. Netflix sets a high bar on content quality and has a QC process in place to ensure this bar is met. Let’s take a quick look at how the Netflix digital supply chain works and the role of the QC process.

We receive assets either from the content owners (e.g. studios, documentary filmmakers) or from a fulfillment house that obtains content from the owners and packages the assets for delivery to Netflix. Our QC process consists of automated and manual inspections to identify and replace assets that do not meet our specified quality standards.

Automated inspections are performed before and after the encoding process that compresses the larger “source” files into a set of smaller encoded distribution files (at different bitrates, for different devices, etc.). Manual QC is then done to check for issues easily detected with the human eye: depending on the content, a QCer either spot checks selected points of the movie or show, or watches the entire duration of the content. Examples of issues caught during the QC process include video interlacing artifacts, audio-video sync issues, and text issues such as missing or poorly placed subtitles.

It is worth noting the fraction of assets that fail quality checks is small. However, to optimize the streaming experience, we’re focused on detecting and replacing those sub-par assets. This is even more important as Netflix expands globally and more members consume content in a variety of new languages (both dubbed audio and subtitles). Also, we may receive content from new partners who have not delivered to us before and are not familiar with our quality standards.

Predictive Quality Control

As the Netflix catalog, member base, and global reach grow, it is important to scale the manual QC process by identifying defective assets accurately and efficiently.

Looking at the data

Data and data science play a key role in how Netflix operates, so the natural question to ask was:

Can we use data science to help identify defective assets?

We looked at the data on manual QC failures and observed that certain factors affected the likelihood of an asset failing QC. For example, some combinations of content and fulfillment partners had a higher rate of defects for certain types of assets. Metadata related to the content also showed patterns of failure. For example, older content (by release year) had a higher defect rate, likely due to the use of older formats for the creation and storage of assets. The genre of the content also exhibited certain patterns of failure.

These types of factors were used to build a machine learning model that predicts the probability that a delivered asset would not meet the Netflix quality standards.

A predictive model to identify defective assets helps in two significant ways:

Scale the content QC process by reducing QC effort on assets that are not defective.
Improve member experience by re-allocating resources to the discovery of hard-to-find quality issues that may otherwise be missed due to spot checks.

Machine Learning

Using results from past manual QC checks, a supervised machine learning (ML) approach was used to train a predictive quality control model that predicts a “fail” (likely has content quality issue) or “pass.” If an asset is predicted to fail QC, it is sent to manual QC. The modified supply chain workflow with the predictive QC model is shown below.

Netflix Supply Chain with Predictive Quality Control

A key goal of the model is to identify all defective assets even if this results in extra manual checks. Hence, we tuned the model for low false-negative rate (i.e. fewer uncaught defects) at the cost of increased false-positive rate.

Given that only a small fraction of the delivered assets are defective, one of the main challenges is class imbalance in the training data, i.e. we have a lot more data on “pass” assets than “fail” assets. We tackled this by using cost-sensitive training that heavily penalizes misclassification of the minority class (i.e. defective assets).

As with most model-building exercises, domain knowledge played an important role in this project. An observation that led to improved model performance was that defective assets are typically delivered in batches. For example, video assets from episodes within the same season of a show are mostly defective or mostly non-defective. It’s likely that assets in a batch were created or packaged around the same time and/or with the same equipment, and hence with similar defects.

We performed offline validation of the model by passively making predictions on incoming assets and comparing with actual results from manual QC. This allowed us to fine tune the model parameters and validate the model before deploying into production. Offline validation also confirmed the scaling and quality improvement benefits outlined earlier.

Looking Ahead

Predictive QC is a significant step forward in ensuring that members have an amazing viewing experience every time they watch a movie or show on Netflix. As the slate of Netflix Originals grows and more aspects of content creation—for example, localization, including subtitling and dubbing—are owned by Netflix, there is opportunity to further use data to improve content quality and the member experience.

We’re continuously innovating with data to build creative models and algorithms that improve the streaming experience for Netflix members. The scale of problems we encounter—Netflix accounts for 37.1% of North American downstream traffic at peak—provides for a set of unique modeling challenges. Also, we partner closely with the engineering teams to design and build production systems that embed such machine learning models. If you're interested in working in this exciting space, please check out the Streaming Science & Algorithms and Content Platform Engineering positions on the Netflix jobs site.

↧

Per-Title Encode Optimization

December 14, 2015, 12:30 pm

≫ Next: An update to our Windows 10 app

≪ Previous: Optimizing Content Quality Control at Netflix with Predictive Modeling

We’ve spent years developing an approach, called per-title encoding, where we run analysis on an individual title to determine the optimal encoding recipe based on its complexity. Imagine having very involved action scenes that need more bits to encapsulate the information versus unchanging landscape scenes or animation that need less. This allows us to deliver the same or better experience while using less bandwidth, which will be particularly important in lower bandwidth countries and as we expand to places where video viewing often happens on mobile networks.

Background

In traditional terrestrial, cable or satellite TV, broadcasters have an allocated bandwidth and the program or set of programs are encoded such that the resulting video streams occupy the given fixed capacity. Statistical multiplexing is oftentimes employed by the broadcaster to efficiently distribute the bitrate among simultaneous programs. However, the total accumulated bitrate across the programs should still fit within the limited capacity. In many cases, padding is even added using null packets to guarantee strict constant bitrate for the fixed channel, thus wasting precious data rate. Furthermore, with pre-set channel allocations, less popular programs or genres may be allocated lower bitrates (and therefore, worse quality) than shows that are viewed by more people.

With the advantages of Internet streaming, Netflix is not bound to pre-allocated channel constraints. Instead, we can deliver the best video quality stream to a member, no matter what the program or genre, tailored to the member’s available bandwidth and viewing device capability. We pre-encode streams at various bitrates applying optimized encoding recipes. On the member’s device, the Netflix client runs adaptive streaming algorithms which instantaneously select the best encode to maximize video quality while avoiding playback interruptions due to rebuffers.

Encoding with the best recipe is not a simple problem. For example, assuming a 1 Mbps bandwidth, should we stream H.264/AVC at 480p, 720p or 1080p? With 480p, 1 Mbps will likely not exhibit encoding artifacts such blocking or ringing, but if the member is watching on an HD device, the upsampled video will not be sharp. On the other hand, if we encode at 1080p we send a higher resolution video, but the bitrate may be too low such that most scenes will contain annoying encoding artifacts.

The Best Recipe for All

When we first deployed our H.264/AVC encodes in late 2010, our video engineers developed encoding recipes that worked best across our video catalogue (at that time). They tested various codec configurations and performed side-by-side visual tests to settle on codec parameters that produced the best quality trade-offs across different types of content. A set of bitrate-resolution pairs (referred to as a bitrate ladder), listed below, were selected such that the bitrates were sufficient to encode the stream at that resolution without significant encoding artifacts:

Bitrate (kbps)	Resolution
235	320x240
375	384x288
560	512x384
750	512x384
1050	640x480
1750	720x480
2350	1280x720
3000	1280x720
4300	1920x1080
5800	1920x1080

This “one-size-fits-all” fixed bitrate ladder achieves, for most content, good quality encodes given the bitrate constraint. However, for some cases, such as scenes with high camera noise or film grain noise, the highest 5800 kbps stream would still exhibit blockiness in the noisy areas. On the other end, for simple content like cartoons, 5800 kbps is far more than needed to produce excellent 1080p encodes. In addition, a customer whose network bandwidth is constrained to 1750 kbps might be able to watch the cartoon at HD resolution, instead of the SD resolution specified by the ladder above.

The titles in Netflix’s video collection have very high diversity in signal characteristics. In the graph below we present a depiction of the diversity of 100 randomly sampled titles. We encoded 100 sources at 1080p resolution using x264 constant QP (Quantization Parameter) rate control. At each QP point, for every title, we calculate the resulting bitrate in kbps, shown on the x-axis, and PSNR (Peak Signal-To-Noise Ratio) in dB, shown on the y-axis, as a measure of video quality.

The plots show that some titles reach very high PSNR (45 dB or more) at bitrates of 2500 kbps or less. On the other extreme, some titles require bitrates of 8000 kbps or more to achieve an acceptable PSNR of 38 dB.

Given this diversity, a one-size-fits-all scheme obviously cannot provide the best video quality for a given title and member’s allowable bandwidth. It can also waste storage and transmission bits because, in some cases, the allocated bitrate goes beyond what is necessary to achieve a perceptible improvement in video quality.

Side Note on Quality Metrics: For the above figure, and many of the succeeding plots, we plot PSNR as the measure of quality. PSNR is the most commonly used metric in video compression. Although PSNR does not always reflect perceptual quality, it is a simple way to measure the fidelity to the source, gives good indication of quality at the high and low ends of the range (i.e. 45 dB is very good quality, 35 dB will show encoding artifacts), and is a good indication of quality trends within a single title.The analysis can also be applied using other quality measures such as the VMAF perceptual metric. VMAF (Video Multi-Method Assessment Fusion) is a perceptual quality metric developed by Netflix in collaboration with University of Southern California researchers. We will publish details of this quality metric in a future blog.

The Best Recipe for the Content

Why Per-Title?

Consider an animation title where the content is “simple”, that is, the video frames are composed mostly of flat regions with no camera or film grain noise and minimal motion between frames. We compare the quality curve for the fixed bitrate ladder with a bitrate ladder optimized for the specific title:

As shown in the figure above, encoding this video clip at 1920x1080, 2350 kbps (A) produces a high quality encode, and adding bits to reach 4300 kbps (B) or even 5800 kbps (C) will not deliver a noticeable improvement in visual quality (for encodes with PSNR 45 dB or above, the distortion is perceptually unnoticeable). In the fixed bitrate ladder, for 2350 kbps, we encode at 1280x720 resolution (D). Therefore members with bandwidth constraints around that point are limited to 720p video instead of the better quality 1080p video.

On the other hand, consider an action movie that has significantly more temporal motion and spatial texture than the animation title. It has scenes with fast-moving objects, quick scene changes, explosions and water splashes. The graph below shows the quality curve of an action movie.

Encoding these high complexity scenes at 1920x1080, 4300 kbps (A), would result in encoding artifacts such as blocking, ringing and contouring. A better quality trade-off would be to encode at a lower resolution 1280x720 (B), to eliminate the encoding artifacts at the expense of adding scaling. Encoding artifacts are typically more annoying and visible than blurring introduced by downscaling (before the encode) then upsampling at the member’s device. It is possible that for this title with high complexity scenes, it would even be beneficial to encode 1920x1080 at a bitrate beyond 5800 kbps, say 7500 kbps, to eliminate the encoding artifacts completely.

To deliver the best quality video to our members, each title should receive a unique bitrate ladder, tailored to its specific complexity characteristics. Over the last few years, the encoding team at Netflix invested significant research and engineering to investigate and answer the following questions:

Given a title, how many quality levels should be encoded such that each level produces a just-noticeable-difference (JND)?
Given a title, what is the best resolution-bitrate pair for each quality level?
Given a title, what is the highest bitrate required to achieve the best perceivable quality?
Given a video encode, what is the human perceived quality?
How do we design a production system that can answer the above questions in a robust and scalable way?

The Algorithm

To design the optimal per-title bitrate ladder, we select the total number of quality levels and the bitrate-resolution pair for each quality level according to several practical constraints. For example, we need backward-compatibility (streams are playable on all previously certified Netflix devices), so we limit the resolution selection to a finite set -- 1920x1080, 1280x720, 720x480, 512x384, 384x288 and 320x240. In addition, the bitrate selection is also limited to a finite set, where the adjacent bitrates have an increment of roughly 5%.

We also have a number of optimality criteria that we consider.

The selected bitrate-resolution pair should be efficient, i.e. at a given bitrate, the produced encode should have as high quality as possible.
Adjacent bitrates should be perceptually spaced. Ideally, the perceptual difference between two adjacent bitrates should fall just below one JND. This ensures that the quality transitions can be smooth when switching between bitrates. It also ensures that the least number of quality levels are used, given a wide range of perceptual quality that the bitrate ladder has to span.

To build some intuition, consider the following example where we encode a source at three different resolutions with various bitrates.

Encoding at three resolutions and various bitrates. Blue marker depicts encoding point and the red curve indicates the PSNR-bitrate convex hull.

At each resolution, the quality of the encode monotonically increases with the bitrate, but the curve starts flattening out (A and B) when the bitrate goes above some threshold. This is because every resolution has an upper limit in the perceptual quality it can produce. When a video gets downsampled to a low resolution for encoding and later upsampled to full resolution for display, its high frequency components get lost in the process.

On the other hand, a high-resolution encode may produce a quality lower than the one produced by encoding at the same bitrate but at a lower resolution (see C and D). This is because encoding more pixels with lower precision can produce a worse picture than encoding less pixels at higher precision combined with upsampling and interpolation. Furthermore, at very low bitrates the encoding overhead associated with every fixed-size coding block starts to dominate in the bitrate consumption, leaving very few bits for encoding the actual signal. Encoding at high resolution at insufficient bitrate would produce artifacts such as blocking, ringing and contouring.

Based on the discussion above, we can draw a conceptual plot to depict the bitrate-quality relationship for any video source encoded at different resolutions, as shown below:

We can see that each resolution has a bitrate region in which it outperforms other resolutions. If we collect all these regions from all the resolutions available, they collectively form a boundary called convex hull. In an economic sense, the convex hull is where the encoding point achieves Pareto efficiency. Ideally, we want to operate exactly at the convex hull, but due to practical constraints (for example, we can only select from a finite number of resolutions), we would like to select bitrate-resolution pairs that are as close to the convex hull as possible.

It is practically infeasible to construct the full bitrate-quality graphs spanning the entire quality region for each title in our catalogue. To implement a practical solution in production, we perform trial encodings at different quantization parameters (QPs), over a finite set of resolutions. The QPs are chosen such that they are one JND apart. For each trial encode, we measure the bitrate and quality. By interpolating curves based on the sample points, we produce bitrate-quality curves at each candidate resolution. The final per-title bitrate ladder is then derived by selecting points closest to the convex hull.

Sample Results

BoJack Horseman is an example of an animation with simple content - flat regions and low motion from frame to frame. In the fixed bitrate ladder scheme, we use 1750 kbps for the 480p encode. For this particular episode, with the per-title recipe we start streaming 1080p video at 1540 kbps. Below we compare cropped screenshots (assuming a 1080p display) from the two versions (top: 1750 kbps, bottom: new 1540 kbps). The new encode is crisper and has better visual quality.

Orange is the New Black has video characteristics with more average complexity. At the low bitrate range, there is no significant quality improvement seen with the new scheme. At the high end, the new per-title encoding assigns 4640 kbps for the highest quality 1080p encode. This is 20% in bitrate savings compared to 5800 kbps for the fixed ladder scheme. For this title we avoid wasting bits but maintain the same excellent visual quality for our members. The images below show a screenshot at 5800 kbps (top) vs. 4640 kbps (bottom).

The Best Recipe for Your Device

In the description above where we select the optimized per-title bitrate ladder, there is an inherent assumption that the viewing device can receive and play any of the encoded resolutions. However, because of hardware constraints, some devices may be limited to resolutions lower than the original resolution of the source content. If we select the convex hull covering resolutions up to 1080p, this could lead to suboptimal viewing experiences for, say, a tablet limited to 720p decoding hardware. For example, given an animation title, we may switch to 1080p at 2000 kbps because it results in better quality than a 2000 kbps 720p stream. However the tablet will not be able to utilize the 1080p encode and would be constrained to a sub-2000 kbps stream even if the bandwidth allows for a better quality 720p encode.

To remedy this, we design additional per-title bitrate ladders corresponding to the maximum playable resolution on the device. More specifically, we design additional optimal per-title bitrate ladders tailored to 480p and 720p-capped devices. While these extra encodes reduce the overall storage efficiency for the title, adding them ensures that our customers have the best experience.

What does this mean for my Netflix shows?

Per-title encoding allows us to deliver higher quality video two ways: Under low-bandwidth conditions, per-title encoding will often give you better video quality as titles with “simple” content, such as BoJack Horseman, will now be streamed at a higher resolution for the same bitrate. When the available bandwidth is adequate for high bitrate encodes, per-title encoding will often give you even better video quality for complex titles, such as Marvel's Daredevil, because we will encode at a higher maximum bitrate than our current recipe. Our continuous innovation on this front recognizes the importance of providing an optimal viewing experience for our members while simultaneously using less bandwidth and being better stewards of the Internet.

by Anne Aaron, Zhi Li, Megha Manohara, Jan De Cock and David Ronca

↧

An update to our Windows 10 app

December 16, 2015, 6:00 am

≫ Next: HTML5 Video is now supported in Firefox

≪ Previous: Per-Title Encode Optimization

We have published a new version of our Windows 10 app to the Windows Store. This update features an updated user experience that is powered by an entirely new implementation on the Universal Windows Platform.

The New User Experience

The updated Browse experience provides vertical scrolling through categories and horizontal scrolling through the items in a category.

The updated Details view features large, cinematic artwork for the show or movie. The Details view for a Show includes episode information while the Details view for a Movie includes suggestions for other content.

Our members on Windows run across many different screen sizes, resolutions and scaling factors. The new version of the application uses a responsive layout to optimize the size and placement of items based on the window size and scaling factor.

Since many Windows 10 devices support touch input on integrated displays or via gestures on their trackpad, we have included affordances for both in this update. When a member is browsing content with a mouse, paginated scrolling of the content within a row is enabled by buttons on the ends of the row. When a member is using touch on an integrated display or via gestures on their trackpad, inertial scrolling of rows is enabled via swipe gestures.

Using the Universal Windows Platform

Over the last few years we have launched several applications for Windows and Windows Phone. The applications were built from a few code bases that span several technologies including Silverlight, XAML, C#. Bringing new features to our members on Windows platforms has required us to make changes in several code bases and ship multiple application updates.

With the Universal Windows Platform, we’re able to build an application from a single code base and run on many Windows 10 devices. Although the initial release of this application supports desktops, laptops and tablets running Windows 10, we have run our application on other Windows 10 devices and we will be adding support for phones running Windows 10 in the near future.

This new version of the application is a javascript based implementation that utilizes Microsoft’s WinJS library. Like several other teams (see tech blog posts for Netflix Likes React and Making Netflix.com Faster) at Netflix, we chose to use Facebook React. Using javascript to build the application has allowed us to use the same HTML5 video playback engine that is used in our browser based applications.

Windows Features

The new version of the app continues to support two features that are unique to the Windows platform.

When the app is pinned to the Start menu, the application tile is a Live Tile that will show artwork representing the items in a member’s Continue Watching list. We support several Live Tile sizes and large tile size is a new addition in this version of the app.

Our integration with Cortana enables members to search with voice commands. On app start, we register our supported commands with Cortana. Once a member has signed-in, they can issue one of our supported commands to Cortana. Here are the Cortana commands that we support in English:

If a user were to tell Cortana: Netflix find Jessica Jones, Cortana would start the app (if needed) and perform a search for Jessica Jones.

What’s Next?

We’re excited to share this update with our members and we’re hard at work on a new set of features and enhancements. The Universal Windows Platform will enable us to support phones running Windows 10 in the near future.

Visit the Windows Store to get the app today!

by Sean Sharma

↧

HTML5 Video is now supported in Firefox

December 17, 2015, 9:11 am

≫ Next: Dynomite with Redis on AWS - Benchmarks

≪ Previous: An update to our Windows 10 app

Today we’re excited to announce the availability of our HTML5 player in Firefox! Windows support is rolling out this week, and OS X support will roll out next year.

Firefox ships with the very latest versions of the HTML5 Premium Video Extensions. That includes the Media Source Extensions (MSE), which enable our video streaming algorithms to adapt to your available bandwidth; the Encrypted Media Extensions (EME), which allows for the viewing of protected content; and the Web Cryptography API (WebCrypto), which implements the cryptographic functions used by our open source Message Security Layer client-server protocol.

We worked closely with Mozilla and Adobe throughout development. Adobe supplies a content decryption module (CDM) that powers the EME API and allows protected content to play. We were pleased to find through our joint field testing that Adobe Primetime's CDM, Mozilla’s<video> tag, and our player all work together seamlessly to provide a high quality viewing experience in Firefox. With the new Premium Video Extensions, Firefox users will no longer need to take an extra step of installing a plug-in to watch Netflix.

We’re gratified that our HTML5 player support now extends to the latest versions of all major browsers, including Firefox, IE, Edge, Safari, and Chrome. Upgrade today to the latest version of your browser to get our best-in-class playback experience.

↧

Dynomite with Redis on AWS - Benchmarks

January 14, 2016, 11:40 am

≫ Next: Automated Failure Testing

≪ Previous: HTML5 Video is now supported in Firefox

About a year ago the Cloud Database Engineering (CDE) team published a post Introducing Dynomite. Dynomite is a proxy layer that provides sharding and replication and can turn existing non-distributed datastores into a fully distributed system with multi-region replication. One of its core features is the ability to scale a data store linearly to meet rapidly increasing traffic demands. Dynomite also provides high availability, and was designed and built to support Active-Active Multi-Regional Resiliency.

Dynomite, with Redis is now utilized as a production system within Netflix. This post is the first part of a series that pragmatically examines Dynomite's use cases and features. In this post, we will show performance results using Amazon Web Services (AWS) with and without the recently added consistency feature.

Dynomite Consistency

Dynomite extends eventual consistency to tunable consistency in the local region. The consistency level specifies how many replicas must respond to a write or read request before returning data to the client application. Read and write consistency can be configured to manage availability versus data accuracy. Consistency can be configured for read or write operations separately (cluster-wide). There are two configurations:

DC_ONE: Reads and writes are propagated synchronously only to the token owner in the local Availability Zone (AZ) and asynchronously replicated to other AZs and regions.
DC_QUORUM: Reads and writes are propagated synchronously to quorum number of nodes in the local region and asynchronously to the rest. The DC_QUORUM configuration writes to the number of nodes that make up a quorum. A quorum number is calculated by the formula ceiling((n+1)/2) where n is the number of nodes in a region. The operation succeeds if the read/write succeeded on a quorum number of nodes.

Test Setup

Client (workload generator) Cluster

For the workload generator, we used an internal Netflix tool called Pappy. Pappy is well integrated with other Netflix OSS services such as (Archaius for fast properties, Servo for metrics, and Eureka for discovery). However, any other other distributed load generator with Redis client plugin can be used to replicate the results. Pappy has support for modules, and one of them is Dyno Java client.

Dyno client uses topology aware load balancing (Token Aware) to directly connect to a Dynomite coordinator node that is the owner of the specified data. Dyno also uses zone awareness to send traffic to Dynomite nodes in the local ASG. To get full benefit of a Dynomite cluster a) the Dyno client cluster should be deployed across all ASGs, so all nodes can receive client traffic, and b) the number of client application nodes per ASG must be larger than the corresponding number of Dynomite nodes in the respective ASG so that the cumulative network capacity of the client cluster is at least equal to the corresponding one at the Dynomite layer.

Dyno also uses connection pooling for persistent connections to reduce the connection churn to the Dynomite nodes. However, in performance benchmarks tuning Dyno can tricky as the workload generator make become the bottleneck due to thread contention. In our benchmark, we observed the delay metrics to pick up a connection from the connection pool that Dyno exposes.

Client: Dyno Java client, using default configuration (token aware + zone aware)
Number of nodes: Equal to the number of Dynomite nodes in each experiment.
Region: us-west-2 (us-west-2a, us-west-2b and us-west-2c)
EC2 instance type: m3.2xlarge (30GB RAM, 8 CPU cores, Network throughput: high)
Platform: EC2-Classic
Data size: 1024 Bytes
Number of Keys: 1M random keys
Demo application used a simple workload of just key value pairs for read and writes i.e the Redis GET and SET api.
Read/Write ratio: 80:20 (the OPS was variable per test, but the ratio was kept 80:20)
Number of readers/writers: 80:20 ratio of reader to writer threads. 32 readers/8 writers per Dynomite Node. We performed some experiments varying the number of readers and writers and found that in the context of our experiments, 32 readers and 8 writes per dynomite node gave the best throughput latency tradeoff.

Dynomite Cluster

Dynomite: Dynomite 0.5.6
Data store: Redis2.8.9
Operating system: Ubuntu 14.04.2 LTS
Number of nodes: 3-48 (doubling every time)
Region: us-west-2 (us-west-2a, us-west-2b and us-west-2c)
EC2 instance type: r3.2xlarge (61GB RAM, 8 CPU cores, Network throughput: high)
Platform: EC2-Classic

The test was performed in a single region with 3 availability zones. Note that replicating to two other availability zones is not a requirement for Dynomite, but rather a deployment choice for high availability at Netflix. A Dynomite cluster of 3 nodes means that there was 1 node per availability zone. However all three nodes take client traffic as well as replication traffic from peer nodes. Our results were captured using Atlas. Each experiment was run 3 times, and our results are averaged over based on average of these times. Each run lasted 3h.

For the our benchmarks, we refer to the Dynomite node, as the node that contains the Dynomite layer and Redis. Hence we do not distinguish on whether Dynomite layer or Redis contributes to the average latency.

Linear Scale Test with DC_ONE

The graphs indicate that Dynomite can scale horizontally in terms of throughput. Therefore, it can handle more traffic by increasing the number of nodes per region. On a per node basis with r3.2xlarge nodes, Dynomite can fully process the traffic generated by the client workload generator (i.e. 32K reads OPS and 8K OPS on a per node basis). For a 1KB payload, as the above graphs show, the main bottleneck is the network (1Gbps EC2 instances). Therefore, Dynomite can potentially provide even faster throughput, if r3.4xlarge (2Gbps) or r3.8xlarge (10Gbps) EC2 instances are used. However, we need to note that 10Gbps optimizations will only be effective when the instances are launched in Amazon's VPC with instance types that can support Enhanced Networking using single root I/O virtualization (SR-IOV).

The average and median latency values show that Dynomite can provide sub-millisecond average latency to the client application. More specifically, Dynomite does not add extra latency as it scales to higher number of nodes, and therefore higher throughput. Overall, the Dynomite node contributes around 20% of the average latency, and the rest of it is a result of the network latency and client processing latency.

At the 95th percentile Dynomite's latency is 0.4ms and does not increase as we scale the cluster up/down. More specifically, the network and client is the major reason for the 95th percentile latency, as Dynomite's node effect is <10%.

It is evident from the 99th percentile graph that the latency for Dynomite pretty much remains the same while the client side increases indicating the variable nature of the network between the clusters.

Linear Scale Test With DC_QUORUM

The test setup was similar to what we used for DC_ONE tests above. Consistency was set to DC_QUORUM for both reads and writes on all Dynomite nodes. In DC_QUORUM, our expectations are that throughput will reduce and latency will increase because Dynomite waits for quorum number of responses.

Looking at the above graph it is clear that dynomite still scales well as the cluster nodes are increasing. Moreover Dynomite node achieves 18K OPS per node in our setup, when the cluster spans a single region. In comparison, Dynomite can achieve 40K OPS per node in DC_ONE.

The average and median latency remains <2.5ms even when DC_QUORUM consistency is enabled in Dynomite nodes. The average and median latency are slightly higher than the corresponding experiments with DC_ONE. In DC_ONE, the dynomite co-ordinator only waits for the local zone node to respond. In DC_QUORUM, the coordinator waits for quorum nodes to respond. Hence, in the overall read and write latency formula, the latency of the network hop to the other ASG, and the latency of performing the corresponding operation on those nodes in other ASGs must be included.

The 95th percentile at the Dynomite level is less than 2ms even after increasing the traffic on each Dyno client node (linear scale). At the client side it remains below 3ms.

At the 99th Percentile with DC_QUORUM enabled, Dynomite produces less than 3ms of latency. When considering the network from the cluster to the client, the latency remains well below 5ms opening the door for a number of applications that require consistency with low latency. Dynomite can report 90th percentile and 99.9th percentile through the statistics port. For brevity, we have decided to present the 99th percentile only.

Pipelining

Redis Pipelining is client side batching that is also supported by Dynomite; the client application sends requests without waiting for a response from a previous request and later reads a single response for the whole batch. Pipelining makes it possible to increase the overall throughput at the expense of additional latency for individual operations. In the following experiments, the Dyno client randomly selected between 3 to 10 operations in one pipeline request. We believe that this configuration might be close to how a client application would use the Redis Pipelining. The experiments were performed for both DC_ONE and DC_QUORUM.

For comparison reason, we showcase both the non-pipelining and pipelining results. In our tests, pipelining increased the throughput upto 50%. For a small Dynomite cluster the improvement is larger, but as Dynomite horizontally scales the benefit of pipelining decreases.

Latency is a factor of how many requests are combined into one pipeline request so it will vary and will be higher than non pipelined requests.

Conclusion

We performed the tests to get some more insights about Dynomite using Redis at the data store layer, and how to size our clusters. We could have achieved better results with better instance types both at the client and Dyomite server cluster. For example, adding Dynomite nodes with better network capacity (especially the ones supporting enhanced Networking on Linux Instances in a VPC) could further increase the performance of our clusters.

Another way to improve the performance is by using fewer availability zones. In that case, Dynomite would replicate the data in one more availability zone instead of two more, hence more bandwidth would have been available to client connections. In our experiment we used 3 availability zones in us-west-2, which is a common deployment in most production clusters at Netflix.

In summary, our benchmarks were based on instance types and deployments that are common at Netflix and Dynomite. We presented results that indicate that DC_QUORUM provides better read and write guarantees to the client but with higher latencies and lower throughput. We also showcased how a client can configure Redis Pipeline and benefit from request batching.

We briefly mentioned the availability of higher consistency in this article. In the next article we'll dive deeper into how we implemented higher consistency and how we handle anti-entropy.

by: Shailesh Birari, Jason Cacciatore, Minh Do, Ioannis Papapanagiotou, Christos Kalantzis

↧