Modernizing the Web Playback UI

December 11, 2018, 4:36 pm

≫ Next: Performance comparison of video coding standards: an adaptive streaming perspective

≪ Previous: Our learnings from adopting GraphQL

by Corey Grunewald & Matt Jaquish

Since 2013, the user experience of playing videos on the Netflix website has changed very little. During this period, teams at Netflix have rolled out amazing video playback features, but the visual design and user controls of the playback UI have remained the same.

*The visual design and user controls of playback have been the same since 2013.*

Over the past two years, the Web UI team has had a long running goal to modernize the user experience of playback for our members. Playback consists of three primary canvases:

Pre Play: A video to be shown to members before the the main content (e.g. a season recap).
Video Playback: The member is watching their selected content, and has access to controls to pause the video, change the volume, etc.
Post Play: When the selected content ends, members are presented with the next episode in a series, or recommendations on what to watch next.

Through AB testing and subsequent learning, we have launched an updated, modern playback UI experience for our members. We want to share our journey of how we got to where we are today.

If At First You Don’t Succeed…

Starting in 2016, our main priority for the modernization effort was to start using React to build and render the playback UI components. While the rest of the website transitioned to using React for the UI in the Summer of 2015, the playback UI continued to use a custom, vanilla JavaScript framework. Only a few engineers on the team had experience with the framework, which could create bottlenecks when working on fixes or features. By moving to React, we would enable more developers to contribute to building a better experience because of the familiarity and ergonomics it provided.

Along with improving developer throughput, we needed to eliminate the intricate bridge we had created between the custom framework and the existing React components used for the website. Building the playback UI with React meant that we could get rid of that complexity.

As with most product changes at Netflix, we treated the playback UI modernization as an AB test. With our visual design and data analytics partners, we worked out an AB test design. Our control cell would be our current visual design and feature set using the custom framework to render the UI. Our experiment treatment would be a new visual design for all canvases of playback (Pre Play, Video Playback, and Post Play), with the UI components built using React.

We were excited to get started. There was a green field ahead of us! However, we soon realized that we were too excited to start building and designing, and didn’t spend enough time thinking about if we even should.

We worked for months creating new React components, porting over logic, and rewriting the CSS for the new visual design. By the Summer of 2016, we were ready to launch the test. Drumroll please… we failed.

Doing Too Much at Once

Members using the new player design built on top of React were streaming less content. We were mystified that the test wasn’t a win. We assumed users wouldn’t have issues with the new visual design since it was now aligned with the rest of the website, and other platform’s playback UIs. We had to dig into where we were harming the user experience — and we found a few places.

Isolating Changes

Our initial test design had a fatal flaw — we changed both the visual design and the underlying UI architecture. When we got back the negative test results, it was difficult to determine which change was causing impact. Was it the design, the UI architecture, or both? After a more thorough look at metrics, we found that both the new visual design and move to React were impacting members in different ways. This became a hard-earned lesson in ensuring that all your test variables are isolated.

Rendering Performance Gap

In moving to React, we fundamentally changed the UI architecture for each playback canvas. When looking at performance metrics of the AB test, we found that our specific approach to using React to build components was actually causing playback startup to take longer than our custom framework, as well as drop more frames of video.

*Histogram of playback load times using the custom framework (green), overlayed with React (red).*

We were surprised by this discovery, but after a deeper comparison between our custom framework implementation and our usage of React, we understood why there was a gap. Our custom framework was binding directly to video player events to get UI state. Each component class would create a DOM node, wait for an event to emitted from the video player, and update attributes on the DOM node based upon the event data. Meanwhile, in React, we utilized unidirectional data flow by having a root component receive all video player state, which would then pass that state down to all children components. Re-rendering for each video player state change from this root component contributed to the performance delta.

Try, Try Again

Armed with knowing what was adversely impacted in the initial test, we were tasked with fixing those issues, and cleaning up our AB test design.

Test Design

We decided that the next AB test needed to only focus on UI architecture changes. Moving to React didn’t mean that the visual design of the playback UI needed to change as well. The plan was to replicate the existing visual design on top of our React components.

Need for Speed

In order to fix the gap in startup speed, we first had to measure where time was being spent during UI rendering. Instrumentation was added to track the time it took to hit key milestones of playback. This instrumentation was in the form of using performance markers throughout our components:

// Inside Player.jsx...

componentDidMount() {
    performance.mark(
        `playbackMilestone:start:${this.state.playbackView}`
    );
}

componentDidUpdate(prevProps, prevState) {
    if (prevState.playbackView !== this.state.playbackView) { 
        performance.mark(
            `playbackMilestone:end:${this.state.playbackView}`
        );
        performance.mark(
            `playbackMilestone:start:${this.state.playbackView}`
        );
    }

    if (this.state.playbackView === PlaybackViews.PLAYBACK) {
        // serialize and log playback performance data.
    }
}

The milestone data showed that rendering the initial loading view in React was taking much longer compared to our control cell. It turned out that we were rendering the playback controls component in parallel with the loading component. By ensuring the controls were only rendered when video playback was ready, we improved our render times while the video player was loading.

The next step was to prevent dropping frames during video playback. The built-in React performance tools were used to profile component render timing. We took several steps to improve render times:

React Best Practices: We ensured that the UI components were implementing best practices when using React, i.e. using the shouldComponentUpdate lifecycle where necessary.
Less HOCs: Where possible, we migrated away from using higher order components, by transitioning to using utility functions, or moving logic into a parent component
No Prop Spreads, and Collapsing Props: Spreading props causes time to be spent iterating through objects. Collapsing multiple props into a single object where possible helps reduce comparison time in the shouldComponentUpdate lifecycle.
Observability: Taking a page out of our custom framework’s playbook, we introduced observability of video player state into components that need to be re-rendered most often. This helped reduce render cycles at our root component.

With the visual design and performance changes made, a new AB test was launched. After patiently waiting, the results were in, another drumroll please… members streamed the same amount with the React playback UI compared to the custom framework! In the Summer of 2017, we rolled out using React in playback for all members .

Under the Hood: Simplifying Playback Logic

In addition to using React to make the UI component layer more accessible and easier to develop across multiple teams, we wanted to do the same for the player-related business logic. We have multiple teams working on different kinds of playback logic at the same time, such as: interactive titles (where the user makes choices to participate in the story), movie and episode playback, video previews in the browse experience, and unique content during Post Play

We chose to use Redux in order to single-source and encapsulate the complex playback business logic. Redux is a well-known library/pattern in web UI engineering, and it facilitates separation of concerns in ways that met our goals. By combining Redux with data normalization, we enabled parallel development across teams in addition to providing standardized, predictable ways of expressing complex business logic.

Separating Video Lifecycle From UI Lifecycle

Allowing the UI component tree to control the logic concerning the lifecycle of the actual video can result in a slow user experience. UI component trees usually have their lifecycle represented in a standardized set of methods, such as React’s componentDidMount, componentDidUpdate, etc. When the logic for creating a new video playback is hidden in a UI lifecycle method that is deep inside of a component tree, the user must wait until that specific component is called before the playback can even be initiated. After being initiated, the user must wait until the playback is sufficiently loaded in order to begin viewing the video.

When the UI is rendered on the server, the initial DOM is shipped to the client. This DOM doesn’t include a loaded video or any buffered data needed to start playback. In the case of React, the client UI needs to rebuild itself on top of this initial DOM, and then go through a lifecycle sequence to begin loading the video.

However, if the logic for managing the video playback exists outside of the UI component tree, it can be executed from anywhere inside of the application, such as before the UI tree is rendered during the initial application loading sequence. By kicking off the creation of a video in parallel with rendering the UI, it gives the application more time to create, initialize, and buffer video playback so that when the UI has finished rendering, the user can start playing the video sooner.

Standardizing the Data Representation of a Video Playback

Since video playback is composed of a series of dynamic events, it can pose a problem when there are different parts of an application that care about the state of a video playback. For example, one part of an application may be responsible for creating a video, another part responsible for configuring it based on user preferences, and yet another responsible for managing the real time control of playing, pausing, and seeking.

In order to encapsulate knowledge of a video playback, we created a standardized data structure to represent it. We were then able to create a single, central location to store the data structure for each video playback so that both the business logic and the UI could access them. This enabled intelligent rules governing video playbacks, multiple UIs that operate on a single set of data, and easier testing.

The standardized playback data structure can be created from any source of video: a custom video library, or a standard HTML video element. Using the normalized data frees the UI from having to know about the specific video implementation.

Adding Support for Multiple Video Playbacks

When we have the playback data for every existing video single-sourced in the application independent of the UI, it allows the application to define business logic rules that coordinate single, or multiple video playbacks. If each video was hidden inside a particular instance of a UI component, and the components existed across completely different areas of the UI, it would be difficult to coordinate and would force the UI components to have knowledge of each other when they probably shouldn’t.

Some areas of logic that become easier with the UI-independent playback data and multiple players are:

Volume & mute control.
Play & pause control.
Playback precedence for autoplay.
Constraints on the number of players allowed to coexist.

An Implementation of Application State

In order to provide a well-structured location for the UI-independent state, we decided to leverage Redux again. However, we also knew that we would need to allow multiple teams to work in the codebase as they added and removed logic that would be independent and not required by all use cases. As a result, we created an extremely thin layer on top of core Redux that allowed us to package up files related to specific domains of logic, and then compose Redux applications out of those domains.

A domain in our system is simply a static object that contains the following things:

State data structure.
State reducer.
Actions.
Middleware.
Custom API to query state.

An application can choose to compose itself out of domains, or not use them at all. When a domain is used to create an application, the individual parts of the domain are automatically bound to its own domain state; it won’t have access to any other part of the application state outside of what it defined. The good thing is that the final external API of the application is the same whether it uses domains or not, thanks to the power of composition.

We empower two use cases: a single-level standard Redux application where each part knows about the entire state, or an application where each domain is enforced to only manage its own piece of the application’s substate. The benefit of identifying areas of logic that can be encapsulated into a logical domain is that the domain can easily be added, removed, or ported to any other application without breaking anything else.

Enabling Plug & Play Logic

By leveraging our concept of domains, we were able to continue working on core playback UI features while other teams implemented custom logical domains to plug into our player application, such as logic for interactive titles. Interactive titles have custom playback logic, as well as custom UIs to enable the user to make story choices during playback. Now that we had both well-encapsulated UI (via React) and state with associated logic (via Redux and our domains), we had a system to manage complexity on multiple fronts. Since we continuously AB test a lot of features, the consistent encapsulation of logic makes it much easier to add and remove logic based on AB test data or feature flags. Having an enforced and consistent structure by thinking in terms of logic domains also helped us identify and formalize areas of our application that were previously inconsistent. By adding structure and predictability and giving up the absolute freedom to do anything in any way, it actually freed us and other teams to add more features, perform more testing, and create higher-quality code.

A New Coat of Paint with A Better Engine

With new and improved state management and development patterns for fellow engineers to use, our final step in the modernization journey was to update the visual design of the UI.

From our previous learning about change isolation, the plan was to create an AB test that only focused on updating the UI in the video playback experience, and not modifying any other canvases of playback, or the architecture.

By utilizing our implementation of Redux and extending existing React components, it was easier than ever to turn our design prototypes into production code for the AB test. We rolled out the test in the Summer of 2018. We got back the results in the Fall, and found that members preferred the modern visual design along with new video player controls that allowed them to seek back and forth, or pause the video by simply clicking on the screen.

This final AB test in the modernization journey was easy to implement and analyze. By making mistakes and learning from them, we built intuition and best practices around ensuring you are not trying to do too many things at once.

Modernizing the Web Playback UI was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Performance comparison of video coding standards: an adaptive streaming perspective

December 13, 2018, 12:01 am

≫ Next: Implementing the Netflix Media Database

≪ Previous: Modernizing the Web Playback UI

by Joel Sole, Liwei Guo, Andrey Norkin, Mariana Afonso, Kyle Swanson, Anne Aaron

“This is my advice to people: Learn how to cook, try new recipes, learn from your mistakes, be fearless, and above all have fun” — Julia Child (American chef, author, and television personality)

At Netflix, we are continually refining the recipes we use to serve your favorite shows and movies at the best possible quality. An essential element in this dish is the video encoding technology we use to transform our video content into compressed bitstreams (suitable for whatever bandwidth you happen to be enjoying Netflix at). A fantastic amount of work has been done by the video coding community to develop video coding standards (codecs) with the goal of achieving always better compression ratios. Therefore, an essential task is the assessment of the quality of the ingredients we use and, in the Netflix encoding kitchen, we do this by regularly evaluating the performance of existing and upcoming video codecs and encoders. We select the freshest and best encoding technologies so that you can savor our content, from the satiating cinematography of Salt Fat Acid Heat to the gorgeous food shots of Chef’s Table.

Factors in codec comparisons

Many articles have been published comparing the performance of video codecs. The reader of these articles might often be confused by their seemingly contradicting conclusions. One article might claim that codec A is 15% better than codec B, while the next one might assert that codec B is 10% better than codec A.

A deeper dive into the topic reveals that these apparent contradictions should be expected. Why? Because the testing methodology and content play a crucial role in the evaluation of video codecs. A different selection of the testing conditions can lead to disparate results. We discuss below several factors that impact the assessment of video codecs:

Encoder implementation
Encoder settings
Methodology
Content
Metrics

Where applicable, we make the distinction between the traditional comparison approach and our approach for adaptive streaming.

Encoder implementation

Video coding standards are instantiated in software or hardware with goals as varied as research, broadcasting, or streaming. A ‘reference encoder’ is a software implementation used during the video standardization process and for research, and as a reference by implementers. Generally, this is the first implementation of a standard and not used for production. Afterward, production encoders developed by the open-source community or commercial entities come along. These are practical implementations deployed by most companies for their encoding needs and are subject to stricter speed and resource constraints. Therefore, the performance of reference and production encoders might be substantially different. Besides, the standard profile and specific version influence the observed performance, especially for a new standard with still immature implementations. Netflix deploys production encoders tuned to achieve the highest subjective quality for streaming.

Encoding settings

Encoding parameters such as the number of coding passes, parallelization tools, rate-control, visual tuning, and others introduce a high degree of variability in the results. The selection of these encoding settings is mainly application-driven.

Standardization bodies tend to use test conditions that let them compare one tool to another, often maximizing a particular objective metric and reducing variability over different experiments. For example, rate-control and visual tunings are generally disabled, to focus on the effectiveness of core coding tools.

Netflix encoding recipes focus on achieving the best quality, enabling the available encoder tools that boost visual appearance, and thus, giving less weight to indicators like speed or encoder footprint that are crucial in other applications.

Methodology

Testing methodology in codec standardization establishes well-defined “common test conditions” to assess new coding tools and to allow for reproducibility of the experiments. Common test conditions consist of a relatively small set of test sequences (single shots of 1 to 10 seconds) that are encoded only at the input resolution with a fixed set of quality parameters. Quality (PSNR) and bitrate are gathered for each of these quality points and used to compute the average bitrate savings, the so-called BD-rate, as illustrated below.

While the methodology in standards has been suitable for its intended purpose, other considerations come into play in the adaptive streaming world. Notably, the option to offer multiple representations of the same video at different bitrates and resolutions to match network bandwidth and client processing and display capabilities. The content, encoding, and display resolutions are not necessarily tied together. Removing this constraint implies that quality can be optimized by encoding at different resolutions.

Per-resolution rate-quality curves cross, so there is a range of rates for which each encoding resolution gives the best quality. The ‘convex hull’ is derived by selecting the optimal curve at each rate for the entire range. Then, the BD-rate difference is computed on the convex hulls, instead of using the single resolution curves.

The flexibility in delivering bitstreams on the convex hull as opposed to just the single resolution ones leads to remarkable quality improvements. The dynamic optimizer (DO) methodology generalizes this concept to sequences with multiple shots. DO operates on the convex hull of all the shots in a video to jointly optimize the overall rate-distortion by finding the optimal compression path for an encode across qualities, resolutions, and shots.

DO was introduced in this tech blog. It showed a 25% in BD-rate savings for multiple-shot videos. Three characteristics that make DO particularly well-suited for adaptive streaming and codec comparisons are:

It is codec agnostic since it can be applied in the same way to any encoder.
It can use any metric to guide its optimization process.
It eliminates the need for high-level rate control across shots in the encoder. Lower-level rate control, like adaptive quantization within a frame, is still useful, because DO does not operate below the shot level.

DO, as originally presented, is a non-real-time, computationally expensive algorithm. However, because of the exhaustive search, DO could be seen as the upper-bound performance for high-level rate control algorithms.

Content

For a fair comparison, the testing content should be balanced, covering a variety of distinct types of video (natural vs. animation, slow vs. high motion, etc.) or reflect the kind of content for the application at hand.

The testing content should not have been used during the development of the codec. Netflix has produced and made public long video sequences with multiple shots, such as ‘El Fuente’ or ‘Chimera’, to extend the available videos for R&D and mitigate the problem of conflating training and test content. Internally, we extensively evaluate algorithms using full titles from our catalog.

Metrics

Traditionally, PSNR has been the metric of choice given its simplicity, and it reasonably matches subjective opinion scores. Other metrics, such as VIF or SSIM, better correlate with the subjective scores. Metrics have commonly been computed at the encoding resolution.

Netflix highly relies on VMAF throughout its video pipeline. VMAF is a perceptual video quality metric that models the human visual system. It correlates better than PSNR with subjective opinion over a wide quality range and content. VMAF enables reliable codec comparisons across the broad range of bitrates and resolutions occurring in adaptive streaming. This tech blog is useful to learn more about VMAF and its current deployment status.

*Approximate correspondence between VMAF values and subjective opinion*

Two relevant aspects when employing metrics are the resolution at which they are computed and the temporal averaging:

Scaled metric: VMAF is not computed at the encoding resolution, but at the display resolution, which better emulates our members viewing experience. This is not unique to VMAF, as PSNR and other metrics can be applied at any desired resolution by appropriately scaling the video.
Temporal average: Metrics are calculated on a per-frame basis. Normally, the arithmetic mean has been the method of choice to obtain the temporal average across the entire sequence. We employ the harmonic mean, which gives more weight to outliers than the arithmetic mean. The rationale for using the harmonic mean is that if there are few frames that look really bad in a shot of a show you are watching, then your experience is not that great, no matter how good the quality of the rest of the shot is. The acronym for the harmonic VMAF is HVMAF.

Codec comparison results

Putting in practice the factors mentioned above, we show the results drawn from the two distinct codec comparison approaches, the traditional and the adaptive streaming one.

Three commonly used video coding standards are tested: H.264/AVC and H.265/HEVC by ITU.T and ISO/MPEG and VP9 by Google. For each standard, we use the reference encoder and a production encoder.

Results with the traditional approach

The traditional approach uses fixed QP (quality) encoding for a set of short sequences.

Encoder settings are listed in the table below.

Methodology: Five fixed quality encodings per sequence at the content resolution are generated.
Content: 14 standard sequences from the MPEG Common Test Conditions set (mostly from JVET) and 14 from the Alliance for Open Media (AOM) set. All sequences are 1080p. These are short clips: about 10 seconds for the MPEG set and 1 second for the AOM set. Mostly, they are single shot sequences.
Metrics: BD-rate savings are computed using the Classic PSNR for the luma component.

Results are summarized in the table below. BD-rates are given in percentage with respect to x264. Positive numbers indicate an average increase of bitrate, while negative numbers indicate bitrate reduction.

*BD-rate (in %) using PSNR of the 6 PSNR-tuned video encoders*

Interestingly, using the MPEG set doesn’t seem to benefit HEVC encoders, or using the AOM set help VP9 encoders.

Results from an adaptive streaming perspective

This section describes a more comprehensive experiment. It builds on top of the traditional approach, modifying some aspects in each of the factors:

Encoder settings: Settings are changed to incorporate perceptual tunings. The rest of the settings remain as previously defined.

Methodology: Encoding is done at 10 different resolutions for each shot, from 1920x1080 down to 256x144. DO performs the overall encoding optimization using HVMAF.
Content: A set of 8 Netflix full titles (like Orange is the New Black, House of Cards or Bojack) is added to the other two test sets. Netflix titles are 1080p, 30fps, 8 bits/component. They contain a wide variety of content in about 8 hours of video material.
Metrics: HVMAF is employed to assess these perceptually-tuned encodings. The metric is computed over the relevant quality range of the convex hull. HVMAF is computed after scaling the encodes to the display resolution (assumed to be 1080p), which also matches the resolution of the source content.

Additionally, we split results into two ranges to visualize performance at different qualities. The low range refers to HVMAF between 30 and 63, while the high range refers to HVMAF between 63 and 96, which correlates with high subjective quality.

The highlighted rows in the HVMAF BD-rates table are the most relevant operation points for Netflix.

*BD-rate (in %) using HVMAF of 6 video encoders tuned for perceptual quality. BD-rates percentages use x264 as the reference.*

Takeaways

Encoder, encoding settings, methodology, testing content, and metrics should be thoroughly described in any codec comparison since they greatly influence results. As illustrated above, a different selection of the testing conditions leads to different conclusions on the relative performance of the encoders. For example, EVE-VP9 is about 1% worse than x265 in terms of PSNR for the traditional approach, but about 12% better for the HVMAF high range case.

Given the vast amount of video compressed and delivered by services like Netflix and that traditional and adaptive streaming approaches do not necessarily converge to the same outcome, it would be beneficial if the video coding community considered the adaptive streaming perspective in the comparisons. For example, it is relatively easy to compute metrics on the convex hull or to add the HVMAF numbers to the reported metrics.

Like great recipes, video encoding also has essential elements; VMAF, dynamic optimization, and great codecs. With these ingredients and continuous innovation, we are striving to perfect our recipe — high-quality video encodes at the lowest possible bitrates. If these problems excite you and you would like to contribute to building our video encoding pipeline, check out the Video Algorithms job posts (here and here).

Acknowledgments

For more detailed technical information and results, you can check out the paper ‘Video codec comparison using the dynamic optimizer framework’ by Ioannis Katsavounidis and Liwei Guo.

We would like to thank Ioannis Katsavounidis for all the technical work that lead to this blog, and Jan De Cock, Chen Chao, Aditya Mavlankar, Zhi Li, and David Ronca for their contributions.

Our experimentations were run on the Archer platform built by the Netflix media infrastructure team. We always appreciate their continued efforts to make media innovation at Netflix a pleasant experience.

Performance comparison of video coding standards: an adaptive streaming perspective was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Implementing the Netflix Media Database

December 14, 2018, 8:21 am

≫ Next: Netflix OSS and Spring Boot — Coming Full Circle

≪ Previous: Performance comparison of video coding standards: an adaptive streaming perspective

In the previous blog posts in this series, we introduced the Netflix Media DataBase (NMDB) and its salient “Media Document” data model. In this post we will provide details of the NMDB system architecture beginning with the system requirements — these will serve as the necessary motivation for the architectural choices we made. A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve. NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries. At any given time there could be several applications that are trying to persist data about a media asset (e.g., image, video, audio, subtitles) and/or trying to harness that data to solve a business problem.

Some of the essential elements of such a data system are (a) reliability and availability — under varying load conditions as well as a wide variety of access patterns; (b) scalability — persisting and serving large volumes of media metadata and scaling in the face of bursty requests to serve critical backend systems like media encoding, (c) extensibility — supporting a demanding list of features with a growing list of Netflix business use cases, and (d) consistency — data access semantics that guarantee repeatable data read behavior for client applications. The following section enumerates the key traits of NMDB and how the design aims to address them.

System Requirements

Support for Structured Data

The growth of NoSQL databases has broadly been accompanied with the trend of data “schemalessness” (e.g., key value stores generally allow storing any data under a key). A schemaless system appears less imposing for application developers that are producing the data, as it (a) spares them from the burden of planning and future-proofing the structure of their data and, (b) enables them to evolve data formats with ease and to their liking. However, schemas are implicit in a schemaless system as the code that reads the data needs to account for the structure and the variations in the data (“schema-on-read”). This places a burden on applications that wish to consume that supposed treasure trove of data and can lead to strong coupling between the system that writes the data and the applications that consume it. For this reason, we have implemented NMDB as a “schema-on-write” system — data is validated against schema at the time of writing to NMDB. This provides several benefits including (a) schema is akin to an API contract, and multiple applications can benefit from a well defined contract, (b) data has a uniform appearance and is amenable to defining queries, as well as Extract, Transform and Load (ETL) jobs, (c) facilitates better data interoperability across myriad applications and, (d) optimizes storage, indexing and query performance thereby improving Quality of Service (QoS). Furthermore, this facilitates high data read throughputs as we do away with complex application logic at the time of reading data.

A critical component of a “schema-on-write” system is the module that ensures sanctity of the input data. Within the NMDB system, Media Data Validation Service (MDVS), is the component that makes sure the data being written to NMDB is in compliance with an aforementioned schema. MDVS also serves as the storehouse and the manager for the data schema itself. As was noted in the previous post, data schema could itself evolve over time, but all the data, ingested hitherto, has to remain compliant with the latest schema. MDVS ensures this by applying meticulous treatment to schema modification ensuring that any schema updates are fully compatible with the data already in the system.

Multi-tenancy and Access Control

We envision NMDB as a system that helps foster innovation in different areas of Netflix business. Media data analyses created by an application developed by one team could be used by another application developed by another team without friction. This makes multi-tenancy as well as access control of data important problems to solve. All NMDB APIs are authenticated (AuthN) so that the identity of an accessing application is known up front. Furthermore, NMDB applies authorization (AuthZ) filters that whitelists applications or users for certain actions, e.g., a user or application could be whitelisted for read/write/query or a more restrictive read-only access to a certain media metadata.

In NMDB we think of the media metadata universe in units of “DataStores”. A specific media analysis that has been performed on various media assets (e.g., loudness analysis for all audio files) would be typically stored within the same DataStore (DS). while different types of media analyses (e.g., video shot boundary and video face detection) for the same media asset typically would be persisted in different DataStores. A DS helps us achieve two very important purposes (a) serves as a logical namespace for the same media analysis for various media assets in the Netflix catalog, and (b) serves as a unit of access control — an application (or equivalently a team) that defines a DataStore also configures access permissions to the data. Additionally, as was described in the previous blog article, every DS is associated with a schema for the data it stores. As such, a DS is characterized by the three-tuple (1) a namespace, (2) a media analysis type (e.g., video shot boundary data), and (3) a version of the media analysis type (different versions of a media analysis correspond to different data schemas). This is depicted in Figure 1.

We have chosen the namespace portion of a DS definition to correspond to an LDAP group name. NMDB uses this to bootstrap the self-servicing process, wherein members of the LDAP group are granted “admin” privileges and may perform various operations (like creating a DS, deleting a DS) and managing access control policies (like adding/removing “writers” and “readers”). This allows for a seamless self-service process for creating and managing a DS. The notion of a DS is thus key to the ways we support multi-tenancy and fine grained access control.

Integration with other Netflix Systems

In the Netflix microservices environment, different business applications serve as the system of record for different media assets. For example, while playable media assets such as video, audio and subtitles for a title could be managed by a “playback service”, promotional assets such as images or video trailers could be managed by a “promotions service”. NMDB introduces the concept of a “MediaID” (MID) to facilitate integration with these disparate asset management systems. We think of MID as a foreign key that points to a Media Document instance in NMDB. Multiple applications can bring their domain specific identifiers/keys to address a Media Document instance in NMDB. We implement MID as a map from strings to strings. Just like the media data schema, an NMDB DS is also associated with a single MID schema. However unlike the media data schema, MID schema is immutable. At the time of the DS definition, a client application could define a set of (name, value) pairs against which all of the Media Document instances would be stored in that DS. A MID handle could be used to fetch documents within a DS in NMDB, offering convenient access to the most recent or all documents for a particular media asset.

SLA Guarantees

NMDB serves different logically tiered business applications some of which are deemed to be more business critical than others. The Netflix media transcoding sub-system is an example of a business critical application. Applications within this sub-system have stringent consistency, durability and availability needs as a large swarm of microservices are at work generating content for our customers. A failure to serve data with low latency would stall multiple pipelines potentially manifesting as a knock-on impact on secondary backend services. These business requirements motivated us to incorporate immutability and read-after-write consistency as fundamental precepts while persisting data in NMDB.

We have chosen the high data capacity and high performance Cassandra (C*) database as the backend implementation that serves as the source of truth for all our data. A front-end service, known as Media Data Persistence Service (MDPS), manages the C* backend and serves data at blazing speeds (latency in the order of a few tens of milliseconds) to power these business critical applications. MDPS uses local quorum for reads and writes to guarantee read-after-write consistency. Data immutability helps us sidestep any conflict issues that might arise from concurrent updates to C* while allowing us to perform IO operations at a very fast clip. We use a UUID as the primary key for C*, thus giving every write operation (a MID + a Media Document instance) a unique key and thereby avoiding write conflicts when multiple documents are persisted against the same MID. This UUID (also called as DocumentID) also serves as the primary key for the Media Document instance in the context of the overall NMDB system. We will touch upon immutability again in later sections to show how we also benefited from it in some other design aspects of NMDB.

Flexibility of Queries

The pivotal benefit of data modeling and a “schema-on-write” system is query-ability. Technical metadata residing in NMDB is invaluable to develop new business insights in the areas of content recommendations, title promotion, machine assisted content quality control (QC), as well as user experience innovations. One of the primary purposes of NMDB is that it can serve as a data warehouse. This brings the need for indexing the data and making it available for queries, without a priori knowledge of all possible query patterns.

In principle, a graph database can answer arbitrary queries and promises optimal query performance for joins. For that reason, we explored a graph-like data-model so as to address our query use cases. However, we quickly learnt that our primary use case, which is spatio-temporal queries on the media timeline, made limited use of database joins. And in those queries, where joins were used, the degree of connectedness was small. In other words the power of graph-like model was underutilized. We concluded that for the limited join query use-cases, application side joins might provide satisfactory performance and could be handled by an application we called Media Data Query Service (MDQS). Further, another pattern of queries emerged — searching unstructured textual data e.g., mining movie scripts data and subtitle search. It became clear to us that a document database with search capabilities would address most of our requirements such as allowing a plurality of metadata, fast paced algorithm development, serving unstructured queries and also structured queries even when the query patterns are not known a priori.

Elasticsearch (ES), a highly performant scalable document database implementation fitted our needs really well. ES supports a wide range of possibilities for queries and in particular shines at unstructured textual search e.g., searching for a culturally sensitive word in a subtitle asset that needs searching based on a stem of the word. At its core ES uses Lucene — a powerful and feature rich indexing and searching engine. A front-end service, known as Media Data Analysis Service (MDAS), manages the NMDB ES backend for write and query operations. MDAS implements several optimizations for answering queries and indexing data to meet the demands of storing documents that have varying characteristics and sizes. This is described more in-depth later in this article.

A Data System from Databases

As indicated above, business requirements mandated that NMDB be implemented as a system with multiple microservices that manage a polyglot of DataBases (DBs). The different constituent DBs serve complementary purposes. We are however presented with the challenge of keeping the data consistent across them in the face of the classic distributed systems shortcomings — sometimes the dependency services can fail, sometimes service nodes can go down or even more nodes added to meet a bursty demand. This motivates the need for a robust orchestration service that can (a) maintain and execute a state machine, (b) retry operations in the event of transient failures, and (c) support asynchronous (possibly long running) operations such as queries. We use the Conductor orchestration framework to coordinate and execute workflows related to the NMDB Create, Read, Update, Delete (CRUD) operations and for other asynchronous operations such as querying. Conductor helps us achieve a high degree of service availability and data consistency across different storage backends. However, given the collection of systems and services that work in unison it is not possible to provide strong guarantees on data consistency and yet remain highly available for certain use cases, implying data read skews are not entirely avoidable. This is true in particular for query APIs — these rely on successful indexing of Media Document instances which is done as an asynchronous, background operation in ES. Hence queries on NMDB are expected to be eventually consistent.

Figure 2 shows the NMDB system block diagram. A front end service that shares its name with the NMDB system serves as the gateway to all CRUD and query operations. Read APIs are performed synchronously while write and long running query APIs are managed asynchronously through Conductor workflows. Circling back to the point of data immutability that was discussed previously — another one of its benefits is that it preserves all writes that could occur e.g., when a client or the Conductor framework retries a write perhaps because of transient connection issues. While this does add to data footprint but the benefits such as (a) allowing for lockless retries, (b) eliminating the need for resolving write conflicts and © mitigating data loss, far outweigh the storage costs.

Included in Figure 2 is a component named Object Store that is a part of the NMDB data infrastructure. Object Store is a highly available, web-scale, secure storage service such as Amazon’s Simple Storage Service (S3). This component ensures that all data being persisted is chunked and encrypted for optimal performance. It is used in both write and read paths. This component serves as the primary means for exchanging Media Document instances between the various components of NMDB. Media Document instances can be large in size (several hundreds of MBs — perhaps because a media analysis could model metadata e.g., about every frame in a video file. Further, the per frame data could explode in size due to some modeling of spatial attributes such as bounding boxes). Such a mechanism optimizes bandwidth and latency performance by ensuring that Media Document instances do not have to travel over the wire between the different microservices involved in the read or the write path and can be downloaded only where necessary.

NMDB in Action

While the previous sections discussed the key architectural traits, in this section we dive deeper into the NMDB implementation.

Writing data into NMDB

Figure 3: Writing a Media Document Instance to NMDB

The animation shown in Figure 3 details the machinery that is set in action when we write into NMDB. The write process begins with a client application that communicates its intent to write a Media Document instance. NMDB accepts the write request by submitting the job to the orchestration framework (Conductor) and returns a unique handle to identify the request. This could be used by the client to query on the status of the request. Following this, the schema validation, document persistence and document indexing steps are performed in that order. Once the document is persisted in C* it becomes available for read with strong consistency guarantees and is ready to be used by read-only applications. Indexing a document into ES can be a high latency operation since it is a relatively more intensive procedure that requires multiple processes coordinating to analyze the document contents, and update several data structures that enable efficient search and queries.

Also, noteworthy is the use of an Object store to optimize IO across service components (as was discussed earlier). NMDB leverages a cloud storage service (e.g., AWS S3 service) to which a client first uploads the Media Document instance data. For each write request to NMDB, NMDB generates a Type-IV UUID that is used to compose a key. The key in turn is used to compose a unique URL to which the client uploads the data it wishes to write into NMDB. This URL is then passed around as a reference for the Media Document instance data.

Scaling Strategies

From the perspective of writing to NMDB, some of the NMDB components are compute heavy while some others are IO heavy. For example, the bottle neck for MDVS is CPU as well as memory (as it needs to work with large documents for validation). On the other hand MDAS is bound by network IO as well (Media Document instances need to be downloaded from NMDB Object Store to MDAS so that they can be indexed). Different metrics can be used to configure a continuous deployment platform, such as Spinnaker for load balancing and auto-scaling for NMDB. For example, “requests-per-second” (RPS) is commonly used to auto-scale micro services to serve increased reads or queries. While RPS or CPU usage could be useful metrics for scaling synchronous services, asynchronous APIs (like storing a document in NMDB) bring in the requirement of monitoring queue depth to anticipate work build up and scale accordingly.

Figure 4: Scaling the NMDB service plane

The strategy discussed above gives us a good way to auto-scale the NMDB micro services layer (identified as “Service Plane” in Figure 4) quasi-linearly. However as seen in Figure 4, the steady state RPS that the system can support eventually plateaus at which point scaling the Service Plane does not help improve SLA. At this point it should be amply clear that the data nodes (identified as “Data Backend”) have reached their peak performance limits and need to be scaled. However, distributed DBs do not scale as quickly as services and horizontal or vertical scaling may take a few hours to days, depending on data footprint size. Moreover, while scaling the Service Plane can be an automated process, adding more data nodes (C* or ES) to scale the Data Backend is typically done manually. However, note that once the Data Backend is scaled up (horizontal and/or vertically), the effects of scaling the Service Plane manifests as an increased steady state RPS as seen in Figure 4.

An important point related to scaling data nodes, which is worth mentioning is the key hashing strategy that each DB implements. C* employs consistent key hashing and hence adding a node distributes the data uniformly across nodes. However, ES deploys a modulus based distributed hashing. Here adding a data node improves distribution of shards across the available nodes, which does help alleviate query/write bottlenecks to an extent. However, as the size of shards grow over time, horizontal scaling might not help improve query/write performance as shown in Figure 5.

ES mandates choosing the number of shards for every index at the time of creating an index, which cannot be modified without going through a reindexing step which is expensive and time consuming for large amounts of data. A fixed pre-configured shard size strategy could be used for timed data such as logs, where new shards could be created while older shards are discarded. However, this strategy cannot be employed by NMDB since multiple business critical applications could be using the data, in other words data in NMDB needs to be durable and may not ever be discarded. However, as discussed above large shard sizes affect query performance adversely. This calls for some application level management for relocating shards into multiple indices as shown in Figure 6.

Figure 6: Creating new ES indices over time

Accordingly, once an index grows beyond a threshold, MDAS creates a different index for the same NMDB DS, thereby allowing indices to grow over time and yet keeping the shard size within a bound for optimal write/query performance. ES has a feature called index aliasing that is particularly helpful for alleviating performance degradation that is caused due to large shard sizes which is suitable for the scenario we explained. An index alias could point to multiple indices and serve queries by aggregating search results across all the indices within the alias.

Indexing Data in NMDB at Scale

A single Media Document instance could be large ranging from hundreds of MBs to several GBs. Many document databases (including ES) have a limit on the size of a document after which DB performance degrades significantly. Indexing large documents can present other challenges on a data system such as requiring high network I/O connections, increased computation and memory costs, high indexing latencies as well as other adverse effects.

In principle, we could apply the ES parent-child relationship at the various levels of the Media Document hierarchy and split up a Media Document instance into several smaller ES documents. However, the ES parent-child relationship is a two-level relationship and query performance suffers when multiple such relationships are chained together to represent a deeply nested model (the NMDB Media Document model exhibits upto five levels of nesting). Alternately, we could consider modeling it as a two-level relationship with the high cardinality entities (“Event” and “Region”) on the “child” side of the relationship. However, Media Document could contain a huge number of “Event” and “Region” entities (hundreds of thousands of Events and tens of Regions per Event are typical for an hour of content) which would result in a very large number of child documents. This could also adversely impact query performance.

To address these opposing limitations, we came up with the idea of using “data denormalization”. Adopting this needs more thought since data denormalization can potentially lead to data explosion. Through a process referred to as “chunking”, we split up large document payloads into multiple smaller documents prior to indexing them in ES. The smaller chunked documents could be indexed by using multiple threads of computation (on a single service node) or multiple service nodes — this results in better workload distribution, efficient memory usage, avoids hot spots and improves indexing latencies (because we are processing smaller chunks of data concurrently). We utilized this approach simultaneously with some careful decisions around what data we denormalize in order to provide optimal indexing and querying performance. More details of our implementation are presented as follows.

Chunking Media Document Instances

The hierarchical nature of the Media Document model (as explained in the previous blog post) requires careful consideration while chunking as it contains relationships between its entities. Figure 7 depicts the pre-processing we perform on a Media Document instance prior to indexing it in ES.

Figure 7: An efficient strategy for indexing Media Document Instances in ES

Each Media Document instance is evenly split into multiple chunks with smaller size (of the order of a few MBs).
Asset, Track and Component level information is denormalized across all the chunks and a parent document per chunk with this information is indexed in ES. This denormalization of parent document across different chunks also helps us to overcome a major limitation with ES parent-child relationship, that is the parent document and all the children documents must belong to same shard.
At the level of an event, data is denormalized across all the regions and a child document per region is indexed in ES.

This architecture allows distribution of Media Document instances across multiple nodes and speeds up indexing as well as query performance. At query time, MDAS uses a combination of different strategies depending on the query patterns for serving queries efficiently

ES parent-child join queries are used to speed up query performance where needed.
In another query pattern, the parent documents are queried followed by children documents and application side joins are performed in MDAS to create search results.

Serving Queries & Analytics

As noted earlier, NMDB has a treasure trove of indexed media metadata and lots of interesting insight could be developed by analyzing it. The MDAS backend with ES forms the backbone of analytical capabilities of NMDB. In a typical analytics usage, NMDB users are interested in two types of queries:

A DS level query to retrieve all documents that match the specified query. This is similar to filtering of records using SQL ‘WHERE’ clause. Filtering can be done on any of the entities in a Media Document instance using various condition operators ‘=’ , ‘>’ or ‘<’ etc. Conditions can also be grouped using logic operators like OR, AND or NOT etc.
A more targeted query on a Media Document instance using a Document ID handle to retrieve specific portions of the document. In this query type, users can apply conditional filtering on each of the entities of a Media Document instance and retrieve matching entities.

The two query types target different use cases. Queries of the first type span an entire NMDB DS and can provide insights into which documents in a DS match the specified query. Considering the huge payload of data corresponding to Media Document instances that match a query of the first type, NMDB only returns the coordinates (DocumentID and MID) of the matching documents. The second query type can be used to target a specific Media Document instance using DocumentID and retrieve portions of the document with conditional filtering applied. For example, only a set of events that satisfy a specified query could be retrieved, along with Track and Component level metadata. While it is typical to use the two types of queries in succession, in the event where a document handle is already known one could glean more insights into the data by directly executing the second query type on a specific Media Document instance.

As explained earlier, chunking Media Document instances at the time of indexing comes very handy in optimizing queries. Since relationships between the different entities of a Media Document instance are preserved, cross-entity queries can be handled at the ES layer. For example, a Track can be filtered out based on the number of Events it contains or if it contains Events matching the specified query. The indexing strategy as explained earlier can be contrasted with the nested document approach of ES. Indexing Event and Region level information as children documents helps us output the search results more efficiently.

What’s next

As explained in the previous blog post, the Media Document model has a hierarchical structure and offers a logical way of modeling media timeline data. However, such a hierarchical structure is not optimal for parallel processing. In particular validation (MDVS) and indexing (MDAS) services could benefit immensely by processing a large Media Document instance in parallel thereby reducing write latencies. A compositional structure for Media Document instances would be more amenable to parallel processing and therefore go a long way in alleviating the challenges posed by large Media Document instances. Briefly, such a structure implies a single media timeline is composed of multiple “smaller” media timelines, where each media timeline is represented by a corresponding “smaller” Media Document instance. Such a model would also enable targeted reads that do not require reading the entire Media Document instance.

On the query side, we anticipate a growing need for performing joins across different NMDB DataStores — this could be computationally intensive in some scenarios. This along with the high storage costs associated with ES is motivating us to look for other “big-data” storage solutions. As NMDB continues to be the media metadata platform of choice for applications across Netflix, we will continue to carefully consider new use cases that might need to be supported and evaluate technologies that we will need to onboard to address them. Some interesting areas of future work could involve exploring Map-Reduce frameworks such as Apache Hadoop, for distributed compute, query processing, relational databases for their transactional support, and other Big Data technologies. Opportunities abound in the area of media-oriented data systems at Netflix especially with the anticipated growth in business applications and associated data.

— by Shinjan Tiwary, Sreeram Chakrovorthy, Subbu Venkatrav, Arsen Kostenko, Yi Guo and Rohit Puri

Implementing the Netflix Media Database was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Netflix OSS and Spring Boot — Coming Full Circle

December 18, 2018, 10:10 am

≫ Next: Detecting Performance Anomalies in External Firmware Deployments

≪ Previous: Implementing the Netflix Media Database

Netflix OSS and Spring Boot — Coming Full Circle

Taylor Wicksell, Tom Cellucci, Howard Yuan, Asi Bross, Noel Yap, and David Liu

In 2007, Netflix started on a long road towards fully operating in the cloud. Much of Netflix’s backend and mid-tier applications are built using Java, and as part of this effort Netflix engineering built several cloud infrastructure libraries and systems — Ribbon for load balancing, Eureka for service discovery, and Hystrix for fault tolerance. To stitch all of these components together, additional libraries were created — Governator for dependency injection with lifecycle management and Archaius for configuration. All of these Netflix libraries and systems were open-sourced around 2012 and are still used by the community to this day.

In 2015, Spring Cloud Netflix reached 1.0. This was a community effort to stitch together the Netflix OSS components using Spring Boot instead of Netflix internal solutions. Over time this has become the preferred way for the community to adopt Netflix’s Open Source software. We are happy to announce that starting in 2018, Netflix is also making the transition to Spring Boot as our core Java framework, leveraging the community’s contributions via Spring Cloud Netflix.

Why is Netflix adopting Spring Boot after having invested so much in internal solutions? In the early 2010s, key requirements for Netflix cloud infrastructure were reliability, scalability, efficiency, and security. Lacking suitable alternatives, we created solutions in-house. Fast forward to 2018, the Spring product has evolved and expanded to meet all of these requirements, some through the usage and adaptation of Netflix’s very own software! In addition, community solutions have evolved beyond Netflix’s original needs. Spring provides great experiences for data access (spring-data), complex security management (spring-security), integration with cloud providers (spring-cloud-aws), and many many more.

The evolution of Spring and its features align very well with where we want to go as a company. Spring has shown that they are able to provide well-thought-out, documented, and long lasting abstractions and APIs. Together with the community they have also provided quality implementations from these abstractions and APIs. This abstract-and-implement methodology match well with a core Netflix principle of being “highly aligned, loosely coupled”. Leveraging Spring Boot will allow us to build for the enterprise while remaining agile for our business.

The Spring Boot transition is not something we are undertaking alone. We have been in collaboration with Pivotal throughout the process. Whether it be Github issues and feature requests, in-person conversations at conferences or real-time chats over Gitter/Slack, the responses from Pivotal have been excellent. This level of communication and support gives us great confidence in Pivotal’s ability to maintain and evolve the Spring ecosystem.

Moving forward, we plan to leverage the strong abstractions within Spring to further modularize and evolve the Netflix infrastructure. Where there is existing strong community direction — such as the upcoming Spring Cloud Load Balancer — we intend to leverage these to replace aging Netflix software. Where there is new innovation to bring — such as the new Netflix Adaptive Concurrency Limiters — we want to help contribute these back to the community.

The union of Netflix OSS and Spring Boot began outside of Netflix. We now embrace it inside of Netflix. This is just the beginning of a long journey, and we will be sure to provide updates and interesting findings as we go. We want to give special thanks to the many people who have contributed to this effort over the years, and hope to be part of future innovations to come.

Netflix OSS and Spring Boot — Coming Full Circle was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Detecting Performance Anomalies in External Firmware Deployments

January 31, 2019, 9:21 am

≫ Next: Improving Experimentation Efficiency at Netflix with Meta Analysis and Optimal Stopping

≪ Previous: Netflix OSS and Spring Boot — Coming Full Circle

by Richard Cool

Netflix has over 139M members streaming on more than half a billion devices spanning over 1,700 different types of devices from hundreds of brands. This diverse device ecosystem results in a high dimensionality feature space, often with sparse data, and can make identifying device performance issues challenging. Identifying ways to scale solutions in this space is vital as the ecosystem continues to grow both in volume and diversity. Streaming devices are also used on a wide range of networks which directly impact the delivered user experience. The video quality and app performance that can be delivered to a limited-memory mobile phone with a spotty cellular connection is quite different than what can be achieved on a cable set top box with high speed broadband; understanding how device characteristics and network behavior interact adds a layer of complexity in triaging potential device performance issues.

We strive to ensure that when a member opens the Netflix app and presses play, they are presented with a high-quality experience every step of the way. Encountering an error page, waiting a very long time for video to begin playing, or having the video pause during playback, etc. are poor experiences, and we strive to minimize them. Previous blog posts have detailed the efforts of the Device Reliability Team (part 1, part 2) to identify issues and troubleshoot them and have given examples of the uses of machine learning to improve streaming quality.

Device-related issues typically occur in one of two scenarios: (1) Netflix introduces a change to the app or backend servers that interacts badly with some devices or (2) a consumer electronics partner, browser developer, or operating system developer pushes a change (e.g. a firmware change or browser/OS change) that interacts poorly with our app. While we have tools for dealing with the first scenario (for example, automated canary analysis using Kayenta), the second type previously was only detected when the update had reached a sufficient volume of devices to shift core performance metrics. Being able to quickly identify firmware updates that result in poorer member experience allows us to minimize the impact of these issues and work with device partners to root-cause problems.

Figure 1 — Monthly number of firmware releases seen on consumer electronics devices streaming Netflix.

Figure 1 shows that the rate at which our consumer electronics device partners are pushing new firmware is growing rapidly. In 2018, our partners pushed over 500 firmware pushes a month; this value will likely pass 1,000 firmware upgrades per month by 2020. Often firmware rollouts begin slowly with a fraction of all devices receiving the new firmware for several days before the rest of the devices are upgraded. These rollouts are not random; often a specific subset of devices are targeted for new firmwares and sometimes rollouts target specific geographic regions. Naive analysis of metric changes between new firmwares and devices on older firmwares can be confounded by the non-random rollout, so it’s important to control for this when asking if a new firmware has negatively impacted the Netflix member experience.

Putting the Pieces Together

Consider the case of a metric which follows the grey distribution (with a mean value of ~ 4,570) shown in Figure 2. We see a new firmware deploy in the field (red distribution) which follows an approximately normal distribution with noticeably higher mean of 5,600, indicating that devices using the new firmware have a poor experience than the mean of the full device population. Should we be concerned that the new firmware has resulted in lower performance than prior versions?

Figure 2 — **Left:** Hypothetical distribution of a device performance metric between the control sample of devices (grey) and a population of the same devices which have been upgraded to a new firmware (red). **Right:** The control sample (red) has been broken into multiple sub-components (grey) based on geographic region.

If the devices running the new firmware were a random subsample of the control sample, we very likely should be concerned. Unfortunately, this is not an assumption we can make when working with firmware deployments with our consumer electronics partners. In this example, we can break down the control sample by geographic region (right panel of Figure 2) and see that the control sample is an aggregation of distinct distributions from each region. If our partners roll out a new firmware preferentially to some regions compared to others, we must correct for this effect before quantifying any changes in performance metrics on devices with the new firmware.

Figure 3 — Comparison of the treatment sample distribution (red) with one of the randomly matched control samples (grey) using the methodology described in the text. While the treatment sample has a larger mean metric value than the full control sample shown in Figure 2, when we account for the fact that the treatment sample came from a different population than the control sample, we can see that treatment sample has a lower mean (which is an improved member experience in this case).

We created a framework, Jigsaw, which allows data scientists and engineering teams at Netflix to understand changes in metrics with biased treatment populations. For each treatment sample, we create a Monte Carlo “matched” sample from our control sample. This matched sample is constructed to mirror the same property distribution as the treatment sample using a list of user-specified dimensions. In our example above, we would construct a matched control sample that follows the same geographic distribution as the devices in the treatment sample. This process is not limited to one dimension — in practice, we often match on geographic dimensions as well as key device characteristics (such as device model or device model year). Increasing the number of dimensions used in the matching, however, can lead to data sparsity issues. For our analysis we typically limit matching to one or two device properties to ensure sufficient data. Once we have compared the metric distributions for both the matched control and treatment samples, we repeat the Monte Carlo matching procedure multiple times to estimate the probability that the treatment sample could have been drawn from the control sample given the sampling uncertainties. Figure 3 shows one matched sample realization in the example described above. While the treatment sample has a higher mean metric in the overall comparison, controlling for differences in the underlying population show that the treatment cell has actually lowered the metric and improved our member experience.

The Bigger Picture

Figure 4 — Workflow diagram illustrating the daily cycle of Jigsaw alerts.

The introduction of Jigsaw into the Netflix device reliability engineering team’s workflow quickly made direct impact on our members’ experiences. During the summer of 2018, two device performance deteriorations were detected while the culpable new firmware was only present on 0.5% of the several million potentially impacted devices. With the early alerts from Jigsaw, the device reliability team was able to work with our consumer electronics partners to correct the problem and prevent millions of users from experiencing errors during playback. Work is underway to use the Jigsaw framework to understand more than firmware changes, as well. Comparing metrics between two web browser software versions or operating system versions is aiding several of the Netflix engineering teams understand the effects of in-field software changes on performance metrics.

Netflix members have many options when it comes to entertainment. We strive to provide the best possible experience each time anyone launches Netflix. By enriching our device performance monitoring with automated anomaly detection, we can scale our efforts as the device ecosystem continues to grow and evolve. Through being proactive rather than reacting to issues after they have had wide impact, we protect our members from poor experiences, empowering them to continue to find more moments of uninterrupted joy.

Check out the Netflix Research Page if you want to learn more. We are always looking for new stunning colleagues to join us!

Detecting Performance Anomalies in External Firmware Deployments was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Improving Experimentation Efficiency at Netflix with Meta Analysis and Optimal Stopping

January 31, 2019, 11:18 am

≫ Next: Engineering to Improve Marketing Effectiveness (Part 3) — Scaling Paid Media campaigns

≪ Previous: Detecting Performance Anomalies in External Firmware Deployments

By Gang Su & Ian Yohai

From living rooms in Bogota, to morning commutes in Tokyo, to beaches in Los Angeles and dorms in Berlin, Netflix strives to bring joy to over 139 million members around the globe and connect people with stories they’ll love. Every bit of the customer experience is imbued with innovation, right from the very first encounter with Netflix during the signup process — whether it be on mobile, tablet, laptop or TV. We strive to bring the best experience to our customers through experimentation by continuously learning from data and refining our product. In the customer acquisition area, we aim to make the signup process as accessible, smooth, and intuitive as possible.

There are numerous challenges in experimentation at scale. But believe it or not, even with millions of global daily visitors and state-of-the-art A/B testing infrastructure, we still wish we had larger samples to test more innovative ideas. There are many benefits to ending experiments early when possible. To name just a few:

We could run more tests in the same amount of time, improving our chances of finding better experiences for our customers.
We could rapidly test the waters to identify the best areas to invest in for future innovation.
If we could, in a principled way, end an experiment earlier when a sizable effect is detected, we could bring more delight to our customers sooner.

On the other hand, there are some risks associated with running short experiments:

Very often tests are allocated much longer than the minimal required time determined by power analysis, to mitigate potential seasonal fluctuations (e.g., time of day, day of week, week over week, etc.), identify any diminishing novelty effect, or account for any treatment effects which may take longer to manifest.
Holidays and special events, such as new title launches, may attract non-representative audiences. This may render the test results less generalizable.
Improperly calling experiments early (such as by HARKing or p-hacking) may substantially inflate false positive rates and consequently lead to wasted business effort.

So in order to develop a scientific framework for faster product innovation through experimentation, there are two key questions we would like to answer: 1) How much, if at all, does seasonality impact our experiments and 2) If seasonality is not a great concern, how can we end experiments early in a scientifically principled way?

Detecting Seasonal Effects with Meta Analysis

While seasonality is perceived to render short tests less generalizable, not all tests should be equally vulnerable. For example, if we experiment on the look and feel of a ‘continue’ button, Monday visitors should not have drastic differences in aesthetic preference compared with Friday visitors. On the other hand, a background image featuring a new original TV series may be more compelling during the time of launch when visitors may have higher awareness and intent to join. The key, then, is to identify the tests with time-invariant treatment effects and run them more efficiently. This requires a mix of technical work and experience.

The secret sauce we used here is meta-analysis, a simple yet powerful method of analyzing related analyses. We adopted this methodology to identify time-varying treatment effects. One frequent application of this method in healthcare is to combine results from independent studies to boost power and improve estimates of treatment effects, such as the efficacy of a new drug. At a high level:

If outcomes from independent studies are consistent as shown in the following chart (left side), the data can be fitted with a fixed-effect model to generate a more confident estimate. The treatment effect of five individual tests were all statistically insignificant but directionally negative. When pooled together, the model produces a more accurate estimate, as shown in the fixed-effect row.
By contrast, if outcomes from independent studies are inconsistent, as shown in the right side of the chart, with both positive and negative treatment effects, meta analysis will appropriately acknowledge the higher degree of heterogeneity. It will adjust to a random-effect model to accommodate the wider confidence intervals, as shown in the future prediction interval row.

More details can be found in this reference. The model-fitting process (i.e. fixed-effect model versus random-effect model) can be leveraged to test whether heterogeneous treatment effects are present across time dimensions (e.g., time of day, day of week, week over week, pre-/post-event). We conducted a comprehensive retrospective study in A/B tests on the signup flow. As expected, we found most tests do not demonstrate strong heterogeneous treatment effects over time. Therefore, we could have ended some tests early, innovated more, and brought an even better experience to our prospective customers sooner.

End Experiments Early with Optimal Stopping

Assuming that a treatment effect is both time-invariant (evaluated by meta-analysis) and sufficiently large, we can apply various optimal-stopping strategies to end tests early. Naively, we could constantly peek at experiment dashboards, but this will inflate false positives when we mistakenly believe a treatment effect is present. There are scientific methodologies to control for false positives (Type I errors) with peeking (or, more formally, interim analyses). Several methods have been assessed in our retrospective study, such as Wald’s Sequential Probability Ratio Tests (SPRT), Sequential Triangular Testing, and Group Sequential Testing (GST). GST demonstrated the best performance and practical value in our study; it is widely used in clinical trials in which samples are accumulated over time in batches, which is a perfect fit our use case. This is roughly how it works:

Before a test starts, we decide the minimum required running time and the number of interim analyses.
GST then allocates the total tolerable Type I error (for example, 0.05) into all interim analyses, such that the total Type I error sums up to the overall Type I error. As a result, each interim test becomes more conservative than regular peeking.
A test can be stopped immediately whenever it becomes statistically significant. This often happens when the observed treatment effect size is substantially larger than expected.

The following chart illustrates the critical values and the individual and cumulative alpha spends from a GST design with five interim analyses. By adopting this strategy, we could have saved substantial time in running some experiments and obtained very accurate point estimates of the treatment effects much sooner, albeit with slightly wider confidence intervals and a small inflation of treatment effects. It works best when we would like to do a quick test of ideas and the accuracy of the magnitude of the treatment effect is less critical, or when we need to end a test prematurely due to a severe negative impact.

The following chart illustrates a successful GST early stop and a Fixed Sample Size (FSS full stop) determined by power analysis. Since the observed effect size is sufficiently large, we could stop test earlier with similar point estimates.

Building Decision Support into Our Experimentation Platform

Now that our initial research is completed, we are actively developing meta analysis, optimal stopping, heterogeneity treatment effects detection, and much more into the larger Netflix Experimentation and Causal Inference platform. We hope such features will accelerate our current experimentation workflow, expedite product innovation, and ultimately bring the best experience and delight to our customers. This is an ongoing journey, and if you are passionate about our mission and our exciting work, join our all-star team!

Special thanks to the support from Randall Lewis, Colin McFarland and the Science and Analytics team at Netflix. Team work makes the dream work!

Improving Experimentation Efficiency at Netflix with Meta Analysis and Optimal Stopping was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Engineering to Improve Marketing Effectiveness (Part 3) — Scaling Paid Media campaigns

February 4, 2019, 9:31 am

≫ Next: Protecting a Story’s Future with History and Science

≪ Previous: Improving Experimentation Efficiency at Netflix with Meta Analysis and Optimal Stopping

Engineering to Improve Marketing Effectiveness (Part 3) — Scaling Paid Media campaigns

This is the third blog of the series on Marketing Technology at Netflix. This blog focuses on the marketing tech systems that are responsible for campaign setup and delivery of our paid media campaigns. The first blog focused on solving for creative development and localization at scale. The second blog focused on scaling advertising at Netflix through easier ad assembly and trafficking.

Netflix’s Marketing team is interested in raising awareness about our content, our brand and get new consumers excited about signing up to our service. We use a combination of paid media, owned media (e.g. via Netflix twitter handle), and earned media (publicity) to reach people all over the world. We use a mix of art and science to decide which titles to promote in which market.

The Netflix Marketing Tech team focuses on automation and experimentation to help our marketing team save time and money. These marketing campaigns help raise brand awareness on services/channels outside of the Netflix product itself (e.g. social media channels).

What are we solving for?

The marketing tech team’s goal is to build scalable systems which enable marketers at Netflix to efficiently manage, measure, experiment and learn about tactics that help unlock the effectiveness of our paid media efforts.

Improving Incremental Marketing Effectiveness: Netflix wants to use paid media to drive incremental value to the business. For instance, if we are interested in using a campaign to get people to sign up, we only want to measure the success of our campaign on users we caused to sign up, which is a subset of total signups. We measure this through ongoing experiments and control/holdout groups. This lets us understand the difference between all signups (correlational) and incremental signups (causal). For more detailed information on this topic, please refer to some of the work done in this space — incrementality bidding and attribution and measuring ad effectiveness.

Enabling Marketing At Scale: Netflix is now available in over 190 countries and advertises globally outside of our service in dozens of languages, for hundreds of pieces of content, ultimately using millions of promotional assets created solely for the purposes of marketing. One trailer for a title can quickly turn into hundreds of individual ads with permutations by languages, aspect ratios, subtitles, testing variations, etc. Our Marketing Tech platforms need to support the assembly and delivery of a wide variety of campaigns and plan type combinations, on a variety of external advertising platforms at global scale.

Enabling Easy Experimentation: Similar to how we test on the Netflix Product, the Netflix Marketing team embraces experimentation to identify the best marketing tactics for spending paid media dollars. The Marketing Tech team seeks to create technology that will enable our partners in marketing to spend more of their time on strategic and creative decisions. Our teams use experimentation to guide their instincts on the best performing campaigns. We enable experimentation through methodologies like A/B testing and geo-based quasi testing.

Near Real-time Measurement: As an advertiser running paid media across multiple global ad platforms, our systems need to provide accurate, near real-time answers about campaign performance. We then use this data to adapt and optimize our campaign spend in order to achieve our conversion goals.

How are we solving them?

The tech systems which solve for these problems can be broken into the following types:

Systems which are responsible for automating the workflows used for buying paid media and unlocking efficiencies in those flows (media planner)
Systems which help with creative development & localization and assembly of ads from the creative assets.
Systems which are responsible for marketing campaign creation and execution (campaign management system)
Systems which are responsible for collecting analytics and insights into how our campaigns are performing on ad platforms like Google (advertising insights)
Systems that enable changing budget on live campaigns in order to optimize our marketing spend (ad budget optimization)

Here is a highly simplified life cycle of a paid media marketing campaign.

This blog will focus mainly on the campaign management and ad budget optimization systems.

Campaign Management System

First some terminology for those less familiar with this space. A campaign is a set of advertisements with a single idea or theme. An advertising campaign is typically broadcast through several media channels like programmatic, digital reserve, TV, print, Billboards, etc. A campaign consists of the following:

Objective: A campaign has specific goals and objectives.
Target Audience: It refers to the set of audience and languages within regions that are targeted by the campaign.
Catalog: contains information about all ads that are part of the campaign. It consists of a list of titles, assets links and ad formats e.g videos ads, carousel, etc.
Budget: is the spend that is associated with a campaign. You can apply this spend to each day the campaign runs (daily budget) or over the lifetime of the campaign (lifetime budget).

Programmatic bidding is the automated bidding on advertising inventory, for the opportunity to show an ad to a specific internet user, in a specific context.

The programmatic marketing team at Netflix is responsible for planning, setting up and executing marketing campaigns on the ad platforms. Each campaign requires a separate structure, and combinations of different geography, inventory sources, etc. We built our campaign management system in order to automate large parts of the campaign creation process. The system automates the process of campaign creation by abstracting out complexities in ad catalog setup, budget recommendation, experimentation, audience and campaign setup.

The complexity of the campaigns is further increased because the programmatic marketing team often runs experiments as part of a campaign. For example, we may use a campaign to test the relative effectiveness of a 30 second creative vs. a 6 second creative. Other experiments might try to determine optimal campaign parameters like budget allocation. Without tooling, all of the combinatorics would lead to dramatic increase in campaign setup time and complexity.

To solve this, our campaign management system has the notion of plan type (recipe) — pre-built combinations of various factors such as campaign type, objective, etc. With the help of this feature, we enable the selection of an appropriate recipe and add setup information for multiple cross-country and cross-platform tests in a single place. This dramatically reduces campaign setup time, removes error prone manual steps, and increases our confidence in test learnings.

System architecture

The Campaign Management Service relies on a variety of technologies to achieve its goals. The majority of the service layer is written in Kotlin and Java. Cassandra is the primary store for most data. The UI is built on top of Node.js using React components which communicates with the backend service via REST. Titus provides container based jobs which can be used for heavy lifting, such as uploading videos to ad platforms.

Ad Budget Optimization System

When we run a marketing campaign with an objective of increasing incremental new sign ups, we are faced with the challenge of spending marketing budgets across the globe. For example, should the next incremental dollar get allocated to marketing in the U.S. or Thailand if incremental sign up and revenue is our ultimate objective? Stated simply, this is a budget allocation problem with several hard to measure factors that are behind it — Netflix product/market fit, cost of media, etc.

We solve this problem by building a system to dynamically distribute budgets across campaigns and countries to maximize incremental revenue. The system retrieves live campaign performance data from ad platforms and calculates the budget allocation per country and platform based on the current spend and the number of days remaining in the campaign.

System architecture

There are three main components in the budget optimization system. The front-end provides a CRUD interface for entering campaign metadata (campaign description, start and end dates, budget, etc). It also uses the Netflix workflow orchestration engine Meson to enable the ability to view, select and execute specific budget runs. The next component is a data ETL pipeline which is responsible for calculating input metrics like spend, lift, etc and persisting those in Hive tables. The third component is the backend which reads campaign metadata from S3, spend/lift metrics from Hive and applies machine learning models to calculate the budget allocation per county and ad platform. The updated budget allocations are then pushed to the external platforms through API integrations.

Future challenges

As our business is evolving, our systems need to scale to support for accelerated experimentation and expanded creative inventory. We will also continue to fine tune our systems to make our workflows as automated as possible. The less time and effort spent manually creating ad campaigns, the faster we will be able to move as a business. Our goal is to continue to build an efficient, robust and scalable marketing platform that is greater than the sum of its parts and which ultimately enables consumer delight.

Conclusion

In summary, we’ve discussed the mechanics of creation, delivery and optimization of Netflix campaigns at global scale. Some of the details themselves are worth follow-up posts and we’ll be publishing them in the future. We need back-end engineers to build robust data pipelines and facilitate communication and integration between our many services. We’re also looking for front-end engineers to build beautiful, intuitive user interfaces that are a pleasure to use and provide a cohesive look and feel across our ecosystem. If you’re interested in joining us in working on some of these opportunities within Netflix’s Marketing Tech, we’re hiring!

Authored By Jayashree Biswas, Gopal Krishnan

Engineering to Improve Marketing Effectiveness (Part 3) — Scaling Paid Media campaigns was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Protecting a Story’s Future with History and Science

February 5, 2019, 10:28 am

≫ Next: Building a Cross-platform In-app Messaging Orchestration Service

≪ Previous: Engineering to Improve Marketing Effectiveness (Part 3) — Scaling Paid Media campaigns

By Kylee Peña, Chris Clark, and Mike Whipple

Kylee’s parents after their wedding in 1978.

I — Kylee — have two photos from my parents’ wedding. Just two. This year they celebrated 40 years of marriage, so both photos were shot on film. Both capture a joy and awkwardness that come with young weddings. They’re fresh and full of life, candid captures from another era.

One of the photos is a Polaroid Instant Camera shot. It’s the only copy of the photo. The colors are beginning to fade, and the image itself is only three or four inches across.

The other was shot on a 35mm camera. My mom and dad stand next to their best man and maid of honor, but the details are lost because the exposure is far too dark. And the negative was lost long ago too.

These are two photos that are important to my history, but I’ll probably never be able to make them any better. One was shot on a lost format, the color and contrast embedded into a single original copy as it was shot. The other could be cleaned up and blown up significantly with modern technology — if only the negative had been handled with care.

Everyone has images that are precious to them: that miraculous video of your dog doing that trick he does. The shot of your grandparents on their final anniversary together. Your own wedding, in which you invested thousands of your own savings. Imagine if you couldn’t play the video anymore, or if your grandparents’ portrait got blurry, or if your skin and wedding dress had a green hue on it.

On a movie or television show, this type of thing happens all the time to the director or cinematographer. They have worked their entire lives to get to the point of capturing that picture, and capturing it correctly. And then later it looks bad and wrong — 24 times a second.

By reaching beyond film and television art and into the realms of history and science, we’ve been working on solving this issue for Netflix projects. The future of our industry always has unknowns thanks in large part to rapidly accelerating innovations in technology. However, we can use what we do know from a hundred years of filmmaking, and the study of human perception, to best preserve content as new technology emerges which allows us to make the experience of viewing it even better. Preserving creative intent while preserving these important images is the goal.

With some attention and care spent up front on building and testing a color managed workflow, a show can look as expected at every point in the process, and archival assets can be created that increase the longevity of a show far into the future and protect it for remastering. The assets we require at Netflix — the NAM, GAM, and VDM — might be digital files that make their way to the cloud via our Content Hub, but the concepts are rooted in history and science.

What’s in a NAM, GAM or VDM?

Anyone who has delivered to (or been interested in delivering to) Netflix is familiar with these terms: non-graded archival master (NAM), graded archival master (GAM), and video display master (VDM). Other studios or facilities may have similar assets or similar names, and these are ours. Generally speaking, each delivery to Netflix will include these archival assets.

A non-graded archival master (NAM) is a complete copy of the ungraded, yet fully conformed, final locked picture including VFX, rendered in the original working color space such as ACES or the original camera log space, with no output or display transform rendered in or applied.

A graded archival master (GAM) adds in the final color grading decisions to the fully conformed final locked picture, and is also rendered in the original working color space such as ACES or the original camera log space — again, with no output or display transform rendered in or applied.

Neither the NAM nor GAM are very pretty to watch because images in logarithmic a.k.a “log” or linear space are not rendered for display and potentially hold more information than displays can reproduce. For that, we have the VDM.

The video display master (VDM) is also a complete copy of the fully conformed, final locked picture with VFX, this time rendering in the output or display transform, meaning it is encoded in the mastering display’s color space.

*NAM-GAM-VDM Example: Log workflow (ARRI LogC)*

*NAM-GAM-VDM Example: Linear workflow (ACES)*

Each of these assets is delivered in an uncompressed or losslessly compressed format such as a 16-bit DPX, EXR, or TIFF sequence. These archival assets are among a variety that are required in addition to delivery of a SMPTE IMF (Interoperable Master Format) package, which is the mastering format used to derive all streaming assets.

These assets give us a huge amount of flexibility in the future because we’re preserving a copy of the show in the original color space. While retaining all the original information and dynamic range as it was originally shot, we can remaster shows (and remaster them more easily) while preserving the original creative intent, assuring they’ll continue to look the best they possibly can for years to come.

To understand where these terms and processes come from, we have to go back to film class.

Film History 101

Thinking more deeply about Netflix and the technology evolution pushing forward the creative technical work behind television and film, the last thing to come to mind might be physical film. But in fact, our NAM, GAM and VDM archival assets have their roots in over a hundred years of film history.

Today, more often than not, productions have largely moved to digital acquisition on camera cards and hard drives. A century ago, when the only image capture format was celluloid, the physical processes to handle it were developed and refined over the years that followed.

A motion picture film strip (Source: Wikimedia Commons)

In a physical film workflow, photography was completed on set and exposed film negatives were sent to a lab for a one light print. Multiple camera rolls were strung together into a lab roll, and dailies were created using a simple set of light values that made a positive human-viewable print.

An editor would cut together the film and a negative cut list (similar to an EDL, with a list of edit decisions and key codes instead of file names and time codes) was sent to a negative cutter for conforming the locked picture from the original negative.

This final cut glued together by a negative cutter is the equivalent of our modern day non-graded archival master (NAM).

After this point, a director of photography would work with a color timer to apply a one-light to the entire negative, then give creative adjustments on each scene. The color timer would program the printer lights on a shot by shot basis in an analog process, creating what would be similar to a modern color decision list (CDL). When the color was agreed upon, the negative was printed with the timing lights. This second negative — a negative of a negative — was called the interpositive (IP).

This IP or “negative negative” with all the final color decisions included is the equivalent of our graded archival master (GAM). Since this film stock is based on the original negative, it can hold the same amount of information and dynamic range as the original negative.

Internegatives were created from the IP for bulk printing, and a positive (human viewable) film print was created from that. A print film stock, unlike negative film stocks, has specific characteristics required to produce a pleasing image when shown on a film projector. The film print is our equivalent of a video display master (VDM).

35mm film print (Image courtesy of Adakin Productions) Source: Wikipedia

As film continued to evolve and transition into digital workflows, motion imaging experts have continued to innovate and improve this process and transition it into modern workflows. Decades of work on moving pictures combined with the proliferation of faster, cheaper storage and smaller, better camera sensors have led to the ability to create a robust archive ready for remastering where no scenes will ever be lost to time.

Next Up, Science Class: Color Science

To create these archival assets in today’s digital workflows (and maintain a happy creative team viewing their show exactly as they shot it at every point in the process), proper color management is key from the start. A basic understanding of color science is helpful in understanding how and why color, perception, and display technology is critical.

Most pictures today are color pictures. Color is made up of light, which at different wavelengths we would call “red,” “green,” “blue,” and many other names.

This light goes through two phases:

Light enters the eye and tiny cells on our retina (cones) react to it.
Signal travels to the back of our brain (visual cortex) to form a color perception.

The eye-retina portion (1) is a fairly well-understood science, standardized by the CIE into three measurable “tristimulus values” known as XYZ.* These values are based on our three cone types which respond to long (L), medium (M), and short (S) wavelengths of light.

XYZ is often called colorimetry, or the measurement of color. If you can get XYZ₁ to match XYZ₂, to the average observer, the colors will match. For example, we can print a picture of an apple, using printer dyes, which measures the same XYZ values as the original apple, even though the exact spectral characteristics of the printer dyes and apple differ. This is how most color imaging systems are successful.

The cognitive portion (2) is far more complex, and involves your viewing environment, adaptation state, as well as expectations and memory. This is known as color appearance, and is also well-studied and modeled — but we’ll save that for a future blog.

For this reason, XYZ makes for a fine and proven way of calibrating displays to match. Until someone figures out how to feed content straight to your brain, displays are the only way we can view content, so understanding their characteristics and making sure they’re working as intended is important.

But before we get to the display, we have to create the images to display.

Cameras in our industry typically attempt to respond to light as close to the human visual system as possible, using color filters to emulate the three cone responses of the human eye. A perfectly designed camera would be able to record all visible colors, store them in XYZ, and perfectly store all the colors of a scene! Unfortunately, achieving this with an electronic camera system is difficult, so most cameras are not perfectly colorimetric. Still, the “emulate-the-human-eye” design criteria remains, and most cameras do a fairly good job at it.

Since cameras are not perfect, in very simple terms, they do two things:

Apply a Input Transform from raw sensor RGB → XYZ colorimetry, optimizing this transform for the most important** colors found in the real world.
Apply an Output Transform from XYZ → RGB for display.

Sometimes this is all done in one step. For example, when you get a JPEG out of a camera or your smartphone, these two steps have occurred, and you are looking at what is known as a “display-referred” image. In other words, the RGB values correspond to the colors that should be coming off of the display.

It is worth noting here that broadcast cameras often operate in the same manner — they apply #1 and #2 to output “display-referred” images, which can be sent directly to a display.

Shooting RAW is different. Professional cameras allow for #1 and #2 to not be applied. This means you get the raw sensor RGB values. No color transforms are applied until you process or “develop” that image.

Now, let’s say you applied color transform #1 from above, but not #2, and you output an image in XYZ. This is known as a “scene-referred” image. In other words, the values correspond to the estimated*** colors in the scene, either in XYZ or an RGB encoding defined within XYZ.

Scene-referred images usually contain much more information than displays can show, just like a film negative. This is true in both dynamic range and color. This can be stored in various ways. Camera companies in our industry usually define their own “scene-referred” color spaces.

Just a few examples include:

ARRI: Alexa LogC Wide Gamut
Sony: S-Log3 S-Gamut3.cine
Panasonic: V-Log V-Gamut
RED: RED Wide Gamut Log3G10

These color spaces are designed specifically to encompass the range of light and color that each camera is capable of capturing, and are storable in an integer encoding (usually 10-bit or 12-bit). This takes care of the Input Transform (#1).

It might seem appropriate to simply show the scene-referred color on a display, but the Output Transform (#2) portion is required to account for differences in luminance between the scene and display, as well as the change in viewing environment. For example, a picture of a sunny day is not nearly as bright as the physical sun, so this must be accounted for in terms of contrast and color. This concept of “picture rendering” has many approaches, and goes beyond the scope of this blog post, but since it has a large impact on the overall “look” of an imaging system, it is worth introducing the concept here.

For this reason, camera companies usually provide default Output Transforms (in the form of look-up tables or LUTs) so that you can take a Log image from their camera and view it in a color space such as Rec. 709.

An Eye Toward Color Management

These concepts all come together to form a color managed workflow. Because color management assures image fidelity, predictability of viewing images, and ease with mixed image sources, it’s the best way to work to protect the present and future viewing of movies or series. A color managed workflow requires a defined working color space and an agreed upon output transform or LUT, clearly documented and shared with all parties in a workflow.

Once a working color space is defined, all color corrections are made within that space. However, since we know this space is scene-referred and can’t be viewed directly, the output transform must be used to preview what the image will look like on your display.

In this example, the working color space is Log and the display color space is Rec. 709. The Output Transform separates the two, and is only baked in for the Rec. 709 streaming master. The archival masters (non-graded and graded) remain in Log color space.

It might seem easier to just convert all images to a display color space like Rec. 709. But this removes all the dynamic range and additional information from the post production process and the resulting archive. As new display technology emerges as years pass by, your images are stuck in time like Kylee’s parents’ wedding photos.

Using an Output Transform or display LUT — such as a creative LUT designed by a colorist or DI facility, or even a default camera LUT like ARRI’s 709 LUT — can not only serve as the base “look” of the show, but it protects and preserves the working color space and the full dynamic range it has to offer for color and VFX, and the eventual NAM and GAM archival assets.

Additionally, in productions with secondary cameras, an Input Transform can be used to convert images into this larger working color space. Most professional cameras have published color space definitions, and most professional color grading software implement these in their toolsets. This unifies images into a common color space and reduces time spent matching different cameras to one another.

The Academy Color Encoding Standard (ACES) is a color management system which attempts to unify these “scene-referred” color spaces into a larger, standardized one. It covers all visible colors, and uses 16-bit half-float (32 stops of linear dynamic range) encoding stored in an OpenEXR container, well beyond the range of any camera today. Camera manufacturers also publish Input Transforms in order to convert from their native sensor RGB into ACES RGB.

Source: Academy of Motion Picture Arts and Sciences

ACES also defines a standard set of Output Transforms in order to provide a standard way to view an image on a calibrated display, regardless of the camera. This part is key in terms of providing a consistent view of working images, since these ACES Output Transforms are built-in to most popular color grading and VFX software.

It’s worth noting that in the past, color transforms had to be exported into fixed LUTs (look-up tables) for performance reasons. However, increasingly with modern GPUs, systems are able to apply the pure math of a color transform without the need for LUTs.

But what about viewing it all?

Similar to camera color spaces, display color spaces are typically defined in XYZ space. However, no current displays can properly show everything in most scene-referred images due to absolute luminance and color gamut limitations, and the evolution of display technology means what you can see will change year by year.

Displays receive a signal and output light. Display standards, and calibration to those standards, allow us to send a signal and get a predictable output of light and color.

Today, most displays are additive, in that they have three “primaries” which emit red, green, and blue (RGB) light, and when combined or added, they form white. The ‘white point’ is the color that is produced when equal amounts of red, green, and blue are sent to the monitor.

Display standards exist so that you can take an image from Display #1 and send it to Display #2 and get the same color. In other words, which red, green, blue, and white are being used?

This is especially important in broadcast TV and the internet, where images are sent to millions of displays simultaneously. Common display standards include sRGB (internet, mobile), Rec. 709 (HD broadcast), Rec. 2020 (UHD and HDR broadcast), and P3 (digital cinema and HDR).

These standards define three main components:

Primaries, usually defined in XYZ
White point, usually defined in XYZ
EOTF (Electro-Optical Transfer Function / signal-to-luminance (Y))

Example Primaries and White Point of ITU-R BT.709 aka “Rec. 709” color space (left), Example Transfer Function from ITU-R BT.1886 standard (right)

Test patterns are measured in order to tune displays to hit standard targets as closely as possible. Usually the test patterns include patches of red, green, blue, and white, as well as a greyscale to measure the EOTF.

Only after calibration do creative adjustments become meaningful, and color transforms become useful. A color managed workflow requires this step in order to truly have an impact on image fidelity and consistency.

Display technology has come a long way since we first began exhibiting motion images in public and then the home. From projection systems in theaters to over-the-air broadcast to emissive displays like OLED and even an iPhone, the display technology on which we look at images continues to evolve quickly. Color management and proper archival elements protect the high quality display of a movie for the future.

Summary

Returning to the original story of my parents and their remaining wedding photos, it’s apparent how being in the moment and dealing with all the challenges of that moment — budget, time, people, access to technology — can seem small compared to what is happening around you. But as time goes on, the remaining records of that experience only become more precious. And the opportunity to preserve it — and preserve it well — is missed altogether.

Drawing a parallel to a film or television show, preserving these captured moments and assuring the creative intent behind them is preserved and protected for years of enjoyment is incredibly important. Some shows become cultural touchstones. Others are personal favorites that provide comfort for many years. In any case, for many filmmakers they are the culmination of an individual’s life work and deserve the respect of a high quality viewing experience and high fidelity archive.

At Netflix, we are constantly refining our process and approach to these processes while continuing to rely on the vast collective knowledge of so many years of film history and scientific research. Within the Creative Technologies & Infrastructure team, we are always on the hunt for new and innovative ways to increase the usefulness of assets while becoming more and more flexible for creatives and technicians alike. While history and science may give us many resources to work from, the relationships we cultivate with the production community may give the best guidance.

Some movies have been lost to time. Some remastered films are missing entire scenes. And I’m not the only one whose family photo albums are rapidly fading, their quality locked in by the imaging technology available at the time. By combining thoughtful planning and technology, we can preserve the human experience and its stories for decades to come.

Despite improperly captured and archived wedding photos, Kylee’s parents are still happily married after 40 years. And this photo was captured in Canon RAW and is backed up in two locations.

Footnotes

*XYZ, in this blog, refers to the CIE 1931 2-degree color matching functions (CMFs). This is different than DCI X’Y’Z’ (pronounced “X-prime, Y-prime, Z-prime”) which is a color encoding, based on XYZ, defined for digital cinema.

**Examples might be skin tones, blue skies, foliage, and other common colors.

***The word ‘estimated’ is used since cameras are not perfect colorimetric devices.

Protecting a Story’s Future with History and Science was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Building a Cross-platform In-app Messaging Orchestration Service

February 11, 2019, 9:01 am

≫ Next: Extending Vector with eBPF to inspect host and container performance

≪ Previous: Protecting a Story’s Future with History and Science

George Abraham, Devika Chawla, Chris Beaumont, and Daniel Huang.

Thoughtful, relevant, and timely messaging is an integral part of a customer’s Netflix experience. The Netflix Messaging Engineering team builds the platform and the messages to communicate with Netflix customers.

Messages in the Netflix App

In-app messages at Netflix fall broadly into two channels — Notifications and Alerts. Notifications are content recommendations that show up in the Notification Center within the Netflix app. Alerts are typically account related messages that are designed to draw the customer’s attention to take some action and show up on the main screen. This blog post will focus on the Alerts channel.

Left: Notification Center on the Netflix iOS app. Right: An in-app Alert on the Netflix iOS app.

Some History

It is worth spending some time painting a picture of the landscape that we evolved from. In the early days, account-related Alerts were served only on the Netflix website app. All the logic around presenting an Alert was implemented in the website codebase. The UI handled business logic around priority, customer state validation, timing, etc. A few years ago, the Netflix website was undergoing a major re-architecture. We simplified the ecosystem and built an in-app messaging service that handled the complexity of messaging orchestration by taking ownership of dimensions like

Timing — when to show a message
Frequency — how often to show a message
Device Eligibility — which devices to show the message
Population Segment — which profiles to show the message

This freed up UI platforms to focus their attention on core UI innovation.

Initial Goals

Our primary goals at project inception were the following:

Continue to support in-app Alert messaging in the new Website architecture (i.e remain backward compatible)
Transfer business logic around messaging orchestration from the UI tier to the in-app messaging service
Support cross-platform messaging
Minimize time spent on running and productizing messaging A/B tests.

Messaging Architecture Overview

Messaging infrastructures services are implemented in Java and like most Netflix services are deployed to AWS on clusters of EC2 instances in multiple AWS regions.

A simplified view of the messaging platform

Events are consumed by the messaging platform where they are transformed into fully formed messages that are routed to either the external message delivery service or the in-app message delivery service.

For external channels, a message is created by calling other Netflix services for relevant information, assembled in the appropriate format and delivered right away to the external service (e.g. Apple Push Notification Service for iOS push notifications).

For the in-app channels, a message skeleton is stored in the service where it resides until a customer logs into the Netflix app — at which point the message is validated, assembled, and delivered to the device to be displayed.

In-app Messaging Service

At Netflix, member UI teams are organized by the platform that they work on (Android, iOS, TV, Website, etc). Each platform technology stack is different and the treatment of an Alert on each platform looks significantly different. Creating custom payload contracts for each platform would be an inefficient and error-prone solution. It would also hinder our velocity to test messaging experiences across platforms.

A mock Netflix Alert on various Netflix devices — clockwise from top: TV, iOS, Web, Android

Design Based Payload Contract

We settled on a custom JSON payload contract that has pieces that are common to all the UIs but also accounted for differences in the design. We structured it in a way that UI platforms can implement the rendering of an in-app Alert without having to know anything about the specific message type. From the UI standpoint — there is an Alert to be shown.

{
   "templateId": "standard", 
   "template": { }, 
   "attributes": { }
}

The Template Identifier

The templateId is an identifier that lets the UI platform decide how to render the Alert. In simple terms — the design that should be used to render this Alert. In this example, the payload indicates to use the standard design. Design elements such as color, font, etc. are not prescribed since they are baked into the rendering logic that UI platforms use when they recognize the standard templateId.

The Template

The template field contains the payload that describes the various elements of the Alert for the specific templateId. Each field within the template corresponds to an element of the Alert with appropriate copy and attributes.

{
   "template": {
      "title": { 
         "copy": [] 
      },
      "body": {
         "copy": []
      },
      "ctas": [
         {},{},{}
      ]
   },
   "footer": {
      "copy": [ ]
   }
}

Copy

The messaging service also took responsibility for delivering the localized strings for the Alert to the device. This allowed us to create more complex messaging experiences for customers that were not possible before. For instance, the in-app messaging service is now able to intelligently adjust the copy, cadence, etc. associated with a message based on user interaction across platforms. Because the messaging service hosts the strings on the service — we are also not tied to release cycles of each UI platform.

The payload also allows for basic markups such as boldface, newline, and links, since these are typically used within the copy itself. This approach provides reasonable customization of commonly used copy elements across platforms but keeps the messaging system out of the path of design innovations in the UI. Most importantly, this approach also accounts for differences in the technology stack in each UI platform.

"copy" : [ {
      "elementType" : "TEXT",
      "content" : "This is an example " 
  }, {
      "elementType" : "BOLD",
      "content" : "Netflix Alert Message"
   }, {
      "elementType" : "TEXT",
      "content" : "."
   }]
}

Calls to Action

Calls to action (CTAs) include both copy and action attributes

{
   "actionType": "BACKGROUND_SERVICE_CALL",
   "action": "DISMISS_ALERT",
   "ctaType" : "BUTTON",
   "copy": [], 
   "isSelected": true,
   "ctaFeedback": { } 
}

An actionType is a top-level category to denote what type of action should be taken when a customer clicks the CTA. Some clicks result in a redirect to another flow in the app while others may result in a call to backend service to update something.

The action describes the specific action to be taken when a customer clicks on the CTA. For instance, one action could redirect the user to the plan selection flow whereas another could redirect the customer to the change email flow.

For designs that support multiple CTAs — the payload dictates which CTA to highlight by default with the isSelected flag.

Each CTA also contains a feedback field that is posted back to the service in its entirety. The contents of the feedback are not inspected by the UI platform, it is simply a passthrough that is used by the in-app messaging service to implement the orchestration features of the service.

{
   "ttl": 3600,
   "feedbackType": "cta",
   "cta": "DISMISS_ALERT", 
   "trackingInfo": {
      "messageGuid": "786DECAE429EEB029EEE057191675F6764555F12",
      "eventGuid": "D6AC068F169C0E616E99AF1C72F1AD8264555F12",
      "renderGuid": "85661FCF-0857-42A7-AA0E-559A26A5723B_R",
      "messageName": "EXAMPLE_ALERT",
      "messageId": 1234,
      "locale": "en-GB",
      "abTestId": 2222,
      "abTestCell": 2,
      "templateId": "standard",
   }
}

In the feedback call to the service, the UI platform also passes additional context that the messaging service is not privy to — like UI platform, device id, application version, etc.

The data here allows the service to perform orchestration features like changing the behavior of subsequent impressions of the Alert based on the customer’s prior interactions. For instance — we can suppress the Alert for a day on all devices if a customer clicked on DISMISS on any device. Or, we can choose to show a follow-up Alert using a less interruptive design with different copy limited to certain devices.

Attributes

The attributes field contains information to help UI platforms drive the rendering and user experience of the Alert. These are typically platform-specific attributes — such as blacklisted pages for an Alert on the website. For instance, we don’t want to show an “Add Your Phone Number” Alert on the Add Phone page.

The attributes section also contains a feedback field that is posted back to the service once the Alert is rendered. This feedback allows the in-app messaging service to change the behavior based on just the impression of the Alert even if the customer did not click a CTA.

Challenges

One of the main challenges we faced when starting down this path was that messaging engineers had little to no knowledge about UI platforms and the nuances among them. We had to learn to account for differences such as message fetching patterns, testing infrastructure, bandwidth, refresh times, etc. We also evolved our payload from a one size fits all approach to a more balanced one by taking on some of the complexity and managing the nuances by implementing a UI-centric design based payload.

Another challenge was around test automation. Our early days were filled with running manual tests because we didn’t have the infrastructure in place to integrate with UI automation frameworks. In the past year, our investment in test automation has resulted in shortening the time taken for validating changes.

Finally, localization QC was a challenge because we didn’t have the tooling in place to take screenshots across the various UIs for a message. We now have a workflow that allows the localization team to generate Alert screenshots for all languages and devices.

Wins

The in-app messaging service has opened up avenues to run more A/B tests than was possible before.

We now build cross-platform experiences so that customers can get a consistent messaging experience on all devices.
We have reduced the development time for messaging experimentation because resources and context for these projects are contained within the messaging team.
We run omnichannel messaging tests that incorporate other messaging channels such as email, push notifications, etc.
We run experiments using levers such as timing, channel selection, frequency, etc. since the in-app messaging service is integrated into the Netflix customer messaging platform.

Looking Ahead

As Netflix grows around the world, it’s increasingly beneficial and convenient to communicate with our customers inside the Netflix app. From account related messages to onboarding, and more — we are continually evolving our message portfolio.

One area where we are looking to reduce overall development time is around how UI platforms handle interaction with CTAs. Currently, for CTAs that result in a background service call, the UI platform maintains the code that is hosted in the Netflix API Platform. It was the path of least resistance when we started since most platforms were already integrated with backend systems. As our systems grow and become more complex, making changes to these flows will become harder.

We are evolving the service to a completely federated architecture for in-app messaging orchestration where UI platforms will need to interface only with the in-app messaging service eliminating the need for them to create and maintain integrations with backend services for each kind of CTA. At that point, as far as the UI platform is concerned — the payload contract for the Alert will specify what information needs to be passed back to the messaging service which will handle all the downstream calls, provide fallbacks, etc.

If you would like to join our small, impactful, and collaborative team — come join us.

Building a Cross-platform In-app Messaging Orchestration Service was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Extending Vector with eBPF to inspect host and container performance

February 20, 2019, 9:51 am

≫ Next: How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

≪ Previous: Building a Cross-platform In-app Messaging Orchestration Service

by Jason Koch, with Martin Spier, Brendan Gregg, Ed Hunter

Improving the tools available to our engineers to help them diagnose, triage, and work through software performance challenges in the cloud is a key goal for the cloud performance engineering team at Netflix.

Today we are excited to announce latency heatmaps and improved container support for our on-host monitoring solution — Vector — to the broader community. Vector is open source and in use by multiple companies. These updates also bring other user experience improvements and a fresher technology stack.

Remotely view real-time process scheduler latency and tcp throughput with Vector and eBPF

What is Vector?

Vector is an open-source host-level performance monitoring framework which we have been using for some time. Having the right metrics available on demand and at a high resolution is key to understanding how a system behaves and helps to quickly troubleshoot performance issues. For more information on the background and architecture of Vector and PCP, you can see our earlier tech blog post “Introducing Vector”.

Why the changes

There are two main triggers for the refresh:

Firstly, at Netflix we are seeing significant growth in the use of our container environment (Titus). Historically the focus of our tooling and platform has been on AWS EC2, but with more users making the migration, we need to provide better support for container use cases.
For example, being able to present both host and container level metrics on the same dashboard helps us provide a way to quickly answer questions: “am I being throttled by the container runtime?” or “are there noisy neighbors affecting my container task?”.
Secondly, we have a collection of open requests and issues that were difficult to resolve with the current dashboard component. There was no way to pause graphs to take screenshots, and chart legends were sometimes laid out obscuring the actual chart data.

Introducing BCC / eBPF visualisations

The Netflix performance engineering team (especially my colleague Brendan) has been contributing to eBPF since its beginning in 2014, including developing many open source performance tools for BCC such as execsnoop, biosnoop, biotop, ext4slower, tcplife, runqlat, and more. These are command line tools, and for them to be really practical in the Netflix cloud environment of over 100k instances, they need to be runnable from our self service GUIs including Vector. For that to work, there needed to be a PCP interface for BPF.

Andreas Gerstmayr has done fantastic work with developing a PCP PMDA for BCC, allowing BCC tool output to be read from Vector, and adding visualizations to Vector to view this data. Vector can now show these from BCC/eBPF:

Block and filesystem (ext4, xfs, zfs) latency heat maps
Block IO top processes
Active TCP session data such as top users, session life, retransmits ..
Snoops: block IO, exec()
Scheduler run queue latency

In the below diagram we can see a demo for a wget job. As soon as the wget job starts, the ‘runqlat’ chart shows increased scheduler activity (more yellow areas in the diagram) and longer queue latency for some processes (blue blocks appear higher in the vertical area of the chart). The ‘tcptop’ also shows a new process appearing, (wget) with a new TCP connection and a significant data in RX_KB — 10–20 MB/sec (it is, unsurprisingly, receiving lots of data).

Real-time scheduler run queue latency and tcp throughput charts after starting a wget download

Many thanks to the great work by Andreas during his 2018 Google Summer of Code project.

New features

To address these changes, we have introduced the following new features:

We are now able to visualize any combination of host and container metrics from a single view. This allows us to create more complex single-pane dashboards. For example, charting host CPU alongside container CPU, or charting network IO for two communicating instances on the same visualization can be an easy way to get better insights.
Charts are now resizable and movable.
This makes it much easier for engineers to get the graphs they want arranged in a manner for comparisons and to focus in the required areas. In this example, CPU utilisation from two different hosts are arranged so that correlation of spikes — or not — is more clearly visible.

Charts will keep collecting data even when the tab is not visible.
Previously, switching to a different browser tab paused graph data collection and generation. Now, when the tab is not visible, data will still be collected even though rendering is paused.

User experience improvements

Legends have been moved to tool tips.
This frees up some of visual real estate and removes render issues with large legends consuming the chart area, as you can see here:

Graphs can be paused and resumed.
This is helpful for taking screenshots, when comparing data points across charts, or in discussion with other engineers. When Pause is clicked, data collection continues in the background and graph updates are held. Graphs are brought immediately up to date with live data when Play is clicked again.
Here, you can see the Pause button pressed, and the graph pauses until the user hits the Play button.

A pause button has been introduced to help compare, take screenshots, etc.

Styling and widgets
Introduction of new styling and configuration approach. We have introduced the Cividis color scheme for heat maps.

Internal technology refresh

The earlier dashboard core components were running on a tech stack that was — for JavaScript — a relatively old Angular 1.x deployment, with a number of component dependencies. A key dashboard component is no longer maintained; new features we would need to resolve issues were not available without forking the component. Instead of forking and investing in a deprecated stack, we have decided on a stack refresh:

Switch from Angular 1.x to React w/ Semantic UI.
Refresh the build pipeline away from gulp to webpack.
Introduction of Semiotic which is a powerful graph render layer over React and pieces of d3. Along the way we made a bundle of performance improvements to Semiotic for our use case too.

Current state

This code is now available for public consumption and should be making its way to distro packages in their release cycle over time.

As always, your feedback and contributions are very much appreciated. The best way is to reach out to us on Github: http://github.com/netflix/vector, or on Slack: https://vectoross.slack.com/.

Looking forward

Vector continues to serve internal Netflix users. There are a number of user-experience improvements that can be made to the new tooling. In addition to this, the continued shift to containers, with a focus on container startup time and short-lived containers are likely to drive tighter integration with container scheduling tools. We are also excited to see the future of monitoring with more heterogeneous workloads — not only Intel x86, but also ARM, AMD x86, GPUs, FPGAs, etc.

If you’re interested in working on performance challenges and the idea of building quality tools, visualizations appeals to you — please get in touch with us, we’re hiring!

https://jobs.netflix.com — Netflix
https://jobs.netflix.com/jobs/865018 — Senior Performance Engineer

Extending Vector with eBPF to inspect host and container performance was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

March 5, 2019, 8:06 am

≫ Next: Trace Event, Chrome and More Profile Formats on FlameScope

≪ Previous: Extending Vector with eBPF to inspect host and container performance

Netflix’s engineering culture is predicated on Freedom & Responsibility, the idea that everyone (and every team) at Netflix is entrusted with a core responsibility and they are free to operate with freedom to satisfy their mission. This freedom allows teams and individuals to move fast to deliver on innovation and feel responsible for quality and robustness of their delivery. Central engineering teams enable this operational model by reducing the cognitive burden on innovation teams through solutions related to securing, scaling and strengthening (resilience) the infrastructure.

A majority of the Netflix product features are either partially or completely dependent on one of our many micro-services (e.g., the order of the rows on your Netflix home page, issuing content licenses when you click play, finding the Open Connect cache closest to you with the content you requested, and many more). All these micro-services are currently operated in AWS cloud infrastructure.

As a micro-service owner, a Netflix engineer is responsible for its innovation as well as its operation, which includes making sure the service is reliable, secure, efficient and performant. This operational component places some cognitive load on our engineers, requiring them to develop deep understanding of telemetry and alerting systems, capacity provisioning process, security and reliability best practices, and a vast amount of informal knowledge about the cloud infrastructure.

While our engineering teams have and continue to build solutions to lighten this cognitive load (better guardrails, improved tooling, …), data and its derived products are critical elements to understanding, optimizing and abstracting our infrastructure. This is where our data (engineering and science) teams come in: we leverage vast amounts of data produced by our platforms and micro-services to inform and automate decisions related to operating the many components of our cloud infrastructure reliably, securely and efficiently.

In the next section, we will highlight some high level areas of focus in each dimension of our infrastructure. In the last section, we will attempt to feed your curiosity by presenting a set of opportunities that will drive our next wave of impact for Netflix.

In the Security space, our data teams focus almost all our efforts on detecting suspicious or malicious activity using a collection of machine learning and statistical models. Historically, this has been focussed on potentially compromised employee accounts, but efforts are in place to build a more agnostic detection framework that would consider any agent (human or machine). Our data teams also invest in building more transparency around our security and privacy to support progress in reducing threats and hazards faced by our micro-services or internal stakeholders.

In the Reliability space, our data teams focus on two main approaches. The first is on prevention: data teams help with making changes to our environment and its many tenants as safe as possible through contained experiments (e.g., Canaries), detection and improved KPIs. The second approach is on the diagnosis side: data teams measure the impact of outages and expose patterns across their occurrence, as well as provide a connected view of micro-service-level availability.

In the Efficiency space, our data teams focus on transparency and optimization. In Netflix’s Freedom and Responsibility culture, we believe the best approach to efficiency is to give every micro-service owner the right information to help them improve or maintain their own efficiency. Additionally, because our infrastructure is a complex multi-tenant environment, there are also many data-driven efficiency opportunities at the platform level. Finally, provisioning our infrastructure itself is also becoming an increasingly complex task, so our data teams contribute to tools for diagnosis and automation of our cloud capacity management.

In the Performance space, our data teams currently focus on the quality of experience on Netflix-enabled devices. The main motivation is that while the devices themselves have a significant role in overall performance, our network and cloud infrastructure has a non-negligible impact on the responsiveness of devices. There is a continuous push to build improved telemetry and tools to understand and minimize the impact of our infrastructure in the overall performance of Netflix application across a wide range of devices.

In the People space, our data teams contribute to consolidated systems of record on employees, contractors, partners and talent data to help central teams manage headcount planning, reduce acquisition cost, improve hiring practices, and other people analytics related use-cases.

Challenges & Opportunities in the Infra Data Space

Security Events Platform for Anomaly Detection

How can we develop a complex event processing system to ingest semi-structured data predicated on schema contracts from hundreds of sources and transform it into event streams of structured data for downstream analysis?
How can we develop templated detection modules (rules- and ML-based) and data streams to increases speed of development?

See open source project such as StreamAlert and Siddhi to get some general ideas.

Asset Inventory

How can we develop a dimensional data model representing relationships between apps, clusters, regions and other metadata including AMI / software stack to help with availability, resiliency and fleet management?
Can we develop learning models to enrich metadata with application vulnerabilities and risk scores?

Reliability

How can we guarantee that a code change will be safe when rolled out to our production environment?
Can we adjust our auto-scaling policies to be more efficiency without risking our availability during traffic spikes?

Capacity and Efficiency

Which resources (clusters, tables, …) are unused or under-utilized and why?
What will be the cost of rolling out the winning cell of an AB test to all users?

People Analytics

Can we support AB experiments related to recruiting and help improve candidate experience as well as attract solid talent?
Can we measure the impact of Inclusion and Diversity initiatives?

People & Security

How can we build a secure and restricted People Data Vault o provide a consolidated system of reference and allow apps to add additional metadata?
How can we automatically provision or de-provision access privileges?

Data Lineage

Can we develop a generalized lineage system to develop relationships among various data artifacts stored across Netflix data landscape?
Can we leverage this lineage solution to help forecast SLA misses and address Data Lifecycle Management questions (job cost, table cost, and retention)?

This was just a tiny glimpse into our fantastic world of Infrastructure Data Engineering, Science & Analytics. We are on a mission to help scale a world-class data-informed infrastructure and we are just getting started. Give us a holler if you are interested in a thought exchange.

Authors: Sebastien de Larquier (S&A), Jitender Aswani (DEI)

Contributors: Sui Huang (S&A) is partnering on reimagining People initiatives.

Trace Event, Chrome and More Profile Formats on FlameScope

March 6, 2019, 10:01 am

≫ Next: Design Principles for Mathematical Engineering in Experimentation Platform at Netflix

≪ Previous: How Data Inspires Building a Scalable, Resilient and Secure Cloud Infrastructure At Netflix

FlameScope sub-second heatmap visualization.

Less than a year ago, FlameScope was released as a proof of concept for a new profile visualization. Since then, it helped us, and many other users, to easily find and fix performance issues, and allowed us to see patterns that we had never noticed before in our profiles.

As a tool, FlameScope was limited. It only supported the profile format generated by Linux perf, which at the time, was the profiler of choice internally at Netflix.

Immediately after launch, we received multiple requests to support other profile formats. Users looking to use the FlameScope visualization, with their own profilers and tools. Our goal was never to support hundreds of profile formats, especially for tools we don’t use internally, but we always knew that supporting a few “generic” formats would be useful, both for us, and the community.

After receiving multiple requests from users and investigating a few profile formats, we opted to support the Trace Event Format. It is well documented. It is flexible. Multiple tools already use it, and it is the format used by Trace-Viewer, which is the javascript frontend for Chrome’s about:tracing and Android’s systrace tools.

The complete documentation for the format can be found here, but in a nutshell, it consists of an ordered list of events. For now, FlameScope only supports Duration and Complete event types. According to the documentation:

Duration events provide a way to mark a duration of work on a given thread. The duration events are specified by the B and E phase types. The B event must come before the corresponding E event. You can nest the B and E events. This allows you to capture function calling behavior on a thread.

Each complete event logically combines a pair of duration (B and E) events. The complete events are designated by the X phase type.

There is an extra parameter dur to specify the tracing clock duration of complete events in microseconds. All other parameters are the same as in duration events.

Here’s an example:

[
  {

    "name": "Asub",

    "cat": "PERF",

    "ph": "B",

    "pid": 22630,

    "tid": 22630,

    "ts": 829

  }, {

    "name": "Asub",

    "cat": "PERF",

    "ph": "E",

    "pid": 22630,

    "tid": 22630,

    "ts": 833

}
]

As you can imagine, this format works really well for tracing profilers, where the beginning and end of work units are recorded. For sampling based profilers, like perf, the format is not ideal. We could create a Complete event for every sample, with stacks, but even being more efficient than the output generated by perf, there is still a lot of overhead, especially from repeated stacks. Another option would be to analyze the whole profile and create begin and end events every time we enter or exit a stack frame, but that adds complexity to converters.

Since we also work with sampling profilers frequently, we needed a simpler format. In the past, we worked with profiles in the v8 profiler format, which is very similar to Chrome’s old JavaScript CPU profiler format and newer ProfileType event format. We already had all the code needed to generate both heatmap and partial flame graphs, so we decided to use it as base for a new format, which for lack of a more creative name, we called nflxprofile. Different from the v8 profiler format, it uses a map instead of a list to store the nodes, includes extra information about the profile, and takes advantage Protocol Buffers to serialize the data instead of JSON. The .proto file looks like this:

syntax = “proto2”;

package nflxprofile;

message Profile {

  required double start_time = 1;

  required double end_time = 2;

  repeated uint32 samples = 3 [packed=true];

  repeated double time_deltas = 4 [packed=true];

  message Node {

    required string function_name = 1;

    required uint32 hit_count = 2;

    repeated uint32 children = 3;

    optional string libtype = 4;

  map<uint32, Node> nodes = 5;

It can be found on FlameScope’s repository too, and be used to generate code for the multiple programming languages supported by Protocol Buffers.

Netflix has been using the new format internally in its cloud profiling tool, and the improvement is noticeable. The significant reduction in file size, from raw perf output to nflxprofile, allows for faster download time from external storage. The reduction will depend on sampling duration and how homogeneous the workload is (similar stacks), but generally, the output is orders of magnitude smaller. Time spent on parsing and deserialization is also reduced significantly. No more regular expressions!

Since the new format is so similar to what Chrome generates, we decided to include it too! It has been challenging to keep up with the constant changes in DevTools, from CpuProfile, to Profile and now ProfileChunk events, but the format is supported as of now. If you want to try it out, check out the Get Started With Analyzing Runtime Performance post, record and save the profile to FlameScope’s profile directory, and open it!

We also had to make minor adjustments to the user interface, more specifically the file list, to support the new profile formats. Now, instead of a simple list, you will get a dropdown menu next to each file that allows you to select the correct profile type.

We might consider adding support for more formats in the future, or accept pull requests that add support for something new, but in general, profile converters are the simplest solution. If you created a converter for a known profile format, we are happy to link to it on FlameScope’s documentation!

FlameScope was developed by Martin Spier, Brendan Gregg and the Netflix performance engineering team. Blog post by Martin Spier.

Trace Event, Chrome and More Profile Formats on FlameScope was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Design Principles for Mathematical Engineering in Experimentation Platform at Netflix

March 7, 2019, 12:26 pm

≫ Next: MezzFS — Mounting object storage in Netflix’s media processing platform

≪ Previous: Trace Event, Chrome and More Profile Formats on FlameScope

Jeffrey Wong, Senior Modeling Architect, Experimentation Platform
Colin McFarland, Director, Experimentation Platform

At Netflix, we have data scientists coming from many backgrounds such as neuroscience, statistics and biostatistics, economics, and physics; each of these backgrounds has a meaningful contribution to how experiments should be analyzed. To unlock these innovations we are making a strategic choice that our focus should be geared towards developing the surrounding infrastructure so that scientists’ work can be easily absorbed into the wider Netflix Experimentation Platform. There are 2 major challenges to succeed in our mission:

We want to democratize the platform and create a contribution model: with a developer and production deployment experience that is designed for data scientists and friendly to the stacks they use.
We have to do it at Netflix’s scale: For hundreds of millions of users across hundreds of concurrent tests, spanning many deployment strategies from traditional A/B experiments, to evolving areas like quasi experiments.

Mathematical engineers at Netflix in particular work on the scalability and engineering of models that estimate treatment effects. They develop scientific libraries that scientists can apply to analyze experiments, and also contribute to the engineering foundations to build a scientific platform where new research can graduate to. In order to produce software that improves a scientist’s productivity we have come up with the following design principles.

1. Composition

Data Science is a curiosity driven field, and should not be unnecessarily constrained[1]. We support data scientists to have freedom to explore research in any new direction. To help, we provide software autonomy for data scientists by focusing on composition, a design principle popular in data science software like ggplot2 and dplyr[2]. Composition exposes a set of fundamental building blocks that can be assembled in various combinations to solve complex problems. For example, ggplot2 provides several lightweight functions like geom_bar, geom_point, geom_line, and theme, that allow the user to assemble custom visualizations; every graph whether simple or complex can be composed of small, lightweight ggplot2 primitives.

In the democratization of the experimentation platform we also want to allow custom analysis. Since converting every experiment analysis into its own function for the experimentation platform is not scalable, we are making the strategic bet to invest in building high quality causal inference primitives that can be composed into an arbitrarily complex analysis. The primitives include a grammar for describing the data generating process, generic counterfactual simulations, regression, bootstrapping, and more.

2. Performance

If our software is not performant it could limit adoption, subsequent innovation, and business impact. This will also make graduating new research into the experimentation platform difficult. Performance can be tackled from at least three angles:

A) Efficient computation

We should leverage the structure of the data and of the problem as much as possible to identify the optimal compute strategy. For example, if we want to fit ridge regression with various different regularization strengths we can do an SVD upfront and express the full solution path very efficiently in terms of the SVD.

B) Efficient use of memory

We should optimize for sparse linear algebra. When there are many linear algebra operations, we should understand them holistically so that we can optimize the order of operations and not materialize unnecessary intermediate matrices. When indexing into vectors and matrices, we should index contiguous blocks as much as possible to improve spatial locality[3].

C) Compression

Algorithms should be able to work on raw data as well as compressed data. For example, regression adjustment algorithms should be able to use frequency weights, analytic weights, and probability weights[4]. Compression algorithms can be lossless, or lossy with a tuning parameter to control the loss of information and impact on the standard error of the treatment effect.

3. Graduation

We need a process for graduating new research into the experimentation platform. The end to end data science cycle usually starts with a data scientist writing a script to do a new analysis. If the script is used several times it is rewritten into a function and moved into the Analysis Library. If performance is a concern, it can be refactored to build on top of high performance causal inference primitives made by mathematical engineers. This is the first phase of graduation.

The first phase will have a lot of iterations. The iterations go in both directions: data scientists can promote functions into the library, but they can also use functions from the library in their analysis scripts.

The second phase interfaces the Analysis Library with the rest of the experimentation ecosystem. This is the promotion of the library into the Statistics Backend, and negotiating engineering contracts for input into the Statistics Backend and output from the Statistics Backend. This can be done in an experimental notebook environment, where data scientists can demonstrate end to end what their new work will look like in the platform. This enables them to have conversations with stakeholders and other partners, and get feedback on how useful the new features are. Once the concepts have been proven in the experimental environment, the new research can graduate into the production experimentation platform. Now we can expose the innovation to a large audience of data scientists, engineers and product managers at Netflix.

4. Reproducibility

Reproducibility builds trustworthiness, transparency, and understanding for the platform. Developers should be able to reproduce an experiment analysis report outside of the platform using only the backend libraries. The ability to replicate, as well as rerun the analysis programmatically with different parameters is crucial for agility.

5. Introspection

In order to get data scientists involved with the production ecosystem, whether for debugging or innovation, they need to be able to step through the functions the platform is calling. This level of interaction goes beyond reproducibility. Introspectable code allows data scientists to check data, the inputs into models, the outputs, and the treatment effect. It also allows them to see where the opportunities are to insert new code. To make this easy we need to understand the steps of the analysis, and expose functions to see intermediate steps. For example we could break down the analysis of an experiment as

Compose data query
Retrieve data
Preprocess data
Fit treatment effect model
Use treatment effect model to estimate various treatment effects and variances
Post process treatment effects, for example with multiple hypothesis correction
Serialize analysis results to send back to the Experimentation Platform

It is difficult for a data scientist to step through the online analysis code. Our path to introspectability is to power the analysis engine using python and R, a stack that is easy for a data scientist to step through. By making the analysis engine a python and R library we will also gain reproducibility.

6. Scientific Code in Production and in Offline Environments

In the causal inference domain data scientists tend to write code in python and R. We intentionally are not rewriting scientific functions into a new language like Java, because that will render the library useless for data scientists since they cannot integrate optimized functions back into their work. Rewriting poses reproducibility challenges since the python/R stack would need to match the Java stack. Introspection is also more difficult because the production code requires a separate development environment.

We choose to develop high performance scientific primitives in C++, which can easily be wrapped into both python and R, and also delivers on highly performant, production quality scientific code. In order to support the diversity of the data science teams and offer first class support for hybrid stacks like python and R, we standardize data on the Apache Arrow format in order to facilitate data exchange to different statistics languages with minimal overhead.

7. Well Defined Point of Entry, Well Defined Point of Exit

Our causal inference primitives are developed in a pure, scientific library, without business logic. For example, regression can be written to accept a feature matrix and a response vector, without any specific experimentation data structures. This makes the library portable, and allows data scientists to write extensions that can reuse the highly performant statistics functions for their own adhoc analysis. It is also portable enough for other teams to share.

Since these scientific libraries are decoupled from business logic, they will always be sandwiched in any engineering platform; upstream will have a data layer, and downstream will have a visualization and interpretation layer. To facilitate a smooth data flow, we need to design simple connectors. For example, all analyses need to receive data and a description of the data generating process. By focusing on composition, an arbitrary analysis can be constructed by layering causal analysis primitives on top of that starting point. Similarly, the end of an analysis will always consolidate into one data structure. This simplifies the workflow for downstream consumers so that they know what data type to consume.

Next Steps

We are actively developing high performance software for regression, heterogeneous treatment effects, longitudinal studies and much more for the Experimentation Platform at Netflix. We aim to accelerate research in causal inference methodology, expedite product innovation, and ultimately bring the best experience and delight to our members. This is an ongoing journey, and if you are passionate about our exciting work, join our all-star team!

Design Principles for Mathematical Engineering in Experimentation Platform at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

MezzFS — Mounting object storage in Netflix’s media processing platform

March 6, 2019, 8:10 am

≫ Next: Spinnaker Sets Sail to the Continuous Delivery Foundation

≪ Previous: Design Principles for Mathematical Engineering in Experimentation Platform at Netflix

MezzFS — Mounting object storage in Netflix’s media processing platform

By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team)

MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. It’s used extensively in our media processing platform, which includes services like Archer and runs features like video encoding and title image generation on tens of thousands of Amazon EC2 instances. There are similar tools out there, but we’ve developed some unique features like “replays” and “adaptive buffering” that we think are worth sharing.

What problem are we solving?

We are constantly innovating on video encoding technology at Netflix, and we have a lot of content to encode. Video encoding is what MezzFS was originally designed for and remains one of its canonical use cases, so we’ll focus on video encoding to describe the problem that MezzFS solves.

Video encoding is the process of converting an uncompressed video into a compressed format defined by a codec, and it’s an essential part of preparing content to be streamed on Netflix. A single movie at Netflix might be encoded dozens of times for different codecs and video resolutions. Encoding is not a one-time process — large portions of the entire Netflix catalog are re-encoded whenever we’ve made significant advancements in encoding technology.

We scale out video encoding by processing segments of an uncompressed video (we segment movies by scene) in parallel. We have one file — the original, raw movie file — and many worker processes, all encoding different segments of the file. That file is stored in our object storage service, which splits and encrypts the file into separate chunks, storing the chunks in Amazon S3. This object storage service also handles content security, auditing, disaster recovery, and more.

The individual video encoders process their segments of the movie with tools like FFmpeg, which doesn’t speak our object storage service’s API and expects to deal with a file on the local filesystem. Furthermore, the movie file is very large (often several 100s of GB), and we want to avoid downloading the entire file for each individual video encoder that might be processing only a small segment of the whole movie.

This is just one of many use cases that MezzFS supports, but all the use cases share a similar theme: stream the right bits of a remote object efficiently and expose those bits as a file on the filesystem.

The solution: MezzFS

MezzFS is a Python application that implements the FUSE interface. It’s built as a Debian package and installed by applications running on our media processing platform, which use MezzFS’s command line interface to mount remote objects as local files.

MezzFS has a number of features, including:

Stream objects — MezzFS exposes multi-terabyte objects without requiring any disk space.
Assemble and decrypt parts — Our object storage service splits objects into many parts and stores them in S3. MezzFS knows how to assemble and decrypt the parts.
Mount multiple objects — Multiple cloud objects can be mounted on the local filesystem simultaneously.
Disk Caching — MezzFS can be configured to cache objects on the local disk.
Mount ranges of objects — Arbitrary ranges of a cloud object can be mounted as separate files on the local file system. This is particularly useful in media computing, where it is common to mount the frames of a movie scene as separate files.
Regional caching — Netflix operates in multiple AWS regions. If an application in region A is using MezzFS to read from an object stored in region B, MezzFS will cache the object in region A. In addition to improving download speed, this is useful for cutting down on cross-region transfer costs when many workers will be processing the same data — we only pay the transfer costs for one worker, and the rest use the cached object.
Replays — More on this below…
Adaptive buffering — More on this below…

We’ve been using MezzFS in production for 5 years, and have validated it at scale — during a typical week at Netflix, MezzFS performs ~100 million mounts for dozens of different use cases and streams about ~25 petabytes of data.

MezzFS “replays”

MezzFS has become a crucial tool for us, and we don’t just send it out into the wild with a packed lunch and hope it will be fine.

MezzFS collects metrics on data throughput, download efficiency, resource usage, etc. in Atlas, Netflix’s in-memory dimensional time series database. Its logs are collected in an ELK stack. But one of the more novel tools we’ve developed for debugging and developing is the MezzFS “replay”.

At mount time, MezzFS can be configured to record a “replay” file. This file includes:

Metadata — This includes: the remote objects that were mounted, the environment in which MezzFS is running, etc.
File operations — All “open” and “read” operations. That is, all mounted files that were opened and every single byte range read that MezzFS received.
Actions — MezzFS records everything it buffers and everything it caches
Statistics — Finally, MezzFS will record various statistics about the mount, including: total bytes downloaded, total bytes read, total time spent reading, etc.

A single replay may include million of file operations, so these files are packed in a custom binary format to minimize their footprint.

Based on these replay files, we’ve built tools that:

Visualize a replay

This has proven very useful for quickly gaining insight into data access patterns and why they might be causing performance issues.

Here’s a GIF of what these visualization look like:

The bytes of a remote object are represented by pixels on the screen, where the top left is the start of the remote object and the bottom right is the end. The different colors mean different things — green means the bytes have been scheduled for downloading, yellow means the bytes are being actively downloaded, blue means the bytes have been successfully returned, etc. What we see in the above visualization is a very simple access pattern — a remote object is mounted and then streamed through sequentially.

Here is a more interesting, “sparse” access pattern, and one that inspired “adaptive buffering” described later in this post. We can see lots of little green bars quickly sprinkle the screen — these bars represent the bytes that were downloaded:

Visualization of a sparse MezzFS “replay”

Rerun a replay

We mount the same objects and rerun all of the operations recorded in the replay file. We use this to debug errors and performance issues in specific mounts.

Rerun a batch of replays

We collect replays from actual MezzFS mounts in production, and we rerun large batches of replays for regression and performance tests. We’ve integrated these tests into our build pipeline, where a build will fail if there are any errors across the gamut of replays or if the performance of a new MezzFS commit falls below some threshold. We parallelize rerun jobs with Titus, Netflix’s container management platform, which allows us to exercise many hundreds of replay files in minutes. The results are aggregated in Elasticsearch, allowing us to quickly analyze MezzFS’s performance across the entire batch.

Adaptive Buffering

These replays have proven essential for developing optimizations like “adaptive buffering”.

One of the challenges of efficiently streaming bits in a FUSE system is that the kernel will break reads into chunks. This means that if an application reads, for example, 1 GB from a mounted file, MezzFS might receive that as 16,384 consecutive reads of 64KB. Making 16,384 separate HTTP calls to S3 for 64KB will suffer significant overhead, so it’s better to “read ahead” larger chunks of data from S3, speeding up subsequent reads by anticipating that the data will be read sequentially. We call the size of the chunks being read ahead the “buffer size”.

While large buffer sizes speed up sequential data access, they can slow down “sparse” data access — that is, the application is not reading through the file consecutively, but is reading small segments dispersed throughout the file (as shown in the visualization above). In this scenario, most of the buffered data isn’t actually going to be used, leading to a lot of unnecessary downloading and very slow reads.

One option is to expect applications to specify a buffer size when mounting with MezzFS. This is not always easy for application developers to do, since applications might be using third party tools and developers might not actually know their access pattern. It gets even messier when an application changes access patterns during a single MezzFS mount.

With “adaptive buffering,” we aimed to make MezzFS “just work” for a variety of access patterns, without requiring application developers to maintain MezzFS configuration.

How it works

MezzFS records a sliding window of the most recent reads. When it receives a read for data that has not already been buffered, it calculates an appropriate buffer size. It does this by first grouping the window of reads into “clusters”, where a cluster is a contiguous set of reads.

Here’s an illustration of how reads relate to clusters:

If the average number of bytes per read divided by the average number of bytes per cluster is close to 1, we classify the access pattern as “sparse”. In the “sparse” case, we try to match the buffer size to the average number of bytes per read. If number is closer to 0, we classify the access pattern as “dense”, and we set the buffer size to the maximum allowed buffer size divided by the number of clusters (We divide by the number of clusters to account for a common case when an application might have multiple threads all reading different parts from the same file, but each thread is reading its part “densely.” If we used the maximum allowed buffer size for each thread, our buffers would consume too much memory).

Here’s an attempt to represent this logic with some pseudo code:

There is a limit on the throughput you can get out of a single HTTP connection to S3. So when the calculated buffer size is large, we divide the buffer into separate requests and parallelize them across multiple threads. So for “sparse” access patterns we improve performance by choosing a small buffer size, and for “dense” access patterns we improve performance by buffering lots of data in parallel.

How much faster is this?

We’ve been using adaptive buffering in production across a number of different use cases. For the purpose of clarity in demonstration, we used the “rerun a batch of replays” technique described above to run a quick and dirty test comparing the old buffering technique against the new.

Two replay files that represent two canonical access patterns were used:

Dense/Sequential — Sequentially read 1GB from a remote object.
Sparse/Random — Read 32MB in chunks of 64KB, dispersed randomly throughout a remote object.

And we compared two buffering strategies:

Fixed Sized Buffering— This is the old technique, where the buffer size is fixed at 8MB (we chose 8MB as a “one-size-fits-all” buffer size after running some experiments across MezzFS use cases at the time).
Adaptive Buffering— The shiny new technique described above.

We ran each combination of replay file and buffering strategy 10 times each inside containers with 2 Gbps network and 16 CPUs, recording the total time to process all the operations in the replay files. The following table represents the minimum of all 10 runs (while mean and standard deviation often seem like good aggregations, we use minimum here because variability is often caused by other processes interrupting MezzFS, or variability in network conditions outside of MezzFS’s control).

Looking at the dense/sequential replay, fixed buffering has a throughput of ~0.5 Gbps, while adaptive buffering has a throughput of ~1.1Gbps.

While a handful of seconds might not seem worth all the trouble, these seconds become hours for many of our use cases that stream significantly more bytes. And shaving off hours is especially beneficial in latency sensitive workflows, like encoding videos that are released on Netflix the day they are shot.

Conclusion

MezzFS has become a core part of our media transformation and innovation platform. We’ve built some pretty fancy tools around it that we’re actively using to quickly and confidently develop new features and optimizations.

The next big feature on our roadmap is support for writes, which has exciting potential for our next generation media processing platform and our growing, global network of movie production studios.

Netflix’s media processing platform is maintained by the Media Cloud Engineering (MCE) team. If you’re excited about large-scale distributed computing problems in media processing, we’re hiring!

MezzFS — Mounting object storage in Netflix’s media processing platform was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Spinnaker Sets Sail to the Continuous Delivery Foundation

March 12, 2019, 9:16 am

≫ Next: Netflix Public Bug Bounty, 1 year later

≪ Previous: MezzFS — Mounting object storage in Netflix’s media processing platform

Author: Andy Glover

Since releasing Spinnaker to the open source community in 2015, the platform has flourished with the addition of new cloud providers, triggers, pipeline stages, and much more. Myriad new features, improvements, and innovations have been added by an ever growing, actively engaged community. Each new innovation has been a step towards an even better Continuous Delivery platform that facilitates rapid, reliable, safe delivery of flexible assets to pluggable deployment targets.

Over the last year, Netflix has improved overall management of Spinnaker by enhancing community engagement and transparency. At the Spinnaker Summit in 2018, we announced that we had adopted a formalized project governance plan with Google. Moreover, we also realized that we’ll need to share the responsibility of Spinnaker’s direction as well as yield a level of long-term strategic influence over the project so as to maintain a healthy, engaged community. This means enabling more parties outside of Netflix and Google to have a say in the direction and implementation of Spinnaker.

A strong, healthy, committed community benefits everyone; however, open source projects rarely reach this critical mass. It’s clear Spinnaker has reached this special stage in its evolution; accordingly, we are thrilled to announce two exciting developments.

First, Netflix and Google are jointly donating Spinnaker to the newly created Continuous Delivery Foundation (or CDF), which is part of the Linux Foundation. The CDF is a neutral organization that will grow and sustain an open continuous delivery ecosystem, much like the Cloud Native Computing Foundation (or CNCF) has done for the cloud native computing ecosystem. The initial set of projects to be donated to the CDF are Jenkins, Jenkins X, Spinnaker, and Tekton. Second, Netflix is joining as a founding member of the CDF. Continuous Delivery powers innovation at Netflix and working with other leading practitioners to promote Continuous Delivery through specifications is an exciting opportunity to join forces and bring the benefits of rapid, reliable, and safe delivery to an even larger community.

Spinnaker’s success is in large part due to the amazing community of companies and people that use it and contribute to it. Donating Spinnaker to the CDF will strengthen this community. This move will encourage contributions and investments from additional companies who are undoubtedly waiting on the sidelines. Opening the doors to new companies increases the innovations we’ll see in Spinnaker, which benefits everyone.

Donating Spinnaker to the CDF doesn’t change Netflix’s commitment to Spinnaker, and what’s more, current users of Spinnaker are unaffected by this change. Spinnaker’s previously defined governance policy remains in place. Overtime, new stakeholders will emerge and play a larger, more formal role in shaping Spinnaker’s future. The prospects of an even healthier and more engaged community focused on Spinnaker and the manifold benefits of Continuous Delivery is tremendously exciting and we’re looking forward to seeing it continue to flourish.

Spinnaker Sets Sail to the Continuous Delivery Foundation was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Netflix Public Bug Bounty, 1 year later

March 21, 2019, 9:01 am

≫ Next: Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

≪ Previous: Spinnaker Sets Sail to the Continuous Delivery Foundation

by Astha Singhal (Netflix Application Security)

As Netflix continues to create entertainment people love, the security team continues to keep our members, partners, and employees secure. The security research community has partnered with us to improve the security of the Netflix service for the past few years through our responsible disclosure and bug bounty programs. A year ago, we launched our public bug bounty program to strengthen this partnership and enable researchers across the world to more easily participate.

When we decided to go public with our bug bounty, we revamped our program terms to bring even more targets (including Netflix streaming mobile apps) in scope and set clearer guidelines for researchers that participate in our program. We have always tried to prioritize a good researcher experience in our program to keep the community engaged. For example, we maintain an average triage time of less than 48 hours for issues of all severity. Since the public launch, we have engaged with 657 researchers from around the world. We have collectively rewarded over $100,000 for over 100 valid bugs in that time.

We wanted to share a few interesting submissions that we have received over the last year:

We choose to focus our security resources on applications deployed via our infrastructure paved road to be able to scale our services. The bug bounty has been great at shining a light on the parts of our environment that may not be on the paved road. A researcher found an application that ran on an older Windows server that was deployed in a non-standard way making it difficult for our automated visibility services to detect it. The system had significant issues that we were grateful to hear about so we could retire the system.
We received a report from a researcher that found a service they didn’t believe should be available on the internet. The initial finding seemed like a low severity issue, but the researcher asked for permission to continue to explore the endpoint to look for more significant issues. We love it when researchers reach out to coordinate testing on an issue. We looked at the endpoint and determined that there was no risk of access to Netflix internal data, so we allowed the researcher to continue testing. After further testing, the researcher found a remote code execution that we then rewarded and remediated with a higher priority.
We always perform a full impact analysis to make sure we reward the researcher based on the actual impact vs demonstrated impact of any issue. In one example of this, a researcher found an endpoint that allowed access to an internal resource if the attacker knew a randomly generated identifier. The researcher had found the identifier via another endpoint, but had not explicitly linked the two findings. Given the potential impact, we asked the researcher to stop further testing. As we tested the first endpoint ourselves, we discovered this additional impact and issued a reward in accordance with the higher priority.

Over the past year, we have received various high quality submissions from researchers, and we want to continue to engage with them to improve Netflix security. Todayisnew has been the highest earning researcher in our program over the last year. We recently revisited our overall reward ranges to make sure we are competitive with the market for our risk profile. In 2019, we also started publishing quarterly program updates to highlight new product areas for testing. Our goal is to keep the Netflix program an interesting and fresh target for bug bounty researchers.

Going into next year, our goal is to maintain the quality of the researcher experience in our program. We are also thinking about how to extend our bug bounty coverage to our studio app ecosystem. Last year, we conducted a bug bash specifically for some of our studio apps with researchers across the world. We found some significant issues through that and are exploring extending our program to some of our studio production apps in 2019. We thank all the researchers that have engaged in our program and look forward to continued collaboration with them to secure Netflix.

Netflix Public Bug Bounty, 1 year later was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

March 25, 2019, 7:01 am

≫ Next: Introducing SVT-AV1: a scalable open-source AV1 framework

≪ Previous: Netflix Public Bug Bounty, 1 year later

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency

By: Di Lin, Girish Lingappa, Jitender Aswani

Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question — “Can I run a check myself to understand what data is behind this metric?”

Now, imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by few critical customer facing services (e.g. billing). You are about to make structural changes to the data and want to know who and what downstream to your service will be impacted.

Finally, imagine yourself in the role of a data platform reliability engineer tasked with providing advanced lead time to data pipeline (ETL) owners by proactively identifying issues upstream to their ETL jobs. You are designing a learning system to forecast Service Level Agreement (SLA) violations and would want to factor in all upstream dependencies and corresponding historical states.

At Netflix, user stories centered on understanding data dependencies shared above and countless more in Detection & Data Cleansing, Retention & Data Efficiency, Data Integrity, Cost Attribution, and Platform Reliability subject areas inspired Data Engineering and Infrastructure (DEI) team to envision a comprehensive data lineage system and embark on a development journey a few years ago. We adopted the following mission statement to guide our investments:

“Provide a complete and accurate data lineage system enabling decision-makers to win moments of truth.”

In the rest of this blog, we will a) touch on the complexity of Netflix cloud landscape, b) discuss lineage design goals, ingestion architecture and the corresponding data model, c) share the challenges we faced and the learnings we picked up along the way, and d) close it out with “what’s next” on this journey.

Netflix Data Landscape

Freedom & Responsibility (F&R) is the lynchpin of Netflix’s culture empowering teams to move fast to deliver on innovation and operate with freedom to satisfy their mission. Central engineering teams provide paved paths (secure, vetted and supported options) and guard rails to help reduce variance in choices available for tools and technologies to support the development of scalable technical architectures. Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management. Therefore, building a complete and accurate data lineage system to map out all the data-artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models) is a monumental task and requires a scalable architecture, robust design, a strong engineering team and above all, amazing cross-functional collaboration.

Design Goals

At the project inception stage, we defined a set of design goals to help guide the architecture and development work for data lineage to deliver a complete, accurate, reliable and scalable lineage system mapping Netflix’s diverse data landscape. Let’s review a few of these principles:

Ensure data integrity — Accurately capture the relationship in data from disparate data sources to establish trust with users because without absolute trust lineage data may do more harm than good.
Enable seamless integration — Design the system to integrate with a growing list of data tools and platforms including the ones that do not have the built-in meta-data instrumentation to derive data lineage from.
Design a flexible data model — Represent a wide range of data artifacts and relationships among them using a generic data model to enable a wide variety of business use cases.

Ingestion-at-scale

The data movement at Netflix does not necessarily follow a single paved path since engineers have the freedom to choose (and the responsibility to manage) the best available data tools and platforms to achieve their business goals. As a result, a single consolidated and centralized source of truth does not exist that can be leveraged to derive data lineage truth. Therefore, the ingestion approach for data lineage is designed to work with many disparate data sources.

Our data ingestion approach, in a nutshell, is classified broadly into two buckets — push or pull. Today, we are operating using a pull-heavy model. In this model, we scan system logs and metadata generated by various compute engines to collect corresponding lineage data. For example, we leverage inviso to list pig jobs and then lipstick to fetch tables and columns from these pig scripts. For spark compute engine, we leverage spark plan information and for Snowflake, admin tables capture the same information. In addition, we derive lineage information from scheduled ETL jobs by extracting workflow definitions and runtime metadata using Meson scheduler APIs.

In the push model paradigm, various platform tools such as the data transportation layer, reporting tools, and Presto will publish lineage events to a set of lineage related Kafka topics, therefore, making data ingestion relatively easy to scale improving scalability for the data lineage system.

Data Enrichment

The lineage data, when enriched with entity metadata and associated relationships, become more valuable to deliver on a rich set of business cases. We leverage Metacat data, our internal metadata store and service, to enrich lineage data with additional table metadata. We also leverage metadata from another internal tool, Genie, internal job and resource manager, to add job metadata (such as job owner, cluster, scheduler metadata) on lineage data. The ingestions (ETL) pipelines transform enriched datasets to a common data model (design based on a graph structure stored as vertices and edges) to serve lineage use cases. The lineage data along with the enriched information is accessed through many interfaces using SQL against the warehouse and Gremlin and a REST Lineage Service against a graph database populated from the lineage data discussed earlier in this paragraph.

Challenges

We faced a diverse set of challenges spread across many layers in the system. Netflix’s diverse data landscape made it challenging to capture all the right data and conforming it to a common data model. In addition, the ingestion layer designed to address several ingestions patterns added to operational complexity. Spark is the primary big-data compute engine at Netflix and with pretty much every upgrade in Spark, the spark plan changed as well springing continuous and unexpected surprises for us.

We defined a generic data model to store lineage information and now conforming the entity and associated relationships from various data sources to this data model. We are loading the lineage data to a graph database to enable seamless integration with a REST data lineage service to address business use cases. To improve data accuracy, we decided to leverage AWS S3 access logs to identify entity relationships not been captured by our traditional ingestion process.

We are continuing to address the ingestion challenges by adopting a system level instrumentation approach for spark, other compute engines, and data transport tools. We are designing a CRUD layer and exposing it as REST APIs to make it easier for anyone to publish lineage data to our pipelines.

We are taking a mature and comprehensive data lineage system and now extending its coverage far beyond traditional data warehouse realms with a goal to build universal data lineage to represent all the data artifacts and corresponding relationships. We are tackling a bunch of very interesting known unknowns with exciting initiatives in the field of data catalog and asset inventory. Mapping micro-services interactions, entities from real time infrastructure, and ML infrastructure and other non traditional data stores are few such examples.

Lineage Architecture and Data Model

As illustrated in the diagram above, various systems have their own independent data ingestion process in place leading to many different data models that store entities and relationships data at varying granularities. This data needed to be stitched together to accurately and comprehensively describe the Netflix data landscape and required a set of conformance processes before delivering the data for a wider audience.

During the conformance process, the data collected from different sources is transformed to make sure that all entities in our data flow, such as tables, jobs, reports, etc. are described in a consistent format, and stored in a generic data model for further usage.

Based on a standard data model at the entity level, we built a generic relationship model that describes the dependencies between any pair of entities. Using this approach, we are able to build a unified data model and the repository to deliver the right leverage to enable multiple use cases such as data discovery, SLA service and Data Efficiency.

Current Use Cases

Big Data Portal, a visual interface to data management at Netflix, has been the primary consumer of lineage data thus far. Many features benefit from lineage data including ranking of search results, table column usage for downstream jobs, deriving upstream dependencies in workflows, and building visibility of jobs writing to downstream tables.

Our most recent focus has been on powering (a) a data lineage service (REST based) leveraged by SLA service and (b) the data efficiency (to support data lifecycle management) use cases. SLA service relies on the job dependencies defined in ETL workflows to alert on potential SLA misses. This service also proactively alerts on any potential delays in few critical reports due to any job delays or failures anywhere upstream to it.

The data efficiency use cases leverages visibility on entities and their relationships to drive cost and attribution gains, auto cleansing of unused data in the warehouse .

What’s next?

Our journey on extending the value of lineage data to new frontiers has just begun and we have a long way to go in achieving the overarching goal of providing universal data lineage representing all entities and corresponding relationships for all data at Netflix. In the short to medium term, we are planning to onboard more data platforms and leverage graph database and a lineage REST service and GraphQL interface to enable more business use cases including improving developer productivity. We also plan to increase our investment in data detection initiatives by integrating lineage data and detection tools to efficiently scan our data to further improve data hygiene.

Please share your experience by adding your comments below and stay tuned for more on data lineage at Netflix in the follow up blog posts. .

Credits: We want to extend our sincere thanks to many esteemed colleagues from the data platform (part of Data Engineering and Infrastructure org) team at Netflix who pioneered this topic before us and who continue to extend our thinking on this topic with their valuable insights and are building many useful services on lineage data.

We will be at Strata San Francisco on March 27th in room 2001 delivering a tech session on this topic, please join us and share your experiences.

If you would like to be part of our small, impactful, and collaborative team — come join us.

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Introducing SVT-AV1: a scalable open-source AV1 framework

April 17, 2019, 8:31 am

≫ Next: Python at Netflix

≪ Previous: Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

by Andrey Norkin, Joel Sole, Kyle Swanson, Mariana Afonso, Anush Moorthy, Anne Aaron

Netflix Headquarters, Winchester Circle.

Netflix headquarters circa 2014. It’s a nice building with good architecture! This was the primary home of Netflix for a number of years during the company’s growth, but at some point Netflix had outgrown its home and needed more space. One approach to solve this problem would have been to extend the building by attaching new rooms, hallways, and rebuilding the older ones. However, a more scalable approach would be to begin with a new foundation and begin a new building. Below you can see the new Netflix headquarters in Los Gatos, California. The facilities are modern, spacious and scalable. The new campus started with two buildings, connected together, and was further extended with more buildings when more space was needed. What does this example have to do with software development and video encoding? When you are building an encoder, sometimes you need to start with a clean slate too.

What is SVT-AV1?

Intel and Netflix announced their collaboration on a software video encoder implementation called SVT-AV1 on April 8, 2019. Scalable Video Technology (SVT) is Intel’s open source framework that provides high-performance software video encoding libraries for developers of visual cloud technologies. In this tech blog, we describe the relevance of this partnership to the industry and cover some of our own experiences so far. We also describe how you can become a part of this development.

A brief look into the history of video standards

Historically, video compression standards have been developed by two international standardization organizations, ITU-T and MPEG (ISO). The first successful digital video standard was MPEG-2, which truly enabled digital transmission of video. The success was repeated by H.264/AVC, currently, the most ubiquitous video compression standard supported by modern devices, often in hardware. On the other hand, there are examples of video codecs developed by companies, such as Microsoft’s VC-1 and Google’s VPx codecs. The advantage of adopting a video compression standard is interoperability. The standard specification describes in minute detail how a video bitstream should be processed in order to produce displayable video frames. This allows device manufacturers to independently work on their decoder implementations. When content providers encode their video according to the standard, this guarantees that all compliant devices are able to decode and display the video.

Recently, the adoption of the newest video codec standardized by ITU-T and ISO has been slow in light of widespread licensing uncertainty. A group of companies formed the Alliance for Open Media (AOM) with the goal of creating a modern, royalty-free video codec that would be widely adopted and supported by a plethora of devices. The AOM board currently includes Amazon, Apple, ARM, Cisco, Facebook, Google, IBM, Intel, Microsoft, Mozilla, Netflix, Nvidia, and Samsung, and many companies joined as promoter members. In 2018, AOM has published a specification for the AV1 video codec.

Decoder specification is frozen, encoder being improved for years

As mentioned earlier, a standard specifies how the compressed bitstream is to be interpreted to produce displayable video, which means that encoders can vary in their characteristics, such as computational performance and achievable quality for a given bitrate. The encoder can typically be improved years after the standard has been frozen including varying speed and quality trade-offs. An example of such development is the x264 encoder that has been improving years after the H.264 standard was finalized.

To develop a conformant decoder, the standard specification should be sufficient. However, to guide codec implementers, the standardization committee also issues reference software, which includes a compliant decoder and encoder. Reference software serves as the basis for standard development, a framework, in which the performance of video coding tools is evaluated. The reference software typically evolves along with the development of the standard. In addition, when standardization is completed, the reference software can help to kickstart implementations of compliant decoders and encoders.

AOM has produced the reference software for AV1, which is called libaom and is available online. The libaom was built upon the codebase from VP9, VP8, and previous generations of VPx video codecs. During the AV1 development, the software was further developed by the AOM video codec group.

Netflix interest in SVT-AV1

Reference software typically focuses on the best possible compression at the expense of encoding speed. It is well known that encoding time of reference software for modern video codecs can be rather long.

One of Intel’s goals with SVT-AV1 development was to create a production-grade AV1 encoder that offers performance and scalability. SVT-AV1 uses parallelization at several stages of the encoding process, which allows it to adapt to the number of available cores including newest servers with significant core count. This makes it possible for SVT-AV1 to decrease encoding time while still maintaining compression efficiency.

In August 2018, Netflix’s Video Algorithms team and Intel’s Visual Cloud team decided to join forces on SVT-AV1 development. Since that time, Intel’s and Netflix’s teams closely collaborated on SVT-AV1 development, discussing architectural decisions, implementing new tools, and improving the compression efficiency. Netflix’s main interest in SVT-AV1 was somewhat different and complementary to Intel’s intention of building a production-grade highly scalable encoder.

At Netflix, we believe that the AV1 ecosystem would benefit from an alternative clean and efficient open-source encoder implementation. There exists at least one other alternative open-source AV1 encoder, rav1e. However, rav1e is written in Rust programming language, whereas an encoder written in C has a much broader base of potential developers. The open-source encoder should also enable easy experimentation and a platform for testing new coding tools. Consequently, our requirements to the AV1 software are as follows:

Easy to understand code with a low entry barrier and a test framework
Competitive compression efficiency on par with the reference implementation
Complete toolset and a decoder implementation sharing common code with the encoder, which simplifies experiments on new coding tools
Decreased encoder runtime that enables quicker turn-around when testing new ideas

We believe that if SVT-AV1 is aligned with these characteristics, it can be used as a platform for future video coding standards development, such as the research and development efforts towards the AV2 video codec, and improved AV1 encoding.

Thus, Netflix and Intel approach SVT-AV1 with complementary goals. The encoder speed helps innovation, as it is faster to run experiments. Cleanliness of the code helps adoption in the open-source community, which is crucial for the success of an open-source project. It can be argued that extensive parallelization may have compression efficiency trade-offs but it also allows testing more encoding options. Moreover, we expect multi-core platforms be prevalently used for video encoding in the future, which makes it important to test new tools in an architecture supporting many threads.

Our progress so far

We have accomplished the following milestones to achieve the goals of making SVT-AV1 an excellent experimentation platform and AV1 reference:

Open-sourced SVT-AV1 on GitHub https://github.com/OpenVisualCloud/SVT-AV1/ with a BSD + patent license.
Added a continuous integration (CI) framework for Linux, Windows, and MacOs.
Added a unit tests framework based on Google Test. An external contractor is adding unit tests to achieve sufficient coverage for the code already developed. Furthermore, unit tests will cover new code.
Added other types of testing in the CI framework, such as automatic encoding and Valgrind test.
Started a decoder project that shares common parts of AV1 algorithms with the encoder.
Introduced style guidelines and formatted the existing code accordingly.

SVT-AV1 is currently work in progress since it is still missing the implementation of some coding tools and therefore has an average gap of about 14% in PSNR BD-rate with the libaom encoder in a 1-pass mode. The following features are planned to be added and will decrease the BD-rate gap:

Multi-reference pictures
ALTREF pictures
Eighth-pel motion compensation (1/8-pel)
Global motion compensation
OBMC
Wedge prediction
TMVP
Palette prediction
Adaptive transform block sizes
Trellis Quantized Coefficient Optimization
Segmentation
4:2:2 support
Rate control (ABR, CBR, CBR)
2-pass encoding mode

There is still much work ahead, and we are committed to making the SVT-AV1 project satisfy the goal of being an excellent experimentation platform, as well as viable for production applications. You can track the SVT-AV1 performance progress on the beta of AWCY (AreWeCompressedYet) website. AWCY was the framework used to evaluate AV1 tools during its development. In the figure below, you can see a comparison of two versions of the SVT-AV1 codec, the blue plot representing SVT-AV1 version from March 15, 2019, and the green one from March 19, 2019.

Screenshot of the AreWeCompressedYet codec comparison page.

SVT-AV1 already stands out in its speed. SVT-AV1 does not reach the compression efficiency of libaom at the slowest speed settings, but it performs encoding significantly faster than the fastest libaom mode. Currently, SVT-AV1 in the slowest mode uses about 13.5% more bits compared to the libaom encoder in a 1-pass mode with cpu_used=1 (the second slowest mode of libaom), while being about 4 times faster*. The BD-rate gap with 2-pass libaom encoding is wider and we are planning to address this by implementing 2-pass encoding in SVT-AV1. One could also note that faster encoding settings of SVT-AV1 decrease the encoding times even more dramatically providing significant encoder speed-up.

Open-source video coding needs you!

If you are interested in helping us to build SVT-AV1, you can contribute on GitHub https://github.com/OpenVisualCloud/SVT-AV1/ with your suggestions, comments and of course your code.

*These results have been obtained for 8-bit encodes on the set of AOM test video sequences, Objective-1-fast.

Introducing SVT-AV1: a scalable open-source AV1 framework was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Python at Netflix

April 29, 2019, 9:12 am

≫ Next: Engineering a Studio Quality Experience With High-Quality Audio at Netflix

≪ Previous: Introducing SVT-AV1: a scalable open-source AV1 framework

By Pythonistas at Netflix, coordinated by Amjith Ramanujam and edited by Ellen Livengood

As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon. We have donated a few Netflix Originals posters to the PyLadies Auction and look forward to seeing you all there.

Open Connect

Open Connect is Netflix’s content delivery network (CDN). An easy, though imprecise, way of thinking about Netflix infrastructure is that everything that happens before you press Play on your remote control (e.g., are you logged in? what plan do you have? what have you watched so we can recommend new titles to you? what do you want to watch?) takes place in Amazon Web Services (AWS), whereas everything that happens afterwards (i.e., video streaming) takes place in the Open Connect network. Content is placed on the network of servers in the Open Connect CDN as close to the end user as possible, improving the streaming experience for our customers and reducing costs for both Netflix and our Internet Service Provider (ISP) partners.

Various software systems are needed to design, build, and operate this CDN infrastructure, and a significant number of them are written in Python. The network devices that underlie a large portion of the CDN are mostly managed by Python applications. Such applications track the inventory of our network gear: what devices, of which models, with which hardware components, located in which sites. The configuration of these devices is controlled by several other systems including source of truth, application of configurations to devices, and back up. Device interaction for the collection of health and other operational data is yet another Python application. Python has long been a popular programming language in the networking space because it’s an intuitive language that allows engineers to quickly solve networking problems. Subsequently, many useful libraries get developed, making the language even more desirable to learn and use.

Demand Engineering

Demand Engineering is responsible for Regional Failovers, Traffic Distribution, Capacity Operations and Fleet Efficiency of the Netflix cloud. We are proud to say that our team’s tools are built primarily in Python. The service that orchestrates failover uses numpy and scipy to perform numerical analysis, boto3 to make changes to our AWS infrastructure, rq to run asynchronous workloads and we wrap it all up in a thin layer of Flask APIs. The ability to drop into a bpython shell and improvise has saved the day more than once.

We are heavy users of Jupyter Notebooks and nteract to analyze operational data and prototype visualization tools that help us detect capacity regressions.

CORE

The CORE team uses Python in our alerting and statistical analytical work. We lean on the many of the statistical and mathematical libraries (numpy, scipy, ruptures, pandas) to help automate the analysis of 1000s of related signals when our alerting systems indicate problems. We’ve developed a time series correlation system used both inside and outside the team as well as a distributed worker system to parallelize large amounts of analytical work to deliver results quickly.

Python is also a tool we typically use for automation tasks, data exploration and cleaning, and as a convenient source for visualization work.

Monitoring, alerting and auto-remediation

The Insight Engineering team is responsible for building and operating the tools for operational insight, alerting, diagnostics, and auto-remediation. With the increased popularity of Python, the team now supports Python clients for most of their services. One example is the Spectator Python client library, a library for instrumenting code to record dimensional time series metrics. We build Python libraries to interact with other Netflix platform level services. In addition to libraries, the Winston and Bolt products are also built using Python frameworks (Gunicorn + Flask + Flask-RESTPlus).

Information Security

The information security team uses Python to accomplish a number of high leverage goals for Netflix: security automation, risk classification, auto-remediation, and vulnerability identification to name a few. We’ve had a number of successful Python open sources, including Security Monkey (our team’s most active open source project). We leverage Python to protect our SSH resources using Bless. Our Infrastructure Security team leverages Python to help with IAM permission tuning using Repokid. We use Python to help generate TLS certificates using Lemur.

Some of our more recent projects include Prism: a batch framework to help security engineers measure paved road adoption, risk factors, and identify vulnerabilities in source code. We currently provide Python and Ruby libraries for Prism. The Diffy forensics triage tool is written entirely in Python. We also use Python to detect sensitive data using Lanius.

Personalization Algorithms

We use Python extensively within our broader Personalization Machine Learning Infrastructure to train some of the Machine Learning models for key aspects of the Netflix experience: from our recommendation algorithms to artwork personalization to marketing algorithms. For example, some algorithms use TensorFlow, Keras, and PyTorch to learn Deep Neural Networks, XGBoost and LightGBM to learn Gradient Boosted Decision Trees or the broader scientific stack in Python (numpy, scipy, sklearn, matplotlib, pandas, cvxpy, …). Since we’re constantly trying out new approaches, we use Jupyter Notebooks to drive many of our experiments. We have also developed a number of higher-level libraries to help integrate these with the rest of our ecosystem (e.g. data access, fact logging and feature extraction, model evaluation and publishing).

Machine Learning Infrastructure

Besides personalization, Netflix applies machine learning to hundreds of use cases across the company. Many of these applications are powered by Metaflow, a Python framework that makes it easy to execute ML projects from the prototype stage to production.

Metaflow pushes the limits of Python: We leverage well parallelized and optimized Python code to fetch data at 10Gbps, handle hundreds of millions of data points in memory, and orchestrate computation over tens of thousands of CPU cores.

Notebooks

We are avid users of Jupyter notebooks at Netflix, we’ve written about the reasons and nature of this investment before.

But Python plays a huge role in how we provide those services. Python is a primary language when we need to develop, debug, explore and prototype different interactions with the Jupyter ecosystem. We use Python to build custom extensions to the Jupyter server that allows us to manage tasks like logging, archiving, publishing and cloning notebooks on behalf of our users.
We provide many flavors of Python to our users via different Jupyter kernels, and manage the deployment of those kernel specifications using Python.

Orchestration

The Big Data Orchestration team is responsible for providing all of the services and tooling to schedule and execute ETL and Adhoc pipelines.

Many of the components of the orchestration service are written in Python. Starting with our scheduler, which uses Jupyter Notebooks with papermill to provide templatized job types (Spark, Presto, …). This allows our users to have a standardized and easy way to express work that needs to be executed. You can see some deeper details on the subject here. We have been using notebooks as real runbooks for situations where human intervention is required. i.e: restart everything that has failed in the last hour.

Internally, we also built an event-driven platform that is fully written in Python. We have created streams of events from a number of systems that get unified into a single tool. This allows us to define conditions to filter events, and actions to react or route them. As a result of this, we have been able to decouple microservices and get visibility to everything that happens on the data platform.

Our team also built the pygenie client which interfaces with Genie, a federated job execution service. Internally, we have additional extensions to this library that apply business conventions and integrate with the Netflix platform. These libraries are the primary way users interface programmatically with work in the Big Data platform.

Finally, it’s been our team’s commitment to contribute to papermill and scrapbook open source projects. Our work there has been both for our own and external use cases. These efforts have been gaining a lot of traction in the open source community and we’re glad to be able to contribute to these shared projects.

Experimentation Platform

The scientific computing team for experimentation is creating a platform for scientists and engineers to analyze AB tests and other experiments. Scientists and engineers can contribute new innovations on three fronts, data, statistics, and visualizations.

The Metrics Repo is a Python framework based on PyPika that allows contributors to write reusable parameterized SQL queries. It serves as an entry point into any new analysis.

The Causal Models library is a Python & R framework for scientists to contribute new models for causal inference. It leverages PyArrow and RPy2 so that statistics can be calculated seamlessly in either language.

The Visualizations library is based on Plotly. Since Plotly is a widely adopted visualization spec, there are a variety of tools that allow contributors to produce an output that is consumable by our platforms.

Partner Ecosystem

The Partner Ecosystem group is expanding its use of Python for testing Netflix applications on devices. Python is forming the core of a new CI infrastructure, including controlling our orchestration servers, controlling Spinnaker, test case querying and filtering, and scheduling test runs on devices and containers. Additional post-run analysis is being done in Python using TensorFlow to determine which tests are most likely to show problems on which devices.

Video Encoding and Media Cloud Engineering

Our team takes care of encoding (and re-encoding) the Netflix catalog, as well as leveraging machine learning for insights into that catalog.
We use Python for ~50 projects such as vmaf and mezzfs, we build computer vision solutions using a media map-reduce platform called Archer, and we use Python for many internal projects.
We have also open sourced a few tools to ease development/distribution of Python projects, like setupmeta and pickley.

Netflix Animation and NVFX

Python is the industry standard for all of the major applications we use to create Animated and VFX content, so it goes without saying that we are using it very heavily. All of our integrations with Maya and Nuke are in Python, and the bulk of our Shotgun tools are also in Python. We’re just getting started on getting our tooling in the cloud, and anticipate deploying many of our own custom Python AMIs/containers.

Content Machine Learning, Science & Analytics

The Content Machine Learning team uses Python extensively for the development of machine learning models that are the core of forecasting audience size, viewership, and other demand metrics for all content.

Python at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Engineering a Studio Quality Experience With High-Quality Audio at Netflix

May 1, 2019, 6:01 am

≫ Next: Android Rx onError Guidelines

≪ Previous: Python at Netflix

by Guillaume du Pontavice, Phill Williams and Kylee Peña (on behalf of our Streaming Algorithms, Audio Algorithms, and Creative Technologies teams)

Remember the epic opening sequence of Stranger Things 2? The thrill of that car chase through Pittsburgh not only introduced a whole new set of mysteries, but it returned us to a beloved and dangerous world alongside Dustin, Lucas, Mike, Will and Eleven. Maybe you were one of the millions of people who watched it in HDR, experiencing the brilliant imagery as it was meant to be seen by the creatives who dreamt it up.

Imagine this scene without the sound. Even taking away one part of the soundtrack — the brilliant synth-pop score or the perfectly mixed soundscape of a high speed chase — is the story nearly as thrilling and emotional?

Most conversations about streaming quality focus on video. In fact, Netflix has led the charge for most of the video technology that drives these conversations, from visual quality improvements like 4K and HDR, to behind-the-scenes technologies that make the streaming experience better for everyone, like adaptive streaming, complexity-based encoding, and AV1.

We’re really proud of the improvements we’ve brought to the video experience, but the focus on those makes it easy to overlook the importance of sound, and sound is every bit as important to entertainment as video. Variances in sound can be extremely subtle, but the impact on how the viewer perceives a scene differently is often measurable. For example, have you ever seen a TV show where the video and audio were a little out of sync?

Among those who understand the vital nature of sound are the Duffer brothers. In late 2017, we received some critical feedback from the brothers on the Stranger Things 2 audio mix: in some scenes, there was a reduced sense of where sounds are located in the 5.1-channel stream, as well as audible degradation of high frequencies.

Our engineering team and Creative Technologies sound expert joined forces to quickly solve the issue, but a larger conversation about higher quality audio continued. Series mixes were getting bolder and more cinematic with tight levels between dialog, music and effects elements. Creative choices increasingly tested the limits of our encoding quality. We needed to support these choices better.

At Netflix, we work hard to bring great audio to our members. We began streaming 5.1 surround audio in 2010, and began streaming Dolby Atmos in 2016, but wanted to bring studio quality sound to our members around the world. We want your experience to be brilliant even if you aren’t listening with a state-of-the-art home theater system. Just as we support initiatives like HDR and Netflix Calibrated Mode to maintain creative intent in streaming you picture, we wanted to do the same for the sound. That’s why we developed and launched high-quality audio.

To learn more about the people and inspiration behind this effort, check out this video. In this tech blog, we’ll dive deep into what high-quality audio is, how we deliver it to members worldwide, and why it’s so important to us.

What do we mean by “studio quality” sound?

If you’ve ever been in a professional recording studio, you’ve probably noted the difference in how things sound. One reason for that is the files used in mastering sessions are 24-bit 48 kHz with a bitrate of around 1 Mbps per channel. Studio mixes are uncompressed, which is why we consider them to be the “master” version.

Our high-quality sound feature is not lossless, but it is perceptually transparent. That means that while the audio is compressed, it is indistinguishable from the original source. Based on internal listening tests, listening test results provided by Dolby, and scientific studies, we determined that for Dolby Digital Plus at and above 640 kbps, the audio coding quality is perceptually transparent. Beyond that, we would be sending you files that have a higher bitrate (and take up more bandwidth) without bringing any additional value to the listening experience.

In addition to deciding 640 kbps — a 10:1 compression ratio when compared to a 24-bit 5.1 channel studio master — was the perceptually transparent threshold for audio, we set up a bitrate ladder for 5.1-channel audio ranging from 192 up to 640 kbps. This ranges from “good” audio to “transparent” — there aren’t any bad audio experiences when you stream!

At the same time, we revisited our Dolby Atmos bitrates and increased the highest offering to 768 kbps. We expect these bitrates to evolve over time as we get more efficient with our encoding techniques.

Our high-quality sound is a great experience for our members even if they aren’t audiophiles. Sound helps to tell the story subconsciously, shaping our experience through subtle cues like the sharpness of a phone ring or the way a very dense flock of bird chirps can increase anxiety in a scene. Although variances in sound can be nuanced, the impact on the viewing and listening experience is often measurable.

And perhaps most of all, our “studio quality” sound is faithful to what the mixers are creating on the mix stage. For many years in the film and television industry, creatives would spend days on the stage perfecting the mix only to have it significantly degraded by the time it was broadcast to viewers. Sometimes critical sound cues might even be lost to the detriment of the story. By delivering studio quality sound, we’re preserving the creative intent from the mix stage.

Adaptive Streaming for Audio

Since we began streaming, we’ve used static audio streaming at a constant bitrate. This approach selects the audio bitrate based on network conditions at the start of playback. However, we have spent years optimizing our adaptive streaming engine for video, so we know adaptive streaming has obvious benefits. Until now, we’ve only used adaptive streaming for video.

Adaptive streaming is a technology designed to deliver media to the user in the most optimal way for their network connection. Media is split into many small segments (chunks) and each chunk contains a few seconds of playback data. Media is provided in several qualities.

An adaptive streaming algorithm’s goal is to provide the best overall playback experience — even under a constrained environment. A great playback experience should provide the best overall quality, considering both audio and video, and avoid buffer starvation which leads to a rebuffering event — or playback interruption.

Constrained environments can be due to changing network conditions and device performance limitations. Adaptive streaming has to take all these into account. Delivering a great playback experience is difficult.

Let’s first look at how static audio streaming paired with adaptive video operates in a session with variable network conditions — in this case, a sudden throughput drop during the session.

The top graph shows both the audio and video bitrate, along with the available network throughput. The audio bitrate is fixed and has been selected at playback start whereas video bitrate varies and can adapt periodically.

The bottom graph shows audio and video buffer evolution: if we are able to fill the buffer faster than we play out, our buffer will grow. If not, our buffer will shrink.

In the first session above, the adaptive streaming algorithm for video has reacted to the throughput drop and was able to quickly stabilize both the audio and video buffer level by down-switching the video bitrate.

In the second scenario below, under the same network conditions we used a static high-quality audio bitrate at session start instead.

Our adaptive streaming for video logic is reacting but in this case, the available throughput is becoming less than the sum of audio and video bitrate, and our buffer starts draining. This ultimately leads to a rebuffer.

In this scenario, the video bitrate dropped below the audio bitrate, which might not provide the best playback experience.

This simple example highlights that static audio streaming can lead to suboptimal playback experiences with fluctuating network conditions. This motivated us to use adaptive streaming for audio.

By using adaptive streaming for audio, we allow audio quality to adjust during playback to bandwidth capabilities, just like we do for video.

Let’s consider a playback session with exactly the same network conditions (a sudden throughput drop) to illustrate the benefit of adaptive streaming for audio.

In this case we are able to select a higher audio bitrate when network conditions supported it and we are able to gracefully switch down the audio bitrate and avoid a rebuffer event by maintaining healthy audio and video buffer levels. Moreover, we were able to maintain a higher video bitrate when compared to the previous example.

The benefits are obvious in this simple case, but extending it to our broad streaming ecosystem was another challenge. There were many questions we had to answer in order to move forward with adaptive streaming for audio.

What about device reach? We have hundreds of millions of TV devices in the field, with different CPU, network and memory profiles, and adaptive audio has never been certified. Do these devices even support audio stream switching?

We had to assess this by testing adaptive audio switching on all Netflix supported devices.
We also added adaptive audio testing in our certification process so that every new certified device can benefit from it.

Once we knew that adaptive streaming for audio was achievable on most of our TV devices, we had to answer the following questions as we designed the algorithm:

How could we guarantee that we can improve audio subjective quality without degrading video quality and vice-versa?
How could we guarantee that we won’t introduce additional rebuffers or increase the startup delay with high-quality audio?
How could we guarantee that this algorithm will gracefully handle devices with different performance characteristics?

We answered these questions via experimentation that led to fine-tuning the adaptive streaming for audio algorithm in order to increase audio quality without degrading the video experience. After a year of work, we were able to answer these questions and implement adaptive audio streaming on a majority of TV devices.

Enjoying a Higher Quality Experience

By using our listening tests and scientific data to choose an optimal “transparent” bitrate, and designing an adaptive audio algorithm that could serve it based on network conditions, we’ve been able to enable this feature on a wide variety of devices with different CPU, network and memory profiles: the vast majority of our members using 5.1 should be able to enjoy new high-quality audio.

And it won’t have any negative impact on the streaming experience. The adaptive bitrate switching happens seamlessly during a streaming experience, with the available bitrates ranging from good to transparent, so you shouldn’t notice a difference other than better sound. If your network conditions are good, you’ll be served up the best possible audio, and it will now likely sound like it did on the mixing stage. If your network has an issue — your sister starts a huge download or your cat unplugs your router — our adaptive streaming will help you out.

After years perfecting our adaptive video switching, we’re thrilled that a similar approach can enable studio quality sound to make it to members’ households, ensuring that every detail of the mix is preserved. Uniquely combining creative technology with engineering teams at Netflix, we’ve been able to not only solve a problem, but use that problem to improve the quality of audio for millions of our members worldwide.

Preserving the original creative intent of the hard-working people who make shows like Stranger Things is a top priority, and we know it enhances your viewing — and listening — experience for many more moments of joy. Whether you’ve fallen into the Upside Down or you’re being chased by the Demogorgon, get ready for a sound experience like never before.