Open-Sourcing Metaflow, a Human-Centric Framework for Data Science

December 3, 2019, 12:46 pm

≫ Next: Netflix Hack Day — November 2019

≪ Previous: Data Compression for Large-Scale Streaming Experimentation

by David Berg, Ravi Kiran Chirravuri, Romain Cledat, Savin Goyal, Ferras Hamad, Ville Tuulos

tl;dr Metaflow is now open-source! Get started at metaflow.org.

Netflix applies data science to hundreds of use cases across the company, including optimizing content delivery and video encoding. Data scientists at Netflix relish our culture that empowers them to work autonomously and use their judgment to solve problems independently. We want our data scientists to be curious and take smart risks that have the potential for high business impact.

About two years ago, we, at our newly formed Machine Learning Infrastructure team started asking our data scientists a question: “What is the hardest thing for you as a data scientist at Netflix?” We were expecting to hear answers related to large-scale data and models, and maybe issues related to modern GPUs. Instead, we heard stories about projects where getting the first version to production took surprisingly long — mainly because of mundane reasons related to software engineering. We heard many stories about difficulties related to data access and basic data processing. We sat in meetings where data scientists discussed with their stakeholders how to best version different versions of their models without impacting production. We saw how excited data scientists were about modern off-the-shelf machine learning libraries, but we also witnessed various issues caused by these libraries when they were casually included as dependencies in production workflows.

We realized that nearly everything that data scientists wanted to do was already doable technically, but nothing was easy enough. Our job as a Machine Learning Infrastructure team would therefore not be mainly about enabling new technical feats. Instead, we should make common operations so easy that data scientists would not even realize that they were difficult before. We would focus our energy solely on improving data scientist productivity by being fanatically human-centric.

How could we improve the quality of life for data scientists? The following picture started emerging:

Our data scientists love the freedom of being able to choose the best modeling approach for their project. They know that feature engineering is critical for many models, so they want to stay in control of model inputs and feature engineering logic. In many cases, data scientists are quite eager to own their own models in production, since it allows them to troubleshoot and iterate the models faster.

On the other hand, very few data scientists feel strongly about the nature of the data warehouse, the compute platform that trains and scores their models, or the workflow scheduler. Preferably, from their point of view, these foundational components should “just work”. If, and when they fail, the error messages should be clear and understandable in the context of their work.

A key observation was that most of our data scientists had nothing against writing Python code. In fact, plain-and-simple Python is quickly becoming the lingua franca of data science, so using Python is preferable to domain specific languages. Data scientists want to retain their freedom to use arbitrary, idiomatic Python code to express their business logic — like they would do in a Jupyter notebook. However, they don’t want to spend too much time thinking about object hierarchies, packaging issues, or dealing with obscure APIs unrelated to their work. The infrastructure should allow them to exercise their freedom as data scientists but it should provide enough guardrails and scaffolding, so they don’t have to worry about software architecture too much.

Introducing Metaflow

These observations motivated Metaflow, our human-centric framework for data science. Over the past two years, Metaflow has been used internally at Netflix to build and manage hundreds of data-science projects from natural language processing to operations research.

By design, Metaflow is a deceptively simple Python library:

Data scientists can structure their workflow as a Directed Acyclic Graph of steps, as depicted above. The steps can be arbitrary Python code. In this hypothetical example, the flow trains two versions of a model in parallel and chooses the one with the highest score.

On the surface, this doesn’t seem like much. There are many existing frameworks, such as Apache Airflow or Luigi, which allow execution of DAGs consisting of arbitrary Python code. The devil is in the many carefully designed details of Metaflow: for instance, note how in the above example data and models are stored as normal Python instance variables. They work even if the code is executed on a distributed compute platform, which Metaflow supports by default, thanks to Metaflow’s built-in content-addressed artifact store. In many other frameworks, loading and storing of artifacts is left as an exercise for the user, which forces them to decide what should and should not be persisted. Metaflow removes this cognitive overhead.

Metaflow is packed with human-centric details like this, all of which aim at boosting data scientist productivity. For a comprehensive overview of all features of Metaflow, take a look at our documentation at docs.metaflow.org.

Metaflow on Amazon Web Services

Netflix’s data warehouse contains hundreds of petabytes of data. While a typical machine learning workflow running on Metaflow touches only a small shard of this warehouse, it can still process terabytes of data.

Metaflow is a cloud-native framework. It leverages elasticity of the cloud by design — both for compute and storage. Netflix has been one of the largest users of Amazon Web Services (AWS) for many years and we have accumulated plenty of operational experience and expertise in dealing with the cloud, AWS in particular. For the open-source release, we partnered with AWS to provide a seamless integration between Metaflow and various AWS services.

Metaflow comes with built-in capability to snapshot all code and data in Amazon S3 automatically, which is a key value proposition of our internal Metaflow setup. This provides us with a comprehensive solution for versioning and experiment tracking without any user intervention, which is core to any production-grade machine learning infrastructure.

In addition, Metaflow comes bundled with a high-performance S3 client, which can load data up to 10Gbps. This client has been massively popular amongst our users, who can now load data into their workflows an order of magnitude faster than before, enabling faster iteration cycles.

For general purpose data processing, Metaflow integrates with AWS Batch, which is a managed, container-based compute platform provided by AWS. The user can benefit from infinitely scalable compute clusters by adding a single line in their code: @batch. For training machine learning models, besides writing their own functions, the user has the choice to use AWS Sagemaker, which provides high-performance implementations of various models, many of which support distributed training.

Metaflow supports all common off-the-shelf machine learning frameworks through our @conda decorator, which allows the user to specify external dependencies for their steps safely. The @conda decorator freezes the execution environment, providing good guarantees of reproducibility, both when executed locally as well as in the cloud.

For more details, read this page about Metaflow’s integration with AWS.

From Prototype To Production

Out of the box, Metaflow provides a first-class local development experience. It allows data scientists to develop and test code quickly on your laptop, similar to any Python script. If your workflow supports parallelism, Metaflow takes advantage of all CPU cores available on your development machine.

We encourage our users to deploy their workflows to production as soon as possible. In our case, “production” means a highly available, centralized DAG scheduler, Meson, where users can export their Metaflow runs for execution with a single command. This allows them to start testing their workflow with regularly updating data quickly, which is a highly effective way to surface bugs and issues in the model. Since Meson is not available in open-source, we are working on providing a similar integration to AWS Step Functions, which is a highly available workflow scheduler.

In a complex business environment like Netflix’s, there are many ways to consume the results of a data science workflow. Often, the final results are written to a table, to be consumed by a dashboard. Sometimes, the resulting model is deployed as a microservice to support real-time inferencing. It is also common to chain workflows so that the results of a workflow are consumed by another. Metaflow supports all these modalities, although some of these features are not yet available in the open-source version.

When it comes to inspecting the results, Metaflow comes with a notebook-friendly client API. Most of our data scientists are heavy users of Jupyter notebooks, so we decided to focus our UI efforts on a seamless integration with notebooks, instead of providing a one-size-fits-all Metaflow UI. Our data scientists can build custom model UIs in notebooks, fetching artifacts from Metaflow, which provide just the right information about each model. A similar experience is available with AWS Sagemaker notebooks with open-source Metaflow.

Get Started With Metaflow

Metaflow has been eagerly adopted inside of Netflix, and today, we are making Metaflow available as an open-source project.

We hope that our vision of data scientist autonomy and productivity resonates outside Netflix as well. We welcome you to try Metaflow, start using it in your organization, and participate in its development.

You can find the project home page at metaflow.org and the code at github.com/Netflix/metaflow. Metaflow is comprehensively documented at docs.metaflow.org. The quickest way to get started is to follow our tutorial. If you want to learn more before getting your hands dirty, you can watch presentations about Metaflow at the high level or dig deeper into the internals of Metaflow.

If you have any questions, thoughts, or comments about Metaflow, you can find us at Metaflow chat room or you can reach us by email at help@metaflow.org. We are eager to hear from you!

Open-Sourcing Metaflow, a Human-Centric Framework for Data Science was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Netflix Hack Day — November 2019

December 10, 2019, 8:44 am

≫ Next: DBLog: A Generic Change-Data-Capture Framework

≪ Previous: Open-Sourcing Metaflow, a Human-Centric Framework for Data Science

Netflix Hack Day — Fall 2019

By Tom Richards, Carenina Garcia Motion, and Leslie Posada

Hack Day at Netflix is an opportunity to build and show off a feature, tool, or quirky app. The goal is simple: experiment with new ideas/technologies, engage with colleagues across different disciplines, and have fun!

We know even the silliest idea can spur something more.

The most important value of our Hack Days is that they support a culture of innovation. We believe in this work, even if it never ships, and enjoy sharing the creativity and thought put into these ideas.

Below, you can find videos made by the hackers of some of our favorite hacks from this event.

Nostalgiflix

Nostalgiflix is a chrome extension that transforms your Netflix web browser into an interactive TV time machine covering three decades (80’s, 90’s, and 00’s.) By dragging the UI slider around, you can view titles originally released within the selected year ( based on their historic box office and episode air dates.) More importantly you can also adjust the video filters in real-time to creatively downgrade the viewing experience, further enhancing the nostalgic effect. We think this feature could encourage our users to watch more of our older content while having fun reliving those moments of cinematic history.

By Joey Cato, Nazanin Delam, Sumana Mohan, Jeff Shi, Lily Dwyer, and Vishal Mishra

World of CS

This is a real time visualization of all contacts around the world. Each square on the map represent one of our global contact centers, spanning from Salt Lake City to Brazil, India, and Japan. The heatmap in the background is a historical trend of calls over the last hour, showing which countries are currently most active in contacting customer service. Every line you see is a live customer contact — starting at the customer’s country and ending at the contact center it was routed to. Four different types of contacts are represented in this visualization, white for regular phone calls, light blue for chats, green for calls that are initiated through our mobile apps on android and iOS, and red for contacts which are escalated from one representative to another.

By Sushruth Puttaswamy and Adam Krasny

Bird Box — Automatic AD

Audio Descriptive tracks provide descriptive narration in addition to dialog, helping visually impaired and blind members enjoy our shows. For the Hack Day project, we explored using recent research¹ to automatically generate descriptions, then used our own internal authoring tools to refine the output. We then used synthetic audio and automated mixing techniques to deliver a final audio description track.

By Adam Wang, Andy Swan, Raja Senapati, Shilpa Jois, Anjali Chablani, Deepa Krishnan, Vidya Sundaram, and Casey Wilms

You can also check out highlights from our past events: May 2019, November 2018, March 2018, August 2017, January 2017, May 2016 , November 2015, March 2015, February 2014 & August 2014.

Thanks to all the teams who put together a great round of hacks in 24 hours

Footnotes

Weakly Supervised Dense Event Captioning in Videos
Duan, Xuguang and Huang, Wenbing and Gan, Chuang and Wang, Jingdong and Zhu, Wenwu and Huang, Junzhou
Advances in Neural Information Processing Systems 31 Curran Associates, Inc.. p. 3062–3072. 2018

Netflix Hack Day — November 2019 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

DBLog: A Generic Change-Data-Capture Framework

December 17, 2019, 6:28 am

≫ Next: Netflix Now Streaming AV1 on Android

≪ Previous: Netflix Hack Day — November 2019

Andreas Andreakis, Ioannis Papapanagiotou

Continue reading on Netflix TechBlog »

↧

Netflix Now Streaming AV1 on Android

February 5, 2020, 12:00 pm

≫ Next: AVIF for Next-Generation Image Coding

≪ Previous: DBLog: A Generic Change-Data-Capture Framework

By Liwei Guo, Vivian Li, Julie Beckley, Venkatesh Selvaraj, and Jeff Watts

Today we are excited to announce that Netflix has started streaming AV1 to our Android mobile app. AV1 is a high performance, royalty-free video codec that provides 20% improved compression efficiency over our VP9† encodes. AV1 is made possible by the wide-ranging industry commitment of expertise and intellectual property within the Alliance for Open Media (AOMedia), of which Netflix is a founding member.

Our support for AV1 represents Netflix’s continued investment in delivering the most efficient and highest quality video streams. For our mobile environment, AV1 follows on our work with VP9, which we released as part of our mobile encodes in 2016 and further optimized with shot-based encodes in 2018.

While our goal is to roll out AV1 on all of our platforms, we see a good fit for AV1’s compression efficiency in the mobile space where cellular networks can be unreliable, and our members have limited data plans. Selected titles are now available to stream in AV1 for customers who wish to reduce their cellular data usage by enabling the “Save Data” feature.

Our AV1 support on Android leverages the open-source dav1d decoder built by the VideoLAN, VLC, and FFmpeg communities and sponsored by the Alliance for Open Media. Here we have optimized dav1d so that it can play Netflix content, which is 10-bit color. In the spirit of making AV1 widely available, we are sponsoring an open-source effort to optimize 10-bit performance further and make these gains available to all.

As codec performance improves over time, we plan to expand our AV1 usage to more use cases and are now also working with device and chipset partners to extend this into hardware.

† AV1-libaom compression efficiency as measured against VP9-libvpx.

Netflix Now Streaming AV1 on Android was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

AVIF for Next-Generation Image Coding

February 13, 2020, 2:52 pm

≫ Next: Essential Suite — Artwork Producer Assistant

≪ Previous: Netflix Now Streaming AV1 on Android

By Aditya Mavlankar, Jan De Cock¹, Cyril Concolato, Kyle Swanson, Anush Moorthy and Anne Aaron

TL; DR

We need an alternative to JPEG that a) is widely supported, b) has better compression efficiency and c) has a wider feature set. We believe AV1 Image File Format (AVIF) has the potential. Using the framework we have open sourced, AVIF compression efficiency can be seen at work and compared against a whole range of image codecs that came before it.

Image compression at Netflix

Netflix is enjoyed by its members on a variety of devices — smart TVs, phones, tablets, personal computers and streaming devices connected to TV screens. The user interface (UI), intended for browsing the catalog and serving up recommendations, is rich in images and graphics across all device categories. Shown below are screenshots of the Netflix app on iOS as an example.

Screenshots showing the Netflix UI on iOS (iPhone 7) at the time of this writing.

Image assets might be based on still frames from the title, special on-set photography or a combination thereof. Assets could also stem from art generated during the production of the feature.

As seen above, image assets typically have gradients, text and graphics, for example the Netflix symbol or other title-specific symbols such as “The Witcher” insignia, composited on the image. Such special treatments lead to a variety of peculiarities which do not necessarily arise in natural images. Hard edges, including those with chroma differences on either side of the edge, are common and require good detail preservation, since they typically occur at salient locations and convey important information. Further, there is typically a character or a face in salient locations with a smooth, uncluttered background. Again, preservation of detail on the character’s face is of primary importance. In some cases, the background is textured and complex, exhibiting a wide range of frequencies.

After an image asset is ingested, the compression pipeline kicks in and prepares compressed image assets meant for delivering to devices. The goal is to have the compressed image look as close to the original as possible while reducing the number of bytes required. Given the image-heavy nature of the UI, compressing these images well is of primary importance. This involves picking, among other things, the right combination of color subsampling, codec, encoder parameters and encoding resolution.

Compressed image assets destined for various client devices and various spaces in the UI are created from corresponding “pristine” image sources.

Let us take color subsampling as an example. Choosing 420 subsampling, over the original 444 format, halves the number of samples (counting across all 3 color planes) that need to be encoded while relying on the fact that the human visual system is more sensitive to luma than chroma. However, 420 subsampling can introduce color bleeding and jaggies in locations with color transitions. Below we toggle between the original source in 444 and the source converted to 420 subsampling. The toggling shows loss introduced just by the color subsampling, even before the codec enters the picture.

Toggling between the original source image with 444 subsampling and after converting to 420 subsampling. Showing the top part of the artwork only. The reader may zoom in on the webpage to view jaggies around the Netflix logo appearing due to 420 subsampling.

Nevertheless, there are source images where the loss due to 420 subsampling is not obvious to human perception and in such cases it can be advantageous to use 420 subsampling. Ideally, a codec should be able to support both subsampling formats. However, there are a few codecs that only support 420 subsampling — webp, discussed below, is one such popular codec.

Brief overview of image coding formats

The JPEG format was introduced in 1992 and is widely popular. It supports various color subsamplings including 420, 422 and 444. JPEG can ingest RGB data and transform it to a luma-chroma representation before performing lossy compression. The discrete cosine transform (DCT) is employed as the decorrelating transform on 8x8 blocks of samples. This is followed by quantization and entropy coding. However, JPEG is restricted to 8-bit imagery and lacks support for alpha channel. The more recent JPEG-XT standard extends JPEG to higher bit-depths, support for alpha channel, lossless compression and more in a backwards compatible way.

The JPEG 2000 format, based on the discrete wavelet transform (DWT), was introduced as a successor to JPEG in the year 2000. It brought a whole range of additional features such as spatial scalability, region of interest coding, range of supported bit-depths, flexible number of color planes, lossless coding, etc. With the motion extension, it was accepted as the video coding standard for digital cinema in 2004.

The webp format was introduced by Google around 2010. Google added decoding support on Android devices and Chrome browser and also released libraries that developers could add to their apps on other platforms, for example iOS. Webp is based on intra-frame coding from the VP8 video coding format. Webp does not have all the flexibilities of JPEG 2000. It does, however, support lossless coding and also a lossless alpha channel, making it a more efficient and faster alternative to PNG in certain situations.

High-Efficiency Video Coding (HEVC) is the successor of H.264, a.k.a. Advanced Video Coding (AVC) format. HEVC intra-frame coding can be encapsulated in the High-Efficiency Image File Format (HEIF). This format is most notably used by Apple devices to store recorded imagery.

Similarly, AV1 Image File Format (AVIF) allows encapsulating AV1 intra-frame coded content, thus taking advantage of excellent compression gains achieved by AV1 over predecessors. We touch upon some appealing technical features of AVIF in the next section.

The JPEG committee is pursuing a coding format called JPEG XL which includes features aimed at helping the transition from legacy JPEG format. Existing JPEG files can be losslessly transcoded to JPEG XL while achieving file size reduction. Also included is a lightweight conversion process back to JPEG format in order to serve clients that only support legacy JPEG.

AVIF technical features

Although modern video codecs were developed with primarily video in mind, the intraframe coding tools in a video codec are not significantly different from image compression tooling. Given the huge compression gains of modern video codecs, they are compelling as image coding formats. There is a potential benefit in reusing the hardware in place for video compression/decompression. Image decoding in hardware may not be a primary motivator, given the peculiarities of OS dependent UI composition, and architectural implications of moving uncompressed image pixels around.

In the area of image coding formats, the Moving Picture Experts Group (MPEG) has standardized a codec-agnostic and generic image container format: ISO/IEC 23000–12 standard (a.k.a. HEIF). HEIF has been used to store most notably HEVC-encoded images (in its HEIC variant) but is also capable of storing AVC-encoded images or even JPEG-encoded images. The Alliance for Open Media (AOM) has recently extended this format to specify the storage of AV1-encoded images in its AVIF format. The base HEIF format offers typical features expected from an image format such as: support for any image codec, ability to use a lossy or a lossless mode for compression, support for varied subsampling and bit-depths, etc. Furthermore, the format also allows the storage of a series of animated frames (offering an efficient and long-awaited alternative to animated GIFs), and the ability to specify an alpha channel (which sees tremendous use in UIs). Further, since the HEIF format borrows learnings from next-generation video compression, the format allows for preserving metadata such as color gamut and high dynamic range (HDR) information.

Image compression comparison framework

We have open sourced a Docker based framework for comparing various image codecs. Salient features include:

Encode orchestration (with parallelization) and insights generation using Python 3
Easy reproducibility of results and
Easy control of target quality range(s).

Since the framework allows one to specify a target quality (using a certain metric) for target codec(s), and stores these results in a local database, one can easily utilize the Bjontegaard-Delta (BD) rate to compare across codecs since the target points can be restricted to a useful or meaningful quality range, instead of blindly sweeping across the encoder parameter range (such as a quality factor) with fixed parameter values and landing on arbitrary quality points.

An an example, below are the calls that would produce compressed images for the choice of codecs at the specified SSIM and VMAF values, with the desired tolerance in target quality:

main(metric='ssim', target_arr=[0.92, 0.95, 0.97, 0.99], target_tol=0.005, db_file_name='encoding_results_ssim.db')

main(metric='vmaf', target_arr=[75, 80, 85, 90, 95], target_tol=0.5, db_file_name='encoding_results_vmaf.db')

For the various codecs and configurations involved in the ensuing comparison, the reader can view the actual command lines in the shared repository. We have attempted to get the best compression efficiency out of every codec / configuration compared here. The reader is free to experiment with changes to encoding commands within the framework. Furthermore, newer versions of respective software implementations might have been released compared to versions used at the time of gathering below results. For example, a newer software version of Kakadu demo apps is available compared to the one in the framework snapshot on github used at the time of gathering below results.

Visual examples

This is the section where we get to admire the work of the compression community over the last 3 decades by looking at visual examples comparing JPEG and the state-of-the-art.

The encoded images shown below are illustrative and meant to compare visual quality at various target bitrates. Please note that the quality of the illustrative encodes is not representative of the high quality bar that Netflix employs for streaming image assets on the actual service, and is meant to be purely educative in nature.

Shown below is one original source image from the Kodak dataset and the corresponding result with JPEG 444 @ 20,429 bytes and with AVIF 444 @ 19,788 bytes. The JPEG encode shows very obvious blocking artifacts in the sky, in the pond as well as on the roof. The AVIF encode is much better, with less blocking artifacts, although there is some blurriness and loss of texture on the roof. It is still a remarkable result, given the compression factor of around 59x (original image has dimensions 768x512, thus requiring 768x512x3 bytes compared to the 20k bytes of the compressed image).

An original image from the Kodak dataset

For the same source, shown below is the comparison of JPEG 444 @ 40,276 bytes and AVIF 444 @ 39,819 bytes. The JPEG encode still has visible blocking artifacts in the sky, along with ringing around the roof edges and chroma bleeding in several locations. The AVIF image however, is now comparable to the original, with a compression factor of 29x.

Shown below is another original source image from the Kodak dataset and the corresponding result with JPEG 444 @ 13,939 bytes and with AVIF 444 @ 4,176 bytes. The JPEG encode shows blocking artifacts around most edges, particularly around the slanting edge as well as color distortions. The AVIF encode looks “cleaner” even though it is one-third the size of the JPEG encode. It is not a perfect rendition of the original, but with a compression factor of 282x, this is commendable.

Shown below are results for the same image with slightly higher bit-budget; JPEG 444 @ 19,787 bytes versus AVIF 444 @ 20,120 bytes. The JPEG encode still shows blocking artifacts around the slanting edge whereas the AVIF encode looks nearly identical to the source.

Shown below is an original image from the Netflix (internal) 1142x1600 resolution “boxshots-1” dataset. Followed by JPEG 444 @ 69,445 bytes and AVIF 444 @ 40,811 bytes. Severe banding and blocking artifacts along with color distortions are visible in the JPEG encode. Less so in the AVIF encode which is actually 29kB smaller.

An original source image from the Netflix (internal) boxshots-1 dataset

Shown below are results for the same image with slightly increased bit-budget. JPEG 444 @ 80,101 bytes versus AVIF 444 @ 85,162 bytes. The banding and blocking is still visible in the JPEG encode whereas the AVIF encode looks very close to the original.

Shown below is another source image from the same boxshots-1 dataset along with JPEG 444 @ 81,745 bytes versus AVIF 444 @ 76,087 bytes. Blocking artifacts overall and mosquito artifacts around text can be seen in the JPEG encode.

Another original source image from the Netflix (internal) boxshots-1 dataset

Shown below is another source image from the boxshots-1 dataset along with JPEG 444 @ 80,562 bytes versus AVIF 444 @ 80,432 bytes. There is visible banding, blocking and mosquito artifacts in the JPEG encode whereas the AVIF encode looks very close to the original source.

Overall results

Shown below are results over public datasets as well as Netflix-internal datasets. The reference codec used is JPEG from the JPEG-XT reference software, using the standard quantization matrix defined in Annex K of the JPEG standard. Following are the codecs and/or configurations tested and reported against the baseline in the form of BD rate.

The encoding resolution in these experiments is the same as the source resolution. For 420 subsampling encodes, the quality metrics were computed in 420 subsampling domain. Likewise, for 444 subsampling encodes, the quality metrics were computed in 444 subsampling domain. Along with BD rates associated with various quality metrics, such as SSIM, MS-SSIM, VIF and PSNR, we also show rate-quality plots using SSIM as the metric.

Kodak dataset; 24 images; 768x512 resolution

We have uploaded the source images in PNG format here for easy reference. We give the necessary attribution to Kodak as the source of this dataset.

Given a quality metric, for each image, we consider two separate rate-quality curves. One curve associated with the baseline (JPEG) and one curve associated with the target codec. We compare the two and compute the BD-rate which can be interpreted as the average percentage rate reduction for the same quality over the quality region being considered. A negative value implies rate reduction and hence is better compared to the baseline. As a last step, we report the arithmetic mean of BD rates over all images in the dataset. We also highlight the best performer in the tables below.

CLIC dataset; 303 images; 2048x1320 resolution

We selected a subset of images from the dataset made public as part of the workshop and challenge on learned image compression (CLIC), held in conjunction with CVPR. We have uploaded our selected 303 source images in PNG format here for easy reference with appropriate attribution to CLIC.

Billboard dataset (Netflix-internal); 223 images; 2048x1152 resolution

Billboard images generally occupy a larger canvas than the thumbnail-like boxshot images and are generally horizontal. There is room to overlay text or graphics on one of the sides, either left or right, with salient characters/scenery/art being located on the other side. An example can be seen below. The billboard source images are internal to Netflix and hence do not constitute a public dataset.

A sample original source image from the billboard dataset

Boxshots-1 dataset (Netflix-internal); 100 images; 1142x1600 resolution

Unlike billboard images, boxshot images are vertical and typically boxshot images representing different titles are displayed side-by-side in the UI. Examples from this dataset are showcased in the section above on visual examples. The boxshots-1 source images are internal to Netflix and hence do not constitute a public dataset.

Boxshots-2 dataset (Netflix-internal); 100 images; 571x800 resolution

The boxshots-2 dataset also has vertical box art but of lower resolution. The boxshots-2 source images are internal to Netflix and hence do not constitute a public dataset.

At this point, it might be prudent to discuss the omission of VMAF as a quality metric here. In previous work we have shown that for JPEG-like distortions and datasets similar to “boxshots” and “billboards”, VMAF has high correlation with perceived quality. However, VMAF, as of today, is a metric trained and developed to judge encoded videos rather than static images. The range of distortions associated with the range of image codecs in our tests is broader than what was considered in the VMAF development process and to that end, it may not be an accurate measure of image quality for those codecs. Further, today’s VMAF model is not designed to capture chroma artifacts and hence would be unable to distinguish between 420 and 444 subsampling, for instance, apart from other chroma artifacts (this is also true of some other measures we’ve used, but given the lack of alternatives, we’ve leaned on the side of using the most well tested and documented image quality metrics). This is not to say that VMAF is grossly inaccurate for image quality, but to say that we would not use it in our evaluation of image compression algorithms with such a wide diversity of codecs at this time. We have some exciting upcoming work to improve the accuracy of VMAF for images, across a variety of codecs, and resolutions, including chroma channels in the score. Having said that, the code in the repository computes VMAF and the reader is encouraged to try it out and see that AVIF also shines judging by VMAF as is today.

PSNR does not have as high correlation with perceptual quality over a wide quality range. However, if encodes are made with a high PSNR target then one overspends bits but can rest assured that a high PSNR score implies closeness to the original. With perceptually driven metrics, we sometimes see failure manifest in rare cases where the score is undeservingly high but visual quality is lacking.

Interesting observation regarding subsampling

In addition to above quality calculations, we have the following observation which reveals an encouraging trend among modern codecs. After performing an encode with 420 subsampling, let’s assume we decode the image, up-convert it to 444 subsampling and then compute various metrics by comparing against the original source in 444 format. We call this configuration “444u” to distinguish from above cases where “encode-subsampling” and “quality-computation-subsampling” match. Among the chosen metrics, PSNR_AVG is one which takes all 3 channels (1 luma and 2 chroma) into account. With an older codec like JPEG, the bit-budget is spread thin over more samples while encoding 444 subsampling compared to encoding 420 subsampling. This shows as poorer PSNR_AVG for encoding JPEG with 444 subsampling compared to 420 subsampling, as shown below. However, given a rate target, with modern codecs like HEVC and AVIF, it is simply better to encode 444 subsampling over a wide range of bitrates.

It is simply better to encode with 444 subsampling with a modern codec such as AVIF judging by PSNR_AVG as the metric

We see that with modern codecs we yield a higher PSNR_AVG when encoding 444 subsampling than 420 subsampling over the entire region of “practical” rates, even for the other, more practical, datasets such as boxshots-1. Interestingly, with JPEG, we see a crossover; i.e., after crossing a certain rate, it starts being more efficient to encode 444 subsampling. Such crossovers are analogous to rate-quality curves crossing over when encoding over multiple spatial resolutions. Shown below are rate-quality curves for two different source images from the boxshots-1 dataset, comparing JPEG and AVIF in both 444u and 444 configurations.

AVIF support and next steps

Although AVIF provides superior compression efficiency, it is still at an early deployment stage. Various tools exist to produce and consume AVIF images. The Alliance for Open Media is notably developing an open-source library, called libavif, that can encode and decode AVIF images. The goal of this library is to ease the integration in software from the image community. Such integration has already started, for example, in various browsers, such as Google Chrome, and we expect to see broad support for AVIF images in the near future. Major efforts are also ongoing, in particular from the dav1d team, to make AVIF image decoding as fast as possible, including for 10-bit images. It is conceivable that we will soon test AVIF images on Android following on the heels of our recently announced AV1 video adoption efforts on Android.

The datasets used above have standard dynamic range (SDR) 8-bit imagery. At Netflix, we are also working on HDR images for the UI and are planning to use AVIF for encoding these HDR image assets. This is a continuation of our previous efforts where we experimented with JPEG 2000 as the compression format for HDR images and we are looking forward to the superior compression gains afforded by AVIF.

Acknowledgments

We would like to thank Marjan Parsa, Pierre Lemieux, Zhi Li, Christos Bampis, Andrey Norkin, Hunter Ford, Igor Okulist, Joe Drago, Benbuck Nason, Yuji Mano, Adam Rofer and Jeff Watts for all their contributions and collaborations.

¹as part of his work while he was affiliated with Netflix

AVIF for Next-Generation Image Coding was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Essential Suite — Artwork Producer Assistant

February 14, 2020, 10:51 am

≫ Next: How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience

≪ Previous: AVIF for Next-Generation Image Coding

Essential Suite — Artwork Producer Assistant

By: Hamid Shahid & Syed Haq

Introduction

Netflix continues to invest in content for a global audience with a diverse range of unique tastes and interests. Correspondingly, the member experience must also evolve to connect this global audience to the content that most appeals to each of them. Images that represent titles on Netflix (what we at Netflix call “artwork”) have proven to be one of the most effective ways to help our members discover the content they love to watch. We thus need to have a rich and diverse set of artwork that is tailored for different parts of the Netflix experience (what we call product canvases). We also need to source multiple images for each title representing different themes so we can present an image that is relevant to each member’s taste.

Manual curation and review of these high quality images from scratch for a growing catalog of titles can be particularly challenging for our Product Creative Strategy Producers (referred to as producers in the rest of the article). Below, we discuss how we’ve built upon our previous work of harvesting static images directly from video source files and our computer vision algorithms to produce a set of artwork candidates that covers the major product canvases for the entire content catalog. The artwork generated by this pipeline is used to augment the artwork typically sourced from design agencies. We call this suite of assisted artwork “The Essential Suite”.

Supplement, not replace

Producers from our Creative Production team are the ultimate decision makers when it comes to the selection of artwork that gets published for each title. Our usage of computer vision to generate artwork candidates from video sources thus is focussed on alleviating the workload for our Creative Production team. The team would rather spend its time on creative and strategic tasks rather than sifting through thousands of frames of a show looking for the most compelling ones. With the “Essential Suite”, we are providing an additional tool in the producers toolkit. Through testing we have learned that with proper checks and human curation in place, assisted artwork candidates can perform on par with agency designed artwork.

Design Agencies

Netflix uses best-in-class design agencies to provide artwork that can be used to promote titles on and off the Netflix service. Netflix producers work closely with design agencies to request, review and approve artwork. All artwork is delivered through a web application made available to the design agencies.

The computer generated artwork can be considered as artwork provided by an “Internal agency”. The idea is to generate artwork candidates using video source files and “bubble it up” to the producers on the same artwork portal where they review all other artwork, ideally without knowing if it is an agency produced or internally curated artwork, thereby selecting what goes on product purely based on creative quality of the image.

Assisted Artwork Generation Workflow

The artwork generation process involves several steps, starting with the arrival of the video source files and culminating in generated artwork being made available to producers. We use an open source workflow engine Netflix Conductor to run the orchestration. The whole process can be divided into two parts

Generation
Review

1. Generation

This article on AVA provides a good explanation on our technology to extract interesting images from video source files. The artwork generation workflow takes it a step further. For a given product canvas, it selects a handful of images from the hundreds of video stills most suitable for that particular product canvas. The workflow then crops and color-corrects the selected image, picks out the best spot to place the movie’s title based on negative space, selects and resizes the movie title and places it onto the image.

Here is an illustration of what it means if we had to do it manually

c. Cropped, color-corrected & title placed in the negative space

Image Selection / Analyze Image

Selection of the right still image is essential to generating good quality artwork. A lot of work has already been done in AVA to extract out a few hundreds of frames from hundreds of thousands of frames present in a typical video source. Broadly speaking, we use two methods to extract movie stills out of video source.

AVA — Ava is primarily a character based algorithm. It picks up frames with a clear facial shot taking into account actors, facial expression and shot detection.
Cinematics — Cinematics picks up aesthetically pleasing cinematic shots.

The combination of these two approaches produce a few hundred movie stills from a typical video source. For a season, this would be a few hundred shots for each episode. Our work here is to pick up the stills that best work for the desired canvas.

Both of the above algorithms use a few heatmaps which define what kind of images have proven to be working best in different canvases. The heatmaps are designed by internal artists who are experienced in designing promotional artwork/posters

We make use of meta-information such as the size of desired canvas, the “unsafe regions” and the “regions of interest” to identify what image would serve best. “Unsafe regions” are areas in the image where badges such as Netflix logo, new episodes, etc are placed. “Regions of interest” are areas that are always displayed in multi-purpose canvases. These details are stored as metadata for each canvas type and passed to the algorithm by the workflow. Some of our canvases are cropped dynamically for different user interfaces. For such images, the “Regions of interest” will be the area that is always displayed in each crop.

This data-driven approach allows for fast turnaround for additional canvases. While selecting images, the algorithms also returns back suggested coordinates within each image for cropping and title placement. Finally, it associates a “score” with the selected image. This score is the “confidence” that the algorithm has on the selection of candidate image on how well it could perform on service, based on previously collected stats.

Image Creation

The artwork generation workflow collates image selection results from each video source and picks up the top “n” images based on confidence score.

The selected image is then cropped and color-corrected based on coordinates passed by the algorithm. Some canvases also need the movie title to be placed on the image. The process makes use of the heatmap provided by our designers to perform cropping and title placement. As an example, the “Billboard” canvas shown on a movie’s landing page is right aligned, with the title and synopsis shown on the left.

The workers to crop and color correct images are made available as separate titus jobs. The workflow invokes the jobs, storing each output in the artwork asset management system and passes it on for review.

2. Review

For each artwork candidate generated by the workflow, we want to get as much feedback as possible from the Creative Production team because they have the most context about the title. However, getting producers to provide feedback on hundreds of generated images is not scalable. For this reason, we have split the review process in two rounds.

Technical Quality Control (QC)

This round of review enables filtering out images that look obviously wrong to a human eye. Images with features such as human actors with an open mouth, inappropriate facial expressions or an incorrect body position, etc are filtered out in this round.

For the purpose of reviewing these images, we use a video/image annotation application that provides a simple interface to add tags for a given list of videos or images. For our purposes, for each image, we ask the very basic question “Should this image be used for artwork?”

The team reviewing these assets treat each image individually and only look for technical aspects of the image, regardless of the theme or genre of the title, or the quantity of images presented for a given title.

When an image is rejected, a few follow up questions are asked to ascertain why the image is not suitable to be used as artwork.

All this review data is fed back to the image selection, cropping and color corrections algorithms to train and improve them.

Editorial QC

Unlike technical QC, which is title agnostic, editorial QC is done by producers who are deeply familiar with the themes, storylines and characters in the title, to select artwork that will represent the title best on the Netflix service.

The application used to review generated artwork is the same application that producers use to place and review artwork requests fulfilled by design agencies. A screenshot of how generated artwork is presented to producers is shown below

Similar to technical QC, the option here for each artwork is whether to approve or reject the artwork. The producers are encouraged to provide reasons why they are rejecting an artwork.

Approved artwork makes its way to the artwork’s asset management system, where it resides alongside other agency-fulfilled artwork. From here, producers have the ability to publish it to the Netflix service.

Conclusion

We have learned a lot from our work on generating artwork. Artwork that looks good might not be the best depiction of the title’s story, a very clear character image might be a content spoiler. All of these decisions are best made by humans and we intend to keep it that way.

However, assisted artwork generation has a place in supporting our creative team by providing them with another avenue to pick up their assets from, and with careful supervision will help in their challenge of sourcing artwork at scale.

Essential Suite — Artwork Producer Assistant was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience

March 3, 2020, 2:43 pm

≫ Next: Open-Sourcing riskquant, a library for quantifying risk

≪ Previous: Essential Suite — Artwork Producer Assistant

By Ben Sykes

Continue reading on Netflix TechBlog »

↧

Open-Sourcing riskquant, a library for quantifying risk

March 5, 2020, 1:03 pm

≫ Next: Introducing Dispatch

≪ Previous: How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience

Netflix has a program in our Information Security department for quantifying the risk of deliberate (attacker-driven) and accidental…

Continue reading on Netflix TechBlog »

↧

Introducing Dispatch

March 5, 2020, 1:24 pm

≫ Next: Ready for changes with Hexagonal Architecture

≪ Previous: Open-Sourcing riskquant, a library for quantifying risk

By Kevin Glisson, Marc Vilanova, Forest Monsen

Netflix is pleased to announce the open-source release of our crisis management orchestration framework: Dispatch!

Okay, but what is Dispatch? Put simply, Dispatch is:

All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!

Dispatch helps us effectively manage security incidents by deeply integrating with existing tools used throughout an organization (Slack, GSuite, Jira, etc.,) Dispatch leverages the existing familiarity of these tools to provide orchestration instead of introducing another tool.

This means you can let Dispatch focus on creating resources, assembling participants, sending out notifications, tracking tasks, and assisting with post-incident reviews; allowing you to focus on actually fixing the issue! Sounds interesting? Continue reading!

The Challenge of Crisis Management

Managing incidents is a stressful job. You are dealing with many questions all at once: What’s the scope? Who can help me? Who do I need to engage? How do I manage all of this?

In general, every incident is unique and extraordinary, if the same incidents are happening over and over you’re firefighting.

There are four main components to Crisis Management that we are attempting to address:

Resource Management — The management of not only data collected about the incident itself but all of the metadata about the response.
Individual Engagement — Understanding the best way to engage individuals and teams, and doing so based on incident context.
Life Cycle Management — Providing the Incident Commander (IC) tools to easily manage the life cycle of the incident.
Incident Learning — Building on past incidents to speed up the resolution of future incidents.

We will use the following terminology throughout the rest of the discussion:

Incident Commanders are individuals that are responsible for driving the incident to resolution.
Incident Participants are individuals that are Subject Matter Experts (SMEs) that have been engaged to help resolve the incident.
Resources are documents, screenshots, logs or any other piece of digital information that is used during an incident.

The Checklist

For an average incident, there are quite a few steps to managing an incident and much of it is typically handled on an ad-hoc basis by a human. Let’s enumerate them:

Declare an Incident — There are many different entry points to a potential incident: automated alerts, an internal notification, or an external notification.
Determine Incident Commander — Determining the sole individual responsible for driving a particular incident to resolution based on the incident source, type, and priority.
Create Communication Channels — Communication during incidents is key. Establishing dedicated and standardized channels for communication prevents the creation of communication silos.
Create Incident Document — The central document responsible for containing up-to-date incident information, including a description of the incident, links to resources, rough notes from in-person meetings, open questions, action items, and timeline information.
Engage Individual Resources — An incident commander will not be able to resolve an incident by themselves, they must identify and engage additional resources within the organization to help them.
Orient Individual Resources — Engaging additional resources is not enough, the Incident Commander needs to orient these resources to the situation at hand.
Notify Key Stakeholders — For any given incident, key stakeholders not directly involved in resolving the incident need to be made aware of the incident.
Drive Incident to Resolution — The actual resolution of the incident, creating tasks, asking questions, and tracking answers. Making note of key learnings to be addressed after resolution.
Perform Post Incident Review (PIR) — Review how the incident process was performed, tracking actions to be performed after the incident, and driving learning through structuring informal knowledge.

Each of these steps has the incident commander and incident participants moving through various systems and interfaces. Each context switch adds to the cognitive load on the responder and distracting them from resolving the incident itself.

Toward Better Crisis Management

Crisis management is not a new challenge, tools like Jira, PagerDuty, VictorOps are all helping organizations manage and respond to incidents. When setting out to automate our incident management process we had two main goals:

Re-use existing tools users were already familiar with; reducing the learning curve to contributing to incidents.
Catalog, store and analyze our incident data to speed up resolution.

Meet Dispatch!

Dispatch

Dispatch is a crisis management orchestration framework that manages incident metadata and resources. It uses tools already in use throughout an organization, providing incident participants a comprehensive crisis management toolset, allowing them to focus on resolving the incident.

Unlike many of our tools Dispatch is not tightly bound to AWS, Dispatch does not use any AWS APIs at all! While Dispatch doesn’t use AWS APIs, it leverages multiple APIs that are deeply embedded into the organization (e.g. Slack, GSuite, PagerDuty, etc.,). In addition to all of the built-in integrations, Dispatch provides multiple integration points that allow it to fit into just about any existing environment.

Although developed as a tool to help Netflix manage security incidents, nothing about Dispatch is specific to a security use-case. At its core, Dispatch aims to manage the entire lifecycle of an incident, focusing on engaging individuals and providing them the context they need to drive the incident to resolution.

Workflow

Let’s take a look at what an incident commander’s new workflow would look like using Dispatch:

Some key benefits of the new workflow are:

The incident commander no longer needs to manage access to resources or multiple data streams.
Communications are standardized (both in style and interval) across incidents.
Incident participants are automatically engaged based on the type, priority, and description of the incident.
Incident tasks are tracked and owners are reminded if they’re not completed on time.
All incident data is centrally tracked.
A common API is provided for internal users and tools.

We want to make reporting incidents as frictionless as possible, giving users a straightforward path to engage the resources they need in a time of crisis.

Jumping between different tools, ensuring data is correct and in sync is a low-value exercise for an incident commander. Instead, we centralized on two common tools to manage the entire lifecycle. Slack for managing incident metadata (e.g. status, title, description, priority, etc,.) and Google Doc and Google Drive for managing data itself.

When teams need to look across many incidents, Dispatch provides an Admin UI. This interface is also where incident knowledge is managed. From common terms and their definitions, individuals, teams, and services. The Admin UI is how we manage incident knowledge for use in future incidents.

Architecture

Dispatch makes use of the following components:

Python 3.8 with FastAPI (including helper packages)
VueJS UI
Postgres

We’re shipping Dispatch with built-in plugins that allow you to create and manage resources with GSuite (Docs, Drive, Sheets, Calendar, Groups), Jira, PagerDuty, and Slack. But the plugin architecture allows for integrations with whatever tools your organization is already using.

Getting Started

Dispatch is available now on the Netflix Open Source site. You can try out Dispatch using Docker. Detailed instructions on setup and configuration are available in our docs.

Interested in Contributing?

Feel free to reach out or submit pull requests if you have any suggestions. We’re looking forward to seeing what new plugins you create to make Dispatch work for you! We hope you’ll find Dispatch as useful as we do!

Oh, and we’re hiring — if you’d like to help us solve these sorts of problems, take a look at https://jobs.netflix.com/teams/security, and reach out!

Introducing Dispatch was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Ready for changes with Hexagonal Architecture

March 10, 2020, 11:34 am

≫ Next: SVT-AV1: an open-source AV1 encoder and decoder

≪ Previous: Introducing Dispatch

by Damir Svrtan and Sergii Makagon

As the production of Netflix Originals grows each year, so does our need to build apps that enable efficiency throughout the entire creative process. Our wider Studio Engineering Organization has built more than 30 apps that help content progress from pitch (aka screenplay) to playback: ranging from script content acquisition, deal negotiations and vendor management to scheduling, streamlining production workflows, and so on.

Highly integrated from the start

About a year ago, our Studio Workflows team started working on a new app that crosses multiple domains of the business. We had an interesting challenge on our hands: we needed to build the core of our app from scratch, but we also needed data that existed in many different systems.

Some of the data points we needed, such as data about movies, production dates, employees, and shooting locations, were distributed across many services implementing various protocols: gRPC, JSON API, GraphQL and more. Existing data was crucial to the behavior and business logic of our application. We needed to be highly integrated from the start.

Swappable data sources

One of the early applications for bringing visibility into our productions was built as a monolith. The monolith allowed for rapid development and quick changes while the knowledge of the space was non-existent. At one point, more than 30 developers were working on it, and it had well over 300 database tables.

Over time applications evolved from broad service offerings towards being highly specialized. This resulted in a decision to decompose the monolith to specific services. This decision was not geared by performance issues — but with setting boundaries around all of these different domains and enabling dedicated teams to develop domain-specific services independently.

Large amounts of the data we needed for the new app were still provided by the monolith, but we knew that the monolith would be broken up at some point. We were not sure about the timing of the breakup, but we knew that it was inevitable, and we needed to be prepared.

Thus, we could leverage some of the data from the monolith at first as it was still the source of truth, but be prepared to swap those data sources to new microservices as soon as they came online.

Leveraging Hexagonal Architecture

We needed to support the ability to swap data sources without impacting business logic, so we knew we needed to keep them decoupled. We decided to build our app based on principles behind Hexagonal Architecture and Uncle Bob’s Clean Architecture.

The idea of Hexagonal Architecture is to put inputs and outputs at the edges of our design. Business logic should not depend on whether we expose a REST or a GraphQL API, and it should not depend on where we get data from — a database, a microservice API exposed via gRPC or REST, or just a simple CSV file.

The pattern allows us to isolate the core logic of our application from outside concerns. Having our core logic isolated means we can easily change data source details without a significant impact or major code rewrites to the codebase.

One of the main advantages we also saw in having an app with clear boundaries is our testing strategy — the majority of our tests can verify our business logic without relying on protocols that can easily change.

Defining the core concepts

Leveraged from the Hexagonal Architecture, the three main concepts that define our business logic are Entities, Repositories, and Interactors.

Entities are the domain objects (e.g., a Movie or a Shooting Location) — they have no knowledge of where they’re stored (unlike Active Record in Ruby on Rails or the Java Persistence API).
Repositories are the interfaces to getting entities as well as creating and changing them. They keep a list of methods that are used to communicate with data sources and return a single entity or a list of entities. (e.g. UserRepository)
Interactors are classes that orchestrate and perform domain actions — think of Service Objects or Use Case Objects. They implement complex business rules and validation logic specific to a domain action (e.g., onboarding a production)

With these three main types of objects, we are able to define business logic without any knowledge or care where the data is kept and how business logic is triggered. Outside of the business logic are the Data Sources and the Transport Layer:

Data Sources are adapters to different storage implementations.
A data source might be an adapter to a SQL database (an Active Record class in Rails or JPA in Java), an elastic search adapter, REST API, or even an adapter to something simple such as a CSV file or a Hash. A data source implements methods defined on the repository and stores the implementation of fetching and pushing the data.
Transport Layer can trigger an interactor to perform business logic. We treat it as an input for our system. The most common transport layer for microservices is the HTTP API Layer and a set of controllers that handle requests. By having business logic extracted into interactors, we are not coupled to a particular transport layer or controller implementation. Interactors can be triggered not only by a controller, but also by an event, a cron job, or from the command line.

The dependency graph in Hexagonal Architecture goes inward.

With a traditional layered architecture, we would have all of our dependencies point in one direction, each layer above depending on the layer below. The transport layer would depend on the interactors, the interactors would depend on the persistence layer.

In Hexagonal Architecture all dependencies point inward — our core business logic does not know anything about the transport layer or the data sources. Still, the transport layer knows how to use interactors, and the data sources know how to conform to the repository interface.

With this, we are prepared for the inevitable changes to other Studio systems, and whenever that needs to happen, the task of swapping data sources is easy to accomplish.

Swapping data sources

The need to swap data sources came earlier than we expected — we suddenly hit a read constraint with the monolith and needed to switch a certain read for one entity to a newer microservice exposed over a GraphQL aggregation layer. Both the microservice and the monolith were kept in sync and had the same data, reading from one service or the other produced the same results.

We managed to transfer reads from a JSON API to a GraphQL data source within 2 hours.

The main reason we were able to pull it off so fast was due to the Hexagonal architecture. We didn’t let any persistence specifics leak into our business logic. We created a GraphQL data source that implemented the repository interface. A simple one-line change was all we needed to start reading from a different data source.

With a proper abstraction it was easy to change data sources

At that point, we knew that Hexagonal Architecture worked for us.

The great part about a one-line change is that it mitigates risks to the release. It is very easy to rollback in the case that a downstream microservice failed on initial deployment. This as well enables us to decouple deployment and activation, as we can decide which data source to use through configuration.

Hiding data source details

One of the great advantages of this architecture is that we are able to encapsulate data source implementation details. We ran into a case where we needed an API call that did not yet exist — a service had an API to fetch a single resource but did not have bulk fetch implemented. After talking with the team providing the API, we realized this endpoint would take some time to deliver. So we decided to move forward with another solution to solve the problem while this endpoint was being built.

We defined a repository method that would grab multiple resources given multiple record identifiers — and the initial implementation of that method on the data source sent multiple concurrent calls to the downstream service. We knew this was a temporary solution and that the second take at the data source implementation was to use the bulk API once implemented.

Our business logic doesn’t need to be aware of specific data source limitations.

A design like this enabled us to move forward with meeting the business needs without accruing much technical debt or the need to change any business logic afterward.

Testing strategy

When we started experimenting with Hexagonal Architecture, we knew we needed to come up with a testing strategy. We knew that a prerequisite to great development velocity was to have a test suite that is reliable and super fast. We didn’t think of it as a nice to have, but a must-have.

We decided to test our app at three different layers:

We test our interactors, where the core of our business logic lives but is independent of any type of persistence or transportation. We leverage dependency injection and mock any kind of repository interaction. This is where our business logic is tested in detail, and these are the tests we strive to have most of.

We test our data sources to determine if they integrate correctly with other services, whether they conform to the repository interface, and check how they behave upon errors. We try to minimize the amount of these tests.

We have integration specs that go through the whole stack, from our Transport / API layer, through the interactors, repositories, data sources, and hit downstream services. These specs test whether we “wired” everything correctly. If a data source is an external API, we hit that endpoint and record the responses (and store them in git), allowing our test suite to run fast on every subsequent invocation. We don’t do extensive test coverage on this layer — usually just one success scenario and one failure scenario per domain action.

We don’t test our repositories as they are simple interfaces that data sources implement, and we rarely test our entities as they are plain objects with attributes defined. We test entities if they have additional methods (without touching the persistence layer).

We have room for improvement, such as not pinging any of the services we rely on but relying 100% on contract testing. With a test suite written in the above manner, we manage to run around 3000 specs in 100 seconds on a single process.

It’s lovely to work with a test suite that can easily be run on any machine, and our development team can work on their daily features without disruption.

Delaying decisions

We are in a great position when it comes to swapping data sources to different microservices. One of the key benefits is that we can delay some of the decisions about whether and how we want to store data internal to our application. Based on the feature’s use case, we even have the flexibility to determine the type of data store — whether it be Relational or Documents.

Uncle Bob said it great:

The purpose of a good architecture is to delay decisions. Why? Because when we delay a decision, we have more information when it comes time to make it.

At the beginning of a project, we have the least amount of information about the system we are building. We should not lock ourselves into an architecture with uninformed decisions leading to a project paradox.

The decisions we made make sense for our needs now and have enabled us to move fast. The best part of Hexagonal Architecture is that it keeps our application flexible for future requirements to come.

Ready for changes with Hexagonal Architecture was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

SVT-AV1: an open-source AV1 encoder and decoder

March 13, 2020, 11:29 am

≫ Next: Bringing 4K and HDR to Anime at Netflix with Sol Levante

≪ Previous: Ready for changes with Hexagonal Architecture

SVT-AV1: open-source AV1 encoder and decoder

by Andrey Norkin, Joel Sole, Mariana Afonso, Kyle Swanson, Agata Opalach, Anush Moorthy, Anne Aaron

SVT-AV1 is an open-source AV1 codec implementation hosted on GitHub https://github.com/OpenVisualCloud/SVT-AV1/ under a BSD + patent license. As mentioned in our earlier blog post, Intel and Netflix have been collaborating on the SVT-AV1 encoder and decoder framework since August 2018. The teams have been working closely on SVT-AV1 development, discussing architectural decisions, implementing new tools, and improving compression efficiency. Since open-sourcing the project, other partner companies and the open-source community have contributed to SVT-AV1. In this tech blog, we will report the current status of the SVT-AV1 project, as well as the characteristics and performance of the encoder and decoder.

SVT-AV1 codebase status

The SVT-AV1 repository includes both an AV1 encoder and decoder, which share a significant amount of the code. The SVT-AV1 decoder is fully functional and compliant with the AV1 specification for all three profiles (Main, High, and Professional).

The SVT-AV1 encoder supports all AV1 tools which contribute to compression efficiency. Compared to the most recent master version of libaom (AV1 reference software), SVT-AV1 is similar in compression efficiency and at the same time achieves significantly lower encoding latency on multi-core platforms when using its inherent parallelization capabilities.

SVT-AV1 is written in C and can be compiled on major platforms, such as Windows, Linux, and macOS. In addition to the pure C function implementations, which allows for more flexible experimentation, the codec features extensive assembly and intrinsic optimizations for the x86 platform. See the next section for an outline of the main SVT-AV1 features that allow high performance at competitive compression efficiency. SVT-AV1 also includes extensive documentation on the encoder design targeted to facilitate the onboarding process for new developers.

Architectural features

One of Intel’s goals for SVT-AV1 development was to create an AV1 encoder that could offer performance and scalability. SVT-AV1 uses parallelization at several stages of the encoding process, which allows it to adapt to the number of available cores, including the newest servers with significant core count. This makes it possible for SVT-AV1 to decrease encoding time while still maintaining compression efficiency.

The SVT-AV1 encoder uses multi-dimensional (process-, picture/tile-, and segment-based) parallelism, multi-stage partitioning decisions, block-based multi-stage and multi-class mode decisions, and RD-optimized classification to achieve attractive trade-offs between compression and performance. Another feature of the SVT architecture is open-loop hierarchical motion estimation, which makes it possible to decouple the first stage of motion estimation from the rest of the encoding process.

Compression efficiency and performance

Encoder performance

SVT-AV1 reaches similar compression efficiency as libaom at the slowest speed settings. During the codec development, we have been tracking the compression and encoding results at the https://videocodectracker.dev/ site. The plot below shows the improvements in the compression efficiency of SVT-AV1 compared to the libaom encoder over time. Note that the libaom compression has also been improving over time, and the plot below represents SVT-AV1 catching up with the moving target. In the plot, the Y-axis shows the additional bitrate in percent needed to achieve similar quality as libaom encoder according to three metrics. The plot shows the results of the 2-pass encoding mode in both codecs. SVT-AV1 uses 4-thread mode, whereas libaom operates in a single-thread mode. The SVT-AV1 results for the 1-pass fixed-QP encoding mode, commonly used in research, are even more competitive, as detailed below.

Reducing BD-rate between SVT-AV1 and libaom in 2-pass encoding mode

The comparison results of the SVT-AV1 against libaom on objective-1-fast test set are presented in the table below. For estimating encoding times, we used Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz machine with 52 physical cores and 96 GB of RAM, with 60 jobs running in parallel. Both codecs use bi-directional hierarchical prediction structure of 16 pictures. The results are presented for 1-pass mode with fixed frame-level QP offsets. A single-threaded compression mode is used. Below, we compute the BD-rates for the various quality metrics: PSNR on all three color planes, VMAF, and MS-SSIM. A negative BD-Rate indicates that the SVT-AV1 encodes produce the same quality with the indicated relative reduction in bitrate. As seen below, SVT-AV1 demonstrates 16.5% decrease in encoding time compared to libaom while being slightly more efficient in compression ability. Note that the encoding times ratio may vary depending on the instruction sets supported by the platform. The results have been obtained on SVT-AV1 cs2 branch (a development branch that is currently being merged into the master, git hash 3a19f29) against the libaom master branch (git hash fe72512). The QP values used to calculate the BD-rates are: 20, 32, 43, 55, 63.

BD-rates of SVT-AV1 vs libaom in 1-pass encoding mode with fixed QP offsets. Negative numbers indicate reduction in bitrate needed to reach the same quality level. The overall encoding time difference is change in total CPU time for all sequences and QPs of SVT-AV1 compared to that of libaom.

*The overall encoding CPU time difference is calculated as change in total CPU time for all sequences and QPs of the test compared to that of the anchor. It is not equal to the average of per sequence values. Per each sequence, the encoding CPU time difference is calculated as change in total CPU time for all QPs for this sequence.

Since all sequences in the objective-1-fast test set have 60 frames, both codecs use one key frame. The following command line parameters have been used to compare the codecs.

libaom parameters:

--passes=1 --lag-in-frames=25 --auto-alt-ref=1 --min-gf-interval=16 --max-gf-interval=16 --gf-min-pyr-height=4 --gf-max-pyr-height=4 --kf-min-dist=65 --kf-max-dist=65 --end-usage=q --use-fixed-qp-offsets=1 --deltaq-mode=0 --enable-tpl-model=0 --cpu-used=0

SVT-AV1 parameters:

--preset 1 --scm 2 --keyint 63 --lookahead 0 --lp 1

The results above demonstrate the excellent objective performance of SVT-AV1. In addition, SVT-AV1 includes implementations of some subjective quality tools, which can be used if the codec is configured for the subjective quality.

Decoder performance

On the objective-1-fast test set, the SVT-AV1 decoder is slightly faster than the libaom in the 1-thread mode, with larger improvements in the 4-thread mode. We observe even larger speed gains over libaom decoder when decoding bitstreams with multiple tiles using the 4-thread mode. The testing has been performed on Windows, Linux, and macOS platforms. We believe the performance is satisfactory for a research decoder, where the trade-offs favor easier experimentation over further optimizations necessary for a production decoder.

Testing framework

To help ensure codec conformance, especially for new code contributions, the code has been comprehensively covered with unit tests and end-to-end tests. The unit tests are built on the Google Test framework. The unit and end-to-end tests are triggered automatically for each pull request to the repository, which is supported by GitHub actions. The tests support sharding, and they run in parallel to speed-up the turn-around time on pull requests.

Unit and e2e test have passed for this pull request

What’s next?

Over the last several months, SVT-AV1 has matured to become a complete encoder/decoder package providing competitive compression efficiency and performance trade-offs. The project is bolstered with extensive unit test coverage and documentation.

Our hope is that the SVT-AV1 codebase helps further adoption of AV1 and encourages more research and development on top of the current AV1 tools. We believe that the demonstrated advantages of SVT-AV1 make it a good platform for experimentation and research. We invite colleagues from industry and academia to check out the project on Github, reach out to the codebase maintainers for questions and comments or join one of the SVT-AV1 Open Dev meetings. We welcome more contributors to the project.

SVT-AV1: an open-source AV1 encoder and decoder was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Bringing 4K and HDR to Anime at Netflix with Sol Levante

April 2, 2020, 2:18 am

≫ Next: How Netflix brings safer and faster streaming experience to the living room on crowded networks…

≪ Previous: SVT-AV1: an open-source AV1 encoder and decoder

By Haruka Miyagawa & Kylee Peña

Continue reading on Netflix TechBlog »

↧

How Netflix brings safer and faster streaming experience to the living room on crowded networks…

April 20, 2020, 7:28 pm

≫ Next: Hyper Scale VPC Flow Logs enrichment to provide Network Insight

≪ Previous: Bringing 4K and HDR to Anime at Netflix with Sol Levante

How Netflix brings safer and faster streaming experience to the living room on crowded networks using TLS 1.3

By Sekwon Choi

At Netflix, we are obsessed with the best streaming experiences. We want playback to start instantly and to never stop unexpectedly in any network environment. We are also committed to protecting users’ privacy and service security without sacrificing any part of the playback experience.

To achieve that, we are efficiently using ABR (adaptive bitrate streaming) for a better playback experience, DRM (Digital Right Management) to protect our service and TLS (Transport Layer Security) to protect customer privacy and to create a safer streaming experience.

Netflix on consumer electronics devices such as TVs, set-top boxes and streaming sticks was until recently using TLS 1.2 for streaming traffic. Now we support TLS 1.3 for safer and faster experiences.

What is TLS?

For two parties to communicate securely, a secure channel is necessary. This needs to have the following three properties.

Authentication: Identity of the communicating party is verified.
Confidentiality: Data sent over the channel is only visible to the endpoints.
Integrity: Data sent over the channel cannot be modified by attackers without detection.

The TLS protocol is designed to provide a secure channel between two peers by providing tools and methods to achieve the above properties.

TLS 1.3

TLS 1.3 is the latest version of the Transport Layer Security protocol. It is simpler, more secure and more efficient than its predecessor.

Perfect Forward Secrecy

One thing we believe is very important at Netflix is providing PFS (Perfect Forward Secrecy).

PFS is a feature of the key exchange algorithm that assures that session keys will not be compromised, even if the server’s private key is compromised. By generating new keys for each session, PFS protects past sessions against the future compromise of secret keys.

TLS 1.2 supports key exchange algorithms with PFS, but it also allows key exchange algorithms that do not support PFS. Even with the previous version of TLS 1.2, Netflix has always selected a key exchange algorithm that provides PFS such as ECDHE (Elliptic Curve Diffie Hellman Ephemeral). TLS 1.3, however, enforces this concept even more by removing all the key exchange algorithms that do not provide PFS, such as static RSA.

Authenticated Encryption

For encryption, TLS 1.3 removes all weak ciphers and uses only Authenticated Encryption with Associated Data (AEAD). This assures the confidentiality, integrity, and authenticity of the data. We use AES Galois/Counter Mode, as it also provides good performance and high throughput.

Secure Handshake

While the above changes are important, the most important change in TLS 1.3 is perhaps its redesign of the handshake protocol.

The TLS 1.2 handshake was not designed to protect the integrity of the entire handshake. It protected only the part of the handshake after the cipher suite negotiation and this opened up the possibility of downgrade attacks which may allow the attackers to force the use of insecure cipher suites.

With TLS 1.3, the server signs the entire handshake including the cipher suite negotiation and thus prevents the attacker from downgrading the cipher suite.

Also in TLS 1.2, extensions were sent in the clear in the ServerHello. Now with TLS 1.3, even extensions are encrypted and all handshake messages after ServerHello are now encrypted.

Reduced Handshake

TLS 1.2 supports numerous key exchange algorithms, cipher suites and digital signatures, including weak and vulnerable ones. Therefore, it requires more messages to perform a handshake and two network round trips.

In contrast, the handshake in TLS 1.3 now requires only one round trip, with a simplified design and with all weak and vulnerable algorithms removed.

In addition, it has a new feature called 0-RTT, or TLS early data, for the resumed handshake. This allows an application to include application data with its initial handshake message, instead of having to wait until the handshake completes.

At Netflix, by the efficient resumption of the TLS session and careful use of 0-RTT for the streaming data, we can reduce the play delay.

A/B Testing Result

We were pretty confident that TLS 1.3 would bring us better security from the analysis of its protocol composition, but we did not know how it would perform in the context of streaming.

Since TLS 1.3’s performance-related feature is the 0-RTT mode with the resumed handshake, our hypothesis is that TLS 1.3 would reduce play delay, as we are no longer required to wait for the handshake to finish and we can instead issue the HTTP request for media data and receive the HTTP response for media data earlier.

To see the actual performance of TLS 1.3 in the field, we performed an experiment with

User accounts: half-million user accounts per cell.
Device type: mid-performance device with Quad ARM core @ 1.7GHz.
Control cell: TLS 1.2
Treatment cell: TLS 1.3

Play Delay

Play Delay is defined by how long it takes for playback to start. Below are the results of the play delay measured in the experiment. The results imply that on slower or congested networks, which can be represented by the quantiles of at least 0.75, TLS 1.3 achieves the largest gains, with improvements across all network conditions.

Below is the time series median play delay graph for this mid-performance device in the field. It also shows that playback starts earlier with TLS 1.3.

Media Rebuffer

At Netflix, we define a media rebuffer as a non-network originated rebuffer. It typically occurs when media data is not processed quickly enough by the device due to the high load on the CPU. Comparing the control cell with TLS 1.2, the experiment cell with TLS 1.3 showed about a 7.4% improvement in media rebuffers. This result implies that using TLS 1.3 with 0-RTT is more efficient and can reduce the CPU load.

Conclusion

From the security analysis, we are confident that TLS 1.3 improves communication security over TLS 1.2. From the field test, we are confident that TLS 1.3 provides us a better streaming experience.

At the time of writing this article, the Internet is experiencing higher than usual traffic and congestion. We believe saving even small amounts of data and round trips can be meaningful and even better if it also provides a more secure and efficient streaming experience.

Therefore, we have started deploying TLS 1.3 on newer consumer electronics devices and we are expecting even more devices to be deployed with TLS 1.3 capability in the near future.

How Netflix brings safer and faster streaming experience to the living room on crowded networks… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Hyper Scale VPC Flow Logs enrichment to provide Network Insight

May 26, 2020, 6:38 am

≫ Next: Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix

≪ Previous: How Netflix brings safer and faster streaming experience to the living room on crowded networks…

How Netflix is able to enrich VPC Flow Logs at Hyper Scale to provide Network Insight

By Hariharan Ananthakrishnan and Angela Ho

The Cloud Network Infrastructure that Netflix utilizes today is a large distributed ecosystem that consists of specialized functional tiers and services such as DirectConnect, VPC Peering, Transit Gateways, NAT Gateways, etc. While we strive to keep the ecosystem simple, the inherent nature of leveraging a variety of technologies will lead us to complications and challenges such as:

App Dependencies and Data Flow Mappings: Without understanding and having visibility into an application’s dependencies and data flows, it is difficult for both service owners and centralized teams to identify systemic issues.
Pathway Validation: Netflix velocity of change within the production streaming environment can result in the inability of services to communicate with other resources.
Service Segmentation: The ease of the cloud deployments has led to the organic growth of multiple AWS accounts, deployment practices, interconnection practices, etc. Without having network visibility, it’s not possible to improve our reliability, security and capacity posture.
Network Availability: The expected continued growth of our ecosystem makes it difficult to understand our network bottlenecks and potential limits we may be reaching.

Cloud Network Insight is a suite of solutions that provides both operational and analytical insight into the Cloud Network Infrastructure to address the identified problems. By collecting, accessing and analyzing network data from a variety of sources like VPC Flow Logs, ELB Access Logs, Custom Exporter Agents, etc, we can provide Network Insight to users through multiple data visualization techniques like Lumen, Atlas, etc.

VPC Flow Logs

VPC Flow Logs is an AWS feature that captures information about the IP traffic going to and from network interfaces in a VPC. At Netflix we publish the Flow Log data to Amazon S3. Flow Logs are enabled tactically on either a VPC or subnet or network interface. A flow log record represents a network flow in the VPC. By default, each record captures a network internet protocol (IP) traffic flow (characterized by a 5-tuple on a per network interface basis) that occurs within an aggregation interval.

version vpc-id subnet-id instance-id interface-id account-id type srcaddr dstaddr srcport dstport pkt-srcaddr pkt-dstaddr protocol bytes packets start end action tcp-flags log-status

3 vpc-12345678 subnet-012345678 i-07890123456 eni-23456789 123456789010 IPv4 52.213.180.42 10.0.0.62 43416 5001 52.213.180.42 10.0.0.62 6 568 8 1566848875 1566848933 ACCEPT 2 OK

The IP addresses within the Cloud can move from one EC2 instance or Titus container to another over time. To understand the attributes of each IP back to an application metadata Netflix uses Sonar. Sonar is an IPv4 and IPv6 address identity tracking service. VPC Flow Logs are enriched using IP Metadata from Sonar as it is ingested.

With a large ecosystem at Netflix, we receive hundreds of thousands of VPC Flow Log files in S3 each hour. And in order to gain visibility into these logs, we need to somehow ingest and enrich this data.

So how do we ingest all these s3 files?

At Netflix, we have the option to use Spark as our distributed computing platform. It is easier to tune a large Spark job for a consistent volume of data. As you may know, S3 can emit messages when events (such as a file creation events) occur which can be directed into an AWS SQS queue. In addition to the s3 object path, these events also conveniently include file size which allows us to intelligently decide how many messages to grab from the SQS queue and when to stop. What we get is a group of messages representing a set of s3 files which we humorously call “Mouthfuls”. In other words, we are able to ensure that our Spark app does not “eat” more data than it was tuned to handle.

We named this library Sqooby. It works well for other pipelines that have thousands of files landing in s3 per day. But how does it hold up to the likes of Netflix VPC Flow Logs that has volumes which are orders of magnitude greater? It didn’t. The primary limitation was that AWS SQS queues have a limit of 120 thousand in-flight messages. We found ourselves needing to hold more than 120 thousand messages in flight at a time in order to keep up with the volumes of files.

Requirements

There are multiple ways you can solve this problem and many technologies to choose from. As with any sustainable engineering design, focusing on simplicity is very important. This means using existing infrastructure and established patterns within the Netflix ecosystem as much as possible and minimizing the introduction of new technologies.

Equally important is the resilience, recoverability, and supportability of the solution. A malformed file should not hold up or back up the pipeline (resilience). If unexpected environmental factors cause the pipeline to get backed up, it should be able to recover by itself. And excellent logging is needed for debugging purposes and supportability. These characteristics allow for an on-call response time that is relaxed and more in line with traditional big data analytical pipelines.

Hyper Scale

At Netflix, our culture gives us the freedom to decide how we solve problems as well as the responsibility of maintaining our solutions so that we may choose wisely. So how did we solve this scale problem that meets all of the above requirements? By applying existing established patterns in our ecosystem on top of Sqooby. In this case, it’s a pattern which generates events (directed into another AWS SQS queue) whenever data lands in a table in a datastore. These events represent a specific cut of data from the table.

We applied this pattern to the Sqooby log tables which contained information about s3 files for each Mouthful. What we got were events that represented Mouthfuls. Spark could look up and retrieve the data in the s3 files that the Mouthful represented. This intermediate step of persisting Mouthfuls allowed us to easily “eat” through S3 event SQS messages at great speed, converting them to far fewer Mouthful SQS Messages which would each be consumed by a single Spark app instance. Because we ensured that our ingestion pipeline could concurrently write/append to the final VPC Flow Log table, this meant that we could scale out the number of Spark app instances we spin up.

Tuning for Hyper Scale

On this journey of ingesting VPC flow logs, we found ourselves tweaking configurations in order to tune throughput of the pipeline. We modified the size of each Mouthful and tuned the number of Spark executors per Spark app while being mindful of cluster capacity. We also adjusted the frequency in which Spark app instances are spun up such that any backlog would burn off during a trough in traffic.

Summary

Providing Network Insight into the Cloud Network Infrastructure using VPC Flow Logs at hyper scale is made possible with the Sqooby architecture. After several iterations of this architecture and some tuning, Sqooby has proven to be able to scale.

We are currently ingesting and enriching hundreds of thousands of VPC Flow Logs S3 files per hour and providing visibility into our cloud ecosystem. The enriched data allows us to analyze networks across a variety of dimensions (e.g. availability, performance, and security), to ensure applications can effectively deliver their data payload across a globally dispersed cloud-based ecosystem.

Special Thanks To

Bryan Keller, Ryan Blue

Hyper Scale VPC Flow Logs enrichment to provide Network Insight was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix

May 27, 2020, 9:59 am

≫ Next: Netflix Studio Engineering Overview

≪ Previous: Hyper Scale VPC Flow Logs enrichment to provide Network Insight

Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix

By Hank Jacobs, Senior Site Reliability Engineer on CORE

We’re privileged to be in the business of bringing joy to our customers at Netflix. Whether it’s a compelling new series or an innovative product feature, we strive to provide a best-in-class service that people love and can enjoy anytime, anywhere. A key underpinning to keeping our customers happy and streaming is a strong focus on reliability.

Reliability, formally speaking, is the ability of a system to function under stated conditions for a period of time. Put simply, reliability means a system should work and continue working. From failure injection testing to regularly exercising our region evacuation abilities, Netflix engineers invest a lot in ensuring the services that comprise Netflix are robust and reliable. Many teams contribute to the reliability of Netflix and own the reliability of their service or area of expertise. The Critical Operations and Reliability Engineering team at Netflix (CORE) is responsible for the reliability of the Netflix service as a whole.

CORE is a team consisting of Site Reliability Engineers, Applied Resilience Engineers, and Performance Engineers. Our group is responsible for the reliability of business-critical operations. Unlike most SRE teams, we do not own or operate any customer-serving services nor do we routinely make production code changes, build infrastructure, or embed on service teams. Our primary focus is ensuring Netflix stays up. Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.

Teams at Netflix follow the service ownership model: they operate what they build. Most of the time, service owners catch issues before they impact customers. Things still occasionally go sideways and incidents happen that impact the customer experience. This is where the CORE team steps in: CORE configures, maintains, and responds to alerts that monitor high-level business KPIs (stream starts per second, for instance). When one of those alerts fires, the CORE on-call engineer assesses the situation to determine the scope of impact, identify involved services, and engage service owners to assist with mitigation. From there, CORE begins to manage the incident.

Incident management at Netflix doesn’t follow common management practices like the ITIL model. In an incident, the CORE on-call engineer generally operates as the Incident Manager. The Incident Manager is responsible for performing or delegating activities such as:

Coordination — bringing in relevant service owners to help with the investigation and focus on mitigation
Decision Making — making key choices to facilitate the mitigation and remediation of customer impact (e.g. deciding if we should evacuate a region)
Scribe — keeping track of incident details such as involved teams, mitigation efforts, graphs of the current impact, etc.
Technical Sleuthing — assisting the responding service owners with understanding what systems are contributing to the incident
Liaison — communicating information about the incident across business functions with both internal and external teams as necessary

Once the customer impact is successfully mitigated, CORE is then responsible for coordinating the post-incident analysis. Analysis comes in many shapes and sizes depending on the impact and uniqueness of the incident, but most incidents go through what we call “memorialization”. This process includes a write-up of what happened, what mitigations took place, and what follow-up work was discussed. For particularly unique, interesting, or impactful incidents, CORE may host an Incident Review or engage in a deeper, long-form investigation. Most post-incident analysis, especially for impactful incidents, is done in partnership with one of CORE’s Applied Resilience Engineers. A key point to emphasize is that all incident analysis work focuses on the sociotechnical aspects of an incident. Consequently, post-incident analysis tends to uncover many practical learnings and improvements for all involved. We frequently socialize these findings outside of those directly involved to help share learnings across the company.

So what happens when a CORE engineer is not on-call or doing incident analysis? Unsurprisingly, the response varies widely based on the skillset and interests of the individual team member. In broad strokes, examples include:

Preserving operational visibility and response capabilities — fixing and improving our dashboards, alerts, and automation
Reliability consulting — discussing various aspects including architectural decisions, systemic observability, application performance, and on-call health training
Systematic risk identification and mitigation — partner with various teams to identify and fix systematic risks revealed by incidents
Internal tooling — build and maintain tools that support and augment our incident response capabilities
Learning and re-learning the changes to a complex, ever-moving system
Building and maintaining relationships with other teams

Overall, we’ve found that this form of reliability work best suits the needs and goals of Netflix. Reliability being CORE’s primary focus affords us the bandwidth to both proactively explore potential business-critical risks as well as effectively respond to those risks. Additionally, having a broad view of the system allows us to spot systematic risks as they develop. By being a separate and central team, we can more efficiently share learnings across the larger engineering organization and more easily consult with teams on an ad hoc basis. Ultimately, CORE’s singular focus on reliability empowers us to reveal business-critical sociotechnical risks, facilitate effective responses to those risks and ensure Netflix continues to bring joy to our customers.

Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Netflix Studio Engineering Overview

June 30, 2020, 10:36 am

≫ Next: Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

≪ Previous: Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix

By Steve Urban, Sridhar Seetharaman, Shilpa Motukuri, Tom Mack, Erik Strauss, Hema Kannan, CJ Barker

Netflix is revolutionizing the way a modern studio operates. Our mission in Studio Engineering is to build a unified, global, and digital studio that powers the effective production of amazing content.

Netflix produces some of the world’s most beloved and award-winning films and series, including The Irishman, The Crown, La Casa de Papel, Ozark, and Tiger King. In an effort to effectively and efficiently produce this content we are looking to improve and automate many areas of the production process. We combine our entertainment knowledge and our technical expertise to provide innovative technical solutions from the initial pitch of an idea to the moment our members hit play.

Why Does Studio Engineering Exist?

We enable Netflix to build a unified, global and digital studio that powers the effective production of amazing content. — Studio Engineering’s ‘Why’

The journey of a Netflix Original title from the moment it first comes to us as a pitch, to that press of the play button is incredibly complex. Producing great content requires a significant amount of coordination and collaboration from Netflix employees and external vendors across the various production phases. This process starts before the deal has been struck and continues all the way through launch on the service, involving people representing finance, scheduling, human resources, facilities, asset delivery, and many other business functions. In this overview, we will shed light on the complexity and magnitude of this journey and update this post with links to deeper technical blogs over time.

Content Lifecycle: Pitch, Development, Production, On-Service — Pitch-to-Play

Mission at a Glance

Creative pitch: Combine the best of machine learning and human intuition to help Netflix understand how a proposed title compares to other titles, estimate how many subscribers will enjoy it, and decide whether or not to produce it.
Business negotiations: Empower the Netflix Legal team with data to help with deal negotiations and acquisition of rights to produce and stream the content.
Pre-Production: Provide solutions to plan for resource needs, and discovery of people and vendors to continue expanding the scale of our productions. Any given production requires the collaboration of hundreds of people with varying expertise, so finding exactly the right people and vendors for each job is essential.
Production: Enable content creation from script to screen that optimizes the production process for efficiency and transparency. Free up creative resources to focus on what’s important: producing amazing and entertaining content.
Post-Production: Help our creative partners collaborate to refine content into their final vision with digital content logistics and orchestration.

What’s Next?

Studio Engineering will be publishing a series of articles providing business and technical insights as we further explore the details behind the journey from pitch to play. Stay tuned as we expand on each stage of the content lifecycle over the coming months!

Here are some related articles to Studio Engineering:

Netflix Studio Engineering Overview was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

July 8, 2020, 2:59 pm

≫ Next: Empowering the Visual Effects Community with the NetFX Platform

≪ Previous: Netflix Studio Engineering Overview

By Torio Risianto, Bhargavi Reddy, Tanvi Sahni, Andrew Park

Continue reading on Netflix TechBlog »

↧

Empowering the Visual Effects Community with the NetFX Platform

July 17, 2020, 9:01 am

≫ Next: Machine Learning for a Better Developer Experience

≪ Previous: Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

The cloud-based platform allows vendors, artists and creators to connect and collaborate on visual effects (VFX) from anywhere in the…

Continue reading on Netflix TechBlog »

↧

Machine Learning for a Better Developer Experience

July 20, 2020, 7:47 pm

≫ Next: Unbundling Data Science Workflows with Metaflow and AWS Step Functions

≪ Previous: Empowering the Visual Effects Community with the NetFX Platform

Stanislav Kirdey, William High

Imagine having to go through 2.5GB of log entries from a failed software build — 3 million lines — to search for a bug or a regression that happened on line 1M. It’s probably not even doable manually! However, one smart approach to make it tractable might be to diff the lines against a recent successful build, with the hope that the bug produces unusual lines in the logs.

Standard md5 diff would run quickly but still produce at least hundreds of thousands candidate lines to look through because it surfaces character-level differences between lines. Fuzzy diffing using k-nearest neighbors clustering from machine learning (the kind of thing logreduce does) produces around 40,000 candidate lines but takes an hour to complete. Our solution produces 20,000 candidate lines in 20 min of computing — and thanks to the magic of open source, it’s only about a hundred lines of Python code.

The application is a combination of neural embeddings, which encode the semantic information in words and sentences, and locality sensitive hashing, which efficiently assigns approximately nearby items to the same buckets and faraway items to different buckets. Combining embeddings with LSH is a great idea that appears to be less than a decade old.

Note — we used Tensorflow 2.2 on CPU with eager execution for transfer learning and scikit-learn NearestNeighbor for k-nearest-neighbors. There are sophisticated approximate nearest neighbors implementations that would be better for a model-based nearest neighbors solution.

What embeddings are and why we needed them

Assembling a k-hot bag-of-words is a typical (useful!) starting place for deduplication, search, and similarity problems around un- or semi-structured text. This type of bag-of-words encoding looks like a dictionary with individual words and their counts. Here’s what it would look like for the sentence “log in error, check log”.

{“log”: 2, “in”: 1, “error”: 1, “check”: 1}

This encoding can also be represented using a vector where the index corresponds to a word and the value is the count. Here is “log in error, check log” as a vector, where the first entry is reserved for “log” word counts, the second for “in” word counts, and so forth.

[2, 1, 1, 1, 0, 0, 0, 0, 0, …]

Notice that the vector consists of many zeros. Zero-valued entries represent all the other words in the dictionary that were not present in that sentence. The total number of vector entries possible, or dimensionality of the vector, is the size of your language’s dictionary, which is often millions or more but down to hundreds of thousands with some clever tricks.

Now let’s look at the dictionary and vector representations of “problem authenticating”. The words corresponding to the first five vector entries do not appear at all in the new sentence.

{“problem”: 1, “authenticating”: 1}
[0, 0, 0, 0, 1, 1, 0, 0, 0, …]

These two sentences are semantically similar, which means they mean essentially the same thing, but lexically are as different as they can be, which is to say they have no words in common. In a fuzzy diff setting, we might want to say that these sentences are too similar to highlight, but md5 and k-hot document encoding with kNN do not support that.

Erica Sinclair is helping us get LSH probabilities right

Dimensionality reduction uses linear algebra or artificial neural networks to place semantically similar words, sentences, and log lines near to each other in a new vector space, using representations known as embeddings. In our example, “log in error, check log” might have a five-dimensional embedding vector

[0.1, 0.3, -0.5, -0.7, 0.2]

and “problem authenticating” might be

[0.1, 0.35, -0.5, -0.7, 0.2]

These embedding vectors are near to each other by distance measures like cosine similarity, unlike their k-hot bag-of-word vectors. Dense, low dimensional representations are really useful for short documents, like lines of a build or a system log.

In reality, you’d be replacing the thousands or more dictionary dimensions with just 100 information-rich embedding dimensions (not five). State-of-the-art approaches to dimensionality reduction include singular value decomposition of a word co-occurrence matrix (GloVe) and specialized neural networks (word2vec, BERT, ELMo).

Erica Sinclair and locality-preserving hashing

What about clustering? Back to the build log application

We joke internally that Netflix is a log-producing service that sometimes streams videos. We deal with hundreds of thousands of requests per second in the fields of exception monitoring, log processing, and stream processing. Being able to scale our NLP solutions is just a must-have if we want to use applied machine learning in telemetry and logging spaces. This is why we cared about scaling our text deduplication, semantic similarity search, and textual outlier detection — there is no other way if the business problems need to be solved in real-time.

Our diff solution involves embedding each line into a low dimensional vector and (optionally “fine-tuning” or updating the embedding model at the same time), assigning it to a cluster, and identifying lines in different clusters as “different”. Locality sensitive hashing is a probabilistic algorithm that permits constant time cluster assignment and near-constant time nearest neighbors search.

LSH works by mapping a vector representation to a scalar number, or more precisely a collection of scalars. While standard hashing algorithms aim to avoid collisions between any two inputs that are not the same, LSH aims to avoid collisions if the inputs are far apart and promote them if they are different but near to each other in the vector space.

The embedding vector for “log in error, check log” might be mapped to binary number 01 — and 01 then represents the cluster. The embedding vector for “problem authenticating” would with high probability be mapped to the same binary number, 01. This is how LSH enables fuzzy matching, and the inverse problem, fuzzing diffing. Early applications of LSH were over high dimensional bag-of-words vector spaces — we couldn’t think of any reason it wouldn’t work on embedding spaces just as well, and there are signs that others have had the same thought.

Using LSH to place characters in the same bucket but in the Upside Down.

The work we did on applying LSH and neural embeddings in-text outlier detection on build logs now allows an engineer to look through a small fraction of the log’s lines to identify and fix errors in potentially business-critical software, and it also allows us to achieve semantic clustering of almost any log line in real-time.

We now bring this benefit from semantic LSH to every build at Netflix. The semantic part lets us group seemingly dissimilar items based on their meanings and surface them in outlier reports.

Conclusion

The mature state of open source transfer learning data products and SDKs has allowed us to solve semantic nearest neighbor search via LSH in remarkably few lines of code. We became especially interested in investigating the special benefits that transfer learning and fine-tuning might bring to the application. We’re excited to have an opportunity to solve such problems and to help people do what they do better and faster than before.

We hope you’ll consider joining Netflix and becoming one of the stunning colleagues whose life we make easier with machine learning. Inclusion is a core Netflix value and we are particularly interested in fostering a diversity of perspectives on our technical teams. So if you are in analytics, engineering, data science, or any other field and have a background that is atypical for the industry we’d especially like to hear from you!

If you have any questions about opportunities at Netflix, please reach out to the authors on LinkedIn.

Machine Learning for a Better Developer Experience was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Unbundling Data Science Workflows with Metaflow and AWS Step Functions

July 29, 2020, 1:45 pm

≫ Next: Improving our video encodes for legacy devices

≪ Previous: Machine Learning for a Better Developer Experience

by David Berg, Ravi Kiran Chirravuri, Romain Cledat, Jason Ge, Savin Goyal, Ferras Hamad, Ville Tuulos

Continue reading on Netflix TechBlog »

↧