Revisiting 1 Million Writes per second

July 25, 2014, 8:36 am

≫ Next: Scaling A/B Testing on Netflix.com with Node.js

≪ Previous: Billing & Payments Engineering Meetup

In an article we posted in November 2011, Benchmarking Cassandra Scalability on AWS - Over a million writes per second, we showed how Cassandra (C*) scales linearly as you add more nodes to a cluster. With the advent of new EC2 instance types, we decided to revisit this test. Unlike the initial post, we were not interested in proving C*’s scalability. Instead, we were looking to quantify the performance these newer instance types provide.

What follows is a detailed description of our new test, as well as the throughput and latency results of those tests.

Node Count, Software Versions & Configuration

The C* Cluster

The Cassandra cluster ran Datastax Enterprise 3.2.5, which incorporates C* 1.2.15.1. The C* cluster had 285 nodes. The instance type used was i2.xlarge. We ran JVM 1.7.40_b43 and set the heap to 12GB. The OS is Ubuntu 12.04 LTS. Data and logs are in the same mount point. The mount point is EXT3.

You will notice that in the previous test we used m1.xlarge instances for the test. Although we could have had similar write throughput results with this less powerful instance type, in Production, for the majority of our clusters, we read more than we write. The choice of i2.xlarge (an SSD backed instance type) is more realistic and will better showcase read throughput and latencies.

The full schema follows:

create keyspace Keyspace1

with placement_strategy = 'NetworkTopologyStrategy'

and strategy_options = {us-east : 3}

and durable_writes = true;

use Keyspace1;

create column family Standard1

with column_type = 'Standard'

and comparator = 'AsciiType'

and default_validation_class = 'BytesType'

and key_validation_class = 'BytesType'

and read_repair_chance = 0.1

and dclocal_read_repair_chance = 0.0

and populate_io_cache_on_flush = false

and gc_grace = 864000

and min_compaction_threshold = 999999

and max_compaction_threshold = 999999

and replicate_on_write = true

and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'

and caching = 'KEYS_ONLY'

and column_metadata = [

{column_name : 'C4',

validation_class : BytesType},

{column_name : 'C3',

validation_class : BytesType},

{column_name : 'C2',

validation_class : BytesType},

{column_name : 'C0',

validation_class : BytesType},

{column_name : 'C1',

validation_class : BytesType}]

and compression_options = {'sstable_compression' : ''};

You will notice that min_compaction_threshold and max_compaction_threshold were set high. Although we don’t set these parameters to exactly those values in Production, it does reflect the fact that we prefer to control when compactions take place and initiate a full compaction on our own schedule.

The Client

The client application used was Cassandra Stress. There were 60 client nodes. The instance type used was r3.xlarge. This instance type has half the cores of the m2.4xlarge instances we used in the previous test. However, the r3.xlarge instances were still able to push the load (while using 40% less threads) required to reach the same throughput at almost half the price. The client was running JVM 1.7.40_b43 on Ubuntu 12.04 LTS.

Network Topology

Netflix deploys Cassandra clusters with a Replication Factor of 3. We also spread our Cassandra rings across 3 Availability Zones. We equate a C* rack to an Amazon Availability Zone (AZ). This way, in the event of an Availability Zone outage, the Cassandra ring still has 2 copies of the data and will continue to serve requests.

In the previous post all clients were launched from the same AZ. This differs from our actual production deployment where stateless applications are also deployed equally across three zones. Clients in one AZ attempt to always communicate with C* nodes in the same AZ. We call this zone-aware connections. This feature is built into Astyanax, Netflix’s C* Java client library. As a further speed enhancement, Astyanax also inspects the record’s key and sends requests to nodes that actually serve the token range of the record about the be written or read. Although any C* coordinator can fulfill any request, if the node is not part of the replica set, there will be an extra network hop. We call this making token-aware requests.

Since this test uses Cassandra Stress, I do not use token-aware requests. However, through some simple grep and awk-fu, this test is zone-aware. This is more representative of our actual production network topology.

Latency & Throughput Measurements

We’ve documented our use of Priam as a sidecar to help with token assignment, backups & restores. Our internal version of Priam adds some extra functionality. We use the Priam sidecar to collect C* JMX telemetry and send it to our Insights platform, Atlas. We will be adding this functionality to the open source version of Priam in the coming weeks.

Below are the JMX properties we collect to measure latency and throughput.

Latency

AVG & 95%ile Coordinator Latencies

Read

StorageProxyMBean.getRecentReadLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated

Write

StorageProxyMBean.getRecentWriteLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated

Throughput

Coordinator Operations Count

Read

StorageProxyMBean.getReadOperations()

Write

StorageProxyMBean.getWriteOperations()

The Test

I performed the following 4 tests:

A full write test at CL One
A full write test at CL Quorum
A mixed test of writes and reads at CL One
A mixed test of writes and reads at CL Quorum

100% Write

Unlike in the original post, this test shows a sustained >1 million writes/sec. Not many applications will only write data. However, a possible use of this type of footprint can be a telemetry system or a backend to an Internet of Things (IOT) application. The data can then be fed into a BI system for analysis.

CL One

Like in the original post, this test runs at CL One. The average coordinator latencies are a little over 5 milliseconds and a 95th percentile of 10 milliseconds.

Every client node ran the following Cassandra Stress command:

cassandra-stress -d [list of C* IPs] -t 120 -r -p 7102 -n 1000000000 -k -f [path to log] -o INSERT

CL LOCAL_QUORUM

For the use case where a higher level of consistency in writes is desired, this test shows the throughput that is achieved if the million writes per/sec test was running at a CL of LOCAL_QUORUM.

The write throughput is hugging the 1 million writes/sec mark at an average coordinator latency of just under 6 milliseconds and a 95th percentile of 17 milliseconds.

Every client node ran the following Cassandra Stress command:

cassandra-stress -d [list of C* IPs] -t 120 -r -p 7102 -e LOCAL_QUORUM -n 1000000000 -k -f [path to log] -o INSERT

Mixed - 10% Write 90% Read

1 Million writes/sec makes for an attractive headline. Most applications, however, have a mix of reads and writes. After investigating some of the key applications at Netflix I noticed a mix of 10% writes and 90% reads. So this mixed test consists of reserving 10% of the total threads for writes and 90% for reads. The test is unbounded. This is still not realistic of the actual footprint an app might experience. However, it is a good indicator of how much throughput can be handled by the cluster and what the latencies might look like when pushed hard.

To avoid reading data from memory or from the file system cache, I let the write test run for a few days until a compacted data to memory ratio of 2:1 was reached.

CL One

C* achieves the highest throughput and highest level of availability when used in a CL One configuration. This does require developers to embrace eventual consistency and to design their applications around this paradigm. More info on this subject, can be found here.

The Write throughput is >200K writes/sec with an average coordinator latency of about 1.25 milliseconds and a 95th percentile of 2.5 milliseconds.

The Read throughput is around 900K reads/sec with an average coordinator latency of 2 milliseconds and a 95th percentile of 7.5 milliseconds.

Every client node ran the following 2 Cassandra Stress commands:

cassandra-stress -d $cCassList -t 30 -r -p 7102 -e LOCAL_QUORUM -n 1000000000 -k -f /data/stressor/stressor.log -o INSERT

cassandra-stress -d $cCassList -t 270 -r -p 7102 -e LOCAL_QUORUM -n 1000000000 -k -f /data/stressor/stressor.log -o READ

CL LOCAL_QUORUM

Most application developers starting off with C*, will default to CL Quorum writes and reads. This provides them the opportunity to dip their toes into the distributed database world, without having to also tackle the extra challenges of rethinking their applications for eventual consistency.

The Write throughput is just below the 200K writes/sec with an average coordinator latency of 1.75 milliseconds and a 95th percentile of 20 milliseconds.

The Read throughput is around 600K reads/sec with an average coordinator latency of 3.4 milliseconds and a 95th percentile of 35 milliseconds.

Every client node ran the following 2 Cassandra Stress commands:

cassandra-stress -d $cCassList -t 30 -r -p 7102 -e LOCAL_QUORUM -n 1000000000 -k -f [path to log] -o INSERT

cassandra-stress -d $cCassList -t 270 -r -p 7102 -e LOCAL_QUORUM -n 1000000000 -k -f [path to log] -o READ

Cost

The total costs involved in running this test include the EC2 instance costs as well as the inter-zone network traffic costs. We use Boundary to monitor our C* network usage.

The above shows that we were transferring a total of about 30Gbps between Availability Zones.

Here is the breakdown of the costs incurred to run the 1 million writes per/second test. These are retail prices that can be referenced here.

Instance Type / Item	Cost per Minute	Count	Total Price per Minute
i2.xlarge	$0.0142	285	$4.047
r3.xlarge	$0.0058	60	$0.348
Inter-zone traffic	$0.01 per GB	3.75 GBps * 60 = 225GB per minute	$2.25

		Total Cost per minute	$6.645
		Total Cost per half Hour	$199.35
		Total Cost per Hour	$398.7

Final Thoughts

Most companies probably don’t need to process this much data. For those that do, this is a good indication of what types of cost, latencies and throughput one could expect while using the newer i2 and r3 AWS instance types. Every application is different, and your mileage will certainly vary.

This test was performed over the course of a week during my free time. This isn’t an exhaustive performance study, nor did I get into any deep C*, system or JVM tuning. I know you can probably do better. If working with distributed databases at scale and squeezing out every last drop of performance is what drives you, please join the Netflix CDE team.

↧

Scaling A/B Testing on Netflix.com with Node.js

August 18, 2014, 3:14 pm

≫ Next: Netflix Hack Day - Summer 2014

≪ Previous: Revisiting 1 Million Writes per second

By Chris Saint-Amant

Last Wednesday night we held our third JavaScript Talks event at our Netflix headquarters in Los Gatos, Calif. Alex Liu and Micah Ransdell discussed some of the unique UI engineering challenges that go along with running hundreds of A/B tests each year

We are currently migrating our website UI layer to Node.js, and have taken the opportunity to step back and evaluate the most effective way to build A/B tests. The talk explored some of the patterns we’ve built in Node.js to to handle the complexity of running so many multivariate UI tests at scale. These solutions ultimately enable quick feature development and rapid test iteration.

Slides from the talk are now online. No video this time around, but you can come see us talk about UI approaches to A/B testing and our adoption of Node.js at several upcoming events. We hope to see you there!

NodeConf.eu, Sept 7-11, 2014
jQuery Chicago, Sept 12-13, 2014
HTML5DevConf, Oct 20-21, 2014

Lastly, don’t forget to check out our Netflix UI Engineering channel on YouTube to watch videos from past JavaScript Talks.

↧

Netflix Hack Day - Summer 2014

August 20, 2014, 11:29 am

≫ Next: Announcing Scumblr and Sketchy - Search, Screenshot, and Reclaim the Internet

≪ Previous: Scaling A/B Testing on Netflix.com with Node.js

By Daniel Jacobson, Ruslan Meshenberg, Matt McCarthy, Leslie Posada

Hack Day is a tradition at Netflix, as it is for many Silicon Valley companies. It is a great way to get away from everyday work and to provide a fun, experimental, collaborative and creative outlet for our product development and technical teams.

Similar to our Hack Day in February, we saw some really incredible ideas and implementations in our latest iteration last week. If something interesting and useful comes from Hack Day, that is great, but the real motivation is fun and collaboration. With that spirit in mind, we had over 150 people code from Thursday morning to Friday morning, yielding more than 50 “hacks.”

The teams produced hacks covering a wide array of areas, including development productivity, internal insights tools, quirky and fun mashups, and of course a breadth of product-related feature ideas. All hackers then presented and the audience of Netflix employees rated each hack on a 5-star scale to determine our seven category winners and a “People’s Choice Award.”

Below are some examples of some of the hacks to give you a taste of what we saw this time around. We should note that, while we think these hacks are very cool and fun, they may never become part of the Netflix product, internal infrastructure, or otherwise be used beyond Hack Day. We are surfacing them here publicly to share the spirit of the Netflix Hack Day.

And thanks to all of the hackers for putting together some incredible work in just 24 hours! If you are interested in being a part of our next Hack Day, let us know by checking out our jobs site!

Netflix Hue

Use Philips 'smart' lightbulbs to make your room's ambient lighting match the titles that you are browsing and watching.

By Evan Browning, Bogdan Ciuca

Nerdflix

Text and console-based Netflix UI.

By Sanjit Bhattacharjee

Oculix

A 3D room version of our UI for the Oculus Rift, complete with gesture support.

By Ian McKay, Steve McGuire, Rachel Nordman, Kris Range, Alex Bustin, M. Frank Emanuel

Netflix Mini

Chrome extension that allows for multi-task watching in a mini screen.

By Adam Butterworth, Paul Anastasopoulos and Art Burov

Dropflix

Bringing actions that are currently only accessible via Display Pages into the Home & Browse UI.

By Ben Johnson, David Sutton

Circle of Life

Home page shows an alternate Netflix UI experience based on a network graph of titles.

By Sanjit Bhattacharjee

And here are some pictures taken during the event.

↧

Announcing Scumblr and Sketchy - Search, Screenshot, and Reclaim the Internet

August 25, 2014, 9:58 am

≫ Next: Distributed Neural Networks with GPUs in the AWS Cloud

≪ Previous: Netflix Hack Day - Summer 2014

Netflix is pleased to announce the open source release of two security-related web applications: Scumblr and Sketchy!

Scumbling The Web

Many security teams need to stay on the lookout for Internet-based discussions, posts, and other bits that may be of impact to the organizations they are protecting. These teams then take a variety of actions based on the nature of the findings discovered. Netflix’s security team has these same requirements, and today we’re releasing some of the tools that help us in these efforts.

Scumblr is a Ruby on Rails web application that allows searching the Internet for sites and content of interest. Scumblr includes a set of built-in libraries that allow creating searches for common sites like Google, Facebook, and Twitter. For other sites, it is easy to create plugins to perform targeted searches and return results. Once you have Scumblr setup, you can run the searches manually or automatically on a recurring basis.

Scumblr leverages a gem called Workflowable (which we are also open sourcing) that allows setting up flexible workflows that can be associated with search results. These workflows can be customized so that different types of results go through different workflow processes depending on how you want to action them. Workflowable also has a plug-in architecture that allows triggering custom automated actions at each step of the process.

Scumblr also integrates with Sketchy, which allows automatic screenshot generation of identified results to provide a snapshot-in-time of what a given page and result looked like when it was identified.

Architecture

Scumblr makes use of the following components :

Ruby on Rails 4.0.9
Backend database for storing results
Redis + Sidekiq for background tasks
Workflowable for workflow creation and management
Sketchy for screenshot capture

We’re shipping Scumblr with built-in search libraries for seven common services including Google, Twitter, and Facebook.

Getting Started with Scumblr and Workflowable

Scumblr and Workflowable are available now on the Netflix Open Source site. Detailed instructions on setup and configuration are available in the projects’ wiki pages.

Sketchy

One of the features we wanted to see in Scumblr was the ability to collect screenshots and text content from potentially malicious sites - this allows security analysts to preview Scumblr results without the risk of visiting the site directly. We wanted this collection system to be isolated from Scumblr and also resilient to sites that may perform malicious actions. We also decided it would be nice to build an API that we could use in other applications outside of Scumblr. Although a variety of tools and frameworks exist for taking screenshots, we discovered a number of edge cases that made taking reliable screenshots difficult - capturing screenshots from AJAX-heavy sites, cut-off images with virtual X drivers, and SSL and compression issues in the PhantomJS driver for Selenium, to name a few. In order to solve these challenges, we decided to leverage the best possible tools and create an API framework that would allow for reliable, scalable, and easy to use screenshot and text scraping capabilities. Sketchy to the rescue!

Architecture:

At a high level, Sketchy contains the following components:

Python + Flask to serve Sketchy
PhantomJS to take lazy captures of AJAX heavy sites
Celery to manage jobs and + Redis to schedule and store job results
Backend database to store capture records (by leveraging SQLAlchemy)

Sketchy Overview

Sketchy at its core provides a scalable task-based framework to capture screenshots, scrape page text, and save HTML through a simple to use API. These captures can be stored locally or on an AWS S3 bucket. Optionally, token auth can be configured and callbacks can be used if required. Sketchy uses PhantomJS withlazy-renderingto ensure AJAX-heavy sites are captured correctly. Sketchy also uses the Celerytask management system, allowing users to scale Sketchy accordingly and manage time-intensive captures for large sites.

Getting Started with Sketchy

Sketchy is available now on the Netflix Open Source site and setup is straightforward. In addition, we've also created a Docker for Sketchy for interested users. Please visit the Sketchy wiki for documentation on how to get started.

Conclusion

Scumblr and Sketchy are helping the Netflix security team keep an eye on potential threats to our environment every day. We hope that the open source community can find new and interesting uses for the newest additions to the Netflix Open Source Software initiative. Scumblr, Sketchy, and the Workflowable gem are all available on our GitHub site now!

-Andy Hoernecke and Scott Behrens (Netflix Cloud Security Team)

↧

Distributed Neural Networks with GPUs in the AWS Cloud

February 10, 2014, 8:51 am

≫ Next: Introducing Chaos Engineering

≪ Previous: Announcing Scumblr and Sketchy - Search, Screenshot, and Reclaim the Internet

by Alex Chen, Justin Basilico, and Xavier Amatriain

As we have described previously on this blog, at Netflix we are constantly innovating by looking for better ways to find the best movies and TV shows for our members. When a new algorithmic technique such as Deep Learning shows promising results in other domains (e.g. Image Recognition, Neuro-imaging, Language Models, and Speech Recognition), it should not come as a surprise that we would try to figure out how to apply such techniques to improve our product. In this post, we will focus on what we have learned while building infrastructure for experimenting with these approaches at Netflix. We hope that this will be useful for others working on similar algorithms, especially if they are also leveraging the Amazon Web Services (AWS) infrastructure. However, we will not detail how we are using variants of Artificial Neural Networks for personalization, since it is an active area of research.

Many researchers have pointed out that most of the algorithmic techniques used in the trendy Deep Learning approaches have been known and available for some time. Much of the more recent innovation in this area has been around making these techniques feasible for real-world applications. This involves designing and implementing architectures that can execute these techniques using a reasonable amount of resources in a reasonable amount of time. The first successful instance of large-scale Deep Learning made use of 16000 CPU cores in 1000 machines in order to train an Artificial Neural Network in a matter of days. While that was a remarkable milestone, the required infrastructure, cost, and computation time are still not practical.

Andrew Ng and his team addressed this issue in follow up work . Their implementation used GPUs as a powerful yet cheap alternative to large clusters of CPUs. Using this architecture, they were able to train a model 6.5 times larger in a few days using only 3 machines. In another study, Schwenk et al. showed that training these models on GPUs can improve performance dramatically, even when comparing to high-end multicore CPUs.

Given our well-known approach and leadership in cloud computing, we sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud. We wanted to use a reasonable number of machines to implement a powerful machine learning solution using a Neural Network approach. We also wanted to avoid needing special machines in a dedicated data center and instead leverage the full, on-demand computing power we can obtain from AWS.

In architecting our approach for leveraging computing power in the cloud, we sought to strike a balance that would make it fast and easy to train Neural Networks by looking at the entire training process. For computing resources, we have the capacity to use many GPU cores, CPU cores, and AWS instances, which we would like to use efficiently. For an application such as this, we typically need to train not one, but multiple models either from different datasets or configurations (e.g. different international regions). For each configuration we need to perform hyperparameter tuning, where each combination of parameters requires training a separate Neural Network. In our solution, we take the approach of using GPU-based parallelism for training and using distributed computation for handling hyperparameter tuning and different configurations.

Distributing Machine Learning: At what level?

Some of you might be thinking that the scenario described above is not what people think of as a distributed Machine Learning in the traditional sense. For instance, in the work by Ng et al. cited above, they distribute the learning algorithm itself between different machines. While that approach might make sense in some cases, we have found that to be not always the norm, especially when a dataset can be stored on a single instance. To understand why, we first need to explain the different levels at which a model training process can be distributed.

In a standard scenario, we will have a particular model with multiple instances. Those instances might correspond to different partitions in your problem space. A typical situation is to have different models trained for different countries or regions since the feature distribution and even the item space might be very different from one region to the other. This represents the first initial level at which we can decide to distribute our learning process. We could have, for example, a separate machine train each of the 41 countries where Netflix operates, since each region can be trained entirely independently.

However, as explained above, training a single instance actually implies training and testing several models, each corresponding to a different combinations of hyperparameters. This represents the second level at which the process can be distributed. This level is particularly interesting if there are many parameters to optimize and you have a good strategy to optimize them, like Bayesian optimization with Gaussian Processes. The only communication between runs are hyperparameter settings and test evaluation metrics.

Finally, the algorithm training itself can be distributed. While this is also interesting, it comes at a cost. For example, training ANN is a comparatively communication-intensive process. Given that you are likely to have thousands of cores available in a single GPU instance, it is very convenient if you can squeeze the most out of that GPU and avoid getting into costly across-machine communication scenarios. This is because communication within a machine using memory is usually much faster than communication over a network.

The following pseudo code below illustrates the three levels at which an algorithm training process like us can be distributed.

for each region -> level 1 distribution
for each hyperparameter combination -> level 2 distribution
train model -> level 3 distribution
end for
end for

In this post we will explain how we addressed level 1 and 2 distribution in our use case. Note that one of the reasons we did not need to address level 3 distribution is because our model has millions of parameters (compared to the billions in the original paper by Ng).

Optimizing the CUDA Kernel

Before we addressed distribution problem though, we had to make sure the GPU-based parallel training was efficient. We approached this by first getting a proof-of-concept to work on our own development machines and then addressing the issue of how to scale and use the cloud as a second stage. We started by using a Lenovo S20 workstation with a Nvidia Quadro 600 GPU. This GPU has 98 cores and provides a useful baseline for our experiments; especially considering that we planned on using a more powerful machine and GPU in the AWS cloud. Our first attempt to train our Neural Network model took 7 hours.

We then ran the same code to train the model in on a EC2’s cg1.4xlarge instance, which has a more powerful Tesla M2050 with 448 cores. However, the training time jumped from 7 to over 20 hours. Profiling showed that most of the time was spent on the function calls to Nvidia Performance Primitive library, e.g. nppsMulC_32f_I, nppsExp_32f_I. Calling the npps functions repeatedly took 10x more system time on the cg1 instance than in the Lenovo S20.

While we tried to uncover the root cause, we worked our way around the issue by reimplementing the npps functions using the customized cuda kernel, e.g. replace nppsMulC_32f_I function with:

__global__
void KernelMulC(float c, float *data, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
data[i] = c * data[i];
}
}

Replacing all npps functions in this way for the Neural Network code reduced the total training time on the cg1 instance from over 20 hours to just 47 minutes when training on 4 million samples. Training 1 million samples took 96 seconds of GPU time. Using the same approach on the Lenovo S20 the total training time also reduced from 7 hours to 2 hours. This makes us believe that the implementation of these functions is suboptimal regardless of the card specifics.

PCI configuration space and virtualized environments

While we were implementing this “hack”, we also worked with the AWS team to find a principled solution that would not require a kernel patch. In doing so, we found that the performance degradation was related to the NVreg_CheckPCIConfigSpace parameter of the kernel. According to RedHat, setting this parameter to 0 disables very slow accesses to the PCI configuration space. In a virtualized environment such as the AWS cloud, these accesses cause a trap in the hypervisor that results in even slower access.

NVreg_CheckPCIConfigSpace is a parameter of kernel module nvidia-current, that can be set using:

sudo modprobe nvidia-current NVreg_CheckPCIConfigSpace=0

We tested the effect of changing this parameter using a benchmark that calls MulC repeatedly (128x1000 times). Below are the results (runtime in sec) on our cg1.4xlarge instances:

	KernelMulC	npps_MulC
CheckPCI=1	3.37	103.04
CheckPCI=0	2.56	6.17

As you can see, disabling accesses to PCI space had a spectacular effect in the original npps functions, decreasing the runtime by 95%. The effect was significant even in our optimized Kernel functions saving almost 25% in runtime. However, it is important to note that even when the PCI access is disabled, our customized functions performed almost 60% better than the default ones.

We should also point out that there are other options, which we have not explored so far but could be useful for others. First, we could look at optimizing our code by applying a kernel fusion trick that combines several computation steps into one kernel to reduce the memory access. Finally, we could think about using Theano, the GPU Match compiler in Python, which is supposed to also improve performance in these cases.

G2 Instances

While our initial work was done using cg1.4xlarge EC2 instances, we were interested in moving to the new EC2 GPU g2.2xlarge instance type, which has a GRID K520 GPU (GK104 chip) with 1536 cores. Currently our application is also bounded by GPU memory bandwidth and the GRID K520‘s memory bandwidth is 198 GB/sec, which is an improvement over the Tesla M2050’s at 148 GB/sec. Of course, using a GPU with faster memory would also help (e.g. TITAN’s memory bandwidth is 288 GB/sec).

We repeated the same comparison between the default npps functions and our customized ones (with and without PCI space access) on the g2.2xlarge instances.

	KernelMulC	npps_MulC
CheckPCI=1	2.01	299.23
CheckPCI=0	0.97	3.48

One initial surprise was that we measured worse performance for npps on the g2 instances than the cg1 when PCI space access was enabled. However, disabling it improved performance between 45% and 65% compared to the cg1 instances. Again, our KernelMulC customized functions are over 70% better, with benchmark times under a second. Thus, switching to G2 with the right configuration allowed us to run our experiments faster, or alternatively larger experiments in the same amount of time.

Distributed Bayesian Hyperparameter Optimization

Once we had optimized the single-node training and testing operations, we were ready to tackle the issue of hyperparameter optimization. If you are not familiar with this concept, here is a simple explanation: Most machine learning algorithms have parameters to tune, which are called often called hyperparameters to distinguish them from model parameters that are produced as a result of the learning algorithm. For example, in the case of a Neural Network, we can think about optimizing the number of hidden units, the learning rate, or the regularization weight. In order to tune these, you need to train and test several different combinations of hyperparameters and pick the best one for your final model. A naive approach is to simply perform an exhaustive grid search over the different possible combinations of reasonable hyperparameters. However, when faced with a complex model where training each one is time consuming and there are many hyperparameters to tune, it can be prohibitively costly to perform such exhaustive grid searches. Luckily, you can do better than this by thinking of parameter tuning as an optimization problem in itself.

One way to do this is to use a Bayesian Optimization approach where an algorithm’s performance with respect to a set of hyperparameters is modeled as a sample from a Gaussian Process. Gaussian Processes are a very effective way to perform regression and while they can have trouble scaling to large problems, they work well when there is a limited amount of data, like what we encounter when performing hyperparameter optimization. We use package spearmint to perform Bayesian Optimization and find the best hyperparameters for the Neural Network training algorithm. We hook up spearmint with our training algorithm by having it choose the set of hyperparameters and then training a Neural Network with those parameters using our GPU-optimized code. This model is then tested and the test metric results used to update the next hyperparameter choices made by spearmint.

We’ve squeezed high performance from our GPU but we only have 1-2 GPU cards per machine, so we would like to make use of the distributed computing power of the AWS cloud to perform the hyperparameter tuning for all configurations, such as different models per international region. To do this, we use the distributed task queue Celery to send work to each of the GPUs. Each worker process listens to the task queue and runs the training on one GPU. This allows us, for example, to tune, train, and update several models daily for all international regions.

Although the Spearmint + Celery system is working, we are currently evaluating more complete and flexible solutions using HTCondor or StarCluster. HTCondor can be used to manage the workflow of any Directed Acyclic Graph (DAG). It handles input/output file transfer and resource management. In order to use Condor, we need each compute node register into the manager with a given ClassAd (e.g. SLOT1_HAS_GPU=TRUE; STARD_ATTRS=HAS_GPU). Then the user can submit a job with a configuration "Requirements=HAS_GPU" so that the job only runs on AWS instances that have an available GPU. The main advantage of using Condor is that it also manages the distribution of the data needed for the training of the different models. Condor also allows us to run the Spearmint Bayesian optimization on the Manager instead of having to run it on each of the workers.

Another alternative is to use StarCluster , which is an open source cluster computing framework for AWS EC2 developed at MIT. StarCluster runs on the Oracle Grid Engine (formerly Sun Grid Engine) in a fault-tolerant way and is fully supported by Spearmint.

Finally, we are also looking into integrating Spearmint with Jobman in order to better manage the hyperparameter search workflow.
Figure below illustrates the generalized setup using Spearmint plus Celery, Condor, or StarCluster:

Conclusions

Implementing bleeding edge solutions such as using GPUs to train large-scale Neural Networks can be a daunting endeavour. If you need to do it in your own custom infrastructure, the cost and the complexity might be overwhelming. Levering the public AWS cloud can have obvious benefits, provided care is taken in the customization and use of the instance resources. By sharing our experience we hope to make it much easier and straightforward for others to develop similar applications.

We are always looking for talented researchers and engineers to join our team. So if you are interested in solving these types of problems, please take a look at some of our open positions on the Netflix jobs page .

↧

Introducing Chaos Engineering

September 10, 2014, 11:34 pm

≫ Next: Inviso: Visualizing Hadoop Performance

≪ Previous: Distributed Neural Networks with GPUs in the AWS Cloud

Chaos Monkey was launched in 2010 with our move to Amazon Web Services, and thus the Netflix Simian Army was born. Our ecosystem has evolved as we’ve introduced thousands of devices, many new countries, a Netflix optimized CDN often referred to as OpenConnect, a growing catalog of Netflix Originals, and new and exciting UI advancements. Not only has complexity grown, but our infrastructure itself has grown to support our rapidly growing customer base. As growth and evolution continues, we will experience and find new failure modes.

Our philosophy remains unchanged around injecting failure into production to ensure our systems are fault-tolerant. We are constantly testing our ability to survive “once in a blue moon” failures. In a sign of our commitment to this very philosophy, we want to double down on chaos aka failure-injection. We strive to mirror the failure modes that are possible in our production environment and simulate these under controlled circumstances. Our engineers are expected to write services that can withstand failures and gracefully degrade whenever necessary. By continuing to run these simulations, we are able to evaluate and improve such vulnerabilities in our ecosystem.

A great example of a new failure mode was the Christmas Eve 2012 regional ELB outage we experienced. The Simian Army at the time only injected failure that we understood and experienced up to that point. In response we invested in a multi-region Active-Active infrastructure to be resilient to such events. Its not enough that we simply make a system that is fault-tolerant to region outages, we must regularly exercise our ability to withstand regional outages.

Each outage reinforces our commitment to chaos to ensure a reliable experience possible for our users. While much of the simian army is designed and built around maintaining our environments, Chaos Engineering is entirely focused on controlled failure injection.

The Plan for Chaos Engineering:

Establish Virtuous Chaos Cycles

A common industry practice around outages are blameless post-mortems, a discipline we practice along with action items to prevent recurrence. In parallel with resilience patches and work to prevent recurrence, we also want to build new chaos tools to regularly and systematically test resilience to detect regressions or new conditions.

Regression Testing in Software Testing is a well understood discipline, chaos testing for regression in distributed systems at scale presents a unique challenge. We aspire to make chaos testing as well an understood discipline in production systems as other disciplines in software development.

Increase use of Reliability Design Patterns

In distributed environments there’s a challenge in both creating reliability design patterns and integrating them in a consistent manner to handle failure. When an outage or new failure mode surfaces it may start in a single service, but all services may be susceptible to the same failure mode. Post-mortems will lead to immediate action items for a particular involved service but do not always lead to improvement for other loosely coupled services. Eventually other susceptible services become impacted by a failure condition that may have previously surfaced. Hystrix is a fantastic example of a reliability design pattern that helps to create consistency in our micro-services ecosystem.

Anticipate Future Failure Modes

Ideally distributed systems are designed to be so robust and fault-tolerant that they never fail. We must anticipate failure modes, determine ways to inject these conditions in a controlled manner and evolve our reliability design patterns. Anticipating such events requires creativity and deep understanding of distributed systems; two of the most critical characteristics of Chaos Engineers.

New forms of Chaos and Reliability Design Patterns are two ways we are researching at Chaos Engineering. As we get deeper into our research we will continue to post our findings.

For those interested in this challenging research, we’re hiring additional Chaos Engineers. Check out the jobs for Chaos Engineering at our jobs site.

-Bruce Wong, Engineering Manager of Chaos Engineering at Netflix (sometimes referred to as Chaos Commander)

↧

Inviso: Visualizing Hadoop Performance

September 25, 2014, 10:36 am

≫ Next: A State of Xen - Chaos Monkey & Cassandra

≪ Previous: Introducing Chaos Engineering

by: Daniel C. Weeks

In a post last year we discussed our big data architecture and the advantages of working with big data in the cloud (read more here). One of the key points from the article is that Netflix leverages Amazon’s Simple Storage Service (S3) as the “source of truth” for all data warehousing. This differentiates us from the more traditional configuration where Hadoop’s distributed file system is the storage medium with data and compute residing in the same cluster. Decentralizing the data warehouse frees us to explore new ways to manage big data infrastructure but also introduces a new set of challenges.

From a platform management perspective, being able to run multiple clusters isolated by concerns is both convenient and effective. We experiment with new software and perform live upgrades by simply diverting jobs from one cluster to another or adjust the size and number of clusters based on need as opposed to capacity. Genie, our execution service, abstracts the configuration and resource management for job submissions by providing a centralized service to query across all big data resources. This cohesive infrastructure abstracts all of the orchestration from the execution and allows the platform team to be flexible and adapt to dynamic environments without impacting users of the system.

However, as a user of the system, understanding where and how a particular job executes can be confusing. We have hundreds of platform users ranging from running casual queries to ETL developers and data scientists running tens to hundreds of queries every day. Navigating the maze of tools, logs, and data to gather information about a specific run can be difficult and time consuming. Some of the most common questions we hear are:

Why did my job run slower today than yesterday?

Can we expand the cluster to speed up my job?

What cluster did my job run on?

How do I get access to task logs?

These questions can be hard to answer in our environment because clusters are not persistent. By the time someone notices a problem, the cluster that ran the query, along with detailed information, may already be gone or archived.

To help answer these questions and empower our platform users to explore and improve their job performance, we created a tool: Inviso (latin: to go to see, visit, inspect, look at). Inviso is a job search and visualization tool intended to help big data users understand execution performance. Netflix is pleased to add Inviso to our open source portfolio under the Apache License v2.0 and is available on github.

Inviso provides an easy interface to find jobs across all clusters, access other related tools, visualize performance, make detailed information accessible, and understand the environment in which jobs run.

Searching for Jobs

Finding a specific job run should be easy, but with each Hive or Pig script abstracting multiple Hadoop jobs, finding and pulling together the full execution workflow can be painful. To simplify this process, Inviso indexes every job configuration across all clusters into ElasticSearch and provides a simple search interface to query. Indexing job configurations into ElasticSearch is trivial because the structure is simple and flat. With the ability to use the full lucene query syntax, finding jobs is straightforward and powerful.

The search results are displayed in a concise table reverse ordered by time with continuous scrollback and links to various tools like the job history page, Genie, or Lipstick. Clicking the links will take you directly to the specific page for that job. Being able to look back over months of different runs of the same job allows for detailed analysis of how the job evolves over time.

In addition to the interface provided by Inviso, the ElasticSearch index is quite effective for other use cases. Since the index contains the full text of hive or pig script, searching for table or UDF usage is possible as well. Internally, we use the index to search for dependencies and scripts when modifying/deprecating/upgrading datasources, UDFs, etc. For example, when we last upgraded Hive, the new version had keyword conflicts with some existing scripts and we were able to identify the scripts and owners to upgrade prior to rolling out the new version of Hive. Others use it to identify who is using a specific table in case they want to change the structure or retire the table.

Visualizing Job Performance

Simply finding a job and the corresponding hadoop resources doesn’t make it any easier to understand the performance. Stages of a Hive or Pig script might execute in serially or parallel impacting the total runtime. Inviso correlates the various stages and lays them out in a swimlane diagram to show the parallelism. Hovering over a job provides detailed information including the full set of counters. The stages taking the longest time and where to focus effort to improve performance is readily apparent.

Overview Diagram Showing Stages of a Pig Job

Below the workflow diagram is a detailed task diagram for each job showing the individual execution of every task attempt. Laying these out in time order shows how tasks were allocated and executed. This visual presentation can quickly convey obvious issues with jobs including data skew, slow attempts, inconsistent resource allocation, speculative execution, and locality. Visualizing job performance in this compact format allows users to quickly scan the behavior of many jobs for problems. Hovering over an individual task will bring up task specific details including counters making it trivial to compare task details and performance.

Diagram Showing Task Details

Diagram of Execution Locality

Tasks are ordered by scheduler allocation providing insight into how many resources were available at the time and how long it took for the attempt to start. The color indicates the task type or status. Failed or killed tasks even present the failure reason and stack trace, so delving into the logs isn’t necessary. If you do want to look at a specific task log, simply select the task and click the provided link to go directly to the log for that task.

The detailed information used to populate this view comes directly from the job history file produced for every mapreduce job. Inviso has a single REST endpoint to parse the detailed information for a job and represent it as a json structure. While this capability is similar to what the MapReduce History Server REST API provides, the difference is that Inviso provides the complete structure in a single response. Gathering this information from the History Server would require thousands of requests with the current API and could impact the performance of other tools that rely on the history server such as Pig and Hive clients. We also use this REST API to collect metrics to aggregate job statistics and identify performance issues and failure causes.

Cluster Performance

With job performance we tend to think of how a job will run in isolation, but that’s rarely the case in any production environment. At Netflix, we have clusters isolated by concen: multiple clusters for production jobs, ad-hoc queries, reporting, and some dedicated to smaller efforts (e.g. benchmarking, regression testing, test data pipeline). The performance of any specific run is a function of the cluster capacity and the allocation assigned by the scheduler. If someone is running a job at the same time as our ETL pipeline, which has higher weight, they might get squeezed out due to the priority we assign ETL.

Similar to how Inviso indexes job configurations, REST endpoints are polled on the Resource Manager to get the current metrics for all clusters and indexes the results into ElasticSearch. With this information we can query and reconstitute the state of the cluster for any timespan going back days or months. So even though the a cluster may be gone or the job details are purged from the system, you can look back at how busy the cluster was when a job ran to determine if the performance was due to congestion.

Application Stream: Running applications with ETL and Data Pipeline Activity Highlighted

In a second graph on the same page, Inviso displays the capacity and backlog on the cluster using the running, reserved, and pending task metrics available from the Resource Manager’s REST API. This view has a range selector to adjust the timespan in the first graph and looks back over a longer period.

This view provides a way to gauge the load and backlog of the cluster. When large jobs are submitted the pending tasks will spike and slowly drain as the cluster works them off. If the cluster is unable to work off the backlog, the cluster might need to be expanded. Another insight this view provides is periods when clusters have underutilized capacity. For example, ad-hoc clusters are used less frequently at night, which is an opportune time to run a large backfill job. Inviso makes these types of usage patterns clear so we can shift resources or adjust usage patterns to take full advantage of the cluster.

Resource Stream: Showing Active Containers and Backlog of Tasks

Putting it all Together

With the increasing size and complexity of Hadoop deployments, being able to locate and understand performance is key to running an efficient platform. Inviso provides a convenient view of the inner workings of jobs and platform. By simply overlaying a new view on existing infrastructure, Inviso can operate inside any Hadoop environment with a small footprint and provide easy access and insight.

Given an existing cluster (in the datacenter or cloud), setting up Inviso should only take a few minutes, so give it a shot. If you like it and want to make it better, send some pull requests our way.

↧

A State of Xen - Chaos Monkey & Cassandra

October 2, 2014, 8:53 pm

≫ Next: NetflixOSS Season 2, Episode 1

≪ Previous: Inviso: Visualizing Hadoop Performance

On Sept 25th, 2014 AWS notified users about an EC2 Maintenance where “a timely security and operational update” needed to be performed that required rebooting a large number of instances. (around 10%) On Oct 1st, 2014 AWS sent an updated about the status of the reboot and XSA-108.

While we’d love to claim that we weren’t concerned at all given our resilience strategy, the reality was that we were on high alert given the potential of impact to our services. We discussed different options, weighed the risks and monitored our services closely. We observed that our systems handled the reboots extremely well with the resilience measures we had in place. These types of unforeseen events reinforce regular, controlled chaos and continued to invest in chaos engineering is necessary. In fact, Chaos Monkey was mentioned as a best practice in the latest EC2 Maintenance update.

Our commitment to induced chaos testing helps drive resilience, but it definitely isn’t trivial or easy; especially in the case of stateful systems like Cassandra. The Cloud Database Engineering team at Netflix rose to the challenge to embrace chaos and runs chaos monkey live in production last year. The number of nodes rebooted served as true battle testing for the resilience design measures created to operate cassandra.

Monkeying with the Database

Databases have long been the pampered and spoiled princes of the application world. They received the best hardware, copious amounts of personalized attention and no one would ever dream of purposely mucking around with them. In the world of democratized Public Clouds, this is no longer possible. Node failures are not just probable, they are expected. This requires database technology that can withstand failure and continue to perform.

Cassandra, Netflix’s database of choice, straddles the AP (Availability, Partition Tolerance) side of the CAP theorem. By trading away C (Consistency), we’ve made a conscious decision to design our applications with eventual consistency in mind. Our expectation is that Cassandra would live up to its side of the bargain and provide strong availability and partition tolerance. Over the years, it had demonstrated fairly good resilience to failure. However, it required lots of human intervention.

Last year we decided to invest in automating the recovery of failed Cassandra nodes. We were able to detect and determine a failed node. With the cloud APIs afforded to us by AWS, we can identify the location of the failed node and programmatically initiate the replacement and bootstrap of a new Cassandra node. This gave us the confidence to have Cassandra participate in our Chaos Monkey exercises.

It wasn’t perfect at first, but then again, what is? In true Netflix fashion, we failed fast and fixed forward. Over the next few months, our automation got better. There were less false positives, and our remediation scripts required almost no more human intervention.

AWS RE:BOOT

“When we got the news about the emergency EC2 reboots, our jaws dropped. When we got the list of how many Cassandra nodes would be affected, I felt ill. Then I remembered all the Chaos Monkey exercises we’ve gone through. My reaction was, “Bring it on!”.” - Christos Kalantzis - Engineering Manager, Cloud Database Engineering

That weekend our on-call staff was exceptionally vigilant. The whole Cloud Database Engineering team was on high alert. We have confidence in our automation but a prudent team prepares for the worst and hopes for the best.

Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes were on hardware that did not reboot successfully. This led to those Cassandra nodes not coming back online. Our automation detected the failed nodes and replaced them all, with minimal human intervention. Netflix experienced 0 downtime that weekend.

Repeatedly and regularly exercising failure, even in the persistence layer, should be part of every company’s resilience planning. If it wasn’t for Cassandra’s participation in Chaos Monkey, this story would have ended much differently.

by Bruce Wong, Engineering Manager - Chaos Engineering and Christos Kalantzis, Engineering Manager - Cloud Database Engineering

↧

NetflixOSS Season 2, Episode 1

March 13, 2014, 2:18 pm

≫ Next: Using Presto in our Big Data Platform on AWS

≪ Previous: A State of Xen - Chaos Monkey & Cassandra

By Ruslan Meshenberg

Wondering what this headline means? It means that NetflixOSS continues to grow, both in the number of projects that are now available and in the use by others.

We held another NetflixOSS Meetup in our Los Gatos, Calif., headquarters last night. Four companies came specifically to share what they’re doing with the NetflixOSS Platform:

Matt Bookman from Google shared how to leverage NetlixOSS Lipstick on top of Google Compute Engine. Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, giving developers insight that previously required a lot of sifting through logs (or a Pig expert) to piece together.
Andrew Spyker from IBM shared how they’re leveraging many of NetflixOSS components on top of the SoftLayer infrastructure to build real-world applications - beyond the AcmeAir app that won last year’s Cloud Prize.
Peter Sankauskas from Answers4AWS talked about the motivation behind his work on NetflixOSS components setup automation, and his work towards 0-click setup for many of the components.
Daniel Chia from Coursera shared how they utilize NetflixOSS Priam and Aegithus to work with Cassandra.

Since our previous NetflixOSS Meetup we have open sourced several new projects in many areas: Big Data tools and solutions, Scalable Data Pipelines, language agnostic storage solutions and more. At the yesterday’s Meetup Netflix engineers talking about recent projects and gave previews of the projects that may be soon released.

Zeno - in memory data serialization and distribution platform
Suro - a distributed data pipeline which enables services to move, aggregate, route and store data
STAASH - a language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems
A preview of Dynomite - a thin Dynamo-based replication for cached data
Aegithus - a bulk data pipeline out of Cassandra
PigPen - Map-Reduce for Clojure
S3mper - a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
A preview of Inviso - a performance focused Big Data tool

All the slides are available on Slideshare:

Netflix oss season 2 episode 1 - meetup Lightning talks from Ruslan Meshenberg

In preparation for the event, we spruced up our Github OSS site - all components now feature new cool icons:

The event itself was a full house - people at the demo stations were busy all evening answering many questions about the components they wrote and opened.

It was great to see how many of our Open Source components are being used outside of Netflix. We hear of many more companies that are using and contributing to NetflixOSS components. If you’re one of them, and would like to have your logo featured on our Github site “Powered by NetflixOSS” page - contact us at netflixoss@netflix.com

If you’re interested to hear more about upcoming NetflixOSS projects and events, follow @NetflixOSS on Twitter, and join our Meetup group. The slides for this event are available on Slideshare, videos will be uploaded shortly.

↧

Using Presto in our Big Data Platform on AWS

October 7, 2014, 9:54 am

≫ Next: Improving the performance of our JavaScript inheritance model

≪ Previous: NetflixOSS Season 2, Episode 1

by: Eva Tse, Big Data Platform team

At Netflix, the Big Data Platform team is responsible for building a reliable data analytics platform shared across the whole company. In general, Netflix product decisions are very data driven. So we play a big role in helping different teams to gain product and consumer insights from a multi-petabyte scale data warehouse (DW). Their use cases range from analyzing A/B tests results to analyzing user streaming experience to training data models for our recommendation algorithms.

We shared our overall architecture in a previous blog post. The underpinning of our big data platform is that we leverage AWS S3 for our DW. This architecture allows us to separate compute and storage layers. It allows multiple clusters to share the same data on S3 and clusters can be long-running and yet transient (for flexibility). Our users typically write Pig or Hive jobs for ETL and data analytics.

A small subset of the ETL output and some aggregated data is transferred to Teradata for interactive querying and reporting. On the other hand, we also have the need to do low latency interactive data exploration on our broader data set on S3. These are the use cases thatPresto serves exceptionally well. Seven months ago, we first deployed Presto into production and it is now an integral part of our data ecosystem. In this blog post, we would like to share our experience with Presto and how we made it work for us!

Why Presto?

We had been in search of an interactive querying engine that could work well for us. Ideally, we wanted an open source project that could handle our scale of data & processing needs, had great momentum, was well integrated with the Hive metastore, and was easy for us to integrate with our DW on S3. We were delighted when Facebook open sourced Presto.

In terms of scale, we have a 10 petabyte data warehouse on S3. Our users from different organizations query diverse data sets across expansive date ranges. For this use case, caching a specific dataset in memory would not work because cache hit rate would be extremely low unless we have an unreasonably large cache. The streaming DAG execution architecture of Presto is well-suited for this sporadic data exploration usage pattern.

In terms of integrating with our big data platform, Presto has a connector architecture that is Hadoop friendly. It allows us to easily plug in an S3 file system. We were up and running in test mode after only a month of work on the S3 file system connector in collaboration with Facebook.

In terms of usability, Presto supports ANSI SQL, which has made it very easy for our analysts and developers to get rolling with it. As far as limitations / drawbacks, user-defined functions in Presto are more involved to develop, build, and deploy as compared to Hive and Pig. Also, for users who want to productionize their queries, they need to rewrite them in HiveQL or Pig Latin, as we don’t currently use Presto in our critical production pipelines. While there are some minor inconveniences, the benefits of being able to interactively analyze large amounts of data is a huge win for us.

Finally, Presto was already running in production at Facebook. We did some performance benchmarking and stress testing and we were impressed. We also looked under the hood and saw well designed and documented Java code. We were convinced!

Our production environment and use cases

Currently, we are running with ~250 m2.4xlarge EC2 worker instances and our coordinator is on r3.4xlarge. Our users run ~2500 queries/workday. Our Presto cluster is completely isolated from our Hadoop clusters, though they all access the same data on our S3 DW.

Almost all of our jobs are CPU bound. We set our task memory to a rather high value (i.e., 7GB, with a slight chance in oversubscribing memory) to run some of our memory intensive queries, like big joins or aggregation queries.

We do not use disk (as we don’t use HDFS) in the cluster. Hence, we will be looking to upgrade to the current generation AWS instance type (e.g. r3), which has more memory, and has better isolation and performance than the previous generation of EC2 instances.

We are running the latest Presto 0.76 release with some outstanding pull requests that are not committed yet. Ideally, we would like to contribute everything back to open source and not carry custom patches in our deployment. We are actively working with Facebook and looking forward to committing all of our pull requests.

Presto addresses our ad hoc interactive use cases. Our users always go to Presto first for quick answers and for data exploration. If Presto does not support what they need (like big join / aggregation queries that exceed our memory limit or some specific user-defined functions that are not available), then they would go back to Hive or Pig.

We are working on a Presto user interface for our internal big data portal. Our algorithm team also built an interactive data clustering application by integrating R with Presto via an open source Python Presto client.

Performance benchmark

At a high level, we compare Presto and Hive query execution time using our own datasets and users’ queries instead of running standard benchmarks like TPC-H or TPC-DS. This way, we can translate the results back to what we can expect for our use cases. The graph below shows the results of three queries: a group-by query, a join plus a group-by query, and a needle-in-a-haystack (table scan) query. We compared the performance of Presto vs. Hive 0.11 on Hadoop 2 using Parquet input files on S3, all of which we currently use in production. Each query processed the same data set with varying data sizes between ~140GB to ~210GB depending on the file format.

Cluster setup:

40 nodes m2.4xlarge

Settings we tuned:

task.shard.max-threads=32

task.max-memory=7GB

sink.max-buffer-size=1GB

hive.s3.max-client-retries=50

hive.s3.max-error-retries=50

hive.s3.max-connections=500

hive.s3.connect-timeout=5m

hive.s3.socket-timeout=5m

We understand performance test environments and numbers are hard to reproduce. What is worth noting is the relative performance of these tests. The key takeaway is that queries that take one or two map-reduce (MR) phases in Hadoop run 10 to 100 times faster in Presto. The speedup in Presto is linear to the number of MR jobs involved. For jobs that only do a table scan (i.e., I/O bound instead of CPU bound), it is highly dependent on the read performance of the file format used. We did some work on Presto / Parquet integration, which we will cover in the next section.

Our Presto contributions

The primary and initial piece of work that made Presto work for us was S3 FileSystem integration. In addition, we also worked on optimizing S3 multipart upload. We also made a few enhancements and bug fixes based on our use cases along the way: disabling recursive directory listing, json tuple generation, foreground metastore refresh, mbean for S3 filesystem monitoring, and handling S3 client socket timeout.

In general, we are committed to make Presto work better for our users and to cover more of their needs. Here are a few big enhancements that we are currently working on:

Parquet file format support

We recently upgraded our DW to use the Parquet file format (FF) for its performance on S3 and for its flexibility to integrate with different data processing engines. Hence, we are committed to make Presto work better with Parquet FF. (For details on why we chose Parquet and what we contributed to make it work in our environment, stay tuned for an upcoming blog post).

Developing based on Facebook’s initial Parquet integration, we added support for predicate pushdown, column position based access (instead of name based access) to Parquet columns, and data type coercion. For context, we use the Hive metastore as our source of truth for metadata, and we do schema evolution on the Hive metastore. Hence, we need column position based access to work with our Hive metastore instead of using the schema information stored in Parquet files.

Here is a comparison of Presto job execution times among different FFs. We compare read performance of sequence file (a FF we have historically used), ORCFile (we benchmarked the latest integration with predicate pushdown, vectorization and lazy materialization on read) and Parquet. We also compare the performance on S3 vs. HDFS. In this test, we use the same data sets and environment as the above benchmark test. The query is a needle-in-a-haystack query that does a select and filter on a condition that returns zero rows.

Screen Shot 2014-09-30 at 2.29.03 PM.png

As next step, we will look into improving Parquet performance further by doing predicate pushdown to eliminate whole row groups, vectorization and lazy materialization on read. We believe this will make Parquet performance on par with ORC files.

ODBC / JDBC support

This is one of the biggest asks from our users. Users like to connect to our Hive DW directly to do exploratory / ad hoc reporting because it has the full dataset. Given Presto is interactive and integrated with Hive metastore, it is a natural fit.

Presto has a native ODBC driver that was recently open sourced. We made a few bug fixes and we are working on more enhancements. Overall, it is working well now for our Tableau users in extract (non-live exploration) mode. For our users who prefer to use Microstrategy, we plan to explore different options to integrate with it next.

Map data type support

All the event data generated from our Netflix services and Netflix-enabled devices comes through our Suro pipeline before landing in our DW. For flexibility, this event data is structured as key/value pairs, which get automatically stored in map columns in our DW. Users may pull out keys as a top level columns in the Hive metastore by adjusting some configurations in our data pipeline. Still, a large number of key/value pairs remain in the map because there are a large number of keys and the key space is very sparse.

It is very common for users to lookup a specific key from the map. With our current Parquet integration, looking up a key from the map column means converting the column to JSON string first then parsing it. Facebook recently added native support for array and map data types. We plan to further enhance it to support array element or map key specific column pruning and predicate pushdown for Parquet FF to improve performance.

Our wishlist

There are still a couple of items that are high on our wishlist and we would love to contribute on these when we have the bandwidth.

Big table join. It is very common for our queries to join tables as we have a lot of normalized data in our DW. We are excited to see that distributed hash join is now supported and plan to check it out. Sort-merge join would likely be useful to solve some of the even bigger join use cases that we have.

Graceful shrink. Given Presto is used for our ad hoc use cases, and given we run it in the cloud, it would be most efficient if we could scale up the cluster during peak hours (mostly work hours) and scale down during trough hours (night time or weekends). If Presto nodes can be blacklisted and gracefully drained before shutdown, we could combine that with available JMX metrics to do heuristic-based auto expand/shrink of the cluster.

Key takeaway

Presto makes the lives of our users a lot easier. It tremendously improves their productivity.

We have learned from our experience that getting involved and contributing back to open source technologies is the best way to make sure it works for our use cases in a fast paced and evolving environment. We have been working closely with the Facebook team to discuss our use cases and align priorities. They have been open about their roadmap, quick in adding new features, and helpful in providing feedback to our contributions. We look forward to continuing to work with them and the community to make Presto even better and more comprehensive. Let us know if you are interested in sharing your experiences using Presto.

Last but not least, the Big Data Platform team at Netflix has been heads-down innovating on our platform to meet our growing business needs. We will share more of our experiences with our Parquet FF migration and Genie 2 upgrade in upcoming blog posts.

If you are interested in solving big data problems like ours, we would like to hear from you!

↧

Improving the performance of our JavaScript inheritance model

May 16, 2014, 12:16 pm

≫ Next: FIT : Failure Injection Testing

≪ Previous: Using Presto in our Big Data Platform on AWS

by Dylan Oudyk

When building the initial architecture for Netflix’s common TV UI in early 2009, we knew we wanted to use inheritance to share common functionality in our codebase. Since the first engineers working on the code were coming from the heavily Java-based Website, they preferred simulating a classical inheritance model: an easy way to extend objects and override functions, but also have the ability to invoke the overridden function. (Since then we’ve moved on to greener pastures, or should we say design patterns, but the classical inheritance model still lives on and remains useful within our codebase.) For example, every view in our UI inherits from a base view with common methods and properties, like show, hide, visible, and parent. Concrete views extend these methods while still retaining the base view behavior.

After searching around and considering a number of approaches, we landed on John Resig’s Simple JavaScript Inheritance. We really liked the syntactic sugar of this._super(), which allows you to invoke the super-class function from within the overriding function.

Resig’s approach allowed us to write simple code like this:

varHuman=Class.extend({
    init:function(height, weight){
this.height = height;
this.weight = weight;
}
});
varMutant=Human.extend({
    init:function(height, weight, abilities){
this._super(height, weight);
this.abilities = abilities;
}
});

Not so super, after all

While his approach was sugary sweet and simple, we’d always been leery of a few aspects, especially considering our performance- and memory-constrained environment on TV devices:

When extending an object, it loops over the prototype to find functions that use _super

To find _super it decompiles a function into a string and tests that string against a regular expression.
Any function using _super is then wrapped in a closure to achieve the super-class chaining.

All that sugar was slowing us down. We found that the _super implementation was about 70% slower in executing overridden functions than a vanilla approach of calling the super-class function directly. The overhead of having to invoke the closure then invoke the overriding function which then invokes the inherited function yields a performance penalty that starts to add up directly onto the execution stack. Our application not only runs on beefy game consoles like the PS3, PS4 and Xbox 360, but also on single-core Blu-ray players with 600 MHz ARM processors. A great user experience demands that the code run as fast as possible for the UI to remain responsive.

Not only was _super slowing us down, it was gobbling up memory. We have an automation suite that stress tests the UI by putting it through its paces: browsing the entire grid of content, conducting multiple playbacks, searching, etc. We run it against our builds to ensure our memory footprint remains relatively stable and to catch any memory leaks that might accidentally get introduced. On a PS3, we profiled the memory usage with and without _super. Our codebase had close to 720 uses of _super, that consumed about 12.2 MB. 12.2 MB is a drop in the bucket in the desktop browser world, but we work in a very constrained environment where 12.2 MB represents a large drop in a small bucket.

Worse still, when we went to move off the now deprecated YUI Compressor, more aggressive JavaScript minifiers like UglifyJS and Google’s Closure Compiler would obfuscate _super causing the regular expression test to fail and blow up our run-time execution.

We knew it was time to find a better way.

All your base are belong to us

We wanted to remove the performance and memory bottlenecks and also unblock ourselves from using a new minifier, but without having to rewrite significant portions of our large, existing codebase.

The _super implementation basically uses the wrapping closure as a complex pointer with references to the overriding function and the inherited function:

If we could find a way to remove the middle man, we’d be able to have the best of both worlds.

We were able to leverage a lesser known feature in JavaScript, named function expressions, to cut out the expensive work that _super had been doing.

Now when we’re extending an object, we loop over the prototype and add a baseproperty to every function. This base property points to the inherited function.

for(name in source){
    value = source[name];
    currentValue = receiver[name];
if(typeof value ==='function'&&typeof currentValue ==='function'&&
        value !== currentValue){
        value.base = currentValue;
}
}

We use a named function expression within the overriding function to invoke the inherited function.

varHuman=Class.extend({
    init:function(height, weight){
this.height = height;
this.weight = weight;
}
});
varMutant=Human.extend({
    init:functioninit(height, weight, abilities){
init.base.call(this, height, weight);
this.abilities = abilities;
}
});

var theWolverine =newMutant('5ft 3in',300,[
'adamantium skeleton',
'heals quickly',
'claws'
]);

(Please note, that if you need to support Internet explorer versions < 9 this may not be an option for you, but arguments.callee will be available, more details at Named function expressions demystified).

Arguably, we lost a teeny bit of sugar but at significant savings. The base approach is about 45% faster in executing an overridden method than _super. Add to that a significant memory savings.

As can be seen from the graph above, after running our application for one minute the memory savings is close to about 12.2MB. We could pull the line back to the beginning and the memory savings would be even more because after one minute the application code has long been interpreted, and the classes have been created.

Conclusion

We believe we found a great way to invoke overridden methods with the named function expression approach. By replacing _super, we saved RAM and CPU cycles. We’d rather save the RAM for gorgeous artwork and our streaming video buffer. The saved CPU cycles can be put to work on beautiful transitions. Overall a change that improves the experience for our users.

References:

John Resig’s Simple JavaScript Inheritance System http://ejohn.org/blog/simple-javascript-inheritance/

Invoking Overridden Methods Performance Tests: http://jsperf.com/fun-with-method-overrides/2

Named Function Expressions: http://kangax.github.io/nfe/

↧

FIT : Failure Injection Testing

October 23, 2014, 3:55 pm

≫ Next: Message Security Layer: A Modern Take on Securing Communication

≪ Previous: Improving the performance of our JavaScript inheritance model

by Kolton Andrus, Naresh Gopalani, Ben Schmaus

It's no secret that at Netflix we enjoy deliberately breaking things to test our production systems. Doing so lets us validate our assumptions and prove that our mechanisms for handling failure will work when called upon. Netflix has a tradition of implementing a range of tools that create failure, and it is our pleasure to introduce you to the latest of these solutions, FIT or Failure Injection Testing.

FIT is a platform that simplifies creation of failure within our ecosystem with a greater degree of precision for what we fail and who we will impact. FIT also allows us to propagate our failures across the entirety of Netflix in a consistent and controlled manner.

Why We Built FIT

While breaking things is fun, we do not enjoy causing our customers pain. Some of our Monkeys, by design, can go a little too wild when let out of their cages. Latency Monkey in particular has bitten our developers, leaving them wary about unlocking the cage door.

Latency monkey adds a delay and/or failure on the server side of a request for a given service. This provides us good insight into how calling applications behave when their dependency slows down - threads pile up, the network becomes congested, etc. Latency monkey also impacts all calling applications - whether they want to participate or not, and can result in customer pain if proper fallback handling, timeouts, and bulkheads don’t work as expected. With the complexity of our system it is virtually impossible for us to anticipate where failures will happen when turning latency monkey loose. Validating these behaviors often is risky, but critical to remain resilient.

What we need is a way to limit the impact of failure testing while still breaking things in realistic ways. We need to control the outcome until we have confidence that the system degrades gracefully, and then increase it to exercise the failure at scale. This is where FIT comes in.

How FIT works

Simulating failure starts when the FIT service pushes failure simulation metadata to Zuul. Requests matching the failure scope at Zuul are decorated with failure. This may be an added delay to a service call, or failure in reaching the persistence layer. Each injection point touched checks the request context to determine if there is a failure for that specific component. If found, the injection point simulates that failure appropriately. Below is an outline of a simulated failure, demonstrating some of the inflection points in which failure can be injected.

Failure Scope

We only want to break those we intend, so limiting the potential blast radius is critical. To achieve this we use Zuul, which provides many powerful capabilities for inspecting and managing traffic. Before forwarding a request, Zuul checks a local store of FIT metadata to determine if this request should be impacted. If so, Zuul decorates the request with a failure context, which is then propagated to all dependent services.

For most failure tests, we use Zuul to isolate impacted requests to only a specific test account or a specific device. Once validated at that level, we expand the scope to a small percentage of production requests. If the failure tests still looks good, we will gradually dial up the chaos to 100%.

Injection Points

We have several key “building block” components that are used within Netflix. They help us to isolate failure and define fallbacks (Hystrix), communicate with dependencies (Ribbon), cache data (EVCache), or persist data (Astyanax). Each of these layers make perfect inflection points to inject failure. These layers interface with the FIT context to determine if this request should be impacted. The failure behavior is provided to that layer, which determines how to emulate that failure in a realistic fashion: sleep for a delay period, return a 500, throw an exception, etc.

Failure Scenarios

Whether we are recreating a past outage, or proactively testing the loss of a dependency, we need to know what could fail in order to build a simulation. We use an internal system that traces requests through the entirety of the Netflix ecosystem to find all of the injection points along the path. We then use these to create failure scenarios, which are sets of injection points which should or should not fail. One such example is our critical services scenario, the minimum set of our services required to stream. Another may be the loss of an individual service, including its persistence and caching layers.

Automated Testing

Failure testing tools are only as valuable as their usage. Our device testing teams have developed automation which: enables failure, launches Netflix on a device, browses through several lists, selects a video, and begins streaming. We began by validating this process works if only our critical services are available. Currently we are extending this to identify every dependency touched during this process, and systematically failing each one individually. As this is running continuously, it helps us to identify vulnerabilities when introduced.

Resiliency Strategies

FIT has proven useful to bridge the gap between isolated testing and large scale chaos exercises, and make such testing self service. It is one of many tools we have to help us build more resilient systems. The scope of the problem extends beyond just failure testing, we need a range of techniques and tools: designing for failure, better detection and faster diagnosis, regular automated testing, bulkheading, etc. If this sounds interesting to you, we’re looking for great engineers to join our reliability, cloud architecture, and API teams!

↧

Message Security Layer: A Modern Take on Securing Communication

October 31, 2014, 12:39 pm

≫ Next: Introducing Dynomite - Making Non-Distributed Databases, Distributed

≪ Previous: FIT : Failure Injection Testing

Netflix serves audio and video to millions of devices and subscribers across the globe. Each device has its own unique hardware and software, and differing security properties and capabilities. The communication between these devices and our servers must be secured to protect both our subscribers and our service.

When we first launched the Netflix streaming service we used a combination of HTTPS and a homegrown security mechanism called NTBA to provide that security. However, over time this combination started exhibiting growing pains. With the advent of HTML5 and the Media Source Extensions and Encrypted Media Extensions we needed something new that would be compatible with that platform. We took this as an opportunity to address many of the shortcomings of the earlier technology. The Message Security Layer (MSL) was born from these dual concerns.

Problems with HTTPS

One of the largest problems with HTTPS is the PKI infrastructure. There were a number of short-lived incidents where a renewed server certificate caused outages. We had no good way of handling revocation: our attempts to leverage CRL and OCSP technologies resulted in a complex set of workarounds to deal with infrastructure downtimes and configuration mistakes, which ultimately led to a worse user experience and brittle security mechanism with little insight into errors. Recent security breaches at certificate authorities and the issuance of intermediate certificate authorities means placing trust in one actor requires placing trust in a whole chain of actors not necessarily deserving of trust.

Another significant issue with HTTPS is the requirement for accurate time. The X.509 certificates used by HTTPS contain two timestamps and if the validating software thinks the current time is outside that time window the connection is rejected. The vast majority of devices do not know the correct time and have no way of securely learning the correct time.

Being tied to SSL and TLS, HTTPS also suffers from fundamental security issues unknown at the time of their design. Examples include padding attacks and the use of MAC-then-Encrypt, which is less secure than Encrypt-then-MAC.

There are other less obvious issues with HTTPS. Establishing a connection requires extra network round trips and depending on the implementation may result in multiple requests to supporting infrastructure such as CRL distribution points and OCSP responders in order to validate a certificate chain. As we continually improved application responsiveness and playback startup time this overhead became significant, particularly in situations with less reliable network connectivity such as Wi-Fi or mobile networks.

Even ignoring these issues, integrating new features and behaviors into HTTPS would have been extremely difficult. The specification is fixed and mandates certain behaviors. Leveraging specific device security features would require hacking the SSL/TLS stack in unintended ways: imagine generating some form of client certificate that used a dynamically generated set of device credentials.

High-level Goals

Before starting to design MSL we had to identify its high-level goals. Other than general best practices when it comes to protocol design, the following objectives are particularly important given the scale of deployment, the fact it must run on multiple platforms, and the knowledge it will be used for future unknown use cases.

Cross-language. Particularly subject to JavaScript constraints such as its maximum integer value and native functions found in web browsers.
Automatic error recovery. With millions of devices and subscribers we need devices that enter a bad state to be able to automatically recover without compromising security.
Performance. We do not want our application performance and responsiveness to be limited any more than it has to be. The network is by far the most expensive performance cost.
Figure 1. HTTP vs. HTTPS Performance
Flexible and extensible.Whenever possible we want to take advantage of security features provided by devices and their software. Likewise if something no longer provides the security we need then there needs to be a migration path forward.
Standards compatible. Although related to being flexible and extensible, we paid particular attention to being standards compatible. Specifically we want to be able to leverage the Web Crypto API now available in the major web browsers.

Security Properties

MSL is a modern cryptographic protocol that takes into account the latest cryptography technologies and knowledge. It supports the following basic security properties.

Integrity protection. Messages in transit are protected from tampering.
Encryption. Message data is protected from inspection.
Authentication. Messages can be trusted to come from a specific device and user.
Non-replayable. Messages containing non-idempotent data can be non-replayable.

MSL supports two different deployment models, which we refer to as MSL network types. A single device may participate in multiple MSL networks simultaneously.

Trusted services network. This deployment consists of a single client device and multiple servers. The client authenticates against the servers. The servers have shared access to the same cryptographic secrets and therefore each server must trust all other servers.
Peer-to-peer. This is a typical p2p arrangement where each each side of the communication is mutually authenticated.

Figure 2. MSL Networks

Protocol Overview

A typical MSL message consists of a header and one or more application payload chunks. Each chunk is individually protected which allows the sender and recipient to process application data as it is transmitted. A message stream may remain open indefinitely, allowing large time gaps between chunks if desired.

MSL has pluggable authentication and may leverage any number of device and user authentication types for the initial message. The initial message will provide authentication, integrity protection, and encryption if the device authentication type supports it. Future messages will make use of session keys established as a result of the initial communication.

If the recipient encounters an error when receiving a message it will respond with an error message. Error messages consist of a header that indicates the type of error that occurred. Upon receipt of the error message the original sender can attempt to recover and retransmit the original application data. For example, if the message recipient believes one side or the other is using incorrect session keys the error will indicate that new session keys should be negotiated from scratch. Or if the message recipient believes the device or user credentials are incorrect the error will request the sender re-authenticate using new credentials.

To minimize network round-trips MSL attempts to perform authentication, key negotiation, and renewal operations while it is also transmitting application data (Figure 2). As a result MSL does not impose any additional network round trips and only minimal data overhead.

Figure 3. MSL Communication w/Application Data

This may not always be possible in which case a MSL handshake must first occur, after which sensitive data such as user credentials and application data may be transmitted (Figure 3).

Figure 4. MSL Handshake followed by Application Data

Once session keys have been established they may be reused for future communication. Session keys may also be persisted to allow reuse between application executions. In a trusted services network the session keys resulting from a key negotiation with one server can be used with all other servers.

Platform Integration

Whenever possible we would like to take advantage of the security features provided by a specific platform. Doing so often provides stronger security than is possible without leveraging those features.

Some devices may already contain cryptographic keys that can be used to authenticate and secure initial communication. Likewise some devices may have already authenticated the user and it is a better user experience if the user is not required to enter their email and password again.

MSL is a plug-in architecture which allows for the easy integration of different device and user authentication schemes, session key negotiation schemes, and cryptographic algorithms. This also means that the security of any MSL deployment heavily depends on the mechanisms and algorithms it is configured with.

The plug-in architecture also means new schemes and algorithms can be incorporated without requiring a protocol redesign.

Other Features

Time independence. MSL does not require time to be synchronized between communicating devices. It is possible certain authentication or key negotiation schemes may impose their own time requirements.
Service tokens. Service tokens are very similar to HTTP cookies: they allow applications to attach arbitrary data to messages. However service tokens can be cryptographically bound to a specific device and/or user, which prevents data from being migrated without authorization.

The Release

To learn more about MSL and find out how you can use it for your own applications visit the Message Security Layer repository on GitHub.

The protocol is fully documented and guides are provided to help you use MSL securely for your own applications. Java and JavaScript implementations of a MSL stack are available as well as some example applications. Both languages fully support trusted services and peer-to-peer operation as both client and server.

MSL Today and Tomorrow

With MSL we have eliminated many of the problems we faced with HTTPS and platform integration. Its flexible and extensible design means it will be able to adapt as Netflix expands and as the cryptographic landscape changes.

We are already using MSL on many different platforms including our HTML5 player, game consoles, and upcoming CE devices. MSL can be used just as effectively to secure internal communications. In the future we envision using MSL over Web Sockets to create long-lived secure communication channels between our clients and servers.

We take security seriously at Netflix and are always looking for the best to join our team. If you are also interested in attacking the challenges of the fastest-growing online streaming service in the world, check out our job listings.

Wesley Miaw & Mitch Zollinger
Security Engineering

↧

Introducing Dynomite - Making Non-Distributed Databases, Distributed

November 3, 2014, 1:21 pm

≫ Next: Prana: A Sidecar for your Netflix PaaS based Applications and Services

≪ Previous: Message Security Layer: A Modern Take on Securing Communication

Introduction & Overview

Netflix has long been a proponent of the microservices model. This model offers higher-availability, resiliency to failure and loose coupling. The downside to such an architecture is the potential for a latent user experience. Every time a customer loads up a homepage or starts to stream a movie, there are a number of microservices involved to complete that request. Most of these microservices use some kind of stateful system to store and serve data. A few milliseconds here and there can add up quickly and result in a multi-second response time.

The Cloud Database Engineering team at Netflix is always looking for ways to shave off milliseconds from an application’s database response time, while maintaining our goal of local high-availability and multi-datacenter high-availability. With that goal in mind, we created Dynomite.

Inspired by the Dynamo White Paper as well as our experience with Apache Cassandra, Dynomite is a sharding and replication layer. Dynomite can make existing non distributed datastores, such as Redis or Memcached, into a fully distributed & multi-datacenter replicating datastore.

Server Architecture

Motivation

In the open source world, there are various single-server datastore solutions, e.g. Memcached, Redis, BerkeleyDb, LevelDb, Mysql (datastore). The availability story for these single-server datastores usually ends up being a master-slave setup. Once traffic demands overrun this setup, the next logical progression is to introduce sharding. Most would agree that it is non trivial to operate this kind of a setup. Furthermore, managing data from different shards is also a challenge for application developers.

In the age of high scalability and big data, Dynomite’s design goal is to turn those single-server datastore solutions into peer-to-peer, linearly scalable, clustered systems while still preserving the native client/server protocols of the datastores, e.g., Redis protocol.

Now we will introduce a few high level concepts that are core to the Dynomite server architecture design.

Dynomite Topology

A Dynomite cluster consists of multiple data centers (dc). A datacenter is a group of racks, and a rack is a group of nodes. Each rack consists of the entire dataset, which is partitioned across multiple nodes in that rack. Hence, multiple racks enable higher availability for data. Each node in a rack has a unique token, which helps to identify the dataset it owns.

Each Dynomite node (e.g., a1 or b1 or c1) has a Dynomite process co-located with the datastore server, which acts as a proxy, traffic router, coordinator and gossiper. In the context of the Dynamo paper, Dynomite is the Dynamo layer with additional support for pluggable datastore proxy, with an effort to preserve the native datastore protocol as much as possible.

A datastore can be either a volatile datastore such as Memcached or Redis, or persistent datastore such as Mysql, BerkeleyDb or LevelDb. Our current open sourced Dynomite offering supports Redis and Memcached.

Replication

A client can connect to any node on a Dynomite cluster when sending write traffic. If the Dynomite node happens to own the data based on its token, then the data is written to the local datastore server process and asynchronously replicated to other racks in the cluster across all data centers. If the node does not own the data, it acts as a coordinator and sends the write to the node owning the data in the same rack. It also replicates the writes to the corresponding nodes in other racks and DCs.

In the current implementation, a coordinator returns an Ok back to client if a node in the local rack successfully stores the write and all other remote replications will happen asynchronously.

The pic below shows an example for the latter case where client sends a write request to non-owning node. It belongs on nodes a2,b2,c2 and d2 as per the partitioning scheme. The request is sent to a1 which acts as the coordinator and sends the request to the appropriate nodes.

Highly available reads

Multiple racks and multiple data centers provide high availability. A client can connect to any node to read the data. Similar to writes, a node serves the read request if it owns the data, otherwise it forwards the read request to the data owning node in the same rack. Dynomite clients can fail over to replicas in remote racks and/or data centers in case of node, rack, or data center failures.

Pluggable Datastores

Dynomite currently supports Redis and Memcached. For each of the data stores, based on our usage experience, a pragmatic subset of the most useful Redis/Memcached APIs are supported. Support for additional APIs will be added as needed in the near future.

Standard open source Memcached/Redis ASCII protocol support

Any client that can talk to Memcached or Redis can talk to Dynomite - no change needed. However, there will be a few things missing, including failover strategy, request throttling, connection pooling, etc., unless our Dyno client is used (more details to in the Client Architecture section).

Scalable I/O event notification server

All incoming/outgoing data traffic is processed by a single threaded I/O event loop. There are additional threads for background or administrative tasks. All thread communications are based on lock-free circular queue message passing, and asynchronous message processing. This style of implementation enables each Dynomite node to handle a very large number of client connections while still processing many non-client facing tasks in parallel.

Peer-to-peer, and linearly scalable

Every Dynomite node in a cluster has the same role and responsibility. Hence, there is no single point of failure in a cluster. With this advantage, one can simply add more nodes to a Dynomite cluster to meet traffic demands or loads.

Cold cache warm-up

Currently, this feature is available for Dynomite with the Redis datastore. Dynomite can help to reduce the performance impact by filling up an empty node or nodes with data from its peers.

Asymmetric multi-datacenter replications

As seen earlier, a write can be replicated over to multiple datacenters. In different datacenters, Dynomite can be configured with different number of racks with different number of nodes. This helps greatly when there are unbalanced traffic into different datacenters.

Internode communication and Gossip

Dynomite with built-in gossip helps to maintain cluster membership as well as failure detection and recovery. This simplifies the maintenance operations on Dynomite clusters.

Functional in AWS and physical datacenter

In AWS environment, a datacenter is equivalent an AWS’ region and a rack is the same as an AWS’ availability zone. At Netflix, we have more tools to support running Dynomite clusters within AWS but in general, both deployments in these two environments should be similar.

Client Architecture

Dynomite server implements the underlying datastore protocol and presents that as its public interface. Hence, one can use popular java clients like Jedis, Redisson and SpyMemcached to directly speak to Dynomite.

At Netflix, we see the benefit in encapsulating client side complexity and best practices in one place instead of having every application repeat the same engineering effort, e.g., topology-aware routing, effective failover, load shedding with exponential backoff, etc.

Dynomite ships with a Netflix homegrown client called Dyno. Dyno implements patterns inspired by Astyanax (the Cassandra client at Netflix), on top of popular clients like Jedis, Redisson and SpyMemcached, to ease the migration to Dyno and Dynomite.

Dyno Client Features

Connection pooling of persistent connections - this helps reduce connection churn on the Dynomite server with client connection reuse.
Topology aware load balancing (Token Aware) for avoiding any intermediate hops to a Dynomite coordinator node that is not the owner of the specified data.
Application specific local rack affinity based request routing to Dynomite nodes.
Application resilience by intelligently failing over to remote racks when local Dynomite rack nodes fail.
Application resilience against network glitches by constantly monitoring connection health and recycling unhealthy connections.
Capability of surgically routing traffic away from any nodes that need to be taken offline for maintenance.
Flexible retry policies such as exponential backoff etc
Insight into connection pool metrics
Highly configurable and pluggable connection pool components for implementing your advanced features.

Here is an example of how Dyno does failover to improve app resilience against individual node problems.

Fun facts

Dyno client strives to maintain compatibility with client interfaces like Jedis, which greatly reduces the barrier for apps that are already using Jedis when performing a switch to Dynomite.

Also, since Dynomite implements both Redis and Memcached protocols, one can use Dyno to directly connect to Redis/Memcached itself and bypass Dynomite (if needed). Just switch the connection port from Dynomite server port to the redis server port.

Having a layer of indirection with our own homegrown client gives Netflix the flexibility to do other cool things such as

Request interception - you should be able to plug in your own interceptor to do things such as

Implement query trace or slow query logging.
Implement fault injection for testing application resilience when things go south server side.

Micro batching - submitting a batch or requests to a distributed db gets tricky since different keys map to different servers as per the sharding/hashing strategy. Dyno has the capability to take a user submitted batch, split it into shard aware micro-batches under the covers, execute them individually and then stitch the results back together before getting back to the user. Obviously one has to deal with partial failure here, and Dyno has the intelligence to retry just the failed micro-batch against the remote rack replica responsible for that hash partition.
Load shedding - Dyno’s interceptor model for every request will give it the ability to do quota management and rate limiting in order to protect the backend Dynomite servers.

Linear scale test

We wanted to ensure that Dynomite could scale horizontally to meet traffic demands from hundreds of micro-services at Netflix as the company expands its global footprint.

We conducted a simple test with a static Dynomite cluster of size 6 and a load test harness that uses dyno client. The cluster was configured to have replication factor of 3 i.e it was a single data center with 3 racks.

We ramped up requests against the cluster while ensuring that 99 percentile latencies were still in the single digit ms range.

We then scaled up both server fleet and client fleet proportionally and repeated the test. We went through a few of cycles of scaling i.e 6 -> 12 -> 24 and at each stage we recorded the sustained throughput where the avg and 99 percentile latencies were within acceptable range.

i.e < 1ms for avg latency and 3-6 ms for 99 percentile latency.

’

We saw that Dynomite scales linearly as we add more nodes to the cluster. This is critical for a datastore at Netflix where we want surgical control on throughput and latency with a predictable cost model. Dynomite enables just that.

Long Term Vision & Roadmap

Dynomite has the potential to offer server-based sharding and replication for any datastore, as long as a proxy is created to intercept the desired API calls.

This initial version of Dynomite supports Redis and Memcahed sharding and replication in clear text, backups and restore. In the next few weeks, we will be implementing encrypted inter-datacenter communication. We also have plans to implement reconciliation (repair) of the cluster’s data and support different read/write consistency setting, making this an eventually consistent datastore.

On the Dyno client side we plan on adding other cool features such as load shedding, distributed pipelining and micro-batching. We are also looking at integrating with RxJava to provide a reactive API to Redis/Memcached which will enable apps to observe sequences of data and events.

by: Minh Do, Puneet Oberai, Monal Daxini & Christos Kalantzis

↧

Prana: A Sidecar for your Netflix PaaS based Applications and Services

November 5, 2014, 9:12 am

≫ Next: Traffic Optimizer: The Power of Negatives

≪ Previous: Introducing Dynomite - Making Non-Distributed Databases, Distributed

The Right Tool for the Right Job; Its hard to argue against this time tested mantra. At Netflix, an overwhelming part of our applications and services have traditionally been implemented in Java. As our services and products evolved, we asked ourselves if Java was still the right choice for implementing these services/applications. While its still true that for most of our microservices Java is the right choice, there were a growing set that begged for a closer look at alternatives.

For example, Javascript is a natural fit for Web Applications and we fully embraced Reactive Extensions pattern via RxJS for our UI layer. While Javascript has traditionally ruled the roost on the Web UI front, the advent of Node.js made Javascript and Event based models a force to reckon with on the server side as well.

Python, for example, is a popular choice for many machine learning and recommendation systems.

Over the years, we have evolved and rapidly augmented our stable of libraries that constitute the Netflix PaaS ecosystem. For the most part these libraries were traditionally implemented in Java. While we wanted to embrace Python and Node.js and other non-java based tools and frameworks for the right reasons, we also wanted to be able to leverage our existing Java libraries and services and continue evolving them. Duplicating the functionalities of these libraries in different languages was undesirable as it would lead to duplication of effort and increased cost in terms of maintenance and developer resources to say nothing of the bug fixes that may need to be ported across these languages.

How could we leverage our Java libraries and services infrastructure while aiding and supporting our increasingly Polyglot ecosystem?

Fig. above: A Polyglot Cloud Ecosystem

A Sidecar was the solution we naturally evolved into.

A sidecar by definition is “A one-wheeled device attached to the side of a motorcycle, scooter, or bicycle,[1] producing a three-wheeled vehicle.”.

Sidecar as a solution for providing applications with “non-intrusive platform capabilities” has gained popularity outside of Netflix as well, especially in the NetflixOSS ecosystem where many users have used this concept. Andrew’s earlier blog post while he was championing NetflixOSS within IBM addresses some of the motivations and philosophies.

Prana is the Sidecar we use at Netflix.

We are happy to add Prana to the collection of open source components under the NetflixOSS banner. Prana is available on Github at http://github.com/netflix/Prana.

Prana is conceptually “attached” to the main (aka Parent) application and complements it by providing “platform features” that are otherwise available as libraries within a JVM-based application.

Architectural Diagram of Prana

The diagram above shows a typical application/service implemented in a non JVM language (Python, RoR, Node.js etc.) accessing various Platform Services via Prana the sidecar which is hosted alongside the main application in the same host/vm.

Motivation

The motivation for creating Prana is to provide Netflix platform functionality for the following two broad application categories:

Non JVM based applications: Since, most Netflix platform libraries are written in Java and there is close to no support for any other languages, it becomes increasingly difficult for non JVM based applications to fit into the Netflix environment, eg: register with Eureka, log events to Suro, read Archaius based Properties, invoke other micro-services safely via Ribbon, etc. Prana makes this possible by providing these functionalities over HTTP from a locally available server (on the same virtual instance.)
Short-lived processes: Initializing various “Platform” libraries is a heavy operation and for processes that are short-lived, eg: jobs, batch scripts, etc. it is an unnecessary overhead for each of these short lived processes. So, even though these processes are JVM based, Prana will alleviate the overhead of platform initialization.

Key Features

Eureka (Metadata Registry) Registration

Prana registers the parent application with discovery and hence makes the parent application “discoverable” inside Netflix infrastructure.

Service Discovery

For application client with their own load balancers that don't need Proxy, eureka/hosts gives you a view of the current set of instances so you can update your own load balancer information.

Dynamic Properties

Sidecar integrates with Archaius to allow for side managed processes to get updates to dynamic properties.

Proxy

IPC calls that are efficient, high performance and resilient (i.e. employ the concept of circuit breakers, bulk heading and fallbacks) to other Services in the ecosystem is an important aspect of a microservices/SOA based architecture. Prana proxies requests to other micro-services using Hystrix/Ribbon thus enabling the parent(main) application to invoke other micro-services inside Netflix using the same resilient stack as used by other JVM based applications.

At Netflix, we are taking measured steps to evolve many of our high throughput IPC based apps/services to employ the new Reactive/Asynchronous programming model. Prana allows the main applications to utilize these high throughput IPC mechanisms via a proxy endpoint.

Healthcheck

Advertises the healthcheck endpoint to Eureka and provides extensible healthcheck implementation, which involves checking the health of the parent(main) application.

Admin UI

Since, Prana is a Karyon application, it can leverage the Karyon feature of an embedded Admin Console. The admin console provides runtime insights and diagnostics of the server and its environment.

Prana also comes with a pluggability feature that would allow for other plugins to be added for easy extensibility.

A note on usage

The usage of Prana inside Netflix is mainly in applications where we can't plug in our JVM based client libraries. We use Prana with non JVM applications and also with any application which is not developed on the Netflix stack. For example, we run Prana alongside Memcached, Spark, Mesos servers to make them discoverable.

Functions such as Hystrix, Eureka registration and health check, when integrated within a JVM based application via the respective java libraries will work better operationally given the internal knowledge the application has of itself as opposed to being accessed via a Sidecar. Also, functions such as dynamic properties and metrics are more efficiently queried and updated within the application process as compared to calling cross process.

Getting Started

You can take Prana for a quick run by following the instructions at https://github.com/Netflix/Prana/wiki/Getting-Started.

Summary

Homogeneity is a great asset while operating and managing a heavily microservices based ecosystem. An effective way to provide such homogeneity in terms of platform-infrastructure is by leveraging the concept of a Sidecar. Prana fulfils this functionality and provides us the ability to employ the best suitable language/framework for our increasingly complex and heterogeneous Cloud ecosystem.

We hope that Prana will be a welcome addition to our already growing NetflixOSS ecosystem. We welcome contributions in terms of code, documentation or ideas to Prana or any of our other NetflixOSS projects.

If building innovative infrastructural components is your passion, we are looking for great engineers to join us and further augment our PaaS infrastructure.

Contributors: Diptanu Gon Choudhury, Sudhir Tonse and Andrew Spyker

↧

Traffic Optimizer: The Power of Negatives

November 6, 2014, 9:05 am

≫ Next: Introducing Raigad - An Elasticsearch Sidecar

≪ Previous: Prana: A Sidecar for your Netflix PaaS based Applications and Services

Contributors:Steve Tangsombatvisit, Guillaume McIntyre, Kevin Heung

Search marketing is an area of vast possibility and vast potential where there are hundreds of different ways to do similar things. Global sites like Netflix serve millions of different search ads to people all over the world each day. We try to optimize this traffic to make sure that the right ads are served to the right people.

There are many tools and methods that teams can use to optimize traffic. To go over all that we do at Netflix in the area of search traffic optimization would consume much more than a single blog post. Instead, we will focus on one potential and important area of traffic optimization, negative keywords.

Negative keywords allow advertisers to direct traffic to the right ad group, and assuming you have relevant ads for every ad group (this assumption applies throughout the entire post), can increase click-through rates (CTR), which will positively impact quality score and, all else being equal, lower average cost-per clicks (CPC).

Let's look at a common Netflix scenario. At Netflix, we have ad groups for a variety of keywords related to streaming movies online. We have different ad groups for different keyword combinations and permutations. For example, we have an ad group for the keyword “stream movies” and another ad group for the keyword “stream movies online.”

Sample Netflix Ad Groups

The differences are subtle but important. If the search query is “stream movies online,” we do not want Google to direct traffic to the “stream movies” ad group because it would not be as relevant as the ad group “stream movies online.”

You can shape the traffic using negative keywords. This is a fairly easy operation to do. Unfortunately, for large search marketing teams with many accounts, thousands of ad groups and millions of keywords, managing negative keywords is a time-consuming task.

This is the problem we set out to solve: to manage negative keywords on a large global scale in an automated and relevant way.

The first thing we defined was a set of rules of when and where to apply negatives. There is no real industry standard for negative keywords. After browsing around, we did not find anything that fit our needs. Per the Netflix culture, we defined our own. Here is the process we came up with.

We typically create "exact", "phrase" and "broad" ad groups.

'Exact (E), Phrase (P) and Broad (B)'

After all ad groups are setup in this format, we look for search traffic that could have been directed to a more relevant ad group. To accomplish this, we do a daily audit of the Adwords search query report, using an algorithm to determine whether a query triggered the most relevant ad group.

Search Query Report

Let’s take a look at a couple of examples.

Example 1

Search Query: “stream movie”

Ad Served From: “stream movies - E”

Example 2:

Search Query: “stream movie online”

Ad Served From: “stream movies - P”

At Netflix, our Traffic Optimizer audits the search query report daily and apply these rules to all queries, adding new negative keywords as necessary. In the first few days of execution, the Traffic Optimizer generated hundreds of new negative keywords and slowed down to a few per day, as expected. The Traffic Optimizer seemed to have helped increase CTR (see below), but we will be conducting a test to quantify the impact.

We use Traffic Optimizer along with several other in house-tools that help us move fast and streamline operations. This allows us to manage hundreds of campaigns in an automated way without much manual involvement.

Traffic optimizer is just one of the many in-house tools that we’re developing to make sure we deliver the right advertisements about Netflix to the right audience at the right time. If you’re interested in helping us stay on the bleeding edge of advertising technology and spreading the word about Netflix, we’re always looking for talented engineers. Head on over to our jobs page to learn more!

↧

Introducing Raigad - An Elasticsearch Sidecar

November 10, 2014, 2:04 pm

≫ Next: Genie 2.0: Second Wish Granted!

≪ Previous: Traffic Optimizer: The Power of Negatives

Netflix has very diverse data needs. Those needs fall anywhere between rock-solid durable datastores, like Apache Cassandra and lossy in-memory stores, such as the current incarnation of Dynomite. Somewhere in that spectrum is the need to store, index and search documents. This is where Elasticsearch has found a niche in Netflix.

Elasticsearch usage, at Netflix, has proliferated over the past year. It began as one or two isolated deployments managed by the teams using it. That usage has quickly grown to over 15+ clusters (755 nodes), in production, centrally managed by the Cloud Database Engineering (CDE) team.

CDE, as does all of Netflix, believes in automating the operations of our production systems. This is what led us to create tools such as Priam, a sidecar to help manage Apache Cassandra clusters. That same philosophy led us to create Raigad, an Elasticsearch sidecar.

Key Features

Integration with a centralized monitoring system

Raigad collects and publishes Elasticsearch metrics to a centralized telemetry, monitoring and alerting system. This is achieved by using the Netflix Open Source project Servo. Raigad’s architecture allows you to integrate into your own telemetry system.

Node discovery and tracking

We’ve included a sample implementation using Cassandra, for Raigad to keep track of metadata information of Elasticsearch clusters. Every Elasticsearch instance will read Cassandra to discover other nodes which it needs to connect to during the bootstrap. In this sample implementation, Cassandra eases multi-region Elasticsearch deployments by replicating Elasticsearch meta data across multiple regions wherever Elasticsearch is deployed. This could also be implemented using Eureka.

Auto configuration of the elasticsearch.yml file

Raigad provides a range of configuration parameters to tune Elasticsearch yaml at bootstrap time. eg. ASG based dedicated master-data-search node deployments (default at Netflix), multi-region deployments, tribe node setup etc.

Index management

Raigad takes care of cleaning old and creating new indices based on the retention period provided for individual indices using configuration parameters. We currently support daily,monthly and yearly retention periods.

Improvements to run better in AWS

Raigad is used extensively at Netflix in the AWS environment. As mentioned above, for dedicated node deployments we use ASG naming convention. In regards to credentials, it supports Amazon’s IAM key profile management. Using IAM Credentials allows you to provide access to the AWS API without storing an AccessKeyId or SecretAccessKey anywhere on the machine. But if required, you can use your own implementation as well.

Raigad also supports scheduled nightly Snapshot backups to S3 along with Restores at startup or via a REST call. (It uses elasticsearch-aws-plugin underneath)

More Info

You can get more info about the features described above or about how to use and install Raigad here.

Summary

Distributed systems are complex to operate and to recover from failure. If you add to that, the huge scale at which Netflix operates, you quickly need to make a decision of how to operate such systems. You can either scale a team out to handle the load, or build good automation that can monitor, analyze and alleviate issues, automatically. Netflix’s approach has always been the latter. Raigad helps continue this trend, by providing a tool to help manage our growing Elasticsearch deployment.

CDE is very excited to add Raigad to our ever growing NetflixOSS library. If you run Elasticsearch on AWS, at scale, we believe Raigad may be useful to you too. As with all of our projects, feedback, code or documentation submissions are always welcome.

If you are passionate about Elasticsearch or Open Source Software, in general, we are always looking for great engineers.

By: Sagar Loke& Christos Kalantzis

↧

Genie 2.0: Second Wish Granted!

November 11, 2014, 2:00 am

≫ Next: ZeroToDocker: An easy way to evaluate NetflixOSS through runtime packaging

≪ Previous: Introducing Raigad - An Elasticsearch Sidecar

By Tom Gianos and Amit Sharma@ Big Data Platform Team

A little over a year ago we announced Genie, a distributed job and resource management tool. Since then, Genie has operated in production at Netflix, servicing tens of thousands of ETL and analytics jobs daily. There were two main goals in the original design of Genie:

To abstract execution environment from the Hadoop, Hive and Pig job submissions.
To enable horizontal scaling of client resources based on demand.

Since the development of Genie 1.0, much has changed in both the big data ecosystem and here at Netflix. Hadoop 2 was officially released, enabling clusters to use execution engines beyond traditional MapReduce. Newer tools, such as interactive query engines like Presto and Spark, are quickly gaining in popularity. Other emerging technologies like Mesos and Docker are changing how applications are managed and deployed. Some changes to our big data platform in the last year include:

Upgrading our Hadoop clusters to Hadoop 2.
Moving to Parquet as the primary storage format for our data warehouse.
Integrating Presto into our big data platform.
Developing, deploying and open sourcing Inviso, to help users and admins gain insights into job and cluster performance.

Amidst all this change, we reevaluated Genie to determine what was needed to meet our evolving needs. Genie 2.0 is the result of this work and it provides a more flexible, extensible and feature rich distributed configuration and job execution engine.

Reevaluating Genie 1.0

Genie 1.0 accomplished its original goals well, but the narrow scope of those goals lead to limitations including:

It only worked with Hadoop 1.
It had a fixed data model designed for a very specific use case. Code changes were required to accomplish minor changes in behavior.

As an example, the s3CoreSiteXml, s3HdfsSiteXml fields of the ClusterConfigElement entity stored the paths to the core-site and hdfs-site XML files of a Hadoop cluster rather than storing them as a generic collection field.

The execution environment selection criteria was very limited. The only way to select a cluster was by setting one of three types of schedules: SLA, ad hoc or bonus.

Genie 1.0 could not continue to meet our needs as the number of desired use cases increased and we continued to adopt new technologies. Therefore, we decided to take this opportunity to redesign Genie.

Designing and Developing Genie 2.0

The goals for Genie 2.0 were relatively straightforward:

Develop a generic data model, which would let jobs run on any multi-tenant distributed processing cluster.
Implement a flexible cluster and command selection algorithm for running a job.
Provide richer API support.
Implement a more flexible, extensible and robust codebase.

Each of these goals are explored below.

The Data Model

The new data model consists of the following entities:

Cluster: It stores all the details of an execution cluster including connection information, properties, etc. Some cluster examples are Hadoop 2, Spark, Presto, etc. Every cluster can be linked to a set of commands that it can run.

Command: It encapsulates the configuration detailsof an executable that is invoked by Genie to submit jobs to the clusters. This includes the path to the executable, the environment variables, configuration files, etc. Some examples are Hive, Pig, Presto and Sqoop. If the executable is already installed on the Genie node, configuring a command is all that is required. If the executable isn’t installed, a command can be linked to an application in order to install it at runtime.

Application: It provides all the components required to install a command executable on Genie instances at runtime. This includes the location of the jars and binaries, additional configuration files, an environment setup file, etc. Internally we have our Presto client binary configured as an application. A more thorough explanation is provided in the “Our Current Deployment” section below.

Job: It contains all the details of a job request and execution including any command line arguments. Based on the request parameters, a cluster and command combination is selected for execution. Job requests can also supply necessary files to Genie either as attachments or via the file dependencies field, if they already exist in an accessible file system. As a job executes, its details are recorded in the job record.

All the above entities support a set of tags that can provide additional metadata. The tags are used for cluster and command resolution as described in the next section.

Job Execution Environment Selection

Genie now supports a highly flexible method to select the cluster to run a job on and the command to execute, collectively known as the execution environment. A job request specifies two sets of tags to Genie:

Command Tags: A set of tags that maps to zero or more commands.

Cluster Tags: A priority ordered list of sets of tags that maps to zero or more clusters.

Genie iterates through the cluster tags list, and attempts to use each set of tags in combination with the command tags to find a viable execution environment. The ordered list allows clients to specify fallback options for cluster selection if a given cluster is not available.

At Netflix, nightly ETL jobs leverage this feature. Two sets of cluster tags are specified for these jobs. The first set matches our bonus clusters, which are spun up every night to help with our ETL load. These clusters use some of our excess, pre-reserved capacity available during lower traffic hours for Netflix. The other set of tags match the production cluster and act as the fallback option. If the bonus clusters are out of service when the ETL jobs are submitted, the jobs are routed to the main production cluster by Genie.

Richer API Support

Genie 1.0 exposes a limited set of REST APIs. Any updates to the contents of the resources had to be done by sending requests, containing the entire object, to the Genie service. In contrast, Genie 2.0 supports fine grained APIs, including the ability to directly manipulate the collections that are part of the entities. For a complete list of available APIs, please see the Genie API documentation.

Code Enhancements

An examination of the Genie 1.0 codebase revealed aspects that needed to be modified in order to provide the flexibility and standards compliance desired going forward.

Some of the goals to improve the Genie codebase were to:

Decouple the layers of the application to follow a more traditional three tiered model.
Remove unnecessary boilerplate code.
Standardize and extend REST APIs.
Improve deployment flexibility.
Improve test coverage.

Tools such as Spring, JPA 2.0, Jersey, JUnit, Mockito, Swagger, etc. were leveraged to solve most of the known issues and better position the software to handle new ones in the future.

Genie 2.0 was completely rewritten to take advantage of these frameworks and tools. Spring features such as dependency injection, JPA support, transactions, profiles and more are utilized to produce a more dynamic and robust architecture. In particular, dependency injection for various components allows Genie to be more easily modified and deployed both inside and outside Netflix. Swagger based annotations on top of the REST APIs provide not only improved documentation, but also a mechanism for generating clients in various languages. We used Swagger codegen to generate the core of our Python client, which has been uploaded to Pypi. Almost six hundred tests have also been added to the Genie code base, making the code more reliable and maintainable.

Our Current Deployment

Genie 2.0 has been deployed at Netflix for a couple of months, and all Genie 1.0 jobs have been migrated over. Genie currently provides access to all the Hadoop and Presto clusters, in our production, test and ad hoc environments. In production, Genie currently autoscales between twelve to twenty i2.2xlarge AWS instances, allowing several hundred jobs to run at any given time. This provides horizontal scaling of clients for our clusters with no additional configuration or overhead.

Presto and Sqoop commands are each configured with a corresponding application that points to locations in S3, where all the jars and binaries necessary to execute these commands are located. Every time one of these commands run, the necessary files are downloaded and installed. This allows us to continuously deploy updates to our Presto and Sqoop clients without redeploying Genie. We’re planning to move our other commands, like Pig and Hive, to this pattern as well.

At Netflix launching a new cluster is done via a configuration based launch script. After a cluster is up in AWS, the cluster configuration is registered with Genie. Commands are then linked to the cluster based on predefined configurations. After it is properly configured in Genie, the cluster will be marked as “available”. When we need to take down a cluster, it is marked as “out of service” in Genie so the cluster can no longer accept new jobs. Once all running jobs are complete, the cluster is marked as “terminated” in Genie and instances are shut down in AWS.

With Genie 2.0 going live in our environment, it has allowed us to bring together all the new tools and services we’ve added to the big data platform over the last year. We have already seen many benefits from Genie 2.0. We were able to add Presto support to Genie in a few days and Sqoop in less than an hour. Theses changes would have required code modification and redeployment with Genie 1.0, but were merely configuration changes in Genie 2.0.

Below is our new big data platform architecture with Genie at its core.

Future Work

There is always more to be done. Some enhancements that can be made going forward

include:

Improving the job execution and monitoring components for better fault tolerance, efficiency on hosts and more granular status feedback.
Abstracting Genie’s use of Netflix OSS components to allow adopters to implement their own functionality for certain components to ease adoption.
Improving the admin UI to expose more data to users. e.g. Show all clusters a given command is registered with.

We’re always looking for feedback and input from the community on how to improve and evolve Genie. If you have questions or want to share your experience with running Genie in your environment, you can join our discussion forum. If you’re interested in helping out, you can visit our Github page to fork the project or request features.

↧

ZeroToDocker: An easy way to evaluate NetflixOSS through runtime packaging

November 12, 2014, 3:04 pm

≫ Next: Node.js in Flames

≪ Previous: Genie 2.0: Second Wish Granted!

Motivation:

The NetflixOSS platform and related ecosystem services are extensive. While we make every attempt to document each project, being able to quickly evaluate NetflixOSS is a large challenge due to the breadth for most users. This becomes a very large challenge to anyone trying to understand individual parts of the platform.

Another part of the challenge relates to how NetflixOSS was designed for scale. Most services are intended to be setup in a multi-node, auto-recoverable cluster. While this is great once you are ready for production, it is prohibitively complex for new users to try out a smaller scale environment of NetflixOSS.

A final part of the challenge is that in order to keep the platform a collection of services and libraries that could be used wherever they make sense by users, the runtime artifacts are distributed in ways that can be later assembled by users in different ways. This means many of the Java libraries are in Maven Central, some of the complete services are assembled as distribution zips and wars in our CI environment on CloudBees, and others through file distribution services. None of these distributions gives you a single command line that is guaranteed to work across the many places people might want to run the NetflixOSS technologies.

A simple solution:

Recently it has become popular to be able to quickly demonstrate technology through the use of Docker containers. If you search GitHub there are hundreds of projects that include Dockerfiles, which are the image build description file for Docker containers. By including such Dockerfiles, the developer is showing you exactly how to not only build the code on GitHub, but also assemble and run the compiled artifacts as a full system.

ZeroToDocker is a project that solves the above problems. Specifically, it allows anyone with a Docker host (on their laptop, on a VM in the cloud, etc.) to, with a single command, run a single node of any NetflixOSS technology. If you have the network bandwidth to download 500-700M images, you can now run each part of the NetflixOSS platform with a single command. For example, here is the command to run a single node of Zookeeper managed through NetflixOSS Exhibitor:

docker run -d --name exhibitor netflixoss/exhibitor:1.5.2

This command tells Docker to pull the image file “exhibitor” of version “1.5.2” from the official NetflixOSS account and run it with as a daemon with a container name of “exhibitor”. This will start up just as quick as the Java processes would on a base OS due to the process model of Docker. It will not start a separate OS instance. It will also “containerize” the environment meaning that the exhibitor and zookeeper process will be isolated from other containers and processes running on the same Docker host. Finally, if you examine the Dockerfile that builds this image you will see that it exposes the zookeeper and exhibitor ports of 2181, 2888, 3888, and 8080 to the network. This means that if you can access these ports via standard Docker networking. In fact you can load up the following URL:

http://EXHIBITORIPADDRESS:8080/exhibitor/v1/ui/index.html

All of this can be done in seconds beyond the initial image download time with very little starting knowledge of NetflixOSS. We expect this should reduce the learning curve of starting NetflixOSS by at least an order of magnitude.

Images so far:

We decided to focus on the platform foundation of NetflixOSS, but we are already in discussions with our chaos testing and big data teams to create docker images of other aspects of the NetflixOSS ecosystem. For now, we have released:

Asgard
Eureka
A Karyon based Hello World Example
A Zuul Proxy used to proxy to the Karyon service
Exhibitor managed Zookeeper
Security Monkey

Can you trust these images:

Some of our great OSS community members have already created Docker images for aspects of NetflixOSS. While we don’t want to take anything away from these efforts, we wanted to take them a step further. Recently, Docker announced Docker Hub. You can think of Docker Hub as a ready to run image repository similar to how you think of GitHub for your code, or CloudBees for you deployment artifacts. Docker Hub creates an open community around images.

Additionally Docker Hub has the concept of a trusted build. What this means is anyone can point their Docker Hub account at GitHub and tell Docker Hub to build, on their behalf their, trusted images. After these builds are done, the images are exposed to the standard cloud registry from which anyone can pull and run. By the fact that the images are built by Docker in a trusted isolated environment combined with the fact that any user can trace the image build back to exact Dockerfiles and source on GitHub and Maven Central, you can see exactly where all the running code originated and make stronger decisions of trust. With the exception of Oracle Java, Apache Tomcat, and Apache Zookeeper all of the code on images originates from trusted NetfixOSS builds. Even Java (cloned from Feng Honglin’s Java 7 Dockerfile), Tomcat and Zookeeper are easy to trust as you can read the Dockerfile to trace exactly their origination.

Can you learn from these images:

If you go to the Docker Hub image you are now running, you can navigate back to the GitHub project that hosts the Dockerfiles. Inside of the Dockerfile you will find the exact commands required to assemble this running image. Also, files are included with the exact properties needed to have a functioning single node service.

This means you can get up and running very quickly on a simple NetflixOSS technology and then learn how those images work and then progress to your own production deployment using what you learned was under the covers of the running single instance. While we tried to document this on the GitHub wiki in the past for each project, it is just so much easier to document through a running technology than document it fully in prose on a Wiki.

A final note on accelerated learning vs. production usage:

As noted on the ZeroToDocker Wiki, we are not recommending the use of these Docker images in production. We designed these images to be as small as possible in scope so you can get the minimum function running as quickly as possible. That means they do not consider production issues that matter like multi-host networking, security hardening, operational visibility, storage management, and high availability with automatic recovery.

We also want to make it clear that we do not run these images in production. We continue to run almost all of our systems on the EC2 virtual machine based IaaS. We do this as the EC2 environment along with Netflix additions provides all of the aforementioned production requirements. We are starting to experiment with virtual machines running Docker hosting multiple containers in EC2, but those experiments are limited to classes of workloads that get unique value out of a container based deployment model while being managed globally by EC2 IaaS.

Based on the fact that these images are not production ready, we have decided to keep ZeroToDocker on our Netflix-Skunkworks account on GitHub. However, we believe the value in helping people get up and running on NetflixOSS is valuable and wanted to make the images available.

Roadmap:

We started small. Over time, you can expect more images that represent a larger slice of the NetflixOSS platform and ecosystem. We also may expand the complexity showing how to set up clusters or more tightly secure the images. We’ve built specific versions of each service, but in the future will need to create a continuous integration system for the building of our images.

If you enjoy helping with open source and want to build future technologies like that we’ve just demonstrated, check out our some of our openjobs. We are always looking for excellent engineers to extend our NetflixOSS platform and ecosystem.

Contributors: Andrew Spyker, Diptanu Gon Choudhury,Ruslan Meshenberg

↧

Node.js in Flames

November 19, 2014, 8:30 am

≫ Next: Version 7: The Evolution of JavaScript

≪ Previous: ZeroToDocker: An easy way to evaluate NetflixOSS through runtime packaging

We’ve been busy building our next-generation Netflix.com web application using Node.js. You can learn more about our approach from the presentation we delivered at NodeConf.eu a few months ago. Today, I want to share some recent learnings from performance tuning this new application stack.

We were first clued in to a possible issue when we noticed that request latencies to our Node.js application would increase progressively with time. The app was also burning CPU more than expected, and closely correlated to the higher latency. While using rolling reboots as a temporary workaround, we raced to find the root cause using new performance analysis tools and techniques in our Linux EC2 environment.

Flames Rising

We noticed that request latencies to our Node.js application would increase progressively with time. Specifically, some of our endpoints’ latencies would start at 1ms and increase by 10ms every hour. We also saw a correlated increase in CPU usage.

This graph plots request latency in ms for each region against time. Each color corresponds to a different AWS AZ. You can see latencies steadily increase by 10 ms an hour and peak at around 60 ms before the instances are rebooted.

Dousing the Fire

Initially we hypothesized that there might be something faulty, such as a memory leak in our own request handlers that was causing the rising latencies. We tested this assertion by load-testing the app in isolation, adding metrics that measured both the latency of only our request handlers and the total latency of a request, as well as increasing the Node.js heap size to 32Gb.

We saw that our request handler’s latencies stayed constant across the lifetime of the process at 1 ms. We also saw that the process’s heap size stayed fairly constant at around 1.2 Gb. However, overall request latencies and CPU usage continued to rise. This absolved our own handlers of blame, and pointed to problems deeper in the stack.

Something was taking an additional 60 ms to service the request. What we needed was a way to profile the application’s CPU usage and visualize where we’re spending most of our time on CPU. Enter CPU flame graphs and Linux Perf Events to the rescue.

For those unfamiliar with flame graphs, it’s best to read Brendan Gregg’s excellent article explaining what they are -- but here’s a quick summary (straight from the article).

Each box represents a function in the stack (a "stack frame").
The y-axis shows stack depth (number of frames on the stack). The top box shows the function that was on-CPU. Everything beneath that is ancestry. The function beneath a function is its parent, just like the stack traces shown earlier.
The x-axis spans the sample population. It does not show the passing of time from left to right, as most graphs do. The left to right ordering has no meaning (it's sorted alphabetically).
The width of the box shows the total time it was on-CPU or part of an ancestry that was on-CPU (based on sample count). Wider box functions may be slower than narrow box functions, or, they may simply be called more often. The call count is not shown (or known via sampling).
The sample count can exceed elapsed time if multiple threads were running and sampled concurrently.
The colors aren't significant, and are picked at random to be warm colors. It's called "flame graph" as it's showing what is hot on-CPU. And, it's interactive: mouse over the SVGs to reveal details.

Previously Node.js flame graphs had only been used on systems with DTrace, using Dave Pacheco’s Node.js jstack() support. However, the Google v8 team has more recently added perf_events support to v8, which allows similar stack profiling of JavaScript symbols on Linux. Brendan has written instructions for how to use this new support, which arrived in Node.js version 0.11.13, to create Node.js flame graphs on Linux.

Here’s the original SVG of the flame graph. Immediately, we see incredibly high stacks in the application (y-axis). We also see we’re spending quite a lot of time in those stacks (x-axis). On closer inspection, it seems the stack frames are full of references to Express.js’s router.handle and router.handle.next functions. The Express.js source code reveals a couple of interesting tidbits¹.

Route handlers for all endpoints are stored in one global array.
Express.js recursively iterates through and invokes all handlers until it finds the right route handler.

A global array is not the ideal data structure for this use case. It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers. Each request requires an expensive O(n) look up in the route array in order to find its route handler. Compounding matters, the array is traversed recursively. This explains why we saw such tall stacks in the flame graphs. Interestingly, Express.js even allows you to set many identical route handlers for a route. You can unwittingly set a request chain like so.

[a, b, c, c, c, c, d, e, f, g, h]

Requests for route c would terminate at the first occurrence of the c handler (position 2 in the array). However, requests for d would only terminate at position 6 in the array, having needless spent time spinning through a, b and multiple instances of c. We verified this by running the following vanilla express app.

var express = require('express');
var app = express();
app.get('/foo',function(req, res){
   res.send('hi');
});
// add a second foo route handler
app.get('/foo',function(req, res){
   res.send('hi2');
});
console.log('stack', app._router.stack);
app.listen(3000);

Running this Express.js app returns these route handlers.

stack [{ keys:[], regexp:/^\/?(?=/|$)/i, handle:[Function: query]},
{ keys:[],
   regexp:/^\/?(?=/|$)/i,
   handle:[Function: expressInit]},
{ keys:[],
   regexp:/^\/foo\/?$/i,
   handle:[Function],
   route:{ path:'/foo', stack:[Object], methods:[Object]}},
{ keys:[],
   regexp:/^\/foo\/?$/i,
   handle:[Function],
   route:{ path:'/foo', stack:[Object], methods:[Object]}}]

Notice there are two identical route handlers for /foo. It would have been nice for Express.js to throw an error whenever there’s more than one route handler chain for a route.

At this point the leading hypothesis was that the handler array was increasing in size with time, thus leading to the increase of latencies as each handler is invoked. Most likely we were leaking handlers somewhere in our code, possibly due to the duplicate handler issue. We added additional logging which periodically dumps out the route handler array, and noticed the array was growing by 10 elements every hour. These handlers happened to be identical to each other, mirroring the example from above.

[...
{ handle:[Function: serveStatic],
   name:'serveStatic',
   params:undefined,
   path:undefined,
   keys:[],
   regexp:{/^\/?(?=\/|$)/i fast_slash:true},
   route:undefined},
{ handle:[Function: serveStatic],
   name:'serveStatic',
   params:undefined,
   path:undefined,
   keys:[],
   regexp:{/^\/?(?=\/|$)/i fast_slash:true},
   route:undefined},
{ handle:[Function: serveStatic],
   name:'serveStatic',
   params:undefined,
   path:undefined,
   keys:[],
   regexp:{/^\/?(?=\/|$)/i fast_slash:true},
   route:undefined},
...
]

Something was adding the same Express.js provided static route handler 10 times an hour. Further benchmarking revealed merely iterating through each of these handler instances cost about 1 ms of CPU time. This correlates to the latency problems we’ve seen, where our response latencies increase by 10 ms every hour.

This turned out be caused by a periodic (10/hour) function in our code. The main purpose of this was to refresh our route handlers from an external source. This was implemented by deleting old handlers and adding new ones to the array. Unfortunately, it was also inadvertently adding a static route handler with the same path each time it ran. Since Express.js allows for multiple route handlers given identical paths, these duplicate handlers were all added to the array. Making matter worse, they were added before the rest of the API handlers, which meant they all had to be invoked before we can service any requests to our service.

This fully explains why our request latencies were increasing by 10ms every hour. Indeed, when we fixed our code so that it stopped adding duplicate route handlers, our latency and CPU usage increases went away.

Here we see our latencies drop down to 1 ms and remain there after we deployed our fix.

When the Smoke Cleared

What did we learn from this harrowing experience? First, we need to fully understand our dependencies before putting them into production. We made incorrect assumptions about the Express.js API without digging further into its code base. As a result, our misuse of the Express.js API was the ultimate root cause of our performance issue.

Second, given a performance problem, observability is of the utmost importance. Flame graphs gave us tremendous insight into where our app was spending most of its time on CPU. I can’t imagine how we would have solved this problem without being able to sample Node.js stacks and visualize them with flame graphs.

In our bid to improve observability even further, we are migrating to Restify, which will give us much better insights, visibility, and control of our applications². This is beyond the scope of this article, so look out for future articles on how we’re leveraging Node.js at Netflix.

Interested in helping us solve problems like this? The Website UI team is hiring engineers to work on our Node.js stack.

Author: Yunong Xiao @yunongx

Footnotes
1 Specifically, this snippet of code. Notice next() is invoked recursively to iterate through the global route handler array named stack.
2 Restify provides many mechanisms to get visibility into your application, from DTrace support, to integration with the node-bunyan logging framework.

↧