HAProxy Load Balancing at Vimeo
In this presentation, Andrew Rodland describes the unique challenges of scaling Vimeo’s video hosting platform. Vimeo splits video files into chunks and caches them on separate servers. They use HAProxy’s consistent-hash load balancing to distribute requests to those servers evenly, which allows them to scale horizontally. Andrew developed a new option in HAProxy that fine-tunes consistent-hash load balancing, called hash-balance-factor, that allows a request to consider the current load on the server in addition to whether it has cached the needed video chunk. This consistent hashing feature is essential to successfully delivering video at scale.
Thank you very much. I’m Andrew Rodland and this is a little bit about some of the different things that we’re doing with HAProxy at Vimeo, where I am a Principal Engineer on the Video Platform team working with making things reliable and scalable and production-ready and delivering video quickly to people all around the world.
It’s a pretty big amount of traffic because, well, video is big. We do 4K, we do 8K, so there’s a lot of bandwidth involved in that. We run on the cloud, primarily Google Cloud Platform. We got into that in 2015 and we shut down our last day datacenter in 2018. So, we’re pretty much entirely on the cloud at this point.
But the first one I want to talk about is Skyfire, which is our video packager; and something cool that HAProxy allowed us to do with that with an algorithm called bounded load consistent hashing.
We actually make that conversion on the fly and it needs to be fast because it is in the loop of video playback.
But Vimeo predates that by quite a few years, so we have a big library of stuff that is stored as progressive MP4, which is basically what you would give to flash player back in the day, what you would give to a regular video tag. It’s just a video file that plays straight through. If you have adaptive streaming, then what you want to do is use a standard like DASH or HLS that breaks the file up into chunks, and at any boundary between chunks the player can switch from a chunk of one quality to a chunk of another quality. So the video data is the same, but it’s in a different kind of container.
If we had really, really massive resources, we could take our existing library of video files and we could reprocess them into DASH and HLS and just serve them as static files; but we don’t have that kind of resources. So, we actually make that conversion on the fly and it needs to be fast because it is in the loop of video playback. The player is requesting the next segment of video and it needs it now because it only has so much video in the buffer and nobody likes when that little spinner goes buffering buffering buffering.
We need to request a big chunk of the video file from storage, which comes with a latency penalty.
But in order to do that we need to know when the user requests chunk number 1, 2, 3 of a video, which bytes of the original file are we actually going to ship out and wrap in this particular packaging and send to the user? So before we can do that we have to look at the global headers of the video file and find out for every little bit of audio and every little bit of video in the file, where does it start and end? Both in terms of time and in terms of bytes within the file. If we know that, then we can build a map and we can say, “Okay, I want this chunk number 1, 2, 3. It’s going to be from this many segments in until that many segments in.” And so, we can know exactly which byte ranges in the file correspond to that. We can load them up, we can do the container transformation and we can send them to the user and we can do that in under ten milliseconds if we have the index.
But actually computing that index is pretty expensive. We need to request a big chunk of the video file from storage, which comes with a latency penalty; and then we need to spend a lot of CPU computing based on that, where, actually, all of these segments are. If we did that on every segment request it would be too slow; It would be too expensive. So the obvious thing to do here, again we don’t want a massive batch processing job on millions of videos to do this in advance. So, we do it on demand, but we need to cache it. That way, the first time it’s going to take a moment. The video is going to start up playback, but all of these later requests are going to have the information already ready to go.
With naive modulo hashing, say you have nine servers and you scale up to 10, then that means that nine parts out of 10 of your content will be hashed to a different server. But with consistent hashing, if you have nine servers and you go up to 10, then only one out of nine moves. So you can scale up, scale down, lose a server without massive interruption and that means that the cache will be mostly effective.
The problem is hot spotting, and hot spotting is based on two things. First is that there’s randomness in the consistent hashing algorithm that naturally causes some servers to get a bigger portion of the hash space than others. There’s ways to control that, but it’s always going to happen. The other is not all content is equally popular. It’s, in fact, it’s very far from that. Some content is thousands of times as popular as other content that we’re serving. So, if you say that this piece of content belongs to this specific server, then one server might be in for a whole lot load and that caused overload on the servers. It caused a lot of delays, a lot of latency shipping out the content and buffering problems in video playback.
So, we weren’t able to stick with that route. What we did instead was in addition to the in-memory cache that’s on the machines, add a shared cache, use Memcached; Switch the load balancing from consistent hash-based to least-connection based, which is another thing that HAProxy can do; Send the request to the server that is doing the least work right now. All servers will be pretty evenly loaded and you don’t have to worry about hot spotting.
The problem is that the bandwidth to that shared cache becomes pretty massive because we’re no longer trying to send a request to a server that’s going to have that index in memory. So more likely than not, it’s going to have to get it from the shared cache. The more traffic we get, the more we scale up, the more servers we have, the less likely it is that the request is going to find its way to a server that happens to have the right data in cache. So, the bigger we get, the more traffic goes to that shared cache, which is potentially a huge bottleneck to scalability.
In 2016 already we were pushing multiple gigabits out of this shared cache in order to feed the packaging process and we could see that that was probably not going to be sustainable forever.
If you can, then you should be able to send a request to a server based on the hash that you know is very likely to have the thing that you want in cache. But if that would make the server too busy compared to the average, then it should go somewhere else. And if you have to go somewhere else, then that should be chosen consistently too so that your second choice…that way, if you do have a very popular video then your second choice will have it in cache, maybe even your third choice will have it in cache because they’re all sharing that load.
This is good for a very specific kind of workload, which is not something that everybody has, but it’s something that we very definitely did have. Where we have a cluster of servers, any server in that cluster is able to serve a request, but some of them can do it with less resource usage than others.
I read the abstract and it spoke directly to me.
I was like, “This is so simple even I can do it.” And we’re using HAProxy. So, I opened up the source, I took a look, I figured, “Yeah, I think I can manage this.” So, in 2016, towards the end of 2016, I sent a patch to [the] HAProxy mailing list. With a little bit of back and forth, it was accepted. It got in just in time for HAProxy 1.7. So, it is available for everybody to use. If you’re using
hash-type consistent in a configuration, all you have to do is add this config option
hash-balance-factor, which is a number of relatively how much difference in load can you tolerate between different servers, and if there would be more than that much difference in load, then it will go to a different server.
We put that into production at Vimeo before it was even merged into master and it did the job beautifully. It allowed most of the indexes to be served from the local cache, not to have to go to the shared cache. But there was no hot spotting problem. There was no problem with the servers getting overloaded. Everything was just humming along nicely.
We deploy just a change to the HAProxy config and not only is the local cache percentage substantially better, it’s also very consistent.
And then, 1026 and 1027 there, we deploy just a change to the HAProxy config and not only is the local cache percentage substantially better, it’s also very consistent. You can see that it’s no longer moving with our traffic. It’s no longer moving with our number of servers. Since then, and you can also see in the lower graph, this is the actual bandwidth usage of the shared Memcached, which we push down by a factor of like six, something like that. I can’t read it off of there right now.
Since then, our traffic has grown substantially, I believe by about a factor of three. The Memcached bandwidth has gone up by about 10%. So, that was the scalability win that we wanted. We’re no longer worried that Memcached bandwidth is going to saturate and bring down the whole thing.
Something that I forgot to put in my slides, but in the next version, HAProxy got multithreading and that was an additional win on top of this because we were running multi-processing: eight processes on eight CPU servers. With multi-processing, the load tally for each server was separate for each process. So, if you think about it, a request comes in and it’s trying to do this assignment based on the relative load of each server. But when the request comes in it’s only able to see 1/8 of the picture. So, we switched from multi-processing to multithreading, went from eight processes to eight threads.
Now, each one is able to see the entire picture and it caused, actually, slightly better assignment of requests to servers. It made it more likely to go to the first choice because you have less variance that way. We were very happy with this and we were really happy to get such a win and to be able to contribute it to a piece of open-source software that we rely on.
hash-balance-factorin your HAProxy config is something reasonable like 1, 2, 5 is what we use, which corresponds to epsilon equals one quarter in the scientific notation. Then, the gains are not going to be nearly as large as the worst case that they want to highlight in their paper.
But even with those, I estimated that we can get about a 20% decrease in the load variance compared to the algorithm that’s currently in HAProxy, which, again, just means more even balancing without sacrificing that cache locality, which means a potential for more performance, more requests per server. It hasn’t happened yet because the way that it selects the next server to go to if the first server is unavailable would require a little bit of refactoring in how HAProxy actually does hashing and I haven’t had time to dig into that because all kinds of things. So, it’s not there yet, but it’s something that I’m excited to look into and squeeze out a little bit more performance and upstream that as well.
Sometimes we have more traffic than you can handle through a single HAProxy fronting VM. So, what we do is we put HAProxy behind Google TCP-mode or SSL-mode load balancer. So, we don’t allow it to run in the intelligent HTTP mode. We just say either proxy TCP connections directly to an instance group or SSL terminate us, TLS terminate us, and then proxy the connection to an instance group.
We can have Google take the load of TLS termination off of the HAProxy VMs. It’s not as much of a problem as it used to be, but hey, they provide the TLS and a non-TLS service for the same price. So why not free up that CPU to be used for something else. Besides which, then we don’t have to actually get the secrets onto the VMs. We can just put the secret into the managed thing that is on Google. So, if we can do that, why not?
Our CDNs want to use Keep-Alive for very good reasons; We want to use Keep-Alive to keep latency down and everything else. We would really like to know if we are being put into that draining interval, this instance is scheduled to go away, and then we could do something in HAProxy, put it into graceful mode so that it could start saying Connection Close to everybody who sends it a request and then we could, you know, actually allow most of those connections to terminate, but we don’t get that notification. The only notification we get is the 60-second warning that says in one minute I’m going to hard terminate your VM whether you like it or not.
At that point, we can send send a signal to HAProxy and it will start sending Connection Closed to everybody and the connections will start to move away, but you can imagine, you know, if you have a user on a slow connection doing a large file download, 60 seconds is not going to be enough time and there’s going to be some interruptions for the places that we are using it. It hasn’t caused a major problem, but it is a bit of a caveat. What we did before this was basically just to have a fixed size pool of HAProxies and give the addresses of all of them to the CDN without involving the load balancer and that is more stable in a way, but when we wanted to get scalable we came up with this solution.
So, we have a Go service that is baked into our image that is colocated with every instance of HAProxy and it listens on localhost for UDP Syslog lines. We tell every HTTP backend:
option httplog, send to this log server and it will parse the log line, extract all of the information, all of the interesting details out of it, and generate a bunch of StatsD timer and counter metrics, latency, amount of requests, amount of requests broken down by the HTTP status, frontend connection counts, backend connection counts so that we can see if we’re getting near any limits, and we can break it down by all kinds of interesting things.
processHealthCheckand the next thing is an array of the arguments. We don’t need any arguments for
processHealthCheck, so empty array, takes one byte to code. So altogether, this is like 30 bytes.