HAProxy 2.0 introduced layer 7 retries, which provides resilience against unreachable nodes, network latency, slow servers, and HTTP errors.
HAProxy powers the uptime of organizations with even the largest traffic demands by giving them the flexibility and confidence to deliver websites and applications with high availability, performance, and security at any scale and in any environment. As the world’s fastest and most widely used software load balancer, ruggedness is one of its essential qualities.
When HAProxy receives a request, but can’t establish a TCP connection to the selected backend server, it automatically tries again after an interval set by timeout connect
. This behavior has been baked in from the beginning. This smooths out short-lived network flakiness and brief downtime caused by server restarts.
You can further customize this by setting a retries
directive in a backend
to the desired number of attempts. It defaults to three. Also, if you add option redispatch
to the backend
, HAProxy tries another server instead of repeatedly trying the same one.
Now with HAProxy 2.0, you aren’t limited to retrying based on a failed connection only. The new retry-on
directive lets you list other kinds of failures that will trigger a retry and covers both Layer 4 and Layer 7 events. For example, if messages time out after the connection has been established due to network latency or because the web server is slow to respond, retry-on
tells HAProxy to trigger a retry.
You can think of it like this:
retries
says how many times to tryretry-on
says which events trigger a retryoption redispatch
says whether to try with a different server
While I was learning about the new retry-on
feature, it got me thinking about the novel ways of adapting to failure. In particular, I began looking at Chaos Engineering and how purposefully injecting faults can guide you to make a better, stronger system. In this blog post, I’ll share what I learned about testing for failure and how retry-on
is a powerful tool when it comes to building system resilience.
Injecting Faults
Ultimately, you want to build systems that can adapt to unusual conditions and that keep on humming even in the face of failed components and turbulence. A resilient system can bounce back in the face of adversity. What sort of adversity? Adrian Cockroft, VP of Cloud Architecture at AWS, gave the keynote presentation at Chaos Conf 2018. He lists possible faults that may happen within your infrastructure or software. Here are just a few of the potential disasters related to infrastructure alone:
Device failures (disk, power supply, cabling…)
CPU failures (cache corruption, logic bugs…)
Datacenter failures (power, connectivity, cooling, fire…)
Internet failures (DNS, ISP, routing…)
There’s plenty that can go wrong, but oftentimes we avoid purposefully trying to break our systems and fix weaknesses. The result is that we don’t adequately test that the mitigations we’ve put in place actually work. Are the load balancer settings you’ve configured optimally for reducing outages?
Acting out real-world failure modes is the best way to test whether your system is resilient. What actually happens when you start killing web server nodes? What is the effect of inducing latency in the network? If a server returns HTTP errors, will downstream clients be affected and, if so, to what degree?
I found that by applying some techniques from Chaos Engineering, by intentionally injecting faults into the system, I began to see exactly how I should tune HAProxy. For example, I saw how best to set various timeouts and which events I should set to trigger a retry.
Creating Chaos is Getting Easier
Maybe it’s due to the maturation of Chaos Engineering, which is a maturation of our collective incident management knowledge, but the tooling available for creating chaos in your infrastructure is getting better and better.
Gremlin allows you to unleash mayhem such as killing off nodes, simulating overloaded CPU and memory, and filling up disk space.
Pumba is a command-line tool that lets you simulate bad network conditions such as latency, corrupted data, and packet loss.
Muxy can be used to alter the responses from your web servers, such as to return HTTP errors.
Unreachable nodes
First, let’s take a look at killing off a node that’s serving as a backend server in HAProxy. If I use Gremlin or the Docker CLI to stop one of the web servers, then HAProxy will fail to connect to that node. This assumes that HAProxy has not already removed it from the load-balancing rotation during its regular health checks. For testing, I disabled health checking in order to allow HAProxy to attempt to connect to a down server.
Gremlin can be run as a Docker container, giving it access to other containers in the network. Then, you can use its UI to kill off nodes. For my experiment, I ran a group of web servers in Docker containers alongside Gremlin.
Suppose your backend
looked like this:
backend be_servers | |
balance roundrobin | |
server s1 server1:80 maxconn 10 | |
server s2 server2:80 maxconn 10 | |
server s3 server3:80 maxconn 10 |
HAProxy adds an implicit retries
directive. It will automatically retry a failed connection three times. You can also set retries
explicitly to the number of desired attempts. After killing a node and trying to connect to it, the HAProxy log entry looked like this:
fe_main be_servers/s1 0/0/-1/-1/12006 503 216 - - sC-- 1/1/0/0/3 0/0 "GET / HTTP/1.1" |
This shows that there were three retries to the same offline server, as indicated by the last number in section 1/1/0/0/3. Ultimately, it ended in a 503 Service Unavailable response. Notice that the termination code is sC, meaning that there was a timeout while waiting to connect to the server.
HAProxy offers an abundance of information in its logs. Read our blog post: Introduction to HAProxy Logging.
To handle this scenario, you should add an option redispatch
directive so that instead of retrying with the same server, HAProxy tries a different one.
backend be_servers | |
balance roundrobin | |
option redispatch | |
server s1 server1:80 maxconn 10 | |
server s2 server2:80 maxconn 10 | |
server s3 server3:80 maxconn 10 |
Then, if HAProxy tries to connect to a down server and hits the timeout specified by timeout connect
, it will try again with a different server. You’ll see the successful attempt in the logs with a +1 as the last number in the 1/1/0/0/+1 section, indicating that there was a redispatch to the s2 server.
fe_main be_servers/s2 0/3000/0/6/3006 200 228 - - ---- 1/1/0/0/+1 0/0 "GET / HTTP/1.1" |
HAProxy keeps trying servers until they’ve all been tried up to the retries
number. So, for this type of Layer 4 disconnection, you don’t need retry-on
.
There is another scenario: The connection was established fine, but then the server disconnected while HAProxy was waiting for the response. You can test this by injecting a delay using Muxy (discussed later in this article) and then killing Muxy before the response is sent. The current configuration, which includes option redispatch
, this type of failure causes the client to receive a 502 Bad Gateway response. The HAProxy logs show an SH termination code, indicating that the server aborted the connection midway through the communication:
fe_main be_servers/s1 0/0/0/-1/1757 502 229 - - SH-- 1/1/0/0/0 0/0 "GET / HTTP/1.1" |
Here is where retry-on
comes into play. You would add a retry policy of empty-response
to guard against this:
backend be_servers | |
balance roundrobin | |
option redispatch | |
retry-on empty-response | |
server s1 server1:80 maxconn 10 | |
server s2 server2:80 maxconn 10 | |
server s3 server3:80 maxconn 10 |
Now, if HAProxy successfully connects to the server, but the server then aborts, the request will be retried with a different server.
Network latency
Another failure mode to test is latency in the network. Using Pumba, you can inject a delay for all responses coming from one or more of your web servers running in Docker containers. For my experiment, I added a five-second delay to the first web server and no delay for the others. The command looks like this:
$ sudo pumba netem --tc-image gaiadocker/iproute2 --duration 5m delay --time 5000 server1 |
First, note that the defaults
section of my HAProxy configuration looked like this:
defaults | |
log global | |
mode http | |
option httplog | |
timeout connect 3s | |
timeout client 5s | |
timeout server 5s | |
timeout queue 30s |
Here, timeout connect
, which is the time allowed for establishing a connection to a server, is set to three seconds. My hypothesis for this experiment was that the HTTP request would be delayed and hit the timeout server
limit. What actually happened was the connection timeout struck first, giving me an sC termination code in the HAProxy logs, which means that there was a server-side timeout while waiting for the connection to be made.
From this, I learned that generic network latency affects all aspects of a request, Layer 4 through Layer 7. In other words, the HTTP messages did not have a chance to time out because even establishing a connection was timing out first. It sounds obvious, but until I tested it, I was only focused on Layer 7.
Retrying when there is a connection timeout is covered by adding a conn-failure
retry policy. You can append it, like this:
backend be_servers | |
balance roundrobin | |
option redispatch | |
retry-on empty-response conn-failure | |
server s1 server1:80 maxconn 10 | |
server s2 server2:80 maxconn 10 | |
server s3 server3:80 maxconn 10 |
If you don’t set retry-on
at all, then conn-failure
is on by default. However, since we’ve set it in order to include empty-response
, we need to include that retry policy explicitly as well. So, whether the server is completely down or just slow to connect, it’s counted as a connection timeout. Also note: How quickly HAProxy will retry depends on your timeout settings.
Slow servers
To learn what would happen if latency affected only the HTTP messages and not the initial connection, I moved on to using a different tool named Muxy. Muxy is a proxy that can change a request or response as it passes through. You can run it in a Docker container so that it has access to muck with messages from other containers in the network, hence its name. Use it to add a delay to one of the web server’s responses so that the connection is established fine, but the application appears sluggish. The following Muxy configuration injects a five-second delay for responses coming from server1:
proxy: | |
- name: http_proxy | |
config: | |
host: 0.0.0.0 # muxy IP | |
port: 81 # muxy port | |
proxy_host: server1 # proxied IP/host | |
proxy_port: 80 # proxied port | |
middleware: | |
- name: delay | |
config: | |
response_delay: 5000 |
You’ll need to point HAProxy to Muxy instead of the actual backend server:
backend be_servers | |
balance roundrobin | |
option redispatch | |
retry-on empty-response conn-failure | |
server s1 muxy:81 maxconn 10 | |
server s2 server2:80 maxconn 10 | |
server s3 server3:80 maxconn 10 |
This causes a different type of timeout in HAProxy, one that’s triggered when timeout server
strikes. The client receives a 504 Gateway Timeout response, as shown in the HAProxy logs:
fe_main be_servers/s1 0/0/1/-1/5001 504 218 - - sH-- 1/1/0/0/0 0/0 "GET / HTTP/1.1" |
Add the response-timeout
retry policy to cover this scenario:
retry-on empty-response conn-failure response-timeout |
HTTP errors
Suppose that there was no latency, but that the server returned an HTTP error. You can deal with this type of chaos too. It is important that you know how your application behaves with Layer 7 retries enabled. Caution must be exercised when retrying requests such as POST requests. Be sure to read the next section regarding retrying POST requests!
To change the returned status to 500 Server Error before it reaches HAProxy, use a Muxy configuration like this:
proxy: | |
- name: http_proxy | |
config: | |
host: 0.0.0.0 | |
port: 81 | |
proxy_host: server1 | |
proxy_port: 80 | |
middleware: | |
- name: http_tamperer | |
config: | |
response: | |
status: 500 |
You’ll need to update your retry policy to look for certain HTTP response status codes. In the following example, a retry happens if there’s a timeout, a connection failure, or an HTTP 500 status returned:
retry-on empty-response conn-failure response-timeout 500 |
With this in place, the HAProxy log shows that the request was routed away from the failing server, s1, to the healthy server, s2:
fe_main be_servers/s2 0/16/0/16/32 200 224 - - ---- 1/1/0/0/+1 0/0 "GET / HTTP/1.1" |
Recall that it is actually option redispatch
that tells HAProxy to try a different server. The retry-on
directive only configures when a retry should happen. If you wanted to keep trying the same server, you’d remove option redispatch
.
You can keep appending more retry policies to mitigate different types of failure modes. Or, you can use the all-inclusive option, all-retryable-errors:
retry-on all-retryable-errors |
It’s the same as if you’d specified all of the following parameters:
retry-on conn-failure empty-response junk-response response-timeout 0rtt-rejected 500 502 503 504 |
You will find all of the available options in the HAProxy documentation.
Beware of POSTs
Retrying requests that fetch data is often safe enough. Although, be sure to test! The application may have unknown side effects that make it unsafe to retry. However, it’s almost never safe to retry a request that writes data to a database, since you may be inserting duplicate data. For that reason, you’ll often want to add a rule that disables retries for POST requests. Use the http-request disable-l7-retry
directive, like this:
backend be_servers | |
balance roundrobin | |
option redispatch 1 | |
retry-on all-retryable-errors | |
retries 3 | |
http-request disable-l7-retry if METH_POST | |
server s1 server1:80 maxconn 10 | |
server s2 server2:80 maxconn 10 | |
server s3 server3:80 maxconn 10 |
Conclusion
In this blog post, you learned about the retry-on
directive that was added to HAProxy 2.0 and complements the existing retries
and option redispatch
features. This is a versatile feature that lets you mitigate various types of failures by specifying the events that should trigger a retry. However, you never know how things will truly work until you inject some faults into the system and see how it responds. By using Chaos Engineering techniques, I was able to verify that this directive adds resilience against unreachable nodes, network latency, slow servers, and HTTP errors.
If you enjoyed this post and want to stay up to date on similar content, subscribe to this blog! You can also follow us on Twitter and join the conversation on Slack.
Contact us to learn more about HAProxy Enterprise, which combines HAProxy, the world’s fastest and most widely used, open-source load balancer and application delivery controller, with enterprise-class features, services, and premium support. You can also sign up for a free trial.
Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.