HAProxy Layer 7 Retries & Chaos Engineering

HAProxy 2.0 introduced layer 7 retries, which provides resilience against unreachable nodes, network latency, slow servers, and HTTP errors.

HAProxy powers the uptime of organizations with even the largest traffic demands by giving them the flexibility and confidence to deliver websites and applications with high availability, performance, and security at any scale and in any environment. As the world’s fastest and most widely used software load balancer, ruggedness is one of its essential qualities.

When HAProxy receives a request, but can’t establish a TCP connection to the selected backend server, it automatically tries again after an interval set by timeout connect. This behavior has been baked in from the beginning. This smooths out short-lived network flakiness and brief downtime caused by server restarts.

You can further customize this by setting a retries directive in a backend to the desired number of attempts. It defaults to three. Also, if you add option redispatch to the backend, HAProxy tries another server instead of repeatedly trying the same one.

Now with HAProxy 2.0, you aren’t limited to retrying based on a failed connection only. The new retry-on directive lets you list other kinds of failures that will trigger a retry and covers both Layer 4 and Layer 7 events. For example, if messages time out after the connection has been established due to network latency or because the web server is slow to respond, retry-on tells HAProxy to trigger a retry.

You can think of it like this:

  • retries says how many times to try

  • retry-on says which events trigger a retry

  • option redispatch says whether to try with a different server

While I was learning about the new retry-on feature, it got me thinking about the novel ways of adapting to failure. In particular, I began looking at Chaos Engineering and how purposefully injecting faults can guide you to make a better, stronger system. In this blog post, I’ll share what I learned about testing for failure and how retry-on is a powerful tool when it comes to building system resilience.

Injecting Faults

Ultimately, you want to build systems that can adapt to unusual conditions and that keep on humming even in the face of failed components and turbulence. A resilient system can bounce back in the face of adversity. What sort of adversity? Adrian Cockroft, VP of Cloud Architecture at AWS, gave the keynote presentation at Chaos Conf 2018. He lists possible faults that may happen within your infrastructure or software. Here are just a few of the potential disasters related to infrastructure alone:

  • Device failures (disk, power supply, cabling…)

  • CPU failures (cache corruption, logic bugs…)

  • Datacenter failures (power, connectivity, cooling, fire…)

  • Internet failures (DNS, ISP, routing…)

There’s plenty that can go wrong, but oftentimes we avoid purposefully trying to break our systems and fix weaknesses. The result is that we don’t adequately test that the mitigations we’ve put in place actually work. Are the load balancer settings you’ve configured optimally for reducing outages?

Acting out real-world failure modes is the best way to test whether your system is resilient. What actually happens when you start killing web server nodes? What is the effect of inducing latency in the network? If a server returns HTTP errors, will downstream clients be affected and, if so, to what degree?

I found that by applying some techniques from Chaos Engineering, by intentionally injecting faults into the system, I began to see exactly how I should tune HAProxy. For example, I saw how best to set various timeouts and which events I should set to trigger a retry.

Creating Chaos is Getting Easier

Maybe it’s due to the maturation of Chaos Engineering, which is a maturation of our collective incident management knowledge, but the tooling available for creating chaos in your infrastructure is getting better and better.

Gremlin allows you to unleash mayhem such as killing off nodes, simulating overloaded CPU and memory, and filling up disk space.

Pumba is a command-line tool that lets you simulate bad network conditions such as latency, corrupted data, and packet loss.

Muxy can be used to alter the responses from your web servers, such as to return HTTP errors.

Unreachable nodes

First, let’s take a look at killing off a node that’s serving as a backend server in HAProxy. If I use Gremlin or the Docker CLI to stop one of the web servers, then HAProxy will fail to connect to that node. This assumes that HAProxy has not already removed it from the load-balancing rotation during its regular health checks. For testing, I disabled health checking in order to allow HAProxy to attempt to connect to a down server.

Gremlin can be run as a Docker container, giving it access to other containers in the network. Then, you can use its UI to kill off nodes. For my experiment, I ran a group of web servers in Docker containers alongside Gremlin.

the gremlin ui

The Gremlin UI

Suppose your backend looked like this:

backend be_servers
balance roundrobin
server s1 server1:80 maxconn 10
server s2 server2:80 maxconn 10
server s3 server3:80 maxconn 10

HAProxy adds an implicit retries directive. It will automatically retry a failed connection three times. You can also set retries explicitly to the number of desired attempts. After killing a node and trying to connect to it, the HAProxy log entry looked like this:

fe_main be_servers/s1 0/0/-1/-1/12006 503 216 - - sC-- 1/1/0/0/3 0/0 "GET / HTTP/1.1"

This shows that there were three retries to the same offline server, as indicated by the last number in section 1/1/0/0/3. Ultimately, it ended in a 503 Service Unavailable response. Notice that the termination code is sC, meaning that there was a timeout while waiting to connect to the server.

Did you know?

HAProxy offers an abundance of information in its logs. Read our blog post: Introduction to HAProxy Logging.

To handle this scenario, you should add an option redispatch directive so that instead of retrying with the same server, HAProxy tries a different one.

backend be_servers
balance roundrobin
option redispatch
server s1 server1:80 maxconn 10
server s2 server2:80 maxconn 10
server s3 server3:80 maxconn 10

Then, if HAProxy tries to connect to a down server and hits the timeout specified by timeout connect, it will try again with a different server. You’ll see the successful attempt in the logs with a +1 as the last number in the 1/1/0/0/+1 section, indicating that there was a redispatch to the s2 server.

fe_main be_servers/s2 0/3000/0/6/3006 200 228 - - ---- 1/1/0/0/+1 0/0 "GET / HTTP/1.1"

HAProxy keeps trying servers until they’ve all been tried up to the retries number. So, for this type of Layer 4 disconnection, you don’t need retry-on.

There is another scenario: The connection was established fine, but then the server disconnected while HAProxy was waiting for the response. You can test this by injecting a delay using Muxy (discussed later in this article) and then killing Muxy before the response is sent. The current configuration, which includes option redispatch, this type of failure causes the client to receive a 502 Bad Gateway response. The HAProxy logs show an SH termination code, indicating that the server aborted the connection midway through the communication:

fe_main be_servers/s1 0/0/0/-1/1757 502 229 - - SH-- 1/1/0/0/0 0/0 "GET / HTTP/1.1"

Here is where retry-on comes into play. You would add a retry policy of empty-response to guard against this:

backend be_servers
balance roundrobin
option redispatch
retry-on empty-response
server s1 server1:80 maxconn 10
server s2 server2:80 maxconn 10
server s3 server3:80 maxconn 10

Now, if HAProxy successfully connects to the server, but the server then aborts, the request will be retried with a different server.

Network latency

Another failure mode to test is latency in the network. Using Pumba, you can inject a delay for all responses coming from one or more of your web servers running in Docker containers. For my experiment, I added a five-second delay to the first web server and no delay for the others. The command looks like this:

$ sudo pumba netem --tc-image gaiadocker/iproute2 --duration 5m delay --time 5000 server1

First, note that the defaults section of my HAProxy configuration looked like this:

defaults
log global
mode http
option httplog
timeout connect 3s
timeout client 5s
timeout server 5s
timeout queue 30s

Here, timeout connect, which is the time allowed for establishing a connection to a server, is set to three seconds. My hypothesis for this experiment was that the HTTP request would be delayed and hit the timeout server limit. What actually happened was the connection timeout struck first, giving me an sC termination code in the HAProxy logs, which means that there was a server-side timeout while waiting for the connection to be made.

From this, I learned that generic network latency affects all aspects of a request, Layer 4 through Layer 7. In other words, the HTTP messages did not have a chance to time out because even establishing a connection was timing out first. It sounds obvious, but until I tested it, I was only focused on Layer 7.

Retrying when there is a connection timeout is covered by adding a conn-failure retry policy. You can append it, like this:

backend be_servers
balance roundrobin
option redispatch
retry-on empty-response conn-failure
server s1 server1:80 maxconn 10
server s2 server2:80 maxconn 10
server s3 server3:80 maxconn 10

If you don’t set retry-on at all, then conn-failure is on by default. However, since we’ve set it in order to include empty-response, we need to include that retry policy explicitly as well. So, whether the server is completely down or just slow to connect, it’s counted as a connection timeout. Also note: How quickly HAProxy will retry depends on your timeout settings.

Slow servers

To learn what would happen if latency affected only the HTTP messages and not the initial connection, I moved on to using a different tool named Muxy. Muxy is a proxy that can change a request or response as it passes through. You can run it in a Docker container so that it has access to muck with messages from other containers in the network, hence its name. Use it to add a delay to one of the web server’s responses so that the connection is established fine, but the application appears sluggish. The following Muxy configuration injects a five-second delay for responses coming from server1:

proxy:
- name: http_proxy
config:
host: 0.0.0.0 # muxy IP
port: 81 # muxy port
proxy_host: server1 # proxied IP/host
proxy_port: 80 # proxied port
middleware:
- name: delay
config:
response_delay: 5000

You’ll need to point HAProxy to Muxy instead of the actual backend server:

backend be_servers
balance roundrobin
option redispatch
retry-on empty-response conn-failure
server s1 muxy:81 maxconn 10
server s2 server2:80 maxconn 10
server s3 server3:80 maxconn 10

This causes a different type of timeout in HAProxy, one that’s triggered when timeout server strikes. The client receives a 504 Gateway Timeout response, as shown in the HAProxy logs:

fe_main be_servers/s1 0/0/1/-1/5001 504 218 - - sH-- 1/1/0/0/0 0/0 "GET / HTTP/1.1"

Add the response-timeout retry policy to cover this scenario:

retry-on empty-response conn-failure response-timeout

HTTP errors

Suppose that there was no latency, but that the server returned an HTTP error. You can deal with this type of chaos too. It is important that you know how your application behaves with Layer 7 retries enabled. Caution must be exercised when retrying requests such as POST requests. Be sure to read the next section regarding retrying POST requests!

To change the returned status to 500 Server Error before it reaches HAProxy, use a Muxy configuration like this:

proxy:
- name: http_proxy
config:
host: 0.0.0.0
port: 81
proxy_host: server1
proxy_port: 80
middleware:
- name: http_tamperer
config:
response:
status: 500

You’ll need to update your retry policy to look for certain HTTP response status codes. In the following example, a retry happens if there’s a timeout, a connection failure, or an HTTP 500 status returned:

retry-on empty-response conn-failure response-timeout 500

With this in place, the HAProxy log shows that the request was routed away from the failing server, s1, to the healthy server, s2:

fe_main be_servers/s2 0/16/0/16/32 200 224 - - ---- 1/1/0/0/+1 0/0 "GET / HTTP/1.1"

Recall that it is actually option redispatch that tells HAProxy to try a different server. The retry-on directive only configures when a retry should happen. If you wanted to keep trying the same server, you’d remove option redispatch.

You can keep appending more retry policies to mitigate different types of failure modes. Or, you can use the all-inclusive option, all-retryable-errors:

retry-on all-retryable-errors

It’s the same as if you’d specified all of the following parameters:

retry-on conn-failure empty-response junk-response response-timeout 0rtt-rejected 500 502 503 504

You will find all of the available options in the HAProxy documentation.

Beware of POSTs

Retrying requests that fetch data is often safe enough. Although, be sure to test! The application may have unknown side effects that make it unsafe to retry. However, it’s almost never safe to retry a request that writes data to a database, since you may be inserting duplicate data. For that reason, you’ll often want to add a rule that disables retries for POST requests. Use the http-request disable-l7-retry directive, like this:

backend be_servers
balance roundrobin
option redispatch 1
retry-on all-retryable-errors
retries 3
http-request disable-l7-retry if METH_POST
server s1 server1:80 maxconn 10
server s2 server2:80 maxconn 10
server s3 server3:80 maxconn 10

Conclusion

In this blog post, you learned about the retry-on directive that was added to HAProxy 2.0 and complements the existing retries and option redispatch features. This is a versatile feature that lets you mitigate various types of failures by specifying the events that should trigger a retry. However, you never know how things will truly work until you inject some faults into the system and see how it responds. By using Chaos Engineering techniques, I was able to verify that this directive adds resilience against unreachable nodes, network latency, slow servers, and HTTP errors.

If you enjoyed this post and want to stay up to date on similar content, subscribe to this blog! You can also follow us on Twitter and join the conversation on Slack.

Contact us to learn more about HAProxy Enterprise, which combines HAProxy, the world’s fastest and most widely used, open-source load balancer and application delivery controller, with enterprise-class features, services, and premium support. You can also sign up for a free trial.

Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.