What began as an investigation into a very stubborn intermittent error turned into a write-up on how to squeeze the most out of your Kubernetes deployment.
Among the many features of Kubernetes, perhaps one of the most useful (and innovative) is its use of IP tables to perform the gruntwork of network plumbing.
If you haven’t already, I highly recommend watching Michael Rubin and Tim Hockin discuss the ins and outs of Kubernetes networking. It’s a great overview of the k8s networking layer with a lot of examples and step-by-step explanation.
First, a little background. My company offers a product that enhances e-commerce search by running a store’s online catalog and query stream through a natural language engine to improve recall and precision for end users.
The product is exposed as a a two-part API: queries and listings. A store’s product catalog must first be indexed through our API, which analyzes existing data fields like title and description, adding structured data according to the product type – sizes and colors for fashion, freshness for grocery items, etc. These items get indexed back in to a search engine on the customer side. End-user queries are then analyzed in real-time with the same engine and translated into search engine-speak to return matching items from the indexed catalog.
The takeaway here is high throughput for the listings API, while the goal for the queries API is low latency.
Over the last few months, we’ve put a lot of work into bringing our infrastructure up to date by preparing a migration from compute instances to an orchestrated containerized platform.
You can probably guess which one.
I’m going to focus on the listings API in this post. Perhaps I’ll go into the queries API in the future, but for now that’ll remain out of scope.
The classic deployment looks like this: A fixed number of worker processes (let’s say 4) are distributed among a set of autoscaling instances (for example, 10). These instances sit behind an HAProxy load balancer, which is itself stationed behind a cloud load balancer to handle SSL termination.
All except for one of the worker instances are switched off when there are no indexing requests. HAProxy configuration is dynamically updated with the number of workers online through Consul template, so it knows how many connections it can accept before returning errors.
Most of the time, HAProxy is idling along, ready to accept a maximum of 4 connections (4 workers per node x 1 online node) when requests start blasting in at full speed (40 concurrent).
At this point HAProxy returns status
429 to the client until the rest of the instances have time to scale up. Scaling is handled by a custom script that reads the
429 rate from the HAProxy stats endpoint and determines how many instances needed to be brought online to handle the demand. This process ideally takes 2-3 minutes, after which the
429 rate drops to zero.
While re-implementing the logic in Kubernetes, we did away with the custom autoscaling stuff, relying instead on GKE’s autoscaling node pool plus a
HorizontalPodAutoscaler to scale the deployment as needed.
For testing, I settled on an intermediate-sized cluster of 6 nodes, 4 vCPUs each, for a total of 24 workers, one process per CPU core. No multi-threading here.
At first, I tried round-robin load balancing using the virtual IP created by a
ClusterIP service, leaving out the HAProxy bits altogether. This worked, even error-free, only it was about two-thirds slower than the classic deployment. So that wasn’t going to fly.
Switching to the use of
externalTrafficPolicy to either
Cluster made things even worse:
|LB Type||Response Time (ms), mean||Requests / sec, mean|
My best guess is that because IP tables operates in kernel space, there’s a lot of context-switching going on, causing a huge slowdown in processing speed. Looking at
top on one of the indexing nodes revealed that a good chunk of CPU time was spent idling, reinforcing my suspicion:
%Cpu(s): 49.1 us, 2.8 sy, 0.0 ni, 47.9 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
My next attempt involved deploying an ingress controller to handle load balancing at the application layer.
Layer 7 – Traefik
First, I wrote an
Ingress object to be picked up by the Traefik controller. Then, per the docs, I had to add some annotations to my app’s service definition to limit the maximum number of connections:
... metadata: annotations: traefik.ingress.kubernetes.io/max-conn-amount: "24" traefik.ingress.kubernetes.io/max-conn-extractor-func: client.ip ...
Hammering it with
ab with the same command-line options as before:
$ ab -t300 -c24 -k -r -l -p listing.json -T 'application/json' -m post \ http://haproxy-controller.kube-system/listings
A quick note about the command line flags used here for those of you who don’t feel like pulling up the
-t300: run for 5 minutes
-c24: 24 concurrent connections
-k: add a HTTP keep-alive header.
HTTP/1.0, which defaults to no keepalive.
-r: ignore errors
-l: accept variable length responses. Our app returns a request ID which is randomly generated and not always the same length.
-p ...: POST a file, and
-T 'application/json': send it as JSON
And here’s where the intermittent error begins to appear.
While load testing the listings API with Traefik, I encountered a consistent error rate of about 0.8 – 1% from the client side. The client would occasionally receive either
502 from the load balancer (hello, blog name) at seemingly random intervals. I tested with our internal indexing client, with
ab, and with
running in a
while loop (yes, really), all showing the same results. The classic deployment didn’t do that.
When you’re aiming for an SLA of five 9’s, that is a no-go.
Concurrency Level: 24 Time taken for tests: 300.004 seconds Complete requests: 14217 Failed requests: 0 Non-2xx responses: 67 Keep-Alive requests: 14217 Total transferred: 16672816 bytes Total body sent: 31315959 HTML transferred: 12557455 bytes Requests per second: 47.39 [#/sec] (mean) Time per request: 506.442 [ms] (mean) Time per request: 21.102 [ms] (mean, across all concurrent requests) Transfer rate: 54.27 [Kbytes/sec] received 101.94 kb/s sent 156.21 kb/s total
Okay. On one hand, we’re getting a ~30% increase in throughput by balancing with an ingress controller instead of IP tables. So that’s good. But on the other hand, we’re getting errors. Not a whole lot of them, but still not satisfactory for our SLA with our customers.
What’s going on here?
I did some research – mostly googling around, but also examining a
tcpdump of a load testing session, which confirmed that the non-2xx responses above were indeed mostly
502‘s – and found an open bug report in the Traefik project on GitHub that seems to match my experience, with a lot of suggested workarounds.
The one that eventually did the trick, as counter-intuitive as it may seem, was to disable keepalives in the Traefik configuration.
Since the author of the comment on the bug report that saved the day already linked to the relevant library used by Traefik, I perused some more and found this tidbit:
// By default, Transport caches connections for future re-use. // This may leave many open connections when accessing many hosts. // This behavior can be managed using Transport's CloseIdleConnections method // and the MaxIdleConnsPerHost and DisableKeepAlives fields.
Apparently, connection caching does not play nice with uWSGI, which we also happen to be using as our application server.
Testing again after setting
MaxIdleConnsPerHost = -1 in the Traefik config yielded similar performance, but with zero errors. Great success!
Let’s see if we can push things even further.
Layer 7 – HAProxy
Our old friend. HAProxy is one of the fastest (if not the fastest), most customizable load balancing proxy servers out there. It always showed excellent results in our classic deployments, barely moving the needle on resource consumption, even under full load, so I wanted to try it out reincarnated as a Kubernetes ingress controller.
All that needed to be done was to switch the ingress class from Traefik to HAProxy in our ingress resource and set the connection limit (docs):
... metadata: annotations: - kubernetes.io/ingress.class: "traefik" + kubernetes.io/ingress.class: "haproxy" + ingress.kubernetes.io/maxconn-server: "4" ...
Note that Traefik gives you the ability to set the maximum connections per backend, whereas HAProxy allows you to set the maximum connections per server – an important differentiation of which we’ll see the effect later.
How does HAProxy compare to Traefik?
|LB Type||Response Time (ms), mean||Requests / sec, mean|
HAProxy is serving requests almost 30% faster than Traefik here, and ~50% faster than IP tables-based load balancing. Nice.
top we see full resource utilization:
%Cpu(s): 99.4 us, 0.4 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
Taking a look at
kubectl top nodes:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% gke-indexing-bd8adade-5fh5 3994m 101% 3650Mi 29% gke-indexing-bd8adade-81m6 3991m 101% 3651Mi 29% gke-indexing-bd8adade-cd3h 3995m 101% 3565Mi 28% gke-indexing-bd8adade-g0lj 3994m 101% 3612Mi 29% gke-indexing-bd8adade-nwkp 3992m 101% 3611Mi 29% gke-indexing-bd8adade-zbdw 3996m 101% 3502Mi 28%
That is efficient.
How much difference does that
maxconn-server annotation really make?
Well, without it, we get numbers almost identical to Traefik:
|LB type||Response Time (ms), mean||Requests / sec, mean|
Without proper connection limits, the load balancer is blindly doing round-robin without taking in to account how many connections it has already opened to the backend. This causes the request queue to fill up on one server while another server might actually be able to process that request, leading to inefficiency.
This can be more clearly demonstrated by observing
kubectl top nodes with connection limiting disabled:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% gke-indexing-bd8adade-01dm 3224m 82% 3411Mi 27% gke-indexing-bd8adade-4l1d 2799m 71% 2895Mi 23% gke-indexing-bd8adade-93l1 3746m 95% 3471Mi 27% gke-indexing-bd8adade-b52p 3092m 78% 3462Mi 27% gke-indexing-bd8adade-g0lj 3986m 101% 4729Mi 38% gke-indexing-bd8adade-rf9c 3464m 88% 3412Mi 27%
Quite a difference from before.
From the viewpoint of HAProxy:
Why does one server get 10 concurrent sessions while others get 2 or 3?
Much, much better.
I really like Traefik, and I think that v2.0 looks especially promising, but I didn’t have a chance to check it out for this post. I did play a little with HAProxy 2.0 though, and was able to achieve similar results.
There’s also Google’s recently GA’d container-native load balancing, which I have yet to play around with. As a general rule we try to keep our infrastructure cloud-agnostic to avoid vendor lock, but I may update this post if / when I get around to trying it out.
Unless Traefik 2.0 has HAProxy’s ability to limit connections per server, I don’t think we’ll be able to consider it a viable contender to dethrone HAProxy, the king of light-speed load balancing.
Maybe we’ll see such a feature added in the future.
Here’s to hoping 🙂