A while ago I did a little writeup on high-end Varnish tuning, where I noted that I made our single core 2.2GHz Opteron reach 27k requests/second. This begged the questions as to how well Varnish scale with hardware. So I went ahead and tried to overload our quad-core Xeon at 2.4GHz. It would obviously take some extra fire power. At the very least, four times as much as the last batch of tests.
Hardware involved
Our main set of test servers for Varnish are called varnish1, varnish2, varnish3, varnish4, varnish6 and varnish7. These have mostly different software and hardware – which is done intentionally so we can perform tests under different circuimstances. We routinely run tests against Varnish2 and Varnish4, which run CentOS and FreeBSD, respectively. For my last test, I used Varnish2 as the server and the remaining servers as test nodes. By any normal math, I would need about 4 times more fire power to overload a 2.4GHz Quad core, compared to a single core Opteron at 2.2GHz.
To sum it up as far as this round of tests go:
- Varnish1 – Single core Opteron
- Varnish2 – Single core Opteron at 2.2GHz (used in the last round of tests)
- Varnish3 – Single core Xeon (if I’m not much mistaken). It’s also the nginx server used as backend, but that just means 1 request every X minutes.
- Varnish4 – Single core Opteron (FreeBSD)
- Varnish6 – Dual core Xeon of some kind
- Varnish7 – Quad-core Xeon at 2.4GHz
So I needed more power. As it happens, we do alot of training and we have three classrooms full of computers for students, and I borrowed two of these class rooms, adding the following to the mix:
- 10 x single core Pentium Celerons at 2.9x GHz
- 10 x Core 2 Duos at 2.4ish GHz
As you might notice – a large part of the challenge when you want to test Varnish is getting your test systems to keep up.
Basic test procedures
Same as last time, more or less: 1 byte pages and httperf. I’ve tried ab, siege and curl… And they simply do not offer the raw power of httperf combined with the control – if anyone cares to enlighten me on how to get the most out of them, then I’m more than willing to listen.
Ideally I wanted to test with 10 requests for each connection, and with mixed data set size. As it turns out, I ended up using 100 requests / second and bursting all of the requests, which is far from realistic. More on this later.
I have an intricate script system for the nightly tests, but that’s a story for an other time. For these tests I simply used clusterssh to replicate my input on 37ish shells. This has allowed me to instantly test identical setups on all the nodes, and to quickly review what their status is. I probably ran a thousand or more different variants of the same test this time around.
I’ve used varnishstat to monitor the request rate and other relevant stats, and top to monitor general load.
The backend I use is hosted on varnish3, which runs nginx and a simple rewrite to ‘current.txt’, which for this occasion was linked to a 1byte file.
Results
Varnish uses alot of threads, and as such, when it does finally saturate the CPU, the load average will skyrocket. On the last test, Varnish2 had a load of 600-700. During this load, Varnish2 would use 10-15 seconds to start ‘top’.
During this round of tests I had roughly 87GHz worth of clients, spread over 25 physical computers. All of the tests systems were running at full load. Varnish7 had a load average around 45. Logging in and starting top was close to instant. And Varnish was serving 143k requests per second.
Based on the load and general snappiness, I think it is safe to conclude that while Varnish was close to the breaking point, it hadn’t actually reached it. To put it simply: My clients were not fast enough. Before I told httperf to burst 100 requests for each connection, Varnish was serving 110-120k requests per second with a load less than 1.0, and the clients were still using all their fire power. I ended up stress testing my clients. Dammit.
However, as I came fairly close to the breaking point, I still believe there are a few interesting things to look at.
The scaling nature of Varnish
It’s very rare that you can see an application scale so well just by throwing cpu power and cpu cores at it. Varnish essentially didn’t get affected at all by the extra work needed to synchronize work on 4 cpu cores. In fact, if you look at the math, the raw performance on 4 cpu cores was actually BETTER than on one cpu core, when you look at it on a cycle-by-cycle.
I think it’s reasonably safe to say that when it comes to raw performance, we’ve nailed it with Varnish.
In fact, scaling Varnish is far more difficult when you increase your active data set beyond physical memory. Or when you introduce latency. Or when when you have a low cache hit rate. Or any other number of corner cases. There will always be bottlenecks.
What you can learn from this is actually simple: Do not focus on the CPU when you want to scale your Varnish setup. I know it’s tempting to buy the biggest baddest server around for a high-traffic site, but if your active data set can fit within physical memory and you have a 64-bit CPU, Varnish will thrive. And for the record: All CPU-usage graphs I’ve seen from Varnish installations confirm this. Most of the time, those sexy CPUs are just sitting idle regardless of traffic.
Myths and further research
Since I didn’t reach the breaking point, there’s not much I can say conclusively. However, I can repeat a few points.
Adjusting the lru_interval had little impact regardless of data set and access patterns. If I repeat this often enough, perhaps I’ll stop seeing new installations with an lru_interval of 3600: DO NOT SET lru_interval TO 3600. There. I didn’t even add the usual “unless you know what you are doing” part. I might’ve explained it before, but the problem is that it leaves you with a really badly sorted lru-listed that will cause Bad Things once you need to lru-nuke something. Possibly really really bad things. Like throwing out the 200 most popular objects on your site at the same time.
And the size of your VCL has little impact on the performance. I have not tested this extensively, but I’ve never registered a difference, and since your cpu will be idle most of the time anyway, you should NOT worry about CPU cycles in VCL.
An otherimportant detail is that your shmlog shouldn’t trigger disk activity. On my setup, it didn’t sync to disk to begin with, but you may want to stick it on a tmpfs just to be sure. I suspect this has improved throughout the 2.0-series of Varnish, but it’s an easy insurance. Typically the shmlog is found in /usr/var/varnish, /usr/local/var/varnish or similar (”ls /proc/*/fd | grep _.vsl” is the lazy way to find it).
I tried several different settings of thread pools, acceptors, listen depth, shmlog parameters, rush exponent and such, but none of it revealed much – most likely because I never pressured Varnish enough. This will be what I want to investigate further. But it should tell you something about how far you have to go before these obscure settings start to matter.
Feedback wanted
I figure this must be some sort of record, but I’m interested in what sort of numbers others have seen or are seeing. Have anyone even come close to the numbers above – synthetic or otherwise – from a single server? Regardless of software or hardware? This is not meant as a challenge or boast, but I’m genuinely curious on what sort of traffic people are able to push. I’m interested in more “normal” requests rates too – I’m a sucker for numbers. What are you seeing on your site? Have you had scaling issues?

We need graphs to show requests pr. second pr. core as the core count goes up!
Comment by Nicolai Langfeldt — January 14, 2010 @ 07:32
What was the maximum number of concurrent connections the Varnish server held open at a given time? If Varnish uses a thread per connection it might be getting a degradation in performance as the number of concurrent connections increase.
Also, does the number of request per second change if the clients access different sets of 1-byte files?
Graphing latency would also be cool. I don’t know if httperf can do that.
Comment by Henrik Nordvik — January 15, 2010 @ 07:37
@henrik: I can’t really tell off the top of my head, but at 100 requests per connection, the connection rate would be one hundreth of the request rate. Since I wasn’t able to starve Varnish, I wasn’t really in a position to test the connection rates properly. The clients had too great of an overhead in setting up connections.
As far as graphs go – I want it. Munin would obviously be both too slow/low detail and too intrusive (Starting a new process on Linux when it’s overloaded is a pain). The administration console does have the resoultion I want, so I might set that up. I’ll have to see if I can’t find a reasonable way to export/store the graphs from the admin console and/or put marks on it (ie: “Now I’m trying THIS command”). Ideally I’d want a tool that can do it automatically too, so I wouldn’t have to manually mark several hundred variants of the same commands…. I may just end up making my own.
As far as number of objects go: I got a better performance when I used multiple 1-byte objects. I didn’t see a huge difference between 1000 objects and 1000 000 objects, but from 1 to 1000 was noticeable (a few thousand requests per second). I suspect this is Varnish performing better, but I can’t be sure that it’s not just httperf being more efficient until I’ve starved Varnish.
Oh, and I haven’t looked closely at the data from httperf. Since it’s running on so many different nodes, it’s hard to aggreagate the data, though possible. It’s easier to just use Varnish for this, but ideally I’d like to look at both the data Varnish gives me and httperf. But I suspect that will take a good bit of time to set up.
Comment by kristian — January 15, 2010 @ 12:30
Maybe you could share your sysctl setup for varnish. This would be useful information.
Comment by nm — March 11, 2010 @ 04:41
Also, I do not see where/how to set shmlog on tmpfs. Are you saying to create a tmpfs and soft link the _.vsl to that?
Comment by nm — March 11, 2010 @ 05:03