Weird Network Problems in XCP-NG

I have no idea what is going on. I’m getting consistently inconsistent throughput and latency issues. This is a post to collect my findings so far… Not very interesting to anyone else but me.

Ping via PIFs:

Between physical machines are very good (<200uS)
Between virtual and physical machines are ok (<350uS)
Between virtual machines on different hosts are also ok (<500uS)

No problems there.

Pings between VMs, using VIFs:

Using Broadcom NIC, pings start at >1.5ms and settle at 600uS, with lots of fluctuation
Using Solarflare 10G NIC, pings start at >1.5ms and settle at 500uS, with some fluctuation

Not terrible, but not amazing.

Throughput tested with iperf3:

More than I need between all hosts/VMs/combinations of the two. Usually around 3-6Gbit/s depending on the config. A bit random though, and fluctuates wildly when between VMs. It’s not the ultimate throughput that concerns me, but the inconsistency.
LOTS OF RETRANSMISSIONS BETWEEN VMs! This is very concerning.

Throughput tested with OpenSpeedTest (with SSL):

Server running in docker on different physical host (non virtualised), tested with both built in SSL AND offloading to a separate HAProxy instance on a separate server: both 1Gbit/s plus.
Sever running in docker on same host (virtualised) – both built in SSL and offloading to a separate HAProxy instance on the SAME host: 550Mbit/s max.

Throughput tested with OpenSpeedTest (without SSL):

Server running in docker on different physical host (non virtualised): 1Gbit/s plus.
Server running in docker on same physical host (virtualised): 1Gbit/s plus.

The casual observer would say that my CPUs aren’t fast enough to handle SSL encryption at those speeds, but not only are my CPUs not under high load when testing, but when the speed test was running on the separate server and SSL was being handled by HAProxy on the separate host, the speeds were fine. CPU is not a bottleneck here.

THIS DON’T MAKE NO DAMN SENSE BOI.

Consider that with HAProxy running as a container on the same docker host as HTTP applications, throughput is much closer to gigabit – up around 700Mbit/s+. Further consider that when I was running docker swarm, accessing services through the overlay network (so traffic has to be routed through exactly ONE other host (same as running HAProxy as a separate VM) also slowed HTTP traffic to around 550Mbit/s – the same as the openspeedtest on the same host!

Now we’re getting somewhere. Observe that the outlier in the openspeedtests (550Mbit/s) was incidently the ONLY test where traffic was routed through an additional VM INSIDE the same XCP-NG host.

But iperf3 says there’s more than enough bandwidth! So what gives? Now consider that between VMs, iperf records high retransmissions (anywhere between 50-300), indicating packet loss. But pings don’t record any packet loss? BUT iperf uses TCP and ping uses ICMP. This could point to checksum problems which only affects TCP. Aha!

My new hypothesis is as follows; each hop over the internal XCP-NG network loses some packets – the more VMs the traffic has to go through, the more packet loss there is. In the case of TCP, this wouldn’t manifest as errors as the packets would be automatically retransmitted, so the server and client have to keep retransmitting packets, reducing server overhead and overall speed. This makes further sense when you add to the mix that I’m also using a virtualised firewall on the same internal network – yet another network hop for packet loss. To test this I propose:

Adding more VM’s for the traffic to pass through. If the problem scales with the number of VMs, I know this is the problem. This doesn’t help with a solution though.
Throwing an Intel PPO/1000 VT NIC in the server and using it as the PIF for my internal networks. This may highlight if checksumming on both the Solarflare and Broadcom NICs is broken. However, this NIC is very old and has known problems in some cases. If it works though, I would have found the problem and solved it at the same time. Big win.

I shall try the latter and report back.

Further notes:

I’ve tried with TX checksumming disabled in Xen Orchestra. It doesn’t make any difference. This makes sense though if the following point is true.
This Reddit post states that the Broardcom NICs in Dell servers have issues offloading virtual network processing to the physical adapter. Although this is in reference to VMware, I expect XCP-NG does the same. It could be bad luck that both my Solarflare 10G NIC and the onboard Broardcom NICs both have this offloading problem.