Load testing knocks out network

Discussion:

(too old to reply)

Andy Young

2012-09-01 20:15:07 UTC

Last night one our servers went offline while I was load testing it. When I
got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of a
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and everything
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels like a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose access
to the network.

Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.

We're running FreeBSD 9.0-RELEASE-p3. The network controller is a Intel(R)
PRO/1000 Network Connection version - 2.2.5 configured with 6 ips using
aliases, five of which are used for jails.

Thank you for the help!!

Andy

Andy Young

2012-09-02 02:44:42 UTC

Permalink

Post by Andy Young
Last night one our servers went offline while I was load testing it. When
I got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of a
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and everything
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels like a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose access
to the network.
Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller is a Intel(R)
PRO/1000 Network Connection version - 2.2.5 configured with 6 ips using
aliases, five of which are used for jails.
Thank you for the help!!
Andy

--
Andrew Young
Mosaic Storage Systems, Inc
http://www.mosaicarchive.com/

Follow us on:
Twitter <https://twitter.com/#!/MosaicArchive>,
Facebook<http://www.facebook.com/MosaicArchive>
, Google Plus<https://plus.google.com/b/102077382489657821832/https://plus.google.com/b/104681960235222388167/104681960235222388167/posts>
, Pinterest <http://pinterest.com/mosaicarchive/>

Andy Young

2012-09-03 04:14:52 UTC

Permalink

Hi Ragnar,

Thank you for the reply. That makes a lot of sense. I think the resources
at risk had to do with the low level details of the network card. I
experimented tonight with bumping the hw.igb.rxd and hw.igb.txd tunable
parameters of the NIC driver to their max value of 4096 (the default was
256). This seems to have resolved the issue. Before bumping their values,
my load test was crashing the network at about 350 simultaneous
connections. The behavior I witnessed was the application server (jetty)
would seize up first and I could see it was no longer responding through my
ssh connections. If I killed it off right away, all of the connections got
closed and everything went back to being fine. If I left it in that state
for 30 seconds or so, the system became unrecoverable. The connections
remained open according to netstat even after I had closed the server and
client processes. Short of rebooting, nothing I did would close the
connections down. After bumping the rxd and txd parameters as well as
kern.ipc.nmbclusters,
the problem seems to have gone away. I can now successfully simulate over
800 simultaneous connections and it hasn't crashed since. To be honest, I
don't know what the rxd and txd parameters do but it seems to have helped.

Andy

Post by Ragnar Lonn
Hi Andy,
I work for an online load testing service (loadimpact.com) and what we
see is that the most common cause when a server crashes during a load

test,

Post by Ragnar Lonn
is that it runs out of some vital system resource. Usually system memory,
but network connections (sockets/file descriptors) is also a likely cause.
You should have gotten some kind of error messages in the system log, but
if the problem is easily repeatable I would set up monitoring of at least
memory and file descriptors, and see if you are near the limits when the
machine freezes.
Regards,
/Ragnar
I read through the driver man page, which is a great source of
information. I see I'm using the Intel igb driver and it supports three
tunables. Could I have exceeded the number of receive descriptors? What
would the effect of this number being too low be? What about the Adaptive
Interrupt Moderation?
To clarify, I was simulating about 800 users simultaneously uploading
files when the crash occurred.
Thanks for any help or insights!!
Andy
NAME
igb -- Intel(R) PRO/1000 PCI Express Gigabit Ethernet adapter driver
LOADER TUNABLES
Tunables can be set at the loader(8) prompt before booting the kernel
or
stored in loader.conf(5).
hw.igb.rxd
Number of receive descriptors allocated by the driver. The
default value is 256. The minimum is 80, and the maximum is
4096.
hw.igb.txd
Number of transmit descriptors allocated by the driver. The
default value is 256. The minimum is 80, and the maximum is
4096.
hw.igb.enable_aim
If set to 1, enable Adaptive Interrupt Moderation. The default
is to enable Adaptive Interrupt Moderation.

Post by Andy Young
Last night one our servers went offline while I was load testing it. When
I got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of a
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and everything
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels like a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose access
to the network.
Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller is a
Intel(R) PRO/1000 Network Connection version - 2.2.5 configured with 6 ips
using aliases, five of which are used for jails.
Thank you for the help!!
Andy

--
Andrew Young
Mosaic Storage Systems, Inc
http://www.mosaicarchive.com/
Twitter <https://twitter.com/#!/MosaicArchive>, Facebook<http://www.facebook.com/MosaicArchive>
, Google Plus<https://plus.google.com/b/102077382489657821832/https://plus.google.com/b/104681960235222388167/104681960235222388167/posts>
, Pinterest <http://pinterest.com/mosaicarchive/>

Ragnar Lonn

2012-09-03 08:14:01 UTC

Permalink

Hi Andy,

It sounds as if your problem is more related to the NIC driver then, I
guess.

I just realized something else that I forgot to mention regarding
crash/freeze causes: network buffers. It's an out-of-memory problem, but
a bit more specific. When you have a ton of open TCP connections to a
host, that host will allocate a lot of (kernel) memory for TCP
transmit/receive buffers. In newer *Linux* kernels, this memory is being
allocated in an adaptive manner - i.e. the kernel only allocates a small
amount of memory to each TCP buffer, and then increases it as necessary
(per connection, depending on transfer speed and network delay to the
other peer). Older kernels, however, will allocate a fixed amount per
socket, which can quickly eat up all available kernel memory.

I think I actually discussed this with FreeBSD developers a while ago
(on this list even?), and they told me the FreeBSD kernel can only
allocate max 2GB of kernel memory. I don't know if it allocates network
buffers dynamically (i.e. as much memory as is necessary for each
socket/connection) but 2GB is not a lot if each connection uses up e.g.
100K buffer memory. If you have e.g. 1GB available to network buffers,
it means a max limit of 10k simultaneous connections on a server,
regardless of how much memory it has.

Regards,

/Ragnar

Post by Andy Young
Hi Ragnar,
Thank you for the reply. That makes a lot of sense. I think the
resources at risk had to do with the low level details of the network
card. I experimented tonight with bumping the hw.igb.rxd and
hw.igb.txd tunable parameters of the NIC driver to their max value of
4096 (the default was 256). This seems to have resolved the issue.
Before bumping their values, my load test was crashing the network at
about 350 simultaneous connections. The behavior I witnessed was the
application server (jetty) would seize up first and I could see it was
no longer responding through my ssh connections. If I killed it off
right away, all of the connections got closed and everything went back
to being fine. If I left it in that state for 30 seconds or so, the
system became unrecoverable. The connections remained open according
to netstat even after I had closed the server and client processes.
Short of rebooting, nothing I did would close the connections down.
After bumping the rxd and txd parameters as well as
kern.ipc.nmbclusters, the problem seems to have gone away. I can now
successfully simulate over 800 simultaneous connections and it hasn't
crashed since. To be honest, I don't know what the rxd and txd
parameters do but it seems to have helped.
Andy

Post by Ragnar Lonn
Hi Andy,
I work for an online load testing service (loadimpact.com

<http://loadimpact.com/>) and what we

Post by Ragnar Lonn
see is that the most common cause when a server crashes during a load

test,

Post by Ragnar Lonn
is that it runs out of some vital system resource. Usually system memory,
but network connections (sockets/file descriptors) is also a likely

cause.

Post by Ragnar Lonn
You should have gotten some kind of error messages in the system log, but
if the problem is easily repeatable I would set up monitoring of at least
memory and file descriptors, and see if you are near the limits when the
machine freezes.
Regards,
/Ragnar

I read through the driver man page, which is a great source of
information. I see I'm using the Intel igb driver and it supports
three tunables. Could I have exceeded the number of receive
descriptors? What would the effect of this number being too low
be? What about the Adaptive Interrupt Moderation?
To clarify, I was simulating about 800 users simultaneously
uploading files when the crash occurred.
Thanks for any help or insights!!
Andy
NAME
igb -- Intel(R) PRO/1000 PCI Express Gigabit Ethernet adapter driver
LOADER TUNABLES
Tunables can be set at the loader(8) prompt before booting
the kernel or
stored in loader.conf(5).
hw.igb.rxd
Number of receive descriptors allocated by the driver. The
default value is 256. The minimum is 80, and the maximum is
4096.
hw.igb.txd
Number of transmit descriptors allocated by the driver. The
default value is 256. The minimum is 80, and the maximum is
4096.
hw.igb.enable_aim
If set to 1, enable Adaptive Interrupt Moderation.
The default
is to enable Adaptive Interrupt Moderation.
On Sat, Sep 1, 2012 at 4:14 PM, Andy Young
Last night one our servers went offline while I was load
testing it. When I got to the datacenter to check on it, the
server seemed perfectly fine. Everything was running on it,
there were no panics or any other sign of a hard crash. The
only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the
ethernet port. I couldn't connect to anything from the box
either. It was if the network controller had seized up. I
restarted netif and it didn't make a difference. Rebooting the
machine however, solved the issue and everything went back to
working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It
feels like a network controller / driver issue to me for a
couple reasons. First, the problem affects the entire system.
We're running FreeBSD 9 with about a half dozen jails. Most of
the jails are running Apache but the one I was load testing
was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to
the jail that hosts it. Instead, the entire machine and all
jails in it lose access to the network.
Apart from not being able to access the network, I don't see
any other signs of problems. This is the first major problem
I've had to debug in FreeBSD so I'm not a debugging expert by
any means. There are no error messages in /var/log/messages or
dmesg apart from syslogd not being able to reach the
network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller
is a Intel(R) PRO/1000 Network Connection version - 2.2.5
configured with 6 ips using aliases, five of which are used
for jails.
Thank you for the help!!
Andy
--
Andrew Young
Mosaic Storage Systems, Inc
http://www.mosaicarchive.com/
Twitter <https://twitter.com/#%21/MosaicArchive>, Facebook
<http://www.facebook.com/MosaicArchive>, Google Plus
<https://plus.google.com/b/102077382489657821832/https://plus.google.com/b/104681960235222388167/104681960235222388167/posts>,
Pinterest <http://pinterest.com/mosaicarchive/>
--
Andrew Young
Mosaic Storage Systems, Inc
http://www.mosaicarchive.com/
Twitter <https://twitter.com/#%21/MosaicArchive>, Facebook
<http://www.facebook.com/MosaicArchive>, Google Plus
<https://plus.google.com/b/102077382489657821832/https://plus.google.com/b/104681960235222388167/104681960235222388167/posts>,
Pinterest <http://pinterest.com/mosaicarchive/>

Peter Jeremy

2012-09-03 21:05:59 UTC

Permalink

Post by Ragnar Lonn
transmit/receive buffers. In newer *Linux* kernels, this memory is being
allocated in an adaptive manner - i.e. the kernel only allocates a small
amount of memory to each TCP buffer, and then increases it as necessary
(per connection, depending on transfer speed and network delay to the
other peer).

FreeBSD does this as well, though I don't recall when this was added.

Post by Ragnar Lonn
I think I actually discussed this with FreeBSD developers a while ago
(on this list even?), and they told me the FreeBSD kernel can only
allocate max 2GB of kernel memory.

This is only true on 32-bit kernels. FreeBSD uses a single address
space so both kernel and userland need to fit into 4GB on 32-bit
systems. On 64-bit systems, KVM is less constrained (it's ~550GB on
my amd64). You can check sysctl's vm.kvm_free and vm.kvm_size for
exact figures.

Post by Ragnar Lonn
100K buffer memory. If you have e.g. 1GB available to network buffers,
it means a max limit of 10k simultaneous connections on a server,
regardless of how much memory it has.

If you want a system to usefully cope with 10K network connections,
you will probably want to be running amd64 anyway. That said, Rod
Grimes was achieving between 100K and 1M TCP connections to FreeBSD
i386 systems in the 1990's.

--
Peter Jeremy

Peter Jeremy

2012-09-03 21:57:33 UTC

Permalink

Post by Peter Jeremy
you will probably want to be running amd64 anyway. That said, Rod
Grimes was achieving between 100K and 1M TCP connections to FreeBSD
i386 systems in the 1990's.

Oops, I misremembered. It was Terry Lambert achieving 1.6M
connections, not Rod Grimes and only about 10 years ago:
http://www.mavetju.org/mail/view_message.php?list=freebsd-hackers&id=1502550
(though I can't find when he started the work).

--
Peter Jeremy

Ragnar Lonn

2012-09-04 09:38:38 UTC

Permalink

Post by Peter Jeremy

FreeBSD does this as well, though I don't recall when this was added.

Post by Ragnar Lonn
I think I actually discussed this with FreeBSD developers a while ago
(on this list even?), and they told me the FreeBSD kernel can only
allocate max 2GB of kernel memory.

Maybe I misremembered slightly. I found the old discussion I had with
people about this on the FreeBSD virtualization mailing list:

http://osdir.com/ml/freebsd-virtualization/2009-02/msg00006.html

Anyway, 1.6M connections sounds really good (although he only had 4GB of
memory, so I guess the exercise was mostly academic - i.e. those
connections would not be very useful in a real setting because each
would have so little buffer memory).

/Ragnar

Post by Peter Jeremy

Post by Ragnar Lonn
100K buffer memory. If you have e.g. 1GB available to network buffers,
it means a max limit of 10k simultaneous connections on a server,
regardless of how much memory it has.

Ragnar Lonn

2012-09-02 08:58:32 UTC

Permalink

Hi Andy,

I work for an online load testing service (loadimpact.com) and what we
see is that the most common cause when a server crashes during a load
test, is that it runs out of some vital system resource. Usually system
memory, but network connections (sockets/file descriptors) is also a
likely cause.

You should have gotten some kind of error messages in the system log,
but if the problem is easily repeatable I would set up monitoring of at
least memory and file descriptors, and see if you are near the limits
when the machine freezes.

Regards,

/Ragnar

Post by Andy Young
Last night one our servers went offline while I was load testing it. When I
got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of a
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and everything
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels like a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose access
to the network.
Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller is a Intel(R)
PRO/1000 Network Connection version - 2.2.5 configured with 6 ips using
aliases, five of which are used for jails.
Thank you for the help!!
Andy
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware

Pepe (Jose) Amengual

2012-09-03 03:14:11 UTC

Permalink

Maybe you should check vmstat -z while running the load testing to see if
you get any errors.

Post by Ragnar Lonn
Hi Andy,
I work for an online load testing service (loadimpact.com) and what we
see is that the most common cause when a server crashes during a load test,
is that it runs out of some vital system resource. Usually system memory,
but network connections (sockets/file descriptors) is also a likely cause.
You should have gotten some kind of error messages in the system log, but
if the problem is easily repeatable I would set up monitoring of at least
memory and file descriptors, and see if you are near the limits when the
machine freezes.
Regards,
/Ragnar

Post by Andy Young
Last night one our servers went offline while I was load testing it. When I
got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of a
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and everything
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels like a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose access
to the network.
Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller is a Intel(R)
PRO/1000 Network Connection version - 2.2.5 configured with 6 ips using
aliases, five of which are used for jails.
Thank you for the help!!
Andy
______________________________**_________________
http://lists.freebsd.org/**mailman/listinfo/freebsd-**hardware<http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>

______________________________**_________________
http://lists.freebsd.org/**mailman/listinfo/freebsd-**hardware<http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>

Andy Young

2012-09-03 04:04:51 UTC

Permalink

Hi Pepe,

Thank you for the tip. I don't know how to interpret any of the output but
I will dig into the documentation.

Andy

On Sun, Sep 2, 2012 at 11:13 PM, Pepe (Jose) Amengual <

Post by Pepe (Jose) Amengual
Maybe you should check vmstat -z while running the load testing to see if
you get any errors.

Post by Ragnar Lonn
Hi Andy,
I work for an online load testing service (loadimpact.com) and what we
see is that the most common cause when a server crashes during a load

test,

Post by Ragnar Lonn
is that it runs out of some vital system resource. Usually system memory,
but network connections (sockets/file descriptors) is also a likely

cause.

Post by Andy Young
Last night one our servers went offline while I was load testing it.

When

Post by Ragnar Lonn

Post by Andy Young
I
got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of

Post by Ragnar Lonn

Post by Andy Young
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and

everything

Post by Ragnar Lonn

Post by Andy Young
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels

Post by Ragnar Lonn

Post by Andy Young
a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose

access

Post by Ragnar Lonn

Post by Andy Young
to the network.
Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller is a

Intel(R)

Post by Ragnar Lonn

Post by Andy Young
PRO/1000 Network Connection version - 2.2.5 configured with 6 ips using
aliases, five of which are used for jails.
Thank you for the help!!
Andy
______________________________**_________________
http://lists.freebsd.org/**mailman/listinfo/freebsd-**hardware<

http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>

Post by Ragnar Lonn
______________________________**_________________
http://lists.freebsd.org/**mailman/listinfo/freebsd-**hardware<

http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
"