Andy Young
2012-09-01 20:15:07 UTC
Last night one our servers went offline while I was load testing it. When I
got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of a
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and everything
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels like a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose access
to the network.
Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller is a Intel(R)
PRO/1000 Network Connection version - 2.2.5 configured with 6 ips using
aliases, five of which are used for jails.
Thank you for the help!!
Andy
got to the datacenter to check on it, the server seemed perfectly fine.
Everything was running on it, there were no panics or any other sign of a
hard crash. The only problem is the network was unreachable. I couldn't
connect to the box even from a laptop directly attached to the ethernet
port. I couldn't connect to anything from the box either. It was if the
network controller had seized up. I restarted netif and it didn't make a
difference. Rebooting the machine however, solved the issue and everything
went back to working great. I restarted the load testing and reproduced the
problem twice more this morning so at least its repeatable. It feels like a
network controller / driver issue to me for a couple reasons. First, the
problem affects the entire system. We're running FreeBSD 9 with about a
half dozen jails. Most of the jails are running Apache but the one I was
load testing was running Jetty. However, if it was my application code
crashing I would expect the problem to at least be isolated to the jail
that hosts it. Instead, the entire machine and all jails in it lose access
to the network.
Apart from not being able to access the network, I don't see any other
signs of problems. This is the first major problem I've had to debug in
FreeBSD so I'm not a debugging expert by any means. There are no error
messages in /var/log/messages or dmesg apart from syslogd not being able to
reach the network. If anyone has ideas on where I can look for more
evidence of what is going wrong, I would really appreciate it.
We're running FreeBSD 9.0-RELEASE-p3. The network controller is a Intel(R)
PRO/1000 Network Connection version - 2.2.5 configured with 6 ips using
aliases, five of which are used for jails.
Thank you for the help!!
Andy