Thursday, September 01, 2011

OpenSuse 11.4 Reload & Problem Is Back

Tuesday we moved our OpenSuse 11.4 users to a VM copy of the new GNOME desktop and we wiped the physical machine completely. We pulled the drives from the RAID array and put them in a new order, and then re-added a new RAID (1+0) partition. Our customizations are always placed in /u and /u2 so a few well placed tarballs and tweaks and we had a brand new machine with the same functionality as the old one.

Very sadly, the same issues have come back. I appreciated all of the ideas and tips presented in the comments area in prior blogs. But I do have a bit more information and it's very odd. This server should run 200 users easily, but as we get over about 20-30 and especially at 40 we see performance get very poor. Here is the new information: If I vi /etc/services it blinks for a few seconds and then opens. If I copy /etc/services to /tmp/services and /home/services and vi those opens immediately. So there is some kind of contention or lock on the /etc/ directory. This contention seems to be the core of the problem. So many services are constantly looking at those files, and they are somehow bottlenecked. If you have any ideas to assist, the bug report is here.

Networking is still sub-optimal; note the dropped RX packets.

eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72
inet addr: Bcast: Mask:
RX packets:12191295 errors:0 dropped:73239 overruns:0 frame:0
TX packets:12731836 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5112551787 (4875.7 Mb) TX bytes:10726315924 (10229.4 Mb)
Interrupt:16 Memory:f8000000-f8012800

eth1 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:74
inet addr: Bcast: Mask:
RX packets:73278 errors:0 dropped:238 overruns:0 frame:0
TX packets:8009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:15639151 (14.9 Mb) TX bytes:827822 (808.4 Kb)
Interrupt:17 Memory:fa000000-fa012800

There are a few more steps that we can take, including installing the 3.0 kernel to see if that helps. We might have to start looking at other distributions if we don't see progress soon.

Other projects continue: Writing some code for the support portal, testing NX, looking at upgrading our Moin Wiki, WiFi upgrades. I also have been poking at some ideas to improve NX performance to our thin clients, will blog about that if it pans out.


Anonymous said...

What does seekwatcher say ?

Anonymous said...

One other thought - are things just as slow with busybox tools? Perhaps there's something network related being caused by glibc for each binary?

Dave Richards said...

It should also be noted that /etc/ , /tmp/ and /home are all in a single file system. We only have a / mounted with file system of ext3. Will check out busybox and seekwatcher.

Anonymous said...

That almost scales as bad as Windows...

Tommi Tervo said...

What is your NIC vendor? Also check /proc/sys/vm/zone_reclaim_mode

is it on or off?

Anonymous said...

sometimes you go down a path and moving to another is too much effort. so i say this cautiously. but CentOS/ScientificLinux are kissing cousins of SUSE. Too much effort to migrate?

Anonymous said...

btw, the link is dead

pbrobinson said...

The first thing that comes to mind is RAID levels. HW or SW RAID? RAID 10? Or some other. What about HW RAID Firmwares and configuration. I've seen very similar issues before with HW RAID where there was a Firmware bug which caused the caching to be disabled because it thought the RAID battery was dead and hence it was going to disk for everything, with /etc likely all on a similar area of disk that would destroy the perf with so many things accessing /etc/ constantly.

Also the scheduler might be worth reviewing. Newer kernels you can run time select the type of scheduling used.

What does something like iotop or atop report?

Also I wonder if you could try a RHEL/CentOS 6 kernel (no idea if that is even possible). said...

The dude is completely just, and there is no suspicion.