I'm leaving for a vacation tomorrow so I thought I'd post a blog concerning the status of our OpenSuse 11.4 GNOME server. We are in the home stretch of the deployment, but unfortunately the underlying Linux is giving us problems. I thought I'd relay the information and maybe someone has some ideas on getting this resolved. When I get back in just over a week, we'll start doing some major changes one at a time to resolve the issue. I'm not concerned about project success, but the best path is not yet clear.
All of the software is installed and for the most part GNOME and avant-window-navigator are working as expected. The hardware is beefy and should support at least 200 users+, but as soon as get to about 20 users - things really start slowing. We are seeing problems with the networking and disk performance. They may be related, but I haven't yet been able to prove it.
We have cloned the server and moved it VMWare and it's suffering the same problems. So I feel pretty confident it's not hardware. We are using this hardware elsewhere with no problems.
Disk IO Problem
For the first time we tried to deploy ext4, all of my Googling seems to indicate no major problems and no complaints about performance. I did not see any "multi user" issues reported. If we have a few (10ish) people on the server things work just fine. Once you get more, the disk becomes VERY slow. If I vi /etc/hosts (shot below), it just sits for 3-4 seconds with a blinking cursor before the file appears.
With a user load, if I run the passwd command it sits for5 seconds after entering a password before it completes. On other servers even with hundreds of people, all of this happens instantly. if I run yast2 and install software with a load, the whole server becomes very slow and non responsive. On other releases of OpenSuse we have never seen this before.
While this might seem a clear IO problem, what also enters my head is that we are having some networking problems too; and I'm wondering if the NFS mounts are having problems which is affecting local disk IO. Unfortunately I cannot dismount the NFS mounts to prove that idea, too many necessary pieces are on remote file systems.
It seems like others are having OpenSuse 11.4 dropped RX woes too. The first shot below is the desktop server, and note the packets that are being dropped. This number climbs constantly every few seconds.
But note on this shot below from OpenSuse 11.3 how it *should* behave. This browser server has over 100 people using Firefox 5 and you can see it's being hammered on the network side and there are almost no dropped packets. This reflects how our other Linux servers work.
I have noticed that certain NFS activities really slow the machine, when they are in Nautilus and generating thumbnails there is a noticeable slowdown for the other users. But it's not clear if this is "networking" or "disk io"
Aside from someone out there having intimate knowledge of OpenSuse 11.4 and a quick fix, we'll begin making drastic changes when I returned. The first thing I'll do is grab the experimental 3.X kernel that I see on some of the OpenSuse channels. I have one inclination that we are having a kernel/scheduling problem because it's affecting two subsystems. Currently I have upgraded to the latest stable release of kernel-desktop-188.8.131.52-0.5.1.x86_64. If that doesn't work, the next thing that I'm going to do is reformat the server with ext3 and rule out a file system problem. This is a nasty thing, and going to require lots of redoing work that is already complete. But what we have now cannot be moved into production.
One other idea I had was that somehow the -desktop kernel is not suited for multiple users, and I could try moving to the more vanilla version; but it sure seems like this change won't make disk and networking better.
If you work on OpenSuse and have any information, it's greatly appreciated. We are anxious to move this server into production.