Thursday, February 14, 2008

GNOME Scaling Problem

In March we will be putting the new OpenSuse 10.2 server live to our thin clients, with support for 3D/Compiz.

Federico is always very cool about me bouncing off him the odd things we sometimes see when running hundreds of concurrent GNOME users on the same server. I found such an issue, and have come up with a work around. When you have around 30-40 people logged into one server, and push live a new .desktop file the whole computer gets very slow as all of the application-browsers are updated at one time. The presentation freezes for all users for about 4-5 seconds while everyone updates. The more people on the server, the worse it gets. In case anyone is interested in improving this situation, I have snapped a screenshot that shows it happening. In the gnome-terminal window I am just duping an icon to simulate a new program being added. In the "top" window, you can see the CPU hitting 90% and all of the application-browser processes appearing at once. This sample was done with about 50 people. When I put on a full load next month, it will be far worse. The work around is low tech; icons will be added only at night when no users are on. :)

I'll say it again, all of this technology is working extremely well. This one small issue is hardly a show stopper.

10 comments:

Frej said...

Just for info :). No solutions. ;).
The problem is probably disk I/O limitited and not CPU.

The "D" in ps means Uninterruptible sleep (usually IO)
(from man ps). A process won't get that status if it's in the run queue (waiting for CPU time).

This is pure guessing!!!
The program that gets notified when you add a *desktop file, probably does something stupid like reading _all_ *desktop files again. Multiply that by nr of users and your disk hurts.

Justin said...

You always login as root over telnet? :-)

Anonymous said...

Try stracing one of these programs during an update, so you can see what they are doing. There's probably a reasonably easy optimisation waiting to happen.

Things I can think of that it could be doing:
- Updating or even recreating some kind of per-user cache
- Reading all the .desktop files instead of just the new/updated one
- Reading/writing files in a silly way (one character at a time, scanning gigantic amounts of dirs/files unnecessarily, repeatedly opening/closing the same files etc.)
- Rescanning once for each added .desktop file: add a tiny, preferably random, delay, then rescan

Anonymous said...

Is this really a GNOME scaling problem or something SUSE related problem? I don't have anything like an application-browser on my GNOME installation here (Arch Linux).

Dave Richards said...

frej: Agreed and thanks. It looks like it's hammering the disk.

justin: busted. :) I was thinking about the test and just telneted over there.

anonymous: I have an strace and will be attaching to my bug report on BGO.

anonymous: I believe this is part of main-menu (slab), which I thought was being merged into GNOME as an alternate menu to replace the normal cascading ones.

Basil Mohamed Gohar said...

Just out of curiosity, what kind of a deployment is that? When you say multiple users logged-in at the same time, do you mean multiple users' Xsessions are running on one server? And if that is the case, what hardware are the users running - dedicated thin clients or their own machines? I'm really interested in this.

--Basil (abu_hurayrah@hidayahonline.org)

Dave Richards said...

Basil Mohamed Gohar: We are using XDMCP with HP thin clients. My main blog page has complete details. Just go back through the archive and you can see my work files and testing.

http://davelargo.blogspot.com

Basil Mohamed Gohar said...

Dave,

Thanks for the reply! I'm really fascinated with the success you are having. Do you find that a regular XDMCP connection does not saturate your network? Are these users on the same LAN or do they connect remotely? I have tried setups using X sessions redirected over a compressed SSH connection, and on a LAN, the response time is usually pretty decent, but when I connect from work to home (which has a simple cable modem connection), response is very laggy, for obvious reasons.

Does your setup overcome these kinds of issues?

Dave Richards said...

Basil Mohamed Gohar: We normally have around 300 concurrent sessions running on our network, which is flat (no subnets). There are no performance problems at all, even when running Compiz and 3D. We have fiber optic lines running to all closets at 1Gb, and 100Mb to the desktops. These specs are pretty normal.

For users not on our local network, we are currently using Citrix Metaframe for Unix (Solaris) for bandwidth compression. It works very well. We also are evaluating NX/No Machine because it offers the RENDER extension and therefore has anti-aliased fonts. NX also runs natively on Linux.

Anonymous said...

I guess you're on something bigger than a /24 then. 10.x.x.x/16 I'd assume.
Quite interested in hearing details of your experimenting with NX. Do you have any legacy Windows apps that you have the thin clients connecting to via Citrix or 2X?