Friday, August 26, 2011

Server Reload & NX Testing

Server Reload

I have mentioned in previous posts some problems with our OpenSuse 11.4 server which is going to be used for running GNOME. It's been running slower than expected and is having some very odd disk and networking issues. The OpenSuse kernel guys have been great at reviewing all of our files and data, and right now everything looks like it should work fine...yet performance is not as expected. So we are going to rule out any chance that something failed during the upgrade from OpenSuse 11.3 to 11.4. As everyone that upgrades knows, sometimes what you get after an upgrade isn't the same as a fresh install. So next week we are going to make a Clonezilla backup of the server, and do a fresh install of OS 11.4 and then lay our customizations over the top. Our beta testers will be moved to a VM instance and won't have any down time. We'll know more next week. I keep reminding everyone that this is a normal part of upgrades and testing. Previous GNOME upgrades took time to debug and certify as well. The production machines are all working fine, so only those people that have volunteered to test are seeing these issues.

Testing NX To Anticipate Video Expansion

I've mentioned many times that we run old school X for those people on our fiber option network, which is the vast majority of users. One thing that is happening is that video training and video playback in general is exploding. We have noticed that certain streams (Flash, cough) seem to weigh heavily on our network from time to time. Many videos play fine, but at times higher quality (or frame rates?) cause the network to get more busy. The other issue we are seeing is that users expect a "video" to be a "TV" and are very often stretching the window as big as possible before there is frame loss or it gets too grainy. If we did not anticipate further growth in this area, we probably could continue to use our current design. But we are expecting this to continue to grow.

We use NX technology at our remote and low bandwidth sites. I have been testing it on the high speed network as a method to compress the video and video streams. I made a quick graphic below to demonstrate the difference.


The current design is to use X or XV and Pulse audio. With NX, the physical Xserver is on the system console itself and only a compressed stream is delivered to the thin clients. I made some tweaks to the configuration files and disabled XV and tested and results are not bad.

The shot below is a Flash video playing from youtube to a thin client over NX. This type of compression software is intelligent enough not to hammer your network. I had not tested this in several years and really it's working pretty well.



The one caveat with using NX/RDP/VNC type software is that sometimes it gets confused about screen changes and leaves artifacts. This is why it was never deployed on our high speed network. Remote X feels very similar to being on the console in terms of being crisp and repaints. In the shot below, while the video was playing I changed tabs in Firefox and the video section remains as an artifact. I had to take this shot with my camera because this is what the eye sees on the thin client. However, if you do a screenshot the artifact is not there. X thinks that area is gone, but the compression formulas don't always know to repaint all areas.



This testing was done with NX3.x, and we are awaiting the NX 4 Linux native client to see how it works. Pulse is now supported and will reduce our network overhead even more. We have a long way to go before we decide to make this type of change, but we are trying to say ahead of the curve with trends.

Happy Weekend.

Friday, August 19, 2011

Post Vacation Projects Continue

One returns from vacation with a new sense of energy, so I am working again on the various issues and projects.

OpenSuse 11.4 Kernel Problem

I got a lot of nice ideas to try and get our OpenSuse 11.4 server running better, and sadly none of them have made major improvements. I finally feel like we have ruled out everything except the kernel and have opened a bug report with the kernel guys. None of the admin tools are showing why the machine is running poorly, but it feels as if the whole server is running from the swap device. Copies of this server moved to VM suffer the same results. Unless I can get some kind of movement in this area soon, I am going to have to consider other Linux distributions. A few well placed tarballs and a VM copy will allow me to do that quickly, but it's still not pleasant. Everything is in place and working, we just cannot scale over 35-40 users. This hardware should run 150+ easily.

Evolution Issues Continue

Still communicating with some of the Evolution developers in the hopes of nailing down some nagging issues. Email and calendar events flow and things are working, but it's not yet to a milestone that gives me comfort. Everyone suffers from low resources in the IT field these days it seems.

Firefox 6

It's out already and I have been testing it heavily with various pages and technologies. Another Flash release came out in early August and that is being given the Youtube stress test over remote display with PULSE. While it never will be as robust as a local video card, playback is not terrible and more and more users are playing this type of content. I have some ideas to improve how this works, and will blog about it in the future. Just not enough hours in the day.

NX Testing and Ideas

I was able to connect finally with a thin client to NX 4 and started testing it in preparation of it going live in the coming months. I also had a few product feature ideas that I emailed to them. While we are using NX right now only at remote sites with low bandwidth, we are watching the marketplace and the whole "cloud" thing. If we move anything off site, we'll need NX for compression.

Support Portal Project

One of the best parts about writing software is when people actually use it, and we recoup the time invested in coding. Merging lots of little functions into one UI is saving time for our support staff, and also I feel allows us to provide better service. When there are problems with printers for instance, we know about it within just a few minutes.

Once again I will mention, please no HIG comments. :) I have kind of come to the mindset that projects should not be over-engineered up front. I'm talking about the types of projects where you spend so much time in design and building it "correctly" that the end users don't see any code for months and months. I'm building ideas and testing them with real people. Some of the data is still stored in flat files which is clunky of course; but the end users never see that stuff. Why make them wait for you you to build and optimize a database when just experimenting? If I abandon an idea, only a short period of time was used. I'll clean as I progress and like the results.

I have started to write some code for the handling of printers. You can now select a department and all of the printers in that department appear. Each printer offers three options/buttons: the first button launches Firefox and just goes to the IP of the printer. This connects with the HP web admin page found on all of their printers. The second button will display the number of print jobs currently in queue (over all servers), and when you click on the button it takes you to the UI for the viewing and halting of print jobs. The third button (marked in green) is an idea I had to make use of the under-utilized "lpmove" command. Yup, you can move print jobs to other printers...and no one ever uses it. When you have 60 printers, you can barely remember them all, let alone where they are located. So using lpmove on the command line would require digging out some kind of paper document to remember where each printer is located. So I have started to build a simple screen that will allow us to enter in the closest printers to each printer, along with directions from one to another. That will allow our support staff to take the stuck jobs, and fire them to a printer that might be located 50 feet away from the original. You might be wondering: Why not just cancel it and just have the user resend it? The real world answer is that many users tend to print (and expect it works 100% of the time) and delete and empty trash before they see that it really worked. What I envision then is the end user would get an email message alerting them the print jobs were moved to a new printer, with directions on where it's located. It'll be interesting to see how this all works. Printers really continue to be a nightmare, anything I can do to reduce our time on them seems worthwhile.



I also wrote some of the initial code to allow for searches of our users by first, last or account name. All matching users appear in the form of buttons. Hovering your mouse over the buttons gives you further details. This continues my pet peeve about making people open a detail screen to see information that should be easily found on the primary UI. If you do click on the user button, a detail screen appears with much of the same information found in the tooltip. It then also scans the servers and looks for their logins and lights up a monitor when found. We allow users to log in multiple times, so it's possible they have 1-4 active logins at a time. Other detail information that will appear on this screen (not yet written) includes information about their print jobs, and then a listing of software applications they are allowed to run. Linux software is deployed to nearly everyone because it's mostly free, Windows software has licenses and must be checked prior to launching. We also are looking at tracking how often all software is used, so that we can review it periodically and remove those packages that are no longer needed. If you click on a user login monitor, it takes you to the thin client detail screen and gives you greater detail about the physical device they are using. This allows us to check software versions, color depth, type of connection and the like.



Happy Friday!

Tuesday, August 09, 2011

OpenSuse 11.4 Woes

I'm leaving for a vacation tomorrow so I thought I'd post a blog concerning the status of our OpenSuse 11.4 GNOME server. We are in the home stretch of the deployment, but unfortunately the underlying Linux is giving us problems. I thought I'd relay the information and maybe someone has some ideas on getting this resolved. When I get back in just over a week, we'll start doing some major changes one at a time to resolve the issue. I'm not concerned about project success, but the best path is not yet clear.

All of the software is installed and for the most part GNOME and avant-window-navigator are working as expected. The hardware is beefy and should support at least 200 users+, but as soon as get to about 20 users - things really start slowing. We are seeing problems with the networking and disk performance. They may be related, but I haven't yet been able to prove it.

We have cloned the server and moved it VMWare and it's suffering the same problems. So I feel pretty confident it's not hardware. We are using this hardware elsewhere with no problems.

Disk IO Problem

For the first time we tried to deploy ext4, all of my Googling seems to indicate no major problems and no complaints about performance. I did not see any "multi user" issues reported. If we have a few (10ish) people on the server things work just fine. Once you get more, the disk becomes VERY slow. If I vi /etc/hosts (shot below), it just sits for 3-4 seconds with a blinking cursor before the file appears.




With a user load, if I run the passwd command it sits for5 seconds after entering a password before it completes. On other servers even with hundreds of people, all of this happens instantly. if I run yast2 and install software with a load, the whole server becomes very slow and non responsive. On other releases of OpenSuse we have never seen this before.

While this might seem a clear IO problem, what also enters my head is that we are having some networking problems too; and I'm wondering if the NFS mounts are having problems which is affecting local disk IO. Unfortunately I cannot dismount the NFS mounts to prove that idea, too many necessary pieces are on remote file systems.

Networking

It seems like others are having OpenSuse 11.4 dropped RX woes too. The first shot below is the desktop server, and note the packets that are being dropped. This number climbs constantly every few seconds.




But note on this shot below from OpenSuse 11.3 how it *should* behave. This browser server has over 100 people using Firefox 5 and you can see it's being hammered on the network side and there are almost no dropped packets. This reflects how our other Linux servers work.




I have noticed that certain NFS activities really slow the machine, when they are in Nautilus and generating thumbnails there is a noticeable slowdown for the other users. But it's not clear if this is "networking" or "disk io"

What's Next?

Aside from someone out there having intimate knowledge of OpenSuse 11.4 and a quick fix, we'll begin making drastic changes when I returned. The first thing I'll do is grab the experimental 3.X kernel that I see on some of the OpenSuse channels. I have one inclination that we are having a kernel/scheduling problem because it's affecting two subsystems. Currently I have upgraded to the latest stable release of kernel-desktop-2.6.37.6-0.5.1.x86_64. If that doesn't work, the next thing that I'm going to do is reformat the server with ext3 and rule out a file system problem. This is a nasty thing, and going to require lots of redoing work that is already complete. But what we have now cannot be moved into production.

One other idea I had was that somehow the -desktop kernel is not suited for multiple users, and I could try moving to the more vanilla version; but it sure seems like this change won't make disk and networking better.

If you work on OpenSuse and have any information, it's greatly appreciated. We are anxious to move this server into production.

Monday, August 01, 2011

Project Updates

It's been a while since my last blog, but as always projects continue to evolve and mature.

Sandboxed/Jailed Firefox Sessions

My last few blogs were concerning us creating sandboxed/jailed Firefox sessions to run certain City applications and that has been fully deployed and working well. I was able to get Firefox 5 pushed live with no problems. Flash 11 was also released and that too has been deployed. I am currently testing Firefox 6 with Java 1.7 on the web and with our internal applications. As is the norm lately with upgrades, 1.7 fails to work with some of our web based software. Wasn't the point of a browser to make it easier to deploy software? :)


Networking Problem, GNOME Desktop

I have been fighting a networking problem on OpenSuse 11.4 where after a certain amount of people log into the server it starts to get odd lags and we see RX errors on the NIC.

RX packets:45731224 errors:0 dropped:307554 overruns:0 frame:0

When you issue a command such as "vi /etc/hosts" it sits and blinks for about 3 seconds before it displays. However, as users log off at night it suddenly begins working again. I'm going to update the kernel tonight and the next step will be to install another type of network card and see if the problem goes away. This unfortunately has prohibited us from adding more beta testers. Very odd.


Portal UI And FOG Thin Client Updates

My coworker Brian has been doing a great job in getting our FOG server (running in VM) configured and working. He has been testing various techniques to push out updates to our thin clients. Many combinations are being tested in terms of performance. He is looking for the sweet spot in how many current thin clients to update at the same time. Sometimes there are circumstances where doing 5 thin clients at a time twice is faster than doing 10 all at once. I should have more information on our final analysis in the coming weeks.

While he has been improving pushing updates to the thin clients, I have been spending time as it's available working on our "Support Portal". Having all of your user sessions on a server is usually pretty wonderful to support, but in many cases it relies on command line tools and tinkering by hand. The tools that come with the distros are more designed for configuration versus server monitoring. I have been trying to think of ways to have the servers do most of the work, all of the information is right there; but it shouldn't take a System Admin to review it and know how to process.

I know those of you that build screens all the time will want to post HIG and UI comments. Please, these are back burner ideas that are still being developed. Our support staff already seems pleased with the features and capabilities and it's barely even started. If we have to do more with less (tm), this is certainly one way to obtain that goal.

The server monitoring screen is giving us graphs that show total number of users, load and total print jobs. The later is probably our most support intensive. I hate having to go into child screens to get information, so the designs are trying to make use of tooltips as much as possible.

Here is live data coming from the servers, user loads are displayed. Hovering your mouse over the load graphs (blue) gives you a tooltip that shows you the user accounts names running that particular software/server.



The lower graph (cyan) turns read when print jobs appear to be stuck. Hovering your mouse over the graph shows you the user and printer that appears to be having problems.



If you do click on a print job graph, it brings up a UI that shows the details of the print jobs on that particular server. The blue area is a toggle button and let's you pick multiple at once. The print jobs will then be capable of being cancelled or moved to another printer. The red area is the name of the printer and when clicked initiates a Firefox session and goes to the IP of the printer. Those of you with HP printers know that they provide an administration console on port 80. The magenta area is the name of the user. Clicking on this area will being up a user detail screen; which is not yet written.



When print jobs are stuck, we now get notify-send popups alerting us of this fact. A similar popup appears when the server seems to be 10% busy after several samples.



The companion piece to the FOG thin client updates was to create a screen to find and maintain files which contain server side settings for the thin clients. Once they get an update, they attempt to download a flat file which contains their settings and then reboot into the right mode. From the thin clients tab you can enter 3 or 4 octets of the IP address and the results will appear. A wrench symbol appears on thin clients that have server side configuration files completed. If you hover your mouse over the thin client entry, it gives you detailed specs on the device. You don't have to open a child screen to view the data.



You can filter and look for thin clients running in many different modes. If someone asks us how many thin clients are still in 1024x768, for the first time we can easily obtain that information. We can also easily query and find out how many thin clients are using two monitors.



We can search for the "function" (purpose) of the thin clients. We are using the same hardware for different purposes around some the city. Some are full featured workstations and others are set into Kiosk modes. Others are configured for low bandwidth sites and use NX.



If you click on a thin client pushbutton, a child screen appears and returns all of the information about the device. We can see who is using it, how many monitors it has and whether it's using HDMI/VGA/DVI cables. We can reset it back to factory defaults, reboot it, we can request a remote control, and we can do a wake on lan and power it on remotely.



At this point enough is working in this code for us to finish configuring the 650+ thin clients around our sites. We hope to have that complete by September. The UI will progress and advance, and I promise will even start looking nicer. :)