Tuesday, April 24, 2012

Trapping User Crashes And Eliminating False Positives

When I was working on the new GNOME desktop servers, I wrote all of the launch scripts to go through a central set of calls so that we could trap an unprecedented amount of data.  Every click and login is recorded and we can tell when users are having problems.  One of the unexpected issues noted is the number of people that are "stealing" software packages that are running on other workstations.  We have two GNOME servers that you can log into and from there you can launch software applications such as Evolution and Libre/OpenOffice which are running on other "applications servers".  Both of these software packages cannot run on multiple heads at the same time.  This multiple login architecture was designed with the expectation that it would be used just occasionally and in a pinch; you are already logged in and running Evolution and go to a meeting in another building and need to 'steal' Evolution from the second workstation.  But the data was indicating that people were doing this all day as part of normal business practice.  When you steal Evolution it crashes and when you steal Libre/OpenOffice you are 100% assured that you will get a recovery dialog of some type.  In the dialog below, it shows how Libre and Evolution are running on servers which then can be run from the two GNOME desktops...but only to one at a time.


We had a lengthy conversation with our Director about this technique and sadly it cannot be stopped, because there are really times when users need to do this.  And there are 16 hours of the day when IT does not have support staff available to assist users.  So what we settled on was a consistent dialog to be used in places where users make an attempt to 'steal' software from one location to another.  The dialog warns the users of the severity of what they are doing and then they have to enter a six digit code which is their 'signature' that they accept the risks.  This same dialog is used for LibreOffice too and soon will be used for times when users attempt to log into the same GNOME desktop for a second time from another workstation.

The beauty of this design is that now when they terminate software with this technique it simply does an exit 0 and no longer is logged in our software as a crash and we can now make no attempt to grab a backtrace.  They are on their own and any problems they have are their own.  What they are doing is exactly like having a desktop computer with software running and pulling the cord; of course you'll have problems and very probably lose work.  We have had a huge drop in people using this technique once the dialog was installed. 

Now that false positives have been eliminated, it's now possible for the first time to grab a backtrace from software that has deadlocked. We know it's frozen because they are requesting a kill from the same workstation from which it was started the first time.  In the case of Evolution when they try and kill a running Evolution instance, they now get this dialog.  Once they select an option from the UI below, it finds the 'evolution' and 'evolution-data-server' process and grabs a backtrace and then closes them down.  Whereas previously we only had backtraces of a crash, now we have backtraces of deadlocks.  This will allow me to look for locking trends and work with Novell in improving the software. 


And if you are interested, here is the code that runs prior to starting the Evolution process.  When they click on the Evolution icon it always first tries to raise Evolution to the front of the window stack with wmctrl.  It then looks to see if it's already running.  If it's already running, it determines if it's running on their current workstation or another workstation.  If it's the current workstation it gives them the dialog to pick the last thing they did before deadlock; from this dialog they can obviously cancel and leave it alone.  If they terminate Evolution, it finds their two processes and used gdb to non-interactivity dump a backtrace.   If instead it's running on another workstation, it warns them that shutting down software is dangerous and then if they do so shuts it down and makes no attempt to create crash backtraces or deadlock backtraces...they are on their own.


I'm excited about the quality of data that we will be obtaining now, and this same code will be moved fully into LibreOffice once it's deployed next week.  We're getting complete crash and deadlock backtraces from hundreds of users without them having to mess with bug-buddy; everything is fully automated.

6 comments:

Anonymous said...

Have you considered running this software via something like Xpra, and then switching it to the new workstation transparently without killing it? Or, alternatively, just opening a VNC session to the old workstation or similar?

Dave Richards said...

@anonymous: Indeed we have, that features is fully supported in NX. The problem is explaining to users "resolution". Moving from a dual or triple screen to a single screen. Even moving 1440x768 to 1024x768. We perceived this to be support intensive. If all of our monitors were the same resolution, it might be do-able. We also like to use native X as our transport. VNC/NX/Citrix look like "pictures" and have artifacts/blinks and aren't as crisp and fast as native X on our network. So yes, we talk about that option but it is not a good fit for us at this time.

Anonymous said...

If you're using native X as your transport, then applications *should* transparently adapt to a change in resolution, the same way they adapt to a local use of RandR to change the resolution. Do they fail to do so?

Anonymous said...

'You appear to be requesting the server terminate ...'

Correct English, this is not.

Dave Richards said...

@all: I appreciate all the comments and will check out Xpra a bit more and experiment. There are a lot of variables at play;
- Time
- Testing all types of window sizes and what happens when they resize
- Some windows launch with fixed size containers and they have to be tested to see how/if the scroll bars work.
- Some applications detect your monitor size when they launch and change settings at launch time, those would all have to be tested.
- Some applications detect your screen size and only launch if a certain monitor resolution is available, because they do not fit in lower resolution.
- Many users will have problems with windows that re-stack and are merged together into a smaller area. We have some people that only open one application at a time because they don't know how to move windows.
- NX is in the mix, we use X for our fibered buildings, and NX at remote sites, so this would come into play as well. NX would have to drop when you connected over X, which would take some testing.
- Support issues with our support staff trying to figure out various problems that might arise as they move through resolutions multiple times.

So until now we have kept it simple: You can log into the two desktop servers once for a total of two concurrent sessions. A few applications just won't run concurrently.

I will reread the dialogs and check grammar and English. :)

Anonymous said...

That's a great way to make it look much more professional without doing a lot of work.

It also would help if you kept some margin between the text and the window borders, I've noticed that in a lot of your screenshot the text almost falls off the window.