Thursday, February 18, 2010

It's never Virus.DOS.Lupus.532

Had a few moments of triumph in diagnosing and fixing some obscure computer problems the last few days that I felt like celebrating here. If you don't want to be geeked out at, skip this post.

The Unreported Symptom: One of the cash registers in an agency was malfunctioning in a moderately common way: it lost some of its CMOS settings. This happens once in a while and it's not something we can fix remotely or talk the customer through fixing, so someone has to go there -- ideally not me, as we have some people in Retail Operations who are the first point of contact who are able to handle this. Some agencies have it happen more often than others; in rare cases it can be written off as a random belch of fate, but when it happens more often, one suspects a failing CMOS battery, or dirty power. And this agency was one where it has happened a lot.

To fix it, you reboot the register, press a particular key to get into BIOS settings, and redo any missing settings. Normally, you're asked for a password first; this is to prevent the customer from messing the settings up. In this case, there was no password prompt; and then, when our man on the scene tried to change the settings, they wouldn't change.

At first we thought he was getting in via some kind of 'read only' mode by doing the password prompt wrong. It's hard to diagnose things remotely going only on the report of a person on the scene who might not be describing everything accurately. Well, let's be honest. Who is never really describing the problem accurately and completely.

There are a lot of other issues that can make it hard to change the particular settings that needed changing. We needed to change the IRQ that the modem and network cards used, but you can't change an IRQ on one device to something used on another device, or reserved; so sometimes you have to go through in several steps, changing one set of IRQs to get other IRQs made available. You can even have to do this in multiple reboots, and it can be complicated to tell which is the simplest path to get to the desired configuration. So sometimes if you try to change an IRQ and it won't change all that tells you is that no other IRQ is available for that device.

Talking through it on the phone for a while, and finding it impossible to change any of the relevant settings -- they all just stubbornly refused to change -- I started casting about. Sometimes when I'm stuck I just try to work on something else, so I decided to check to see if the system had lost all its CMOS settings by checking another thing. Turns out it had, so I talked him through changing those settings, and they changed -- thus confirming that we were able to change settings generally, which pointed to the more complicated IRQ conflict possibilities for the other problem. (This, by the way, is why it didn't ask for a password; that's another setting it lost.)

We went back to the IRQ problem and tried again, but still no change. In fact, we couldn't even make pointless but harmless changes. Somewhere around there, I had that epiphany that you'll recognize as coming about midway through the last act of any House episode. I had him go back to the first thing we tried to change, and told him, instead of using the spacebar to change it (which cycles through settings), to use the keypad minus key (which cycles the reverse order). Sure enough, the setting changed. The keypad plus key (which does the same as the space bar) also worked. The real problem all along had been that the space key was broken.

And the customer never noticed because in the course of a regular day, they won't even use the space bar. They would only use it if doing a search for a product or customer by name, or when doing things like month-end reconciliation, breakage reporting, or special orders, and not even often then. It's the classic unreported symptom (because it didn't seem either important or relevant).

Using the numeric keypad we were able to get the settings fixed and the modem working again in no time, and to arrange to have the keyboard cleaned and/or replaced.

How Would You Like To Not Run That?: Most of the things we do on our Unix system are done from the command line so there's no reason to go to the computer itself when you can just telnet. But those Xwindows programs that run in the GUI, CDE, require you to go into the server room and use the main console. About the only thing we do there these days is use the CDE account manager, dxaccounts, because it's a lot easier to unlock accounts and reset passwords there than from the command line, since it's all in one place. (This is only true since we changed to C2 Security, which happened recently.)

However, the version of CDE that runs on this version of Digital Unix has the bad habit of, if left sitting for too long (by which I mean months), getting wedged, so that you can click on icons to run programs and nothing happens. This is no big deal. We don't use it very often (or else how would it manage to get wedged that way?) and when this happens all you have to do is log out and back in; it affects nothing other than that rarely-used console login. So I never bothered to see if it can be patched or fixed. It's literally something I use a few times a year.

Sometimes when you run a program that requires root access, it'll ask you if you want to run it as root, or as who you're logged in as... which, on this console, is root. So it's a pretty dumb question. Two buttons saying "run as root" and one saying "cancel".

After a recent reboot done to try to clean up another problem (as yet unfixed), we found ourselves unable to run the Account Manager program. It would throw up the "run as root" question, you'd click either of the "run as root" buttons, and then... nothing. No error message, no task in the process list, no anything. It was a real dead end; there was no way to see why it wasn't running, it just didn't. Logging off and back on didn't help at all, unusually.

With a user clamoring to get online, and several other crises going on, I just turned to doing what needed doing from the command line, and left the problem to sit for a day. When I got back to it, I was staring at the brick wall of having no clues to go on. The program simply didn't run and didn't say why. I had tried deleting and recreating the icon, running it from other places, monitoring the process list while it failed to run, and nothing came up.

Generally speaking you can't run GUI programs from the command line, but it is possible to do so from within the command line that nests inside the CDE environment, so my flash of insight was to try that. First, I had to look at the "run as root" message to find out what program it was running, since I had no way to tell by looking at the icon, but fortunately that message shows the full pathname of the program. Then I ran the program from within a nested command line, and, lo and behold, an error message came up!

It wasn't very helpful in itself, and in fact, it was downright inaccurate and deceptive, but it was still the chink in the wall that made everything else follow. Searching on the error message online yielded a number of leads, which when further narrowed down led to the problem. A previous run of the Account Manager had closed incorrectly, failing to delete a lock file (/etc/.AM_is_running), and each new run saw the file and refused to open assuming another copy was running. Since this lock is handled in the form of a file, and that file survives reboots, this problem can even last through a reboot. Terrible design, huh? Deleting the file immediately resolved the problem.

Sorry for the self-aggrandizement. Some days I need to revel in my victories so as not to feel overwhelmed by my defeats.

No comments: