Maddening server problem… Resolved!!!
Starting about 4-5 days ago, one of our web servers began experiencing nightmarish performance issues. This was such a strange problem to deal with, and I wasn’t able to Google anyone having the same issue, so I thought that I better document this in hopes that it may serve someone in the future.
Our ColdFusion MX 6.1 server would run for a while, the suddenly requests would just stop returning results at all. Occasionally a user would finally get a response to the browser that was a ‘<’ sign, followed by varying strings of seemingly random characters. It would hang like this failing to answer any future requests until the services were restarted. This system has some obvious suspect trouble points. I will list a few:
- About 120 remote (and I mean very remote) DBs, and 1 DB on the same subnet. Those remote databases are housed in various data centers and client sites.
- Of the remote DBs, over half are using and ODBC Socket to Sybase (Yes, JConnect is a much better idea and is in the works).
- Of the remote DBs, a handful of them are using VPN connections.
- A proprietary framework has been used on a couple of the applications on the server that has a number of examples of questionable coding practice in it.
- A new company was recently migrated to the proprietary framework application that has 40,000 users.
- Client variables were being used and were being stored in the registry.
I viewed all of the above as places that could potentially have some type of effect on the problem we were experiencing.
Any guesses so far?
Well for starters we had Fusion Reactor on the system. We had it set to notify us when a request took longer than 30 seconds, and kill that request gracefully. There was no consistency whatsoever to the pages in the requests that it as notifying us on. Even simple ‘welcome’ pages would be in that list. This didn’t appear to be helping us on this problem so to isolate the issue I removed it from the server.
We finally got on the phone with Macromedia…errr.. Adobe and dealt with Ted Zimmerman (and later Swathi C.), who went out of his way to help us and even stayed well after he was supposed to have left for the day. One of the first things we did was to change the Client Variables setting in CF Admin so that ColdFusion stored client variables in a DB rather than the registry. Also since we have no real need for persisting client variables through multiple sessions, we set ColdFusion to store them for 1 day as opposed to the 90 day default. This didn’t seem to fix anything and if anything our problem just seemed worse and worse. Instead of the uptime for 10-20 minutes after restarting, if anything, it seemed to be bombing more and more rapidly byt he minute.
With Ted’s assistance, I learned a great new technique for viewing threads at a very low level and even being able to track process IDs over a period of time and seeing if they had hung at a particular memory address. We did this by running ColdFusion from the console and outputting to a text file. That is a very interesting and informative exercise!
We kept finding a process that was talking to a remote database that seemed to be just hanging indefinitely. I did some better trapping on that page by not only catching database exceptions and logging them, but by adding in a timeout to the query, which is something I have not typically done in my code. I found that by doing this I no longer saw that particular page in the metrics information as a trouble maker, but our greater problem still existed. Now, in our thread stack dumps, another page seemed to show up pretty frequently. I put the same traps around that process and still remained in the same boat. That said, even though we didn’t fix the problem with that, we did fix a couple of issues that were definitely in need of attention.
So, any guesses yet?
I need to preface this next section by stating that the way that this company is structured and due to the fact that the systems store extremely sensitive data related to both loan origination and loan servicing, I do not have any direct access to the production system under normal circumstances, and even in this circumstance, I have no access to the database server at all. (Ironic when you consider the power of code and how counterproductive this approach can be, but I will save that for another discussion).
That said, in the middle of all of this work, we received an email from our network center that the SQL Server had reported an error notification of less than 1% of memory available. I assumed that could have just been a weird spike that had happened, but nonetheless out tech group looked into it immediately. We found that our SQL Server was showing that it had 16MB of available physical memory. As if that wasn’t a big enough “Uh Oh!”, we were told that it only had 300MB of space available on the C drive which (against my wishes) is where the virtual memory resides.
Upon learning this, our network tech immediately robbed memory from an internal development server and drove down to our NOC to upgrade the memory. Immediately the system went back to normal and has been blazing ever since!
It was interesting thinking back to when we altered the client variables setting. The change didn’t seem obvious at the moment, but in hindsight, it definitely exasperated the problem, although it was clearly the right thing to do.
So, now… 18 hours after the memory upgrade, things are still rocking!
