This document is available on the Internet at: http://urbanmainframe.com/folders/blog/20050327/folders/blog/20050327/
Oh boy, the Urban Mainframe server migration certainly didn't go according to plan. In theory it should have been a simple exercise: copy the MySQL database, CMS and web-server file-system from an aging and under-performing Linux 7.3 server to a new RedHat Enterprise Linux 3 server built on high-end hardware; keep the old server up during the DNS propogation period; shut down the old server once we're sure that all inbound requests are being routed to the new machine.
The plan was to try to perform the migration without incurring any downtime - but my plan was fatally flawed. If only I had just accepted that some downtime was inevitable, if only I hadn't tried to be too bloody clever, if only...
“there were a couple of possible solutions but, sadly, I chose entirely the wrong one”
Moving the Urban Mainframe from one server to another dictated a new IP address for the domain. As a consequence of this, the domain's DNS record would have to be updated to reflect the new IP address. Both are relatively trivial undertakings in themselves, but there is one big consequence of the latter: DNS changes are not immediate, they take time to propagate to the Internet's name servers - up to 48-hours or more in some cases.
This meant that, for around 48-hours, some inbound requests to the Urban Mainframe would hit the old server and some would hit the new one. Immediately following the DNS change, most requests would hit the former. But during the course of the subsequent 48-hours the bulk of the requests would begin to shift until, with propagation complete, all requests would be handled by the new box.
For a static website, two servers can run in parallel during a DNS change propagation and the end user is generally none the wiser. But the Urban Mainframe, like the many of today's websites, isn't static.
The Urban Mainframe doesn't consist of a collection of HTML files, in fact there are less than 10 such files in the whole website - a site that, at the time of writing, consists of around 200 individual pages (mostly weblog entries) and countless numbers of pages that are generated as a result of special queries such as searches and indices. These pages don't exist as files in the web-server's file-system, they are instead generated at request time by a CMS (Content Management System) - which merges content from a SQL database with HTML templates to produce the output that the web-server then passes on to your browser.
Therein lies the problem. Two web-servers in two different physical locations dictates two instances of the back-end database and, with both servers able to handle requests over a 48-hour time period, there is a strong possibility than these databases would go out of sync. For example, if a user posts a comment while DNS is directing her to the old server, that comment would not appear to a user who was directed to the new one and, for various reasons, is unlikely to ever make it into the new database.
I considered the database synchronisation problem long and hard before I started the migration process. There were a couple of possible solutions but, sadly, I chose entirely the wrong one.
What I wanted to do was to have one database serve both web-servers. If I could do that, then there wouldn't be a synchronisation issue. So I configured the new web-server to read and write to the database of the old one.
Yes, you read that correctly. My heavy mod_perl server was configured to build pages, on a per-request basis, with data requested over the Internet from a database server in a remote location! Everything seemed fine as the first few requests started to hit the new server but, after just a couple of hours and with more requests being routed through to it with every passing minute, both machines ground to a halt - taking around 20 live websites down with them.
Had my two web-servers been in same physical location, I could have connected them and the database server to the same switch and this plan would have worked perfectly. But it was never going to work with the inherent latency of the Internet and, whilst it's easy to have wisdom retrospectively, I have no idea what possessed me when I configured things this way.
The proper solution is probably obvious to the majority of server administrators. I should have simply redirected any requests made of the old server to the new one. A single, simple mod_rewrite rule on the old server would have done just that. All I needed to do was to redirect to the IP address of the new server and everything should have worked.
To illustrate this, let's imagine we make a request for http://urbanmainframe.com/assets/components/alien.htm. This request comes into the old server where it is handled by a "RewriteRule" designed to forward the request to the new server:
RewriteRule ^(.*)$ http://126.96.36.199/urbanmainframe.com/$1
The new server (at IP address: 188.8.131.52) gets the path of whatever page has been requested, prefixed with "urbanmainframe.com", then tries to retrieve that page from the server's web-root.
Thus the URL of the requested page on the new server is: http://184.108.40.206/urbanmainframe.com/assets/components/alien.htm.
As it happens, I tried to do just that. But there was a further complication that stymied my efforts. My URLs don't map to physical files on the web-server (see above). I would have had to make considerable changes to my Apache configuration and mod_rewrite rules on the new server in order to correctly service requests - then I would have had to change them again when DNS propagation was complete in order to return to normal operations and I would have had to do that for every website hosted on that server (I'd explain further, but it's complicated). That wasn't an endeavour I wanted to undertake, hence my dismissal of that route.
But, of course, there was a way to successfully (and easily) implement the redirects. Here's what I believe I should have done, for each website hosted on the server:
At this point, the new server would be almost ready to go live. However, I would now have to duplicate the database from the old server to the new one. I believe I'd have to take the old (live) server down at this point, although it wouldn't be offline for too long. There's two reasons why the server would need to be taken offline: 1) I would need to apply a new "RewriteRule" to each of the Virtual Host configuration files and 2) It's imperative that nothing is written to the database while it is in this transitional state.
The new "RewriteRule" would look something like this:
RewriteRule ^(.*)$ http://transitional.urbanmainframe.com/$1
(Obviously the "urbanmainframe.com" would be substituted with the relevant domain name of each Virtual Host.)
Copying the database from one server to the other is easy enough. The following command would be performed on the source server (configuring the relevant database permissions is left as an exercise for the reader):
mysqldump --all-databases --opt --complete-insert --compress --quote-names --user=xxxxxxxx --password=xxxxxxxx | mysql -h 220.127.116.11 --user=xxxxxxxx --password=xxxxxxxx
Essentially, the above command takes the output of "mysqldump" from the source server and pipes it to the MySQL server on host ("-h") 18.104.22.168.
I would now be able to shutdown the source database (it would be considered deprecated at this point) and start Apache on the new server. I should then be able to point a web browser at transitional.urbanmainframe.com and, if everything worked, I could then restart the old Apache server. Any requests to urbanmainframe.com that arrive at the old server should then redirected to transitional.urbanmainframe.com (new server).
Only then, assuming everything was working as expected, should I have performed the DNS changes to the principle domains. Since all users would already be being directed the new server it wouldn't matter how long DNS propagation took. Some users would find themselves at transitional.urbanmainframe.com and some would be at urbanmainframe.com - but both sets of visitors would be served by the same server, with the same database, regardless of the domain name they arrive at. Downtime would be minimal and the impact on visitors negligible.
NOTE: Of course, not having performed the migration in this manner, the above procedure is unproven (at least to myself). If you can see any errors in my reasoning, or can suggest another procedure I could have used, then please let me know and I'll revise this document accordingly.
Once I'd got over the DNS problem, I found I had a whole new set of issues to contend with: version differences.
With the new server, I decided to upgrade my underlying platform with the newest versions of all software:
There are subtle differences between the old and new versions of all three of these critical applications. Most annoying of all was the fact that MySQL 4.xx returns dates (timestamps, etc.) in a totally different format to MySQL 3.23, which broke most of my CMS!
There was nothing else for it, I had to get my hands dirty and patch the CMS accordingly.
The (hopefully) final problem was the big one - the one that caused me the most stress, confusion and consternation.
When I finally got the new server up and running, I couldn't get the server to a stable state. Much to my chagrin, the machine proved to be incredibly flaky and extremely unpredictable. Following a reboot, it would stay up sometimes for a couple of days, sometimes for only a few hours (the following graph from Netcraft illustrates recent downtime, including that which precipitated the migration in the first place).
At first I ignored it, as I worked to overcome the other problems described above. But of course, I couldn't ignore it forever. Even with DNS fully propagated and the CMS patches complete, the server continued to fail at random intervals. I considered emigrating!
Once I started to investigate, I discovered that the machine was failing with "out of memory" errors. This surprised me, as the box is loaded with a healthy gigabyte of RAM. I tried to find an obvious culprit but failed. Meanwhile, our clients began to complain with increasing anger and frequency. I consulted with the incredibly helpful "fanatical" support guys at Rackspace who suggested that a memory leak in one of my Perl applications was the most likely source of the problem - a suggestion I didn't like too much (What? My code? Buggy?)
So it was that I found myself monitoring the CMS' memory usage with "top" for two whole days (between reboots), whilst pounding the server with our benchmark and test suites. I'm happy to report that the CMS wasn't the problem. Memory usage for the individual httpd/mod_perl threads never exceeded 25MB, a profile that's consistent with the CMS' footprint on our other servers. So we weren't leaking memory. But I did notice something strange (eventually). There seemed to be an awful lot of Apache threads.
I finally realised what was happening. Apache was starting too many threads, while never killing any of the older ones!
Following this epiphany, I began to sift through "httpd.conf". I couldn't find any mis-configured variables, but I did find a missing one. I had somehow failed to add a "MaxClients" setting to the configuration file.
For those of you who aren't familiar with the inner workings of Apache, this requires some explanation. Apache is what is known as a "forking httpd server" (stop giggling at the back). That is, the server daemon (a "daemon" is a process that runs in the background) does not handle requests itself, it "forks" (or creates) child threads to do that. The parent process monitors these children and spawns and kills them as required or as dictated by the configuration files.
The parent daemon will usually have enough children to handle the current load, along with a few spares to handle bursts of activity. When the load is reduced the daemon will kill off any children that are no longer required. It will also terminate child processes when they have served a pre-configured number of requests, so as to release unused resources back to the server and protect against "runaways". All of which is configurable in the "httpd.conf" file.
On my web-servers, the child processes are configured to handle an infinite number of requests ("MaxRequestsPerChild 0" in "httpd.conf") so that I can take full advantage of the performance benefits of mod_perl (if you're really interested, read Stas Bekman's "Performance Tuning" pages in the mod_perl User Guide). Thus, under a sustained load, the server does not waste time and resources forking new httpd processes.
However, there obviously needs to be some kind of brake on the httpd daemon, else it will continue to spawn new children when the server's load increases until its resources are completely depleted. Hence we have a "MaxClients" variable - which dictates the maximum number of child processes that the daemon can start.
Still with me?
"MaxClients" can be omitted from "httpd.conf" and Apache will still run without complaint. The httpd server will simply run with its default value of "150". Which is all well and good for the average web-server. Apache can easily handle 150 busy child processes, even on unimpressive hardware. So why was the missing "MaxClients" causing me such problems?
A mod_perl server is a slightly different beast to a plain Apache server. A vanilla Apache server, delivering static HTML files with an out-of-the-box configuration, will rarely have child processes of more than 1MB each. Thus the server can handle the default 150 threads with ease because 150 x 1MB threads requires only 150MB of RAM (of course this isn't strictly true, the OS and other processes each have their own memory requirements).
However, I've already stated that the httpd/mod_perl processes on my server run at around 25MB each (this is not unusual for a mod_perl server). The math reveals the answer: 150 x 25MB = 3,750MB (or 3.7GB). Therefore, without the throttling value of a limiting "MaxClients" variable in my "httpd.conf", Apache was responding to periods of heavy traffic by trying to spawn up to 3.7GB worth of child processes in a server furnished with just 1GB of RAM. This is obviously impossible and the server would simply spiral down into an out of memory crash whenever the load became too great.
Once I restarted the server with a controlled "MaxClients" value (32), stability began to return. It's been running for 72 hours now without fault. Memory usage is constant and performance is great.
Finally, after days of heartache and nights without sleep, I feel I've completed the migration procedure. So now it's back to business as usual.
UPDATED (28th March, 2005): Corrected minor grammatical and spelling errors.