some of you might have already noticed that opensubtitles.org didn’t work for past couple of days correctly. This was due to the move to a new hardware and new operating system version of our servers at collocation, as the old hardware was already pretty much loaded and site was a bit slow. Also a big impulse for us was that the collocation partner has updated their hardware offer and we were able to get much better boxes for the same money.
At the beginning, we have ordered just one box to test if there are available drivers for all necessary devices in the given server configuration, and also because we have been told that other machines will be available in 2 weeks since the collocation partner have had too many new orders. People at colo connected the new box to the network, I have installed FreeBSD on it, configured required services and fired it up. There were no problems with the new machine (at least we haven’t seen any), maybe because it wasn’t much loaded and well tested at that time. We left it running and ordered other three new machines. Sometime around this time, we experienced first network problems with the new machine. The symptoms were weird, that means the database queries from the new box were corrupted as a result of network connections were being reseted. We thought that it is related to the hardware and asked our support to run hardware tests.
To our surprise, the collocation partner was able to provide new machines sooner, so I have installed them and moved necessary data from old servers, configured services and so forth. We updated our DNS entries to point to a new machines, as we expected things would go flawlessly. But it wouldn’t be life, if there were no problems 🙂
Everything was ready do be launched, so we fired up lighttpds and mysql databases and launched our services. All services were running almost perfectly (fortunately, there were some minor things needed to be fixed, but nothing really major). So we were pretty much happy, that we successfully finished our move, of course with no bigger issues. We went sleep thinking that we will kick the asses with these new boxes 😛
Next day, I woke up and checked if the site is still up and running. I pointed my browser to the site and realized that it’s down. I’ve been investigating for some minutes what is going on and why it’s offline. Well, guess what? The network problems we were experiencing with the first machine were present on 2 other boxes. Interestingly, the third one was running fine (I still don’t know why). This was too bad, because DNS has already been pointing to new servers and updating it back to old ones would take some time of course, so I decided to not bother with doing that. We asked for support at collocation, they have checked cables, power-cycled their switch and so on. I’ve spent a lot of time looking for what might be related to our problems, I have sent some emails to public mailing lists seeking for help, but I haven’t had much of luck. It seemed like there was a problem with network adapters (onboard realtek based chips, ohh well…). And the worst thing was that the problems were appearing only from time to time. Only apparent issue was that we had pretty high packet loss.
Today we decided to ask our collo partner if it would be possible to insert some Intel PCI NICs into our boxes. The collo suport did a great job and finally we had 3 of our boxes back online. The fourth one is still down, because this one machine is located in a different datacenter and they currently don’t have Intel NIC available to be inserted into our machine (this should be resolved in a near future). Summing this up, we are currently running on 2 web servers and 1 database server. Third web server should be online by the end of the weekend, at least this is what we really like to see to happen.
We hope that our network problems are now resolved and the site will be still online once we will get up tomorrow morning 🙂 Cross your fingers and hope in the best!
PS: If you are interested in further details, just ask and I will try to reply back to you, subject to the relevance of your question.
UPDATE: The third web server is online now as well, so everything should be working fine now! Let’s rock 🙂