Leaps and Bounds Forward
Since July 21, focus shifted at FreeForums in the way we thought and went about seeing our future. We wanted to make sure that what happened that night could never happen again. At the same time, we were seeing problems brewing in the air.
Database Clusters
Due to the rate at which FreeForums was growing and the limitations of MySQL replication, we saw ourselves running into a very severe road block. To maintain our MySQL servers, at the time we had one master MySQL server and several slave servers. All writes would go to the MySQL master which would then be synchronized to the MySQL slaves which all reads went to.
The problem we saw was that MySQL replication, on the slaves, was only capable of using a single CPU core to process updates coming from the master. To handle the operation of replication, the master server needed use of all its cores, yet each slave could only use one. Eventually we would hit a wall at 100% usage of the core and thus be unable to process any more updates. The result from this: slaves would fall out of sync.
Our solution to this was to break FreeForums up into multiple independent MySQL clusters. This would distribute writes over serveral master servers and allow us to add even more slave servers. But writing a system for this would not be easy and would take some extensive testing and development.
All the while we were developing this new system, we began hitting the writing bottleneck on our slaves. Throughout December, we were having slaves fall out of sync whenever there was high traffic hitting our web servers; however, thanks to the holidays, we were given two weeks of lower traffic which gave us more time to test and develop the system. Come January, however, that went away and for the entire first week and a half, slaves were out of sync every day from 11AM thru 7PM.
Given the obviousness that we were out of time, we halted testing of the second cluster system and launched it. The tests had been going well, and we wanted to continue testing a few more things to make certain that the system would work as expected, but we were out of time. On the morning of January 15, while watching Cluster-0 slaves fall out of sync at 6AM, I decided that we had to launch the system immediately. At that time, we picked 39 databases accounting for 20% of our daily traffic and sent them to the new cluster.
Low and behold, the Cluster0 slaves synced back up, and since that date we haven’t had a single slave fall out of sync. At this time, we have finished testing and development of the clustering system and will be using it from now on. The addition of clustering was a huge step forward for us as it will guarantee the speed and stability of FreeForums in the coming years.
Web Servers
The second problem we were already facing was that we needed a more stable environment to serve the websites that we were hosting. We had multiple web servers, but the way in which we were load balancing them was by the use of round robin. Round robin is a method where in your DNS records, used by the Internet to look up where your website is located, you list multiple IP addresses. When you type in your web address, your computer randomly selects from that pool of IP addresses and that is the server you land on. Problem here is that there is no load distribution and there is no failover protection.
Those two issues would spell chaos as FreeForums grew. Even though we had three web servers, one of them was receiving 50% of all traffic with the other two receiving 25% each. Next, what would happen if, say, the web server receiving 50% of traffic would go down? The answer is simple: 50% of viewers wouldn’t be able to access the forums they were trying to view.
The solution to this was simple, in theory; however, it would take time to test and develop. Consulting with SoftLayer network engineers, we were directed to look at the Zeus ZXTM Load Balancer. Zeus ZXTM is a software suit that distributes requests for various TCP and UDP services based on specifications that you set for it.
What Zeus ZXTM would give us is a level of stability and redundancy on our web servers that has never been seen before my FreeForums.org and its users. By sending all requests to the Zeus ZXTM load balancing server, we are able to evenly distribute all requests over all web servers as well as redirect requests when a web server goes offline. This will make it impossible for a failed web server to ever be noticed by our viewers again.
As of January 26, all viewers are accessing FreeForums.org by way of the Zeus ZXTM load balancer, and we couldn’t be happier.
The Future
FreeForums.org is well on its way to making all points of service HA (high availability) by making every single service redundant with backup servers, but we still have a ways to go before we achieve this.
First, the global MySQL servers used to house the data shared between all clusters, while having a slave keeping mirrored copy of the data, is still setup in a Master+Slave method. Should the master go down, a human will need to convert the slave to a master server. The solution to this is to utilize MySQL Cluster, which allows for all servers in the mix to be masters. Once this is done, if the primary server ever goes down the others can take over immediately without anyone needing to intervene.
Next, the file storage server is currently standalone. There is no failover, there is no mirror. Just backups being made every fifteen minutes. But, aha!, yet another easy fix. We will be looking into using the Lustre File System as a means to create several file servers which act as one unified server. Should a server ever fail, go offline, or do anything that makes it unavailable, all requests to that server will be directed to the others. No one would ever know the difference.
Lustre is based off Sun Microsystem’s ZFS, a very powerful and robust filesystem. Lustre is currently in use by several of the world’s largest supercomputers, hosting hundreds of millions of files totaling petabytes in total space. So we’re confident that it can handle our humble hundreds of thousands of files.
Many more changes are coming in the future, and things are sure to be very exciting as we move into the future.







March 4th, 2008 at 3:03 pm
That sounds all terribly complicated. I’ll take your word for it.