The Amazon outage shows that sh1t eventually happens, so this post is about some of the ways to prepare for when it does happen.
First some terms to define: Redundancy, Resilience, N+1, Failover, High Availability, Quality of Service (QoS) and Disaster Recovery.
Redundancy means you don’t need it, until something goes wrong. Like a fireman or a lifeguard, these are the servers that are there when the servers that meet your capacity, performance etc requirements fail. Like a fireman, they seem costly to management, and are sometimes cut when times are tough, but they are essential when things go wrong. Like N+1 redundancy, they help maintain normal operation when inevitable things like server failure happen.
SharePoint 2010 farms can have redundant Web-Front Ends, Application Servers (running Service Applications) and SQL Servers. Those farms need redundant Network, Power and other things too to keep going. Resilience means being able to cope with lots of different kinds of threats.
High Availability (HA) with SharePoint 2010 is a goal usually enforced with Service Level Agreements (SLAs) which are contracts between the provider of the data centre or the Cloud/Infrastructure as a Service (IaaS) and the customer. It will be a metric that states the amount of “down-time” the customer can bear in a year, and could range from seconds to days. Microsoft uses a 9s metric to measure this, others use Platinum, Gold, Silver and Bronze service levels. Similar in meaning to Quality of Service (QoS).
Failover is when one system automatically takes over when the main one breaks, Disaster Recovery is when the whole Data Centre has gone down for whatever reason.
So to keep a SharePoint farm going, other than multiple WFEs and App servers what can you do? The data is mostly in SQL Server now, so that is what should be protected most.
SQL Clustering is where an Instance is spread over multiple servers but appears as one because it has a virtual IP address. Sometimes some of the servers (AKA nodes) just wait for the primary ones to fail, other times, they are all working. Clusters can also have multiple Instances. The terms active/passive or active/active or even active/active/active are loosely used to describe these. The main weakness here is they share the same storage or SAN, in the same datacentre, so if it fails, you are still in a restore from tape scenario. The main advantage is maintenance is easier because you can patch servers in the cluster while the cluster does not have to go down. So sys admins like them. Clusters are not like network load balancing (NLB) because NLB is more for the distribution of web traffic.
SQL Log shipping is scheduled backup/restores of the databases. It requires a backup, copy to filesystem, restore to other SQL server. It is not synchronous since there is a lag, it is described as asynchronous. The main weakness from a SharePoint 2010 pov is you can’t do this with the configuration database and I think the Central Admin site db, among others. The main advantage is you have full control over how and when to do the backup/restores. Another weakness is it is resource heavy.
Database Mirroring is now an option in SharePoint 2010. Most of the DBs including the configuration database and I think the Central Admin site db can be mirrored, in fact, most service app ones can. More here. It is automatic, so there’s less control, but it is fast. You can use it with clustering too, but it’s not necessary in my opinion.
So finally, which option is best? Well of course the answer is “it depends”!
I like the idea that you can have N+1 in your Production farm, and use SQL Clustering, then have an identical DR farm in another data centre and use Mirroring or Log Shipping to keep it in sync with Production.
Naturally that requires all the same servers, capacity and maintenance overhead as Production. So cost will be a factor.
Then there’s the Spread Farm, which is a bit of a glass half full/empty option. It puts some of the WFE/App server/SQL nodes in a different data centre. This requires one-way (not two-way, like ping) latency of 1 ms between the WFE and SQL, and 1 gb/s bandwidth between the data centres. Main weakness here is the SAN is still in the main data centre, so if it goes, you’re back to tape backups, so what is the advantage of spreading the servers? It just sounds like a good idea and is cheap, so clients think is is good, but I think it is a false security.
For more reading, this blog is very good: http://www.jeremytaylor.net/
On SQL mirroring and clustering, read here: http://msdn.microsoft.com/en-us/library/ms191309(v=sql.100).aspx
For DR: http://technet.microsoft.com/en-us/library/ff628971.aspx
For Availability: http://technet.microsoft.com/en-us/library/cc748824.aspx