Tuesday, September 1, 2015

#vmworld Architecting Site Recovery Manager 6.1

SRM 6.1 delivers policy driven protection groups. The difference is rather than explicitly adding the VM to a protection group, you simply select the storage volume and the VM is automatically protected. This is the same if the VM is deployed or storage VMotion’d to the storage volume.

If you tend to create more Protection Groups in your SRM deployments you have more granular flexibility for testing failover. Creating less protection groups is less complex, but provides less flexibility. The right combination will vary by customer.

SRM supports Active-Passive, Active-Active (production in one site development in another), Bi-directional Failover (production in both sites and each one serves as a failover to the other) and multi-site (think remote branch to central site). In the past there was no way to leverage stretched storage with SRM. In SRM 6.1 you can now use stretched storage. The failover differs in this model as it can now be orchestrated through a cross-vCenter vMotion (latency is typically 5-10ms or 50 to 100 km in this model).

SRM is a paired topology so with a multi-site topology for each remote site you need a SRM server in the central datacenter per branch. You can also consolidate several remote sites to a central single vCenter SRM model before failing over to the central site. Keep in mind that each VM can only be replicated once so multi-hope scenarios are not natively supported in SRM. It is recommended that you do not make these topologies anymore complicated than you need to.

Recovery Time Objective “RTO” is a very important measurement when designing your DR strategy. RTO is the time between when the disaster occurs to when the system is fully recovered. IP customizations (changing IPs during the recovery) actually takes a number of steps and takes a fair bit of time. One way around this is to use technologies like OTB or NSX to enable stretched layer 2 networks between datacenters to keep the network IPs unchanged. With the integration of NSX and SRM 6.1 you have the concept of a universal network switch which enables the switching to be automatically mapped between sites.

Other things you can consider for lower RTO in an SRM architecture is:

  • Fewer larger NFS datastores (a large NFS datastore can take up to 10 seconds to mount)
  • Fewer Protection Groups
  • Don’t replicate VM swap files, put them on non-replicated datastores (weigh this against the overall complexity)
  • Fewer Recovery Plans

Recommended VM Considerations

  • Install VM tools in all VMs
  • Suspend VMs on Recovery. Although this can increase your RTO, it frees up your resources at your recovery site (works best with an active-active model; a production and development failover site)
  • PowerOff VMs ( the consideration is similar to suspending)

Recovery Site

  • Ensure that the vCenter is sized properly, it works hard during recovery situations
  • If you have an active-active model you may need more hosts as you potentially double the workload during failover

VMware has a few best practices in implementing SRM such as being clear with the business by providing a menu of SLAs .

No comments:

Post a Comment