The idea with Cloud is that each layer is responsible for its own availability and by combining these loosely coupled layers you get higher availability. It should be a consideration before you begin deployment. For example if you start by considering maintenance and how and when you would take a planned outage it informs your design. You should predict the types of failures you can experience so you are mitigating it in your architecture.
There is an emerging concept on Hyper-scale Cloud platforms called grey failures. A grey failure is when your VM, workloads or applications are not down but are not getting the resources they need.
Monitoring and Alerting should be on for any VMs running in IaaS. When you open a ticket in Azure any known issues are surfaced as part of the support request process. This is part of Azure’s automated analytics engine providing support before you input the ticket.
Backup and DR plans should be applied to your VM. Azure allows you to create granular retention policies. When you recover VMs you can restore over the existing or create a new VM. For DR you can leverage ASR to replicate VMs outside another region. ASR is not multi-target however so you could not replicate VMs from the enterprise to an Azure region and then replicate them to a different one. It would be two distinct steps, first replicate and failover the VM to Azure and then you can setup replication between two regions.
Maintenance Microsoft now provides a local endpoint in a region with a simple REST API that provides information on upcoming maintenance events. These can be surfaced within the VM so you can trigger soft shutdowns of your virtual instance. For example, if you have a single VM (outside an availability set) and the host is being patched the VM can complete a graceful shutdown.
Azure uses VM preserving technology when they do underlying host maintenance. For updates that do not require a reboot of the host the VM is frozen for a few seconds while the host files are updated. For most applications this is seamless however if it is impactful you can use the REST API to create a reaction.
Microsoft collects all host reboot requirements so that they are applied at once vs. periodically throughout the year to improve platform availability. You are preemptively notified 30 days out for these events. One notification is sent per subscription to the administrator. The customer can add additional recipients.
Availability Sets is a logical grouping of VMs within a datacenter that allows Azure to understand how your application is built to provide for redundancy and availability. Microsoft recommends that two or more VMs are created within an availability set. To have a 99.95% SLA you need to deploy your VMs in an Availability Sets. Availability Sets provide you fault isolation for your compute.
An Availability Set with Managed Disks is call a Managed Availability Set. With Managed Availability Set you get fault isolation for compute and storage. Essentially it ensures that the managed VM Disks are not on the same storage.