Guest Column | August 8, 2016

8 Virtual Steps To Ease The Pain Of Disaster Recovery

BSM Matt Sprague, Computer Design & Integration LLC

IT management, virtualized infrastructure teams, and service providers should offer DRaaS, disaster-recovery-as-a-service, make DR environments easy to operate, and focus on enhancement services that add value to replication technology and streamline the DR process.

By Matt Sprague, Infrastructure Services Manager, Computer Design & Integration LLC

Hollywood and the media sometimes leverage the shock value of unpleasant events such as fires, floods, blizzards, and hurricanes. However, we may need to remind ourselves disasters are real and happen every day, and we offer solutions that actually help businesses and families recover from these tragedies.

Imagine your business lost its servers, inventory, and customer database. You would want to protect your assets from unforeseen events and natural disasters too. FEMA estimates up to one-half of all businesses close within a year after a major disaster. Losing data is not something we want to think about but, ironically, it’s only when customers do rethink the high cost of losses they come looking to us as VARs and MSPs for solutions.

While our customers develop their business continuity plans, we, as providers, also need to prepare for disaster recovery (DR) service offerings. I’d like to share my eight simple steps.

Step 1: Recognize The Power Of Virtualization
Virtualization has been a major force of change in IT over the last decade and no service has benefited as greatly as DR. Replicating the data and functionality of a production environment was once the hardest part of DR. In today’s 100 percent virtualized environments, it’s as easy as replicating VM data and settings to a DR site. We also enjoy a plethora of software and methods for replicating a copy of VM data and configurations, and for keeping them in sync.

Today, the hard part is over and providers who recognize this shift are offering DRaaS. They have switched their focus to enhancement services that add value to replication technology by streamlining recovery and making DR environments simple to operate. Two tips we can learn from them are:

  • Select a proven replication technology and provide your customers with a frame of reference for DR environments and methodologies.
  • Recognize the power of virtualization: read the real-world case studies, brainstorm with your teams, and take advantage of free trials.

Step 2: Mimic The Production Environment
Here is a simple uncompromising philosophy that allows you to provide streamlined DR for mission-critical environments successfully without added complexity: model the virtual DR environment based on the actual production environment detail-by-detail.

The guest VM must not detect whether it’s running in the production or DR environment. Treat the guest VM as sacred in the DR environment. No aspect of DR should require login to a VM. Nothing can be changed. No IP changes, no DNS changes, no configuration changes. This is the best way to guarantee the functionality is the same in DR as it is in production. In addition to promoting good DR functionality this approach offers several other benefits:

  • Faster Failover: because changes inside of the VMs are not allowed, DR failover is faster.
  • Fewer Mistakes: disasters are no time for critical thinking. Fewer changes in an environment also mean fewer potential mistakes from engineers operating under high-stress conditions.
  • Higher Reliability: clean guest VMs mean a more reliable and streamlined failback to production.
  • Less Maintenance: environment variables remain in the VMs that are actually replicated. The DR environment requires less maintenance.

Step 3: Plan Ahead For Internal Networking Detail
Comprehensive DR design planning is not optional. Successful implementations do not skimp on the planning phase. If we maintain the philosophy I am suggesting, network discussions are clear and simple. For example, all of the following tips apply:

  • DR network IP addresses, subnets, and gateways must be exactly the same as the production network.
  • The layer 2 segments should be completely isolated from production so they do not collide.
  • When your VMs boot up in DR, they should see as close to the same network surroundings as possible.

Step 4: Virtualize The Physical Networking Devices
Consider virtualizing your networking devices including virtual firewalls, load balancers, IDS/IDP systems, and VPN devices. If leveraging them, your DR plan should treat them like the virtual machines they are and just replicate them. Otherwise, the VMs that are replicated need to think they are still in the production environment when they boot up.

Use the following tips for mimicking all the functionality provided by the production devices in the DR environment:

  • To mimic physical devices, deploy simple virtual versions of those devices as often as possible. The configuration is based on production but pared down to address essential VMs/functionality.
  • When appropriate, use a multi-tenant or shared physical device to closely match customer production environments (for example, an F5 load balancer requirement).
  • Only mimic what is required. Don’t worry about trying to mimic discretionary devices like printers or copiers; no one will be printing during disaster recovery.

Step 5: Test Your DR Setup
Testing is critical since we’re building an environment to match production rather than replicating the production environment. It’s the only way to know for sure that our setup is right.

Another benefit of this streamlined approach is that it requires no separate network to bubble test. If we were changing VM IP addresses to a dedicated DR network that might have connectivity back to production networks, we might want to have another separate, isolated network for testing purposes. This network would allow the VMs to boot up so we could verify functionality, but would not interfere with production environments. Since our DR environment is an exact copy of production and the two networks are isolated, there is no need for a separate network to bubble test.

Step 6: Try Partial Failover And Layer 2 Stretch
If partial failover to DR is required, stretch a layer 2 segment/VLAN between sites to allow VMs to come up in DR and communicate to VMs still running in production. Stretching layer 2 segments across physical boundaries is ordinarily a bad idea, and I agree; however, a disaster situation is generally a temporary one, so it can be tolerated for a short time.

To compromise, use a virtualized stretch like Layer2 VPN or Cisco Overlay Transport Virtualization (OTV) to allow the DR and production environments to communicate. Often this grants us more control over the communication and mitigates a lot of the issues created by stretching a VLAN to DR. However, when there is no physical connectivity between the production and DR sites, this is the only option.

Step 7: Preconfigure Public Networking
Maintaining the same public networking between production and DR sites is typically impossible. An exception occurs when an organization owns a large enough IP block to advertise via BGP to the Internet. In most cases, you are changing your public-facing IPs in DR. Here are some additional networking tips:

  • Preconfigure public IPs and DNS names for the DR site. In most cases, tell people to access the public service by a different DNS name while in DR. In some cases, companies prefer this convention because it indicates when they are using DR or production.
  • If DNS changes are required, stage them with a script or standby zone file. Again, disasters are no time for lengthy critical thinking.
  • Lower your DNS Time-to-Live (TTL) settings for critical public facing services. A smaller TTL allows changes to propagate faster across the Internet.
  • Leverage a global load balancing solution or CDN that pulls and balances across multiple sites. These options are generally higher-end solutions for large website deployments but they are effective in streamlining the switchover to a DR site with little or no loss in availability.

Step 8: Automate
No longer just another buzzword, automation is a key aspect of DR and ultimately requires that all the thinking and planning are done ahead of time. With workflow, process, and asset automation, even in an emergency situation, we just have to press the panic button, watch failover, and wait for our DR environment to come back online.

During difficult high-stress disasters, it may not be feasible to rely on a technician to follow a procedure to recover a system in DR. Keep automation in mind to empower you to implement these DR methodologies.

The Last Word on Disaster Recovery
DR environments are easier to implement thanks to strong virtualization trends. Technology empowers organizations to ship VM data to DR environments. Providers must focus on adding value to the services surrounding VM replication. DR environments can be streamlined to enable efficient and effortless failover during a disaster by adhering to a philosophy of not making changes to the guest VMs that are replicating from production.

I encourage you to leverage the tips in this article and apply them to the shared vision of a simple DR panic button with automated (or, at least clearly defined) failover that gives your organization and customers a strong return on their DR services, investments, and business continuity solutions.

Matt Sprague is Manager of Infrastructure Services at Computer Design & Integration LLC (CDI LLC), a hybrid IT cloud and managed services provider, where he manages a cross-functional team of infrastructure engineers that maintain datacenter and service offering infrastructure. An accomplished team leader with expertise in designing, implementing, and managing complex IT environments, Mr. Sprague has a proven track record of implementing sound, practical IT solutions designed to aid businesses in reaching their growth and profitability objectives. He is highly skilled at creating strategies that align technology and operational methodologies with organizational needs that maximize uptime and productivity.