What Are the Best Approaches to Scale Virtual Machine (VM) Networking Beyond the Data Center?

Many of the components of cloud computing have been done before. Infrastructure-as-a-Service looks a lot like time sharing and Software-as-a-Service looks a lot like what we used to call Application Service Providers.  However, one of the components of cloud computing that is new is the ability to move virtual machines between physical servers, both within a data center and between data centers. 

Our research clearly shows that IT organizations are very interested in moving VMs between data centers but that there are a number of barriers that limit their ability to do so. This month's discussion will identify those barriers and will discuss what IT organizations can do to limit their impact.

In order to make this discussion interactive, kindly feel free to send us questions or comments.

In order to comment on the discussion here and/or to suggest further questions, please send email to Jim Metzler or simply enter your comment into the form below.


| Post a new comment/Start a new thread.
|To reply to an existing comment, please click "Reply" next to the original poster’s name and post date.

What are the primary challenges that limit the ability of an IT organization to move VMs between data centers?

VM migration between data centers is a bandwidth intensive operation and distance exacerbates the challenge. In most cases and until recently, IT organizations either shut down the VM or put it into a “quiet state” before moving. This was done to minimize the burden on their WAN as the VM was moved. Today, there are emerging products and solutions, many of them advancements from networking vendors such as Brocade, that allow VMs to be moved between data centers “live”. There are 3 key areas of consideration when designing for live migrations.

  1. Enabling VM mobility between data centers regardless of their location and distance in between.

    VM migration can break network and application access. You need to ensure that applications residing on a VM will have adequate network resources and that configurations such as VLAN profile, QoS and security ACLs are available and applied throughout the network when a VM is moved. Also, today’s VM migration solutions are distance limited to 5ms or so. Application acceleration and optimization techniques are required to enable VM migrations beyond 10ms distances to ensure uninterrupted access.

  2. A network infrastructure that provides high performance, high reliability, security, and proper layer 2 extension capabilities to interconnect the data centers.

    VM migration is a bandwidth intensive operation. A VM migration task needs to be transported over all available bandwidth over all WAN links simultaneously, taking advantage of trunking to aggregate sufficient bandwidth. At the same time, hierarchical-QoS and adaptive rate limiting are required to prevent VM migration traffic from chocking other mission critical traffic in WAN transport. In addition, high speed encryption (10 Gbps) for data in flight is required to protect against a man-in-the-middle attacker viewing or manipulating a VM undergoing live migration over a public IP network and targeting credentials and data in the migrating VM memory.

  3. IP traffic management of client network access to the site where the application server resides

    Live migration of VM requires visibility and orchestration of work flow such as dynamic routing of client traffic to the new VM location once the migration completes. Doing so by updating GSLB/SLB according to the VM migration event is required.

Live migrations of virtual machines is a bandwidth-intensive operation and used to be cost prohibitive except for short distances. The emergence of products and solutions from vendors such as Brocade are addressing these challenges and enabling IT organizations to embrace and successfully perform this operation today.

Primary challenges of moving VM includes: Layer 2/3 extension, Latency, Security, Performance and Management. VM mobility challenges can be well understood if we look at use case requirements.

Live Migration/Continuous Data Availability: Live migration consists of migrating the memory content of the virtual machine, maintaining access to the existing storage infrastructure, and providing continuous connectivity to the existing domain. Challenges include: sub-millisecond round trip latency, maintaining virtual machine performance before, during, and after the live migration and maintaining storage availability.

Non-Disruptive: The Non-Disruptive require s consideration for both existing server transactions and new server transactions when designing the architecture. Virtual machine transactions must be maintained so the solution must anticipate both obvious and unique scenarios that may occur.

Challenges include:

LAN Extension: Data center applications are often legacy or use embedded IP addressing that drives Layer2 expansion across DC.

Path Optimization: Every time a specific VLAN (subnet) is stretched between two remote locations considerations need to be made regarding the routing path.

Layer 3 Extension: Provides routed connectivity between data centers used for segmentation/virtualization and file server backup applications.

SAN Extension: Presents different types of challenges and considerations because of the distance and latency requirements .

Several challenges must be overcome to enable VM migration between data centers. There are two models to consider. One is live VM migration where a VM is migrated in real time while keeping the applications up and running. Since real time VM migration requires shared storage, storage needs to be shared across data centers. Alternately, storage needs to be synchronized in real time across data centers. This is a big challenge. Shared storage across DCs can be challenging due to latency and performance. Storage synchronization across DCs also poses similar challenges when doing this in real time. Another challenge is that since live VM migration typically does not work across Layer 3 boundaries (since the applications need to maintain their IP addresses for application continuity), the different data centers need to have a common VLAN or subnet traversing the different sites. This is primarily a consideration for live VM migration. Sharing vlans or subnets across DCs may be problematic due to a variety of reasons. First is that both or all data centers would now potentially need to learn Mac addresses from each other potentially leading to Mac address explosion. Secondly, problems such as broadcast storms in one data center may now impact another data center.

Note that most of the above problems are prevalent when attempting to move VMs across DCs while keeping applications running. Offline VM migration is perhaps easier as it does not necessarily require IP addresses to be maintained. As such the different DCs do not necessarily require sharing the same VLAN or subnet. Additionally, replicating VM state and storage offline or in non-real time is also an easier challenge to address.

The flexibility and scale of hypervisor technologies are dependent on a fundamental tenant of virtualization: one can’t use vMotion to move VMs outside of the L2 network that contains them. To optimize virtualized environments and architect larger and flexible data center 'application pods', one needs larger and more flexible L2 networking domains.

The technologies used for creating L2 domains and controlling L2 traffic are well established. Spanning tree (STP) and its derivatives were adequate for most of networking needs, but have drawbacks in virtualized data centers.

HP Networking is addressing these issues by taking active roles in both IEEE and IETF efforts to standardize new L2 STP free technologies, respectively 802.1aq SPB and TRILL.

While compute and storage resources are virtualized in most data centers, legacy network elements remain limited in their ability to support workload mobility, needed for cloud computing.

To overcome these limitations, HP is engaged in industry activities to deliver multi-tenant enabled network technologies to support global networks suited for cloud infrastructures. Multi-tenancy requirements challenge traditional networking architectures:

● Multi-tenancy requires secure separation of traffic flows between tenants of a common network infrastructure
● Virtual machine mobility requires large layer 2 networks that are difficult to scale locally or globally
● With a 4,096 limit, traditional IEEE 802.1Q VLAN tags lack the required scale to differentiate the thousands or millions of tenant virtual

To address these limitations, the industry is developing standards based encapsulation methods that provide Layer 2 network abstractions to virtual machines, independent of their location in the network. Encapsulated L2 Ethernet frames sent by VMs are repackaged completely within another Ethernet frame with a new Ethernet header, VLAN tag, IP header, and encapsulation header. Likewise, frames received from the external networks are de-capsulated and the original L2 frames are received by the appropriate VMs.

Today’s Data Center challenges are primarily those of improving end-to-end time-to-service while at the same time avoiding wholesale service-affecting complexity. It’s a crude reality that most problems are solvable, given sufficient resources and funding, however the business-centric goal must be to achieve streamlined, sustainable, and efficient service delivery.

The main limitation impacting Data Center networks is that of achieving scale, whether that is at Layer 2 or Layer 3. Virtual Machines are among the most powerful tools available to Data Center architects today; however, the underlying infrastructure has restricted them to local significance. These environments and their limitations have spawned the concept of pods that are often as limited in reach as a single rack.

While stitching together various protocols or, even better, deploying an end-to-end Fabric – between multiple Data Centers and beyond into the wider Enterprise – may address virtual boundaries, there remains the issue of integration between the virtual environment and the network environment. At Avaya, we’ve developed a level of integration between VMware’s vCenter and our Virtual Provisioning Server (VPS) that allows the migration process for a VM to seamlessly trigger corresponding adaptation of the network. Thus, the policies associated with the port from which the VM is moving are applied on the new port. This is crucial, as 37% of outages in the Data Center are caused by human intervention; and quite obviously the major cause of this intervention is day-to-day administration. Remove the cause and you’ll remove the affect; complexity is of course the enemy of availability, and actively simplifying the various interfaces will promote improved uptime.

Looking forward, VMware appears to be continuing its focus on VXLAN as the dynamic interface between the traditionally disparate environments of compute and network. Therefore, it will be imperative for Data Center architects to choose flexible solutions that provide a complementaryFabric-based architecture, one that addresses the underlying complexity of VXLAN transport and enables communication between both VXLAN and non-VXLAN environments.

Reply to a comment/Post a comment

Note: A "Captcha" box will appear once you start typing a comment. If you have trouble seeing where to respond to the challenge, it goes in the space between the box showing the characters and the words "Type the Characters..."

Return to
Thought Leadership Series