The Evolution of the Data Center LAN

This is the sixth and last of the monthly discussions of data center LAN switching.  The five previous months have focused on a specific technical topic such as the best alternative to the spanning tree protocol.   This month discussion will be an interview that Jim conducted with each of the six vendors and will cover a range of topics, both technical and non-technical, that relate to the evolution of the data center LAN.

In order to comment on the discussion here and/or to suggest further questions, please send email to Jim Metzler or simply enter your comment into the form below.


| Post a new comment/Start a new thread.
|To reply to an existing comment, please click "Reply" next to the original poster’s name and post date.

The trade press talks a lot about the need to flatten the data center in order to reduce latency primarily for east-west applications. Other than for certain well-discussed transactions in the financial industry, can you put a monetary value on cutting data center switch latency by a few microseconds?

There are some applications for which it’s possible to put a monetary value on cutting the latency of the data center switch by a few microseconds. You mentioned one such application - financial transactions like high frequency trading (HFT) readily come to mind. Another application is high performance computing (HPC). The monetary value that is placed on reducing data center switch latency by a few microseconds is generally a function of how the company monetizes the business impact of latency.

While some IT organizations place a value on reducing switch latency by a few microseconds, most IT organizations are looking for predictable latency, not just at a box level, but end-to-end across the network and applications. Based on the data flow, there generally is some fine-tuning that IT organizations can do to improve performance, as long as it improves user or application experience.. If, however, the fine-tuning doesn’t improve application performance, IT organizations shouldn’t look to optimize further.

HP has a team that is entirely focused on just the issue of very low latency switching and the applications that require this low latency. While there is interest in very low latency switching, that interest is very narrow. We see the interest currently as being largely limited to the high frequency trading markets. For most applications, cutting the data center switch latency by 700 nanoseconds or 1 microsecond doesn’t have a real impact.

It isn’t possible to put an absolute dollar value on reducing switch latency by a few microseconds. However, there are certain application suites that could benefit from lower switch latency. One such suite is data mining, which requires the sorting of huge volumes of data in a short period of time. One measure of the scale of sorting that is both necessary and possible is that in 2009 Hadoop set a record by sorting data at over half a terabyte per minute.

A possible way to quantify the value of switch latency involves the bandwidth-delay product. (The bandwidth-delay product refers to the product of a data link's capacity (in bits per second) and its end-to-end delay (in seconds). The result, an amount of data measured in bits (or bytes), is equivalent to the maximum amount of data on the network circuit at any given time, i.e. data that has been transmitted but not yet received.) When designing LAN switches, manufacturers set the buffer size to some multiple of the TCP window size. Lowering the latency in the data center LAN switches that connects to the Top of Rack (ToR) switches reduces the number of buffers that are needed in the ToR switches, which in turn reduces the cost of those switches.

There is clearly a monetary value in being able to support east-west traffic without having to access a core switch as this saves switch ports. Relative to switch latency, the way that we look at this is that it is difficult to put an absolute dollar value on cutting the latency of a data center switch by a few microseconds. Our position is that the value of low latency switching is largely determined by the application. For example, there is a lot of interest in low latency switching from IT organizations that want to reduce cost by converging their LANs and SANs. We also see interest in low latency LAN switching from the health care and financial sectors. However, in a lot of cases reducing the latency of data center switching by a few microseconds doesn’t provide any monetary benefit.

For certain financial applications even a nanosecond of reduced delay has a monetary value. In similar fashion, in many instances of high performance computing (HPC) it is possible to put a monetary value on cutting data center switch latency by a few microseconds.

What do you see as the primary use cases for either 40 GbE or 100 GbE? How much of the market, if any, will there be over the next two years for these speeds?

HP just announced a new generation of enterprise servers based on Intel’s Romley chipset. These new servers feature native 10 GbE on the motherboard. Having 10 GbE on the motherboard will drive down the price of 10 GbE and correspondingly increase the size of the market. In addition, having 10 GbE in top of rack switches will drive the need for 40 GbE in the aggregation layer.

Except for some very high-end research centers and some service providers, there is not much customer interest today in 100 GbE. That lack of interest stems in part from the fact that 100 GbE is currently extremely expensive. We expect that the price of 40 GbE will go down faster than the cost of 100 GbE and that it will be at least 2013 before we see significant price reduction for 100 GbE. We expect that those price reductions will be accompanied by a reduction in the power requirement and an increase in the density of 100 GbE solutions.

We are beginning to see the start of a market for 100 Gb Ethernet, primarily from the Internet exchanges and we are seeing somewhat more of a market for 40 Gb Ethernet. One of the factors that will give a boost to the 40 Gb Ethernet market is that Intel should soon be shipping their Romley chips which can enable a lot of high capacity functions, such as VMotion at speeds greater than 10 Gbit/second.

The real market for 40 GbE will not develop for twenty-four to thirty-six months. The primary driver of the market will be for points of aggregation. One advantage of a 40 Gb/second Ethernet solution is that it avoids the hashing problems that can occur with using four 10 Gb/second connections. A tipping point in terms of when the 40 Gb/second Ethernet market takes off will be when the total cost of a 40 Gb/second Ethernet connection is equal to or less than four times the cost of a 10 Gb/second Ethernet connection.

The customers that are currently purchasing data center LAN switches are telling us that the equipment that they buy has to last seven to ten years. So, we get questions such as “If I buy your equipment today, can I later just buy a 40 GB/second Ethernet card and pop it in?”

We don’t see high end Web properties (e.g., Google, Yahoo) as being big enough to drive the 40 GB Ethernet market. We believe that storage is the killer application for data centers and that the convergence of the LAN and SAN will be the primary driver of the adoption of 40 GB Ethernet. However, in today’s environment a 40 GB/second Ethernet card costs more than four 10 GB/second Ethernet cards. We don’t believe that there will be broad market adoption of 40 GB Ethernet until the costs come down to where a 40 GB/second Ethernet card costs less than four 10 GB/second Ethernet cards.

Many IT organizations either currently use, or are moving towards using 10 GbE at the access layer of the network. The primary use case for 40 GbE and 100 GbE will be for aggregation of the access layer of the network with higher speed ports.

Today many vendors support 40 GbE and the price points for it are falling. 100 GbE is still quite expensive and its use is largely constrained to service providers. In the next two or three years 100 GbE will start seeing traction in enterprises.

Mega-trends like video, BYOD, Cloud, Virtualization and workload mobility are all contributing to increasing bandwidth requirements. We’re seeing this reflected in the growth of 10 GbE across the switching portfolio. The price of 10 GbE has also come down which is helping to drive that market and we believe that eventually there will be a significant market for both 40 GbE and 100 GbE. The market for 40 GbE and 100 GbE will be driven in large part by the data deluge that is beginning to impact most IT organizations. For example, as the access points become increasingly bandwidth hungry, they will drive the need for higher speeds in the aggregation and core layers.

Many of our enterprise clients want to make sure that they can implement 40 GbE when required. One way that we are helping future proof their environment is by providing flexibility where they buy a 40G port, but can use it as four 10G ports today and convert it to a single 40G port later. This functionality was announced on our Catalyst 6500 platform and the Nexus 3000 series. Cisco has also announced 100 GbE on the Nexus 7000 platform. This is targeted primarily at cloud providers and data centers that are looking to minimize chokepoints from the core of their data center to the cloud or service provider.

One thing that we didn’t talk about in previous months was the impact of Big Data on data center LAN design. What impact, if any, do you think that Big Data will have on data center LAN design over the next two years?

One impact that big data is having on data center LAN design is that it creates more server-to-server traffic, which in turn drives the need for a flatter network. HP’s Intelligent Resilient Framework (IRF) is intended to help IT organizations build flatter data center LANs. IRF does this by providing a common, virtualized fabric spanning data center core, distribution, and access layers.

We believe that Big Data is transformational for network architectures. Hadoop, for example, excels at doing complex analyses, including detailed, special-purpose computation, across large collections of data and it relies on DAS, not on a SAN. One thing that Hadoop does that is unique is that it moves the processing to the data. Another way that Hadoop is unique is that it is the world’s first topology aware file system. It is also not very latency sensitive and you can route it. It does, however, drive very high levels of buffer utilization.

Big data is about having the right information and using it to make the right decisions. A lot of our customers have been doing this for a long time. For example, we have been working with the automobile industry for quite a while on applying business analytics. The phrase “Big Data” is just a new term with a lot of hype behind it for an old concept.

In addition to not being new, we view Big Data as just another service along with services such as video and SAP. Like any service, you have to provision Big Data with the appropriate policies; i.e., QoS, security. That can be done today with the traditional physical data center infrastructure, but it is very difficult.

Big Data has become a top catchword like cloud used to be. In a traditional environment, IT organizations tend to make decisions about storage, networking and computing in relative isolation. We see many IT organizations that are supporting big data implementing a converged infrastructure that is optimized for big data. In most cases that infrastructure is one or more racks, each of which hosts integrated storage, networking and computing.

There is no doubt that Big Data is being talked about by lot of vendors and that as a result there is a lot of buzz and awareness around it. Today the phrase Big Data refers primarily just to analyzing large amounts of raw data at rest and applying business intelligence to that analysis. Another way to look at big data goes back to the concept of the forthcoming data deluge and the need to analyze data in motion. This could be real-time analysis of sensor data through machine-to-machine traffic, Internet of things (IOT) or it could be the analysis of video. In the case of sensor data, very often companies want to take some action in real-time based on that data. We will increasingly see applications tap into this. In the case of video, one of the purposes of performing real-time analysis is to determine if the video can be optimized and/or intelligently re-directed.

What do you see as either the biggest fallacy either in the press on in analyst reports about how IT organizations are re-designing their data center LANs or the data center LAN technology that is the most over-hyped? Why?

One fallacy is that everybody needs a large, flat Layer 2 network to support virtualization in the data center. A number of vendors are pushing highly proprietary fabrics as the way to build large, flat Layer 2 networks. Part of the problem is that in many cases these proprietary fabrics don’t actually solve the problem. For example the Broadcom Trident+ chip only supports 16,000 ARP entries. That is not enough ARP entries to support a data center that has several thousand servers with ten or twenty virtual machines and possible multiple virtual NICs per physical server.

Another huge fallacy is that OpenFlow is the answer to every problem. There are some obvious use cases for OpenFlow. One such use case is to send flows to a central location for analysis thus creating a very intelligent, centralized sniffer. However, we don’t believe that OpenFlow will commoditize switches nor will it introduce open semantics into the industry. In addition, while OpenFlow has some cool potential, you have to be a computer science guru to actually utilize it. As a result, it is a protocol for vendors, not for mainstream IT organizations.

Fibre Channel over Ethernet (FCoE) has been over-hyped for years. The fact is that FCoE is still an immature technology and as a result, the amount of true FCoE that has been deployed is really very small. In addition, there are other important issues that impact IT organizations that are looking to converge their storage and data traffic. This includes the need to have a system that can manage the converged infrastructure from a single console. It also includes the resistance of the organization to converging technologies. If indeed IT organizations do converge their data and storage networks, there are multiple ways they can do this including FCoE, iSCSI, NAS and ATA over Ethernet. Whether or not Ethernet will be the dominant solution for converging the LAN and the SAN is hard to say.

One of the biggest areas of hyperbole that we have seen for the last few years is that it will be common to have an end-to-end FCoE infrastructure in data centers. What we are seeing is that the traditional SAN model is doing very well and that some IT organizations are implementing FCoE at the edge of the network to enable the convergence of I/O ports.

We don’t find that much sustained hype around technologies. What typically happens is that the technology either gets through the hype cycle or else it fades away. That said, driven by the trade press and some vendors, today there is a lot of hype around Software Defined Networks (SDNs) and OpenFlow. The reality is that both SDN and OpenFlow are still at a fairly embryonic stage of development and the market needs to get educated to sift through the confusion. What is the killer application for SDN and/or OpenFlow? How will either or both of these approaches benefit mainstream IT organizations for their deployment use-case? What would be good to incorporate into a production environment? Answering these questions will help clear the air and provide a more pragmatic approach to designing the data center LAN.

While we don’t often find sustained hype around technology, we do often find sustained hype around some of the business aspects of networking. An example of that is the myth that if an IT organization buys a switch that costs less, that the total cost of ownership (TCO) of the network is less. In reality a majority of cost is operational expenditure, and costs lie in operational complexity. To avoid getting caught up in this type of hype, IT organizations should take a holistic look at the TCO of the network including the operational cost and not just look at the TCO as a function of list price on a box-by-box basis.

The biggest fallacy that we see is that customers must extend Layer 2 across their networks. While extending Layer 2 across networks has some advantages, it also creates some concerns about the stability of the network. In some cases, a better approach is to implement a protocol such as VXLAN or NVGRE.

With all of the emerging data center LAN technologies, is a multi-vendor data center LAN desirous? Possible?

We believe that it is important that when IT organizations redesign their data center LANs that they implement standards-based solutions. One of the reasons for this is that it makes it easier for the IT organization to implement a multi-vendor solution if that is one of their goals. We don’t believe, however, that a lot of IT organizations will buy a top of rack switch from one vendor and a core switch from another. We also believe that in spite of all of the discussion, that most IT organizations have not implemented an Ethernet fabric in the data center. From what we see, however, the industry is poised to cross the chasm relative to deploying Ethernet fabrics in the near term. That said, the movement to implement an Ethernet fabric is very challenging. For example, it is notably more difficult than just going from RIP-1 to RIP-2.

There is a binary decision that customers make whether to implement a best-of-breed solution in their data centers or to go with a one-stop-shopping approach whereby they implement products from just one company. (A best-of-breed solution is typically multi-vendor.) Whichever approach they take, there are a few large vendors that cater to that approach.

In the long run, companies are better off with a best-of-breed approach because networking, storage and computing have different technology cycles and a best-of-breed approach enables them to be on the cutting edge with all three technologies. This is typically not possible with a one-stop-shopping approach.

The same decision about best-of-breed vs. one-stop-shopping also applies just to networking. For example, IT organizations need to decide if they will get their L2/L3 functionality from the same vendor that supplies them their L4 – L7 functionality. In the last ten years we have seen a gradual shift to where IT organizations increasingly take a best-of-breed approach to acquiring networking functionality.

If you take a close look at today’s data center LAN environment it is usually multi-vendor. Whether or not it is desirous to have the data center LAN be multi-vendor is a business decision. For example, in some cases an IT organization can reduce their CAPEX by having a multi-vendor environment. However, customers need to ask if having a multi-vendor environment makes the environment simpler or more complex? Does it add overhead and increase operational costs? What impact does it have on support and testing?

What most customers find is that while having a multi-vendor environment may reduce CAPEX it will usually increase OPEX by a greater amount. That is why most customers look at cost holistically over a long period of time. In addition, most IT organizations place a high value on having a vendor that has a roadmap to help them migrate from their current environment to their new environment in an evolutionary manner. Most customers also appreciate having a vendor who will solve their problems before there is a standards-based solution and then offer them a roadmap to the standards-based solution when it is available.

We see customers that are concerned about implementing proprietary technologies because if they implement one of these technologies they are locked in for a very long time. We see TRILL as having more of an enterprise focus and SPB having more of a service provider focus. At HP, we will support both technologies. We will also combine technology such as TRILL with our IRF functionality to allow us to deploy networks with a high degree of scalability.

A lot of vendors are pushing proprietary fabrics or technologies that enable a proprietary fabric. This includes Juniper’s QFabric and Cisco’s use of a proprietary version of TRILL, that they refer to as FabricPath. The primary value of those fabrics has nothing to do with technology. Their primary value is that they give the vendors an excuse to go back and talk to their customers and try to sell them something.

MC-LAG (Multi-Chassis Link Aggregation Group) is a technology that IT organizations can utilize today to evolve their data centers. MC-LAG does involve some proprietary technology, but only between pairs of switches. As a result, there is dramatically less vendor lock-in associated with implementing MC-LAG than there is with implementing a pre-standard version of TRILL. What is ironic is that in the current environment, the largest network that you can build using TRILL is smaller than the largest network that you can build using MC-LAG even though one of the key factors driving the development of TRILL is to enable large scale LANs.

What do you see as the biggest mistake being made by enterprise IT organizations as they re-design their data center LANs?

There has been somewhat of a cookie cutter approach to data center design and we see a lot of IT organizations that are resistant to move away from that traditional approach to designing data center LANs.

To us, that resistance to change is a red flag because the evolving requirements being placed on the data center are not well supported by a traditional design. For example, applications are changing, where the data resides is changing and traffic patterns are becoming more east-west than north-south. We see this as a good time for IT organizations to go beyond traditional approaches such as the use of the spanning tree protocol and look for more efficient ways to enable communications between servers.

Customers find rightsizing to be challenging. We see many IT organizations focus just on bandwidth and not on understanding the demands of both physical and virtual workloads. A related mistake is that many IT organizations over estimate the required capacity because they don’t leverage the intelligence that is in their operating systems. Some IT organizations ignore the forthcoming data deluge and under estimate their capacity requirements, but it has implications on security and management

We also see some IT organizations making buying decisions without looking at the track record of vendors including the resiliency of the underlying hardware or software. For example, we have seen some IT organizations buy into an emerging technology that looks good on PowerPoint slides, without evaluating the support and service models that are in place to help the IT organization implement the new technology.

That goes back to the previous question. We see that too many IT organizations just pick a solution without looking at all of the impact. In some cases IT organizations regard their role as merely implementing a solution that is designed and architected by their vendor. This approach tends to result in the deployment of proprietary solutions that locks the customer in going forward.

This is a good time for IT organizations to look around at multiple vendors and ensure that the architecture that they adopt can evolve as their needs evolve. It is also a good time for IT organizations to look at issues such as how the consumerization of IT will impact networking as well as the trend to implement an architecture that tightly integrates networking, storage and computing.

That IT organizations are buying into the hype around proprietary fabrics and technologies such as TRILL. By the way, we have similar concerns with Shortest Path Bridging (SPB) as we have with TRILL, but we see much less support for it. Part of why this approach galls me is that when I see an IT organization design a network around some of these proprietary technologies, I know that means that going forward they have lost all freedom of choice.

We don’t believe that Arista or any vendor has the perfect solution for all use cases. We do believe, however, that IT organizations are making a mistake if they don’t talk to multiple vendors when they are redesigning their data center LANs. First off, if they do talk to multiple vendors they will get a lower price from their chosen vendor. In addition, by talking to multiple vendors they are more likely to get a solution which is right for them and hopefully they can also avoid implementing proprietary technologies that severely limit their choices going forward.

The biggest mistakes we see is that some IT organizations are not receptive to implementing a new architecture because they are used to doing things the way they always have done them and because they are risk adverse. We believe that by not being receptive to new architectures, that IT organizations are short-changing their companies. For example, a lot of IT organizations don’t allow VMotion and hence don’t realize the benefits of VMotion. That is a good example of how an IT organization can short-change the company that they support.

Another big mistake that we see IT organizations make when they re-design their LANs is that they look to do it entirely on their own. Many IT organizations would benefit from using consultative services to help them understand new technologies and new approaches to data center LAN design.

Any last pieces of advise for IT organizations that are contemplating re-designing their data center LANs?

In most cases, a data center is not just a cost center but it is also a strategic asset. IT organizations should recognize that and go beyond analyzing just speeds and feeds when re-designing their data center LANs. Mapping IT decisions to business impact is therefore critical. They need to ask questions like “Will this approach help the company become more agile?” or “Will this approach help the IT organization break down the organizational silos?”

We also encourage IT organizations to look at implementing a design that can evolve as their requirements evolve. For example, a LAN switch that is purpose built for low latency may not be the best solution for convergence in the data center or virtualization requirements like workload mobility. IT organizations should look at the holistic breadth of what a given design can deliver and choose solutions that offer flexibility and investment protection.

Now is the time for IT organizations to look at the different data center LAN architectural options. In many cases, once an IT organization makes a choice of architecture, it will be very difficult to make changes.

First of all, IT organizations need to identify the problem they are trying to solve and make sure that the organization and the organization’s stakeholders are lined up behind that problem statement. If that doesn’t happen, IT usually ends up with a set of requirements that is way too broad. Secondly, IT organizations should bring in at least three or four vendors. If they do that, the IT organization will get a better solution and will pay less for it.

When IT organizations are re-designing their data center LANs, and in most cases that means introducing a virtualized architecture, they should not underestimate the importance of network management. Too often IT organizations approach to network management is something like “I bought it, but I never installed it.”, or “All I need is some CLIs and I can manage this.”

We are at a time when multiple components of IT are converging and IT organizations need the proper tools that can help them to understand the applications, the storage and where virtual machines are located. However, sometimes network management gets overlooked.

We believe that when IT organizations think about building the next generation data center that they need to think about agility in a very different way. For example, in the traditional environment it was common to have it take days or weeks to add a new service or make some other type of change to the infrastructure. That era is ending and today adding a new service or making a change must be done in minutes.

To meet these requirements, IT organizations need to put the agility of provisioning at the front and center of their design philosophy. To obtain this agility, new technologies must be used and the data center LAN has to be more fabric-based and it must allow for multiple paths between any two points in the network.

Reply to a comment/Post a comment

Note: A "Captcha" box will appear once you start typing a comment. If you have trouble seeing where to respond to the challenge, it goes in the space between the box showing the characters and the words "Type the Characters..."

Return to
Thought Leadership Series

Recent Comments