March 31, 2010

Compliance for Internal Wi-Fi SLAs


Networks are living creatures, ever-growing, ever-changing, and Wi-Fi networks change even more quickly and in more ways than wired networks. New APs are deployed, neighboring networks pop up, clients roam, data rates change, flash crowds converge on one part of the network. Users love the freedom and mobility, but they also are demanding that IT consistently deliver them predictable performance in spite of the dynamic nature of the network. Wi-Fi networks are getting faster and smarter and that helps. But Wi-Fi administrators are still hampered by having minimal visibility into how well they are actually delivering what their users expect. They don't have visibility into the actual performance level being provided to the Wi-Fi clients by the infrastructure and their infrastructure only has limited ability to dynamically adjust to deliver the experience their users are demanding. Today's Wi-Fi administrators can only hope their clients are getting the planned level of performance - and that isn't good enough.

Before Wi-Fi protocol analyzers, administrators and consultants alike were only able to troubleshoot by continually reviewing the network design of and device operation within the network infrastructure. Gathering meaningful performance statistics and performing trouble analysis and repair was difficult, if not impossible. With the introduction of Wi-Fi protocol analyzers, these professionals had the equivalent of RF goggles. They could now see what was happening and could reactively troubleshoot problems. The problem with this approach is a lack of ability to properly diagnose and repair performance problems in near real-time. With this in mind, Aerohive has introduced the next level in network visibility and reactive response. Aerohive's new infrastructure-side performance monitoring and response system, dubbed SLA Compliance, increases the troubleshooting granularity and active response speed far beyond what any IT professional could accomplish manually and paves the way for IT to move towards actual performance guarantees.

Download Paper
(Webtorials registration required. Click here if you forgot your username/password.)


9 Comments

Service Level Agreements (SLAs) have been a staple of the WAN menu for a couple of decades. These began primarily as a method for service providers to "guarantee" a certain set of expectations for their customers.

Of course, many, if not most, IT shops act as a service provider to their customers. Consequently, it became common throughout the 1990s for the IT department to offer SLAs to their internal customers. This ensured that the internal customers were getting the services for which they "paid," whether the "payment" was direct or indirect.

Now that Wi-Fi is an integral part of the corporate network, it's only natural that certain levels of service can be expected by the users within the network.

This document does a great job of exploring the options and parameters for internal SLAs for Wi-Fi services.

In the example in the paper, it seems that there is a singular value, bandwidth, which is measured in the SLA. Is this the only value that needs to be monitored? What about issues like latency and error rates?

In the first iteration of our SLA feature we used throughput as the best indicator of client health, but also said that this is just the first. With SLA we built engine that enables the monitoring every client on the network and reporting it proactively to the administrator; now that the engine is in place adding other metrics is easy.

Also just as a clarification, we look at throughput vs bandwidth - a subtle change in meaning. When we look at throughput, we make sure to evaluate whether a client is a) below the throughput threshold established with the SLA and that b) wants to more throughput but can't get it. if both a) and b) are true then there is a problem. This is important because some clients may just want 1 Mbps, even though the SLA may set higher - SLA is smart enough to realize that no alarm should be triggered in this state.

It's common in an SLA when measuring the bandwidth for there to be a certain threshold over which the throughput is measured. From a network design perspective, there's a lot of difference in guaranteeing 3 Mbps with a "reset" every second as compared to 3 Mbps with a "reset" once an hour. (This is an important parameter since it's integral to any overbooking of available bandwidth.)

This leads to two questions. First, what is the timeframe over which you measure the bandwidth, and secondly, what do you recommend as parameters for the extent to which one determines the allowable amount of overbooking in the SLA?

Good point and this is something we have not highlighted. We check the average throughput every 5 seconds for every client on the network. Before we raise the alarm though, we like to see consistent congestion for 3 periods in a row. We evaluated shorter and long term time frames and we found that 5 seconds was the best time-frame to eliminate false positives but still be responsive to client issues.

Up until relatively recently, it was a pretty safe assumption that the majority of data over a Wi-Fi network was variable bit rate and non-realtime. However, with applications like VoIP and video streaming, the demand for realtime and nearly constant bit rate is increasing. Should these be treated separately in an SLA?

That is the challenge of WiFi - Datarates can fluctuate instantaneously based upon localized and transient signal and noise measurements - while there are ways to mitigate this, there is no way to eliminate it. One of the attractive things about SLA is that it allows an administrator to test the network. Set the limit for 5, 10 or 20 Mbps to each client and see if those throughput levels can be met - if they can't, the system gives an understanding of why not.

SLA provides a management and enforcement layer over an in-deterministic medium that enables the establishment and monitoring of deterministic thresholds. If the network is not able to achieve the determinism required, it is able to respond automatically to achieve the threshold. In the unlikely event that it still cannot be met - the reporting gives the information required to enhance the network design to ensure that it does not happen in the future.

I think that Steve is headed toward an important question. The 802 committee has defined a way to do QoS in 802.11e. The white paper sings the praises of protocols (low power, low maintenance, obsolete-proof, etc). The downside of protocols is that there are a lot of them and some of them overlap in function. So -- how does Aerohive dequeuing relate to the existing standard?

Second, looking backwards we see that 80+ percent of the airtime is used to send data toward clients. Probably more like 90 percent. How reliant is the Aerohive method on having the queue in the AP as opposed to queuing in clients waiting to send inbound packets? Is this one-way QoS?

-jim

Good question, and you are right that that is very important. I will answer this is the context of WMM, which is only a portion of 802.11e widely adopted by both clients and APs. WMM, while useful is less a QoS system and more a hardware buffer which is a very different thing. WMM provides strict priority to clients and preemption based on the hardware queue. For this to be an intelligent system, there needs to be something behind WMM, that is what AeroHive's scheduler does. We feed the hardware queues packet by packet to ensure that WMM does what we want. Think of WMM as complimentary to SLA. Another important topic is that we try not to do anything non-standard or unintended with the 802.11 protocol. There is an incredible diversity of WiFi clients and with that comes a range in quality. Some clients aren't tolerant to any deviation to the expected AP behavior. To maximize compatibility we follow the standard as closely as possible and add things like SLA at a higher layer.

On the second point, we queue downstream but are also able to affect upstream because our queuing mechanisms look at aggregate airtime (both send and recieve) therefore we decrement airtime when clients use airtime either upstream or downstream. At the TCP level (or session layer or application layer for UDP protocols) the protocols react, which changes the upstream utilization. This works quite well. In addition the PER gives the AP precedence which means that the AP can usually send when it needs to.

SLA is actually the culmination of a bunch preceding of work on airtime, monitoring, QoS etc and depends on those technologies to deliver this level of determinism.

Search Webtorials

Get E-News and Notices via Email


  

 



  

I accept Webtorials' Terms and Conditions.

Trending Discussions

See more discussions...

Featured Sponsor Microsites






















Archives

Notices

Please note: By downloading this information, you acknowledge that the sponsor(s) of this information may contact you, providing that they give you the option of opting out of further communications from them concerning this information.  Also, by your downloading this information, you agree that the information is for your personal use only and that this information may not be retransmitted to others or reposted on another web site.  Continuing past this point indicates your acceptance of our terms of use as specified at Terms of Use.

Webtorial® is a registered servicemark of Distributed Networking Associates. The Webtorial logo is a servicemark of Distributed Networking Associates. Copyright 1999-2018, Distributed Networking Associates, Inc.