Friday, July 27, 2007

Cisco Ripples - DCA and RRM - Help is on the way

Since I first published " The Ripple Effect" back in February I have heard from many folks who have validated the effect but to my chagrin, I have had no solution to offer. Well thankfully there are smarter people than me out there and solutions have started to appear.



I was alerted to the fact that Medical Connectivity consulting recently put Cisco in their sights and quoted my blog with regard to Dynamic Channel Assignment and RRM causing issues. The Web, being the great time waster that it is, lead me on a journey. As I read the article I clicked here and there and next thing I knew I was looking at a forum at Cisco that was talking about this exact phenomena.



One of the forum posters had some great suggestions to eliminate this problem in the future. Bruce Johnson at Partners Healthcare offered this solution,



"We saw the majority of DCA events were triggered by Interference from Rogue APs. After we disabled Foreign AP Avoidance the number of channel changes dropped by an entire order of magnitude (1000s to 100s). We disabled Cisco AP Load Avoidance and this reduced the number of DCAs within an order of magnitude (100s less).



DTPC will power-up APs to max levels to provide a 3-neighbor -65 RSSI coverage "grid" and 7921s will power up to follow suit (up to their max Tx Power). Other clients with higher Tx power may send the APs to max power causing a mismatch with IP phones.



You can decrease the tx-power-threshold so the "grid" won't be as hot (default is -65, change to -71 or -74):



config advanced 802.11a tx-power-control-thresh <-50 to -80>
config advanced 802.11b tx-power-control-thresh <-50 to -80>



and reduce the coverage hole detection threshold (reduce Min SNR level in RRM Thresholds) to suppress the power-up activity."

Bruce seemed on track with this fix. the problem is that it isn't a fix. It shuts off the RRM and DCA so that the WLAN would remain stable. So where is the benefit of a controller based system?



He does note that a fix is forthcoming from Cisco, "They are revamping the behavior of RRM in the WLC 4.1 Maintenance release." Which is later confirmed by a Cisco employee, Saurabh Bhasin a TME,



"With the 4.1 Maintenance Release(MR) due out on cisco.com shorly, many improvements based on such feedback have been brought into RRM's algorithms ? improvements aimed at allowing administrators to fine-tune their RRM-run WLANs where desired. These enhancements will allow for greater control over both the channel and power output selection algorithms, so administrators may assist RRM in being either more or less aggressive in such decisions, depending on application and network needs. Additionally, enhancements have been made to the management and reporting of all RRM information and configuration alterations to allow for better tracking of RF environmental fluctuations and to assist in keeping track of RRM activity. Further technical detail on the inner workings of these enhancements will be available very soon in an update to the above-mentioned RRM Whitepaper."
The paper he references is found here http://www.cisco.com/warp/public/114/rrm.html and explains a lot of what we are all seeing. (here is the PDF version)



So here is to hope that WLC 4.1 Maint. Rels. fixes it. As an aside, Bruce Johnson is skeptical,


"Its all well and good to make things work for Intel and the CCX/CCKM compliant crew, but if you have any of the other brands of WLAN NICs (like those made by medical device manufacturers, who won't subscribe to fast roaming features until they're adopted by the IEEE) you are best keeping RRM disabled until it delivers on its promise as stated in the following 802.11TGv Objectives draft:

Service and Function Objectives

Solutions shall define mechanisms to provide the service listed below.

[Req2000] TGv shall support Dynamic Channel Selection, to allow STAs to avoid interference. Solution shall be able to change the operating channel (and/or band) for the entire BSS during live system operation and be done seamlessly with no intermittent loss of connectivity from the perspective of an associated STA. Solution shall not define algorithm for channel selection."

Labels: , , , , ,

Thursday, May 3, 2007

Ripple Effect - Redux

Early in the year I posted an article about how the Cisco WLAN controller system may behave strangely in some conditions. I got some email from some folks that had major issues with it. One poster said that, "Before Cisco purchased the technology from Airspace, they had already put dampeners in the RRM so the hysteresis you describe wouldn't occur." This is just plain wrong. Cisco wants to sell more switches and routers and they found out if they purchased the Airespace system they would do just that but they did not make this significant change before releasing it with their name on it. And they are still changing the behavior of the WCS today because this problem still exists.



Did I lose you? As a refresher for those who did not see the original article it is posted HERE.



Since I published that comment back in early February I have spoken to quite a few people who have seen the same effect in their environments in recent months. One network engineer wrote, "I can vouch for having observed this recurrent DCA behavior, also in a hospital environment (12-24 channel changes per day across 10 floors of APs, as you depict in your example). The architecture is not alerting us to this being the result of interference or noise (no WLC or WCS events of either type), and the RSSI of rogue APs is above the threshold required for triggering DCA (neg 85dB)."



I was asked by the nay-sayers what Cisco told it's customers to do and here is what that same engineer said, "We have been told by Cisco that the 100mW AP neighbor beacons, used to determine the picture of the network, does not get input into DCA. Cisco claims these 100mW beacons are used only for dynamic power control, which we hold static -- do you think this voids the dynamic algorithms? Other docs say the RSSI of neighbor APs is the most important criterion in DCA behavior! In lieu of noise and interference alerts we can only surmise its the APs themselves that are the cause of their own DCA ripple effect."



This is just one example. I also have spoken to other folks who say that the Aruba system they are running does not do this. They say it is much more stable and after the original "learning" time it settles down and stays that way as long as the network is in use. I think this makes sense, why change the whole network because of one interferer? Better to be alerted to the fact and deal with it yourself.



I am collecting comments on this and would like to post more testimonials about this effect. If anyone wants to support this claim publicly, please feel free to drop me a line to bruce@hubbert.org or comment to this post. My goal here is not to raise hysteria but get things fixed and level the playing field. The infrastructure vendors tend to pitch the idea that they offer a panacea for all wifi woes and I feel that that is just a flavor of "Kool-Aid" I am unwilling to drink.



Labels: , , , ,

Saturday, February 3, 2007

The Ripple Effect - Problems with Cisco’s Radio Resource Management (RMM)

Introduction:

In its Unified Wireless Network architecture, Cisco has developed patent pending technology for dealing with interference detection and avoidance, dynamic channel assignment, dynamic power adjustment, coverage-hole detection and correction, rogue detection and client load balancing. This system is known as RRM or Radio Resource management. The stated goal of which is to avoid problems in the fixed ISM band of 802.11b/g where only 11 channels are available to U.S. WLANs. This system, though sound in theory, has problems when applied to large WLANs in urban areas or locales that have heavily deployed WLANs such as Metro WiFi, skyscrapers, hospitals, universities and businesses near residential neighborhoods.

Background on Channel Overlap:

Anyone who has configured their own home access point (AP) knows they are allowed to choose a channel for the AP to transmit on. Since APs use Dynamic Spread Spectrum technology they actually utilize 5 channels per AP.

If an admin were to configure APs to use all channels in the 802.11b/g spectrum, a serious decrease in available bandwidth would occur and users would experience sever throughput loss. Thus an admin is restricted to only configure his/her APs to 3 non-overlapping channels; 1, 6 and 11. In some cases an admin may opt, out of necessity, to go for a slight overlap and configure a 4 channel plan consisting of channels 1, 4, 7 and 11.

WLAN planning and Site Surveying:

Administrators need to then plan out their deployment so that each AP avoids overlapping its coverage with another AP on the same channel. APs must have their power adjusted to compensate for walls and coverage gaps that may ensue when a building is not a standard rectangular shape or when neighbors move in and configure their AP on a channel used by the organization the admin works for. This adjustment in power may increase or decrease the size of the cell of each AP and the additional adjustments to all the other APs will now be needed. Lastly, the admin must plan for areas where usage may change very dynamically such as in conference rooms and auditoriums. As one can see, this is really an art and a whole industry has evolved around designing wireless networks. Usually a Site Survey is needed to map out the existing neighbor APs as well as to plan where to place and map the new APs. Surveys are also recommended from time to time to adjust to changes that may happen around the organization as well as within it.

Cisco's Solution:

The Cisco Unified Wireless Network (UWN) architecture hopes to avoid this problem by sensing the types of problems that occur in WLANs and automatically compensating. Problems such as:


  • A neighbor moving in next door or upstairs and implementing APs that overlap yours
  • Coverage gaps that occurs when walls, cubicles and other furniture are moved, added or removed
  • Loss in throughput when people, who are 78% water, move around in a company and group together in conference rooms or other areas (water attenuates or "blocks" radio waves)

Cisco has a brief description on their website at HERE and a much more in depth description HERE

On that second page Cisco describes how this works under the section entitled, "Radio Resource Monitoring"

Management of an RF network requires strong visibility into the factors affecting the air space. Cisco lightweight access points are specially designed to not only offer service, but to also monitor all channels at the same time. This is a result of the extensive development work Cisco has performed on the 802.11 MAC layer as part of its split MAC architecture.

In addition to offering service, Cisco lightweight access points can simultaneously scan all valid 802.11a/b/g channels for the country of operation, as well as for channels valid in other geographies. This provides the highest level of protection-the system will discover rogue access points that might be imported from other countries, or a hacker that knows how to change the country of operation such that the rogue would be out of band and not detected by most WLAN intrusion detection systems (IDSs).

The Cisco lightweight access point goes "off-channel" for a period not greater than 60 ms to listen to these channels. Packets collected during this time are sent to the Cisco Wireless LAN Controller, where they are analyzed to detect rogue access points (whether service set identifiers [SSIDs] are broadcast or not), rogue clients, ad-hoc clients, and interfering access points.

By default, each access point spends only 0.2 percent of its time off-channel. This is statistically distributed across all access points so that adjacent access points are not scanning at the same time, which could adversely affect WLAN performance. This enables administrators to build a picture of what is happening in their WLANs from the perspective of every access point, and increases network visibility beyond what an overlay network can provide, eliminating the "hidden node" problem that can result when air monitors are deployed for every three to five access points.

I will not debate the issues around part time scanning in this article; many others have addressed that already. But I will address the next issue which is how Cisco responds once it has discovered any of the aforementioned problems.

When a station has something to say, it announces it to the media. An access point will allow the station to send its data if the medium is open. If not, the station will be told to wait to transmit until other stations using that medium are finished with it. This prevents two clients from transmitting on the same channel at the same time, which would result in corrupted frames.

With CSMA/CA, two access points on the same channel (in the same vicinity) will get half the capacity of two access points on different channels. This becomes an issue, for example, when someone reading e-mail in a café affects the performance of the access point in a neighboring business. Even though these are completely separate networks, someone sending traffic to the café on Channel 1 can cause data corruption in an enterprise using the same channel. Cisco wireless LAN controllers address this problem and other co-channel interference issues by dynamically allocating access point channel assignments to avoid conflict. Since the Cisco lightweight solution has enterprisewide visibility with its RRM tools, channels are "reused" to avoid wasting scarce RF resources. In other words, Channel 1 will be allocated to a different access point far from the café. This is much more effective than not using Channel 1 altogether, which is what other WLAN systems often do.

Figure 2. Dynamic Channel Assignment

Later in the same document it describes a similar situation as Interference.

"Interference" is defined as any 802.11 traffic that is not part of the Cisco WLAN system, including a rogue access point, a Bluetooth device, or a neighboring WLAN. Cisco lightweight access points are constantly scanning all channels looking for major sources of interference (Figure 3).

If the amount of 802.11 interference a predefined threshold (the default is 10 percent), a trap is sent to the Cisco Wireless Control System (WCS).The Cisco Wireless LAN Controller will attempt to rearrange channel assignments to increase system performance in the presence of the interference.

Figure 3. Dynamic Channel Assignment Reacting to Interference

Again I will refrain from diving too deep on interference sources as Cisco does not even have a way to detect much less respond to such non-803.11 interferers as Cordless phones, baby monitors, wireless cameras, DECT phones and headsets etc.

The Problem:

When you have a large number of APs implemented and you are covering a large area, the Cisco system will adjust to compensate for rogues, neighbors and interferers almost continuously. As you add more and more interferers in and around the WLAN, more and more adjustments must be made to compensate for these. As the compensations take place they run into adjustments coming the other direction from the other side of the building and you get a huge ripple effect that will in some cases cancel out adjustments and in others build up over adjustments. The WLAN starts to behave like a wave phase experiment.

Example:

Let us say that we are in a hospital in San Francisco where the average number of APs per block is around a hundred. The hospital has 20 APs per floor and 10 floors in the main building. That's 200 APs, which is quite a large number. This hospital, since it is in an urban setting has many neighbors, many of whom also have APs.

In a typical situation a neighbor to the hospital puts an AP on Channel 1. The Cisco architecture senses this and adjusts to compensate, moving APs from adjacent channels to ones farther away. At or around the same time but on the other side of the hospital, another neighbor appears but this time the AP is on Channel 11. A similar situation occurs there. At some point the two waves of adjustments meet or cross in the middle. This is made possible because the split MAC architecture of the Cisco UWN has many decisions made in its WLAN controllers. These controllers are distributed and can act semi-independently. By the time the wave reaches the other side of the hospital, the system realizes it is again being interefered and readjusts.






This wave or ripple action, because it moves across floors and up stories may go on forever. As more neighbors or interferers come on line more waves are sent out. The larger the implementation the worse the problem gets. The effect is readily visible and measurable to anyone with a WLAN analyzer. They will see MAC addresses hopping from one channel to the next on a second by second basis. They will also be changing output power continuously so the signal will be rising and falling.

Effects of the "Ripple"

The net effect of this phenomenon is a serious decrease in throughput and a large increase in latency. If you use your WLAN for applications that need low latency or high throughput such as VOIP over a WLAN (known as VoWLAN or VoFi) or you have low power handhelds such as the kind used for barcode scanning, this network is unusable. The VoFi traffic will be filled with jitter and conversations will be choppy at best. The handhelds will never be able to sleep or go to low power as they will always be probing for changes to the environment. If the system had been statically mapped to specific channels that do not change, the WLAN would have had problems, for certain, but these problems would be affecting just the few APs that face the neighbors. Now that all the APs are reconfiguring continuously, the whole WLAN is affected all the time.

WLAN STAs that are associated and attempting to pass data will continuously be probing for new channels and APs to associate with. The amount of roaming will go up dramatically. Roaming takes a few seconds to complete so the problem will be very serious for the end user.

Cisco even mentions this problem in one of their release notes for the CB21AG card found here: HERE

CSCse49324-CB21AG retransmission mechanism has problems with RRM in LWAPP network

A CB21AG client that is operating in an LWAPP infrastructure loses connection for small periods of time. When the AP is performing radio resource management (RRM), the AP goes off channel. During these periods, the AP cannot hear and answer ACK and RTS frames from the client. The client card initiates a scan for another AP, and network traffic for the client is affected.

Workaround: Increase the HwTxRetries value from 4 to 14 (registry entry) so that the client card continues to retry for the 20 to 30 milliseconds that the AP is off channel.

SpectraLink and other VoWLAN vendors specifically warn their customers not to deploy their Cisco UWN architecture with RRM enabled. When a WLAN needs to support voice, the requirements for stability increase dramatically.

Conclusion:

The idea behind automatically adjusting and configuring networks is a good one. Maybe sometime in the near future Cisco will program their controllers to avoid this type of effect but in the meantime, unless you have a pretty small network or are located far from interference sources and neighbors, admins are urged to complete a thorough site survey and statically map all their APs to a channel and resurvey from time to time.

Labels: , , , , , , , ,