Friday, July 27, 2007

Cisco Ripples - DCA and RRM - Help is on the way

Since I first published " The Ripple Effect" back in February I have heard from many folks who have validated the effect but to my chagrin, I have had no solution to offer. Well thankfully there are smarter people than me out there and solutions have started to appear.

I was alerted to the fact that Medical Connectivity consulting recently put Cisco in their sights and quoted my blog with regard to Dynamic Channel Assignment and RRM causing issues. The Web, being the great time waster that it is, lead me on a journey. As I read the article I clicked here and there and next thing I knew I was looking at a forum at Cisco that was talking about this exact phenomena.

One of the forum posters had some great suggestions to eliminate this problem in the future. Bruce Johnson at Partners Healthcare offered this solution,

"We saw the majority of DCA events were triggered by Interference from Rogue APs. After we disabled Foreign AP Avoidance the number of channel changes dropped by an entire order of magnitude (1000s to 100s). We disabled Cisco AP Load Avoidance and this reduced the number of DCAs within an order of magnitude (100s less).

DTPC will power-up APs to max levels to provide a 3-neighbor -65 RSSI coverage "grid" and 7921s will power up to follow suit (up to their max Tx Power). Other clients with higher Tx power may send the APs to max power causing a mismatch with IP phones.

You can decrease the tx-power-threshold so the "grid" won't be as hot (default is -65, change to -71 or -74):

config advanced 802.11a tx-power-control-thresh <-50 to -80>
config advanced 802.11b tx-power-control-thresh <-50 to -80>

and reduce the coverage hole detection threshold (reduce Min SNR level in RRM Thresholds) to suppress the power-up activity."
Bruce seemed on track with this fix. the problem is that it isn't a fix. It shuts off the RRM and DCA so that the WLAN would remain stable. So where is the benefit of a controller based system?

He does note that a fix is forthcoming from Cisco, "They are revamping the behavior of RRM in the WLC 4.1 Maintenance release." Which is later confirmed by a Cisco employee, Saurabh Bhasin a TME,

"With the 4.1 Maintenance Release(MR) due out on cisco.com shorly, many improvements based on such feedback have been brought into RRM's algorithms ? improvements aimed at allowing administrators to fine-tune their RRM-run WLANs where desired. These enhancements will allow for greater control over both the channel and power output selection algorithms, so administrators may assist RRM in being either more or less aggressive in such decisions, depending on application and network needs. Additionally, enhancements have been made to the management and reporting of all RRM information and configuration alterations to allow for better tracking of RF environmental fluctuations and to assist in keeping track of RRM activity. Further technical detail on the inner workings of these enhancements will be available very soon in an update to the above-mentioned RRM Whitepaper."
The paper he references is found here http://www.cisco.com/warp/public/114/rrm.html and explains a lot of what we are all seeing. (here is the PDF version)

So here is to hope that WLC 4.1 Maint. Rels. fixes it. As an aside, Bruce Johnson is skeptical,
"Its all well and good to make things work for Intel and the CCX/CCKM compliant crew, but if you have any of the other brands of WLAN NICs (like those made by medical device manufacturers, who won't subscribe to fast roaming features until they're adopted by the IEEE) you are best keeping RRM disabled until it delivers on its promise as stated in the following 802.11TGv Objectives draft:

Service and Function Objectives

Solutions shall define mechanisms to provide the service listed below.

[Req2000] TGv shall support Dynamic Channel Selection, to allow STAs to avoid interference. Solution shall be able to change the operating channel (and/or band) for the entire BSS during live system operation and be done seamlessly with no intermittent loss of connectivity from the perspective of an associated STA. Solution shall not define algorithm for channel selection."


Labels: , , , , ,

Thursday, May 3, 2007

Ripple Effect - Redux

Early in the year I posted an article about how the Cisco WLAN controller system may behave strangely in some conditions. I got some email from some folks that had major issues with it. One poster said that, "Before Cisco purchased the technology from Airspace, they had already put dampeners in the RRM so the hysteresis you describe wouldn't occur." This is just plain wrong. Cisco wants to sell more switches and routers and they found out if they purchased the Airespace system they would do just that but they did not make this significant change before releasing it with their name on it. And they are still changing the behavior of the WCS today because this problem still exists.

Did I lose you? As a refresher for those who did not see the original article it is posted HERE.

Since I published that comment back in early February I have spoken to quite a few people who have seen the same effect in their environments in recent months. One network engineer wrote, "I can vouch for having observed this recurrent DCA behavior, also in a hospital environment (12-24 channel changes per day across 10 floors of APs, as you depict in your example). The architecture is not alerting us to this being the result of interference or noise (no WLC or WCS events of either type), and the RSSI of rogue APs is above the threshold required for triggering DCA (neg 85dB)."

I was asked by the nay-sayers what Cisco told it's customers to do and here is what that same engineer said, "We have been told by Cisco that the 100mW AP neighbor beacons, used to determine the picture of the network, does not get input into DCA. Cisco claims these 100mW beacons are used only for dynamic power control, which we hold static -- do you think this voids the dynamic algorithms? Other docs say the RSSI of neighbor APs is the most important criterion in DCA behavior! In lieu of noise and interference alerts we can only surmise its the APs themselves that are the cause of their own DCA ripple effect."

This is just one example. I also have spoken to other folks who say that the Aruba system they are running does not do this. They say it is much more stable and after the original "learning" time it settles down and stays that way as long as the network is in use. I think this makes sense, why change the whole network because of one interferer? Better to be alerted to the fact and deal with it yourself.

I am collecting comments on this and would like to post more testimonials about this effect. If anyone wants to support this claim publicly, please feel free to drop me a line to bruce@hubbert.org or comment to this post. My goal here is not to raise hysteria but get things fixed and level the playing field. The infrastructure vendors tend to pitch the idea that they offer a panacea for all wifi woes and I feel that that is just a flavor of "Kool-Aid" I am unwilling to drink.

Labels: , , , ,