Você está na página 1de 6

Approved for Public Release; Distribution Unlimited Case # 06-0602

Improving Multi-homed SCTP Mobile Communication Performance


Kevin H. Grace, Dylan Pecelli, Jeffrey D. DAmelia
The MITRE Corporation {kgrace, dpecelli,jdamelia}@mitre.org

AbstractThe growing availability of different wireless access technologies like WiFi 802.11, cellular broadband EV-DO, and the soon to be deployed WiMax 802.16, provide the opportunity for users to carry multiple radio types and potentially benefit from better connectivity and reliability. The multi-homing features of the Stream Control Transmission Protocol (SCTP) appear to be key enablers for improving mobile communications for such multi-radio equipped users. However, satisfactory performance of SCTP in a mobile wireless environment depends heavily on the settings of configurable protocol parameters and questions remain whether multi-homing may actually perform worse than single-homing in some circumstances. This paper examines the complex interactions between various configurable protocol parameters and their effect on performance with the goal of making recommendations for how to set these knobs in an informed way. Using simulation, we investigate SCTPs throughput performance between a pair of users in three scenarios: each user equipped with a single radio type, each user equipped with a pair of radios that provide identical data rates, and each user equipped with a pair of radios where one radios data rate is 10x less than the other. For the last scenario, we find that SCTPs default recovery performance is lacking to such an extent that better throughput is actually achieved when users are equipped with only a single radio. To mitigate this problem, we propose a change to the protocols heartbeat mechanism and present simulation data showing the resulting improvement.

I. INTRODUCTION For telecommunication network operators in the late 1990s, the potential operational, managerial and financial benefits of moving towards a converged IP network that would carry both data and voice traffic, motivated numerous research and development efforts. Of particular interest was the unsuitability of existing IP based transport protocols for delivering time-sensitive and loss-sensitive call signaling messages. Neither UDP, which does not provide a reliable delivery service, nor TCP, which suffers from head-of-line blocking during loss events, were deemed adequate [5]. Moreover, as telecommunication networks emphasize reliability, it was imperative that a transport solution include some form of redundancy to insure that device or path failures did not degrade service. End-to-end path redundancy requires that end systems be multi-homed and that alternate paths exist between systems. As TCP only allows each system to bind a single IP address to a connection, it cannot support multihoming. The lack of an adequate transport solution for converged telecommunication networks resulted in the IETFs

development of SCTP RFC2960 [7,8] a multi-homing enabled transport layer protocol that provides a reliable, end-toend, connection-oriented, message-based delivery service to higher layer applications. While originally targeted at converged telecommunication networks, the lack of any alternative multi-homing transport protocols makes SCTP an attractive solution for other domains as well. The growing availability of different wireless access technologies like WiFi 802.11, cellular broadband EV-DO, and the soon to be deployed WiMax 802.16, provide the opportunity for users to carry multiple radio types and potentially benefit from improved connectivity and reliability. The multi-homing features of the Stream Control Transmission Protocol (SCTP) appear to be key enablers for improving mobile communications for such multi-radio equipped users. However, satisfactory performance of SCTP in a mobile wireless environment depends heavily on the settings of configurable protocol parameters, and questions remain whether multi-homing may actually perform worse than singlehoming in some circumstances. In [4], the authors investigated large file transfer performance using default SCTP protocol settings across satellite paths with temporary link outages and found that transfer times were often shorter for single-homed rather than multi-homed scenarios. While there are definite improvements that result from multi-homing, especially when permanent link/path failures occur, when temporary loss is experienced, multi-homing should not, at least fundamentally, impose a penalty. This paper examines the complex interactions between various configurable SCTP protocol parameters and their effect on performance with the goal of making recommendations for how to set these knobs in an informed way. Using simulation, we investigate SCTPs throughput performance between a pair of users in three scenarios: each user equipped with a single radio type, each user equipped with a pair of radios that provide identical data rates, and each user equipped with a pair of radios where one radios data rate is 10x less than the others. For the last scenario, we find that SCTPs default recovery performance is lacking to such an extent that better throughput is actually achieved when users are equipped with only a single radio. To mitigate this problem, we propose a change to the protocols heartbeat mechanism and present simulation data showing the resulting improvement. The rest of this paper is structured as follows. In section II we provide a brief overview of SCTPs error control operation and the various protocol parameters that govern it. We also

provide a brief review of the findings of some of the most recent literature regarding SCTPs performance that make suggestions for tuning SCTP to cope with random packet loss. In section III, we present our findings using a loss model based upon temporary link outages rather than random packet loss. Our goal in this section is to determine if the tuning improvements suggested in the literature will hold when the loss model more closely approximates one where terrain blockages, building obscurations, or other forms of shadowing cause the wireless link to temporarily drop out. While the findings in available literature were obtained with an assumption that the bandwidth available on redundant paths was identical, we are also interested in cases where this may not be true. We believe a typical scenario for multi-homed wireless users will likely employ different wireless access services whose data rates vary by an order of magnitude or more. In section IV, we present results for an asymmetric multi-homed topology where the alternate paths data rate is limited to 1/10th of the primary paths rate. Troublingly, the resulting throughput performance for this multi-homed case is worse than that of single-homed performance. In section V, we propose a modification to the SCTP protocols heartbeat mechanism that shortens path recovery times. We then present results showing this adjustments effectiveness at erasing the disparity between asymmetric multi-homed and single-homed throughputs. In section VI, we summarize our results. II. SCTP ERROR CONTROL OVERVIEW SCTPs multi-homing allows end systems to utilize multiple divergent paths in order to provide a measure of faulttolerance against device or path failures. One of the peers addresses is designated the primary destination address, and during normal operation all user data is sent along this primary path. The other paths serve as alternates that may be used for sending retransmissions. In the event that the primary path fails, a failover is performed, and one of the alternate paths will then be used for transmitting user data. To aid in determining the reachability status of the peers addresses, a heartbeat (HB) may be periodically transmitted to each address. This mechanism helps to prevent sending retransmissions to alternate addresses that are unreachable and helps insure that failovers are performed only to a known reachable destination address. During times of failover, the HB mechanism also serves to discover when the primary address once again becomes reachable; when it does, SCTP performs a failover restoration back to the primary path, allowing subsequent data packers to be sent along it. A number of user configurable protocol parameters are involved in SCTPs operation. Here we highlight the ones most important for our discussion and indicate their default values according to RFC2960: Path.Max.Retrans (PMR, default 5) The maximum number of consecutive retransmissions to a particular destination address before marking it inactive; if the path was currently selected for sending user data, a failover is performed. Whenever a valid heartbeat acknowledgement or an acknowledgement for

outstanding user data is received from the peer, this error counter is cleared. RTO.Min (default 1) The minimum number of seconds for a retransmission timeout. RTO.Max (default 60) The maximum number of seconds for a retransmission timeout. HB.interval (default 30) Used in combination with the current RTO to schedule when the next HB will be sent to an idle destination address. One of the most critical parameters for any reliable end-toend transport protocol is the value used to determine when an unacknowledged transmission should result in a retransmission the retransmission timeout (RTO). The RTO plays a role not only in error control, but also in congestion control. For error control purposes, RTO should be just large enough to accommodate the round trip time and processing needed to send a packet and receive an acknowledgment, plus some additional margin to account for delay variability. For congestion control purposes, however, RTO is also used to implement an exponential backoff algorithm (doubling its value after each consecutive loss) and as such, should be allowed to expand to relatively large values to allow short term periods of network congestion to dissipate. Unfortunately, this dual role of RTO can impede precise error control, especially in sending HBs. This obstacle will be discussed more in Section V. A separate RTO is maintained for each path. Round-trip time measurements are sampled, and a smoothed average (SRTT) and round-trip time variation (RTTVAR) are computed. From these, RTO is computed as: RTO = SRTT + 4 RTTVAR. If the resulting RTO is smaller than RTO.Min it will be increased, similarly, if RTO is greater than RTO.Max it will be decreased. As RTO can lie anywhere in the range [RTO.Min, RTO.Max], dramatic differences in protocol performance can result depending upon its actual value. For example, assuming the default Path.Max.Retrans = 5, RTO = RTO.Min = 1, and RTO.Max = 60, an SCTP failover from minimal RTO will take 1 + 2 + 4 + 8 + 16 + 32 = 63 seconds. However, assuming RTO began at the slightly larger value of 2, SCTP failover would take 2 + 4 + 8 + 16 + 32 + 60 = 122 seconds, a much larger time indeed. For many situations, the failover time using default SCTP settings is likely to be unacceptable to users. Shortening this time can be achieved by setting the knobs (i.e., the relevant protocol parameters), namely RTO.Min, RTO.Max, and Path.Max.Retrans, to smaller values. In [6], the authors investigate performance improvements resulting from aggressively setting parameters (RTO.Min=10ms, RTO.Max=250ms, Path.Max.Retrans=2) in order to allow an IP network to meet the time-sensitive requirements of Signaling System No. 7 (SS7). The authors caution however that by aggressively setting RTO.Max, the exponential back-off strategy, which is an important part of congestion control, is effectively disabled. While aggressively setting RTO.Min and RTO.Max may be useful in small, limited domains where the

end systems and the network paths are well characterized and administered by the same organization, in more general scenarios involving more than a single administrative domain, the negative ramifications of such aggressive settings are not well understood and should be avoided. Aggressively setting just Path.Max.Retrans may lead to unnecessary or spurious failovers when a small number of lost packets is interpreted to mean that the destination address is no longer reachable. On the surface, it would seem that spurious failovers are something to be avoided. However, despite the prevalence of spurious failovers for smaller PMRs, in [3], where the impact of setting PMR to values in the range {0,1,2,3,4,5} was investigated, file transfer times were not negatively affected by unnecessary failovers. Scenarios including two paths with symmetric path delays (both 90ms) and asymmetric path delays (90ms and 1040ms) were tested across all combinations of (primary loss rate, alternate loss rate) for uniformly distributed packet loss rates of {0%,1%,2%,10%}. In all cases, the setting of PMR=0 resulted in the shortest transfer times even when failing over to an alternate path with a higher loss rate than the primary. The explanation to this surprising observation was that time spent while SCTP waited for additional retransmission timeouts was time largely wasted. Instead, at the first sign of a problem (i.e. the first retransmission timeout), it was more advantageous to try a different path, which would likely have a smaller RTO, and thereby would waste less time if another loss occurred. These important results seem to point towards a simple solution for setting PMR: always turn it to 0. Yet we wondered whether this was good advice for all scenarios. We wondered whether the advice to set PMR=0 would apply when the loss model was not one of uniform packet loss but instead was one where link outages spanning various durations occurred as would likely be the case when mobile nodes were hindered by building and other obscurations. We also wondered how the advice to set PMR=0 would affect performance relative to the case where the endpoints were only single-homed; would we see results similar to those in [4] in which multi-homed performance was worse than single-homed? We also wondered what would happen if an asymmetry in bandwidth existed between the primary and alternate path, as is likely in wireless access scenarios; would failing over at the first sign of trouble still be the best strategy? In the next section we begin to answer these questions for the symmetric case where the primary and alternate paths have identical data rates and in Section IV we examine the asymmetric case where the alternate path is disadvantaged with only 1/10th the data rate of the primary. III. SYMMETRIC PATH PERFORMANCE We evaluated SCTP throughput performance using the University of Delawares SCTP module [2] for the ns-2 network simulator [1]. Figure 1 illustrates the network topology where each endpoint had two interfaces attached to separate networks. For simplicity, we used ns-2 duplex-links to model each network path. The primary and alternate path data rates were 1Mb/s and one-way delays were 100msec. One node was a bulk data sender and the other was the recipient.

m Pri

ary

IPA1 IPA2

IPB1 IPB2

Sender

Alte rna te

Receiver

Figure 1 Multi-homed simulation topology. The Sender and Receiver each have two IP interfaces that attach to separate communication paths.

We modeled link outages using a two-state Markov process where link up/down times were exponentially distributed. All combinations of average link up times of {1,2,5,10,20,40,60} seconds and average link down times of {1,2,5,10,20,40} seconds were evaluated. The primary and alternate path each had their own Markov outage process configured with the same (up, down) loss settings but different random number generator seeds so that outage events were independent between paths. Each simulation run executed for 5 simulated minutes and the number of bytes transferred was tallied. For each (up,down) combination, 30 to 60 runs were performed and their results averaged into an effective throughput normalized by the 1Mb/s data rate of the paths. 95% confidence intervals were computed and while not shown in the graphs that follow, were used to interpret whether measured differences were statistically significant. The first question we set out to answer was whether setting PMR to 0 would be good advice for the link outage model as it was for the uniform packet loss model in [3]. To determine whether this was the case, we evaluated the scenario in Figure 1 for PMR=5 and again for PMR=0 and computed the difference in normalized throughputs, what we will refer to as Throughput Improvement, for all (up,down) outage combinations. Note that we intentionally refrain from reporting the improvement in terms of a ratio as this tends to overemphasize and obfuscate results when normalized throughputs are small but their ratio is not. The results in Figure 2, show that in most cases, PMR=0 did in fact result in improved throughput over the setting of PMR=5 and in all cases performed at least as well (based upon 95% confidence intervals not shown here). The most dramatic improvements occurred when up and down times were about equal. The maximum normalized throughput difference was 0.22 which occurred at (up=5, down=2); the average throughputs for PMR=0 and PMR=5 at this point were 466 kb/s and 243 kb/s respectively. The next question we wanted to answer was how well the multi-homed topology in Figure 1 would perform in comparison to a single-homed topology. We were also interested in what effect PMR had on this outcome. It turned out that for a setting of PMR=5, as shown in Figure 3, the single-homed topology sometimes outperformed the multihomed topology though in many circumstances their

performances were comparable. The maximum difference of

Figure 2 Throughput improvement for PMR=0 over PMR=5 with symmetric paths. For all (up,down) combinations, PMR=0 outperforms or matches PMR=5.

Figure 4 Throughput improvement for multi-homing with symmetric paths (PMR=0) over single-homing. For most loss combinations, multi-homing subtantially outperforms single-homing.

0.11 occurred at (up=5, down=2); the average throughputs for the single-homed and multi-homed cases at this point were 357 kb/s and 243 kb/s respectively. These results appear to support the general conclusion first reported in [4] that single-homed topologies may outperform multi-homed topologies, at least when default RFC2960 settings are used.

The next logical question to answer was what happens when the bandwidths on both paths differed, possibly by a very substantial amount. With the growing availability of different wireless access technologies, it is more likely that mobile multi-homing users will have asymmetric bandwidth paths available than symmetric paths. IV. ASYMMETRIC PATH PERFORMANCE Using the same topology from Figure 1, we decreased the alternate paths data rate by an order of magnitude to 100 kb/s and kept the same one-way delays (100msec) and primary path data rate (1Mb/s). We first wanted to understand whether PMR=0 still outperformed PMR=5. Unfortunately, as Figure 5 shows, with a bandwidth disadvantaged alternate path, it is no longer the case that setting PMR=0 will result in higher throughput. On the contrary, setting PMR to higher values results in improved throughput performance. There is a substantial advantage to sticking with the higher speed primary path, despite the fact that it is not functioning, and waiting for it to be restored, rather than switching over to the lower speed alternate path. The reason for this is that when SCTP stays with the primary path, it will more quickly discover when the path is again functional (by retransmitting user data using exponential back-off) than if it fails over to the alternate and relies upon the slower HB mechanism to probe for the primarys recovery. While PMR=5 shows a substantial improvement for most loss combinations over PMR=0, this advantage diminishes as the outage duration grows, and for permanent or very long duration outages it would certainly be more advantageous to switch over to the alternate path quickly. In other words, there is still a motivation for setting PMR=0, but doing so would result in diminished throughput when outages were not too long. This provides some of the motivation for an improved HB path recovery detection mechanism, which we present in Section V.

Figure 3 Throughput improvement for single-homing over multi-homing with symmetric paths when PMR=5. For some smaller (up,down) combinations, single-homing subtantially outperforms multi-homing.

However, when we instead use the more aggressive setting of PMR=0, as shown in Figure 4, there was a dramatic improvement in multi-homed throughput versus single-homed throughput for many loss combinations, and, in every loss case, multi-homing performed at least as well, statistically, as singlehoming. This is indeed encouraging as we now have additional evidence supporting the tuning advice that in order to maximize throughput performance, at least when symmetric bandwidth paths are employed, one must simply set PMR to 0.

Figure 5 Throughput improvement for PMR=5 over PMR=0 with asymmetric paths and HB.interval=30s. For most loss combinations, except when path up and down times are very short, PMR=5 outperforms PMR=0.

The next question we wanted to answer was how singlehoming throughput performance would compare to multihoming performance when a bandwidth asymmetry existed between the primary and alternate paths. Since we have just seen that PMR=5 happens to provide better throughput performance than PMR=0, the former was used to represent multi-homing performance in the comparison. While we omit the data due to page limits, we found results very similar to those shown in Figure 3: single-homing performance again sometimes outperformed multi-homing, and many times both had similar performance. The situational superiority of singlehoming performance, as well as the lack of a single ideal PMR tuning covering all path bandwidth and reliability conditions, provided additional motivation for the improved HB path recovery detection mechanism presented in the next section. V. ASYMMETRIC PERFORMANCE WITH MODIFIED HB The disappointing throughput performance of multi-homing when paths have bandwidth asymmetries, which is a likely scenario for mobile users employing heterogeneous wireless access technologies, is the direct result of a slow SCTP path recovery detection process. Once a failover to the alternate path occurs, the goal should be to return to the faster primary path as soon as it becomes viable. As described in Section II, after a failover to the alternate path, the SCTP heartbeat mechanism is responsible for detecting when the primary path has recovered. According to [8], which clarifies implementation issues for RFC2960, the recommended approach to sending HBs is once per RTO of that destination address plus the protocol parameter HB.interval with jittering of +/- 50% of the RTO value, and exponential back-off of the RTO if the previous HEARTBEAT is unanswered. The exponential back-off of the RTO is done when loss occurs based on the tacit assumption that congestion along the path is responsible for the loss. While this assumption is often correct in wired networks, in wireless networks, where loss may be the result of link outage or interference, backing off transmissions often leads to underutilization. For quickly

discovering when the higher speed primary path becomes available, it is desirable to keep the interval between HBs relatively small. While setting HB.interval to a more aggressive setting might initially seem to achieve this, the inclusion of, on average, one RTO into the period between HBs will progressively inflate the spacing as the RTO grows exponentially. Thus, even when aggressively setting the HB.interval to 1 second, for example, an outage that lasts approximately 1 + 2 + 4 + 8 + 16 = 31 seconds, resulting in an RTO of 32 seconds, will require (HB.interval + RTO +/0.5*RTO) seconds, or between 17 and 33 seconds after the path is actually restored to detect the paths recovery. During this time, the available, but as yet undetected higher-speed primary path sits idle while the user traffic continues to flow across the slower alternate path. We propose to reduce the dithering applied to HBs for the purpose of speeding up path restoration detection. Rather than always spreading out HBs using a full RTO, we believe that less conservative dithering is possible without effectively disabling congestion control. A new protocol parameter, HB.MaxDither, could be used to limit the amount of dithering such that when the RTO is smaller than HB.MaxDither, dithering is based upon RTO. However, when RTO exceeds HB.MaxDither, dithering is based solely on HB.MaxDither. This new protocol parameter would provide an additional degree of freedom for tuning purposes and provide looser coupling between the HB mechanism and congestion control. Moreover, it would avoid having to make changes to RTO.Max, which as we have discussed earlier, also plays a crucial role in congestion avoidance for user data. Protection from possible congestion caused by the increased frequency of HBs is still a concern, especially as both HB.interval and HB.MaxDither might be set very aggressively. Determining appropriate values for these parameters is beyond the scope of this paper and would be an interesting area for future work. We are not necessarily advocating such aggressive values, though we believe that the current default HB.interval of 30 seconds is likely to be overly conservative, especially when combined with dithering across a full RTO. To better understand how this new proposal for HB dithering would perform, we evaluated the asymmetric path scenario for PMR=0 and PMR=5, with HB.MaxDither=1 second, and the RFC2960 default HB.interval of 30 seconds. While we omit the resulting graphs here to due to page limits, we observed results similar to those from Section IV; PMR=5 performed better than PMR=0, yet single-homing outperformed both. We then adjusted the HB.interval to an aggressive setting of 1 second. Under these conditions, as shown in Figure 6, PMR=0 modestly outperformed PMR=5 for most loss combinations. Based upon 95% confidence intervals, there were no loss combinations where PMR=0 did not perform at least as well as PMR=5.

VI. CONCLUSION This paper has examined the throughput performance of multi-homed SCTP with the goal of offering tuning advice for wireless access environments where link outages may occur. Recent results in [3] showed that setting PMR=0 resulted in improved performance regardless of the packet error rates on primary and secondary paths. Our hope was that this could be translated into a general-purpose tuning recommendation, yet we had concerns whether such advice would still hold when bandwidth asymmetries existed between the primary and alternate paths. We also had lingering concerns based upon [4], which showed single-homing outperforming multi-homing. Our investigation revealed that when a bandwidth asymmetry did exist, setting PMR=0 was not good advice and single-homing was better than multi-homing. Both of these results, however, were due to the slow recovery detection offered by SCTPs heartbeat mechanism. We proposed a modification that introduced a new tunable protocol parameter and evaluated the proposal. Based upon simulation results, it appears that the proposed HB mechanism works well and provides the desired improvements. By using such a modified HB mechanism, we are confident that simply setting PMR=0 provides the best performance potential, regardless of whether bandwidth asymmetries exist between primary and alternate paths. The modified HB mechanism also diminishes any hesitation to embrace multi-homing for fear that it might perform worse than single-homing. REFERENCES
[1] UC Berkeley, LBL, USC/ISI, and Xerox Parc, ns-2 Simulator, Version 2.29, Oct. 2005, http://www.isi.edu/nsnam/ns [2] A. Caro, J. Iyengar, ns-2 SCTP module, http://pel.cis.udel.edu [3] A. L. Caro, P. D. Amer, R. R. Stewart, Rethinking End-to-End Failover with Transport Layer Multihoming, Annals of Telecommunications Transport Protocols for NGNs, Vol. 61, Jan-Feb 2006 [4] M. H. Duke, T. R. Henderson, P. A. Spagnola, J. H. Kim, Stream Control Transmission Protocol (SCTP) Performance Over The Land Mobile Satellite Channel, Proceedings of IEEE MILCOM 2003, Oct. 2003, Vol. 2, p. 13251331 [5] K. Grinnemo, T. Andersson, A. Brunstrom, Performance Benefits of Avoiding Head-Of-Line Blocking in SCTP, Proceedings of the Joint International Conference on Autonomic and Autonomous Systems and International Conference on Networking and Services, Oct. 2005 [6] R. Rembarz, S. Baucke, P. Mahonen, Enhancing Resilience for High Availability IP-based Signaling Transport, Proceedings of the 16th Annual IEEE International Symposium on Personal Indoor and Mobile Radio Communications, Sept. 2005 [7] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and Paxson, Stream Control Transmission Protocol, RFC2960, Oct. 2000. [8] R. Stewart, I. Aria-Rodriguez, K. Poon, A. Caro, M. Tuexen, Stream Control Transmission Protocol (SCTP) Specification Errata and Issues, Work In Progress Internet Draft draft-ietf-tsvwg-sctpimpguide-16.txt, Oct. 2005

Figure 6 Throughput improvement for PMR=0 over PMR=5 with asymmetric paths and HB.interval=1s. For most loss combinations, PMR=0 modestly outperforms PMR=5.

Figure 7 Throughput improvement for multi-homing with asymmetric paths (PMR=0) over single-homing. For most loss combinations, multi-homing modestly outperforms single-homing.

Comparing multi-homing (PMR=0) with single-homing, as shown in Figure 7, asymmetric multi-homing finally has a modest advantage over single-homing, at least for most loss combinations. The only cases where multi-homing performed worse occurred for loss combinations where the average outage was short: 1 second. In these cases, the improved HB mechanism, which used HB.interval=1 and HB.MaxDither=1 resulting in an average recovery detection time of 2 seconds, was not faster than the user data exponential back-off detection employed by single-homing. The slight normalized throughput advantage offered by single-homing in these cases, about 0.04, could be diminished using even more aggressive settings but the trade-off in overhead and possible HB-induced congestion may not warrant doing so.

Você também pode gostar