Natarajan PH Ddissertation
Natarajan PH Ddissertation
Natarajan PH Ddissertation
by
Preethi Natarajan
February 2009
by
Preethi Natarajan
Approved: __________________________________________________________
B. David Saunders, Ph.D.
Chair of the Department of Computer and Information Sciences
Approved: __________________________________________________________
Tom Apple, Ph.D.
Dean of the College of Arts and Sciences
Approved: __________________________________________________________
Debra Hess Norris, M.S.
Vice Provost for Graduate and Professional Education
I certify that I have read this dissertation and that in my opinion it meets
the academic and professional standard required by the University as a
dissertation for the degree of Doctor of Philosophy.
Signed: __________________________________________________________
Paul D. Amer, Ph. D.
Professor in charge of dissertation
I certify that I have read this dissertation and that in my opinion it meets
the academic and professional standard required by the University as a
dissertation for the degree of Doctor of Philosophy.
Signed: __________________________________________________________
Adarshpal S. Sethi, Ph.D.
Member of dissertation committee
I certify that I have read this dissertation and that in my opinion it meets
the academic and professional standard required by the University as a
dissertation for the degree of Doctor of Philosophy.
Signed: __________________________________________________________
Phillip T. Conrad, Ph. D.
Member of dissertation committee
I certify that I have read this dissertation and that in my opinion it meets
the academic and professional standard required by the University as a
dissertation for the degree of Doctor of Philosophy.
Signed: __________________________________________________________
Stephan Bohacek, Ph. D.
Member of dissertation committee
I certify that I have read this dissertation and that in my opinion it meets
the academic and professional standard required by the University as a
dissertation for the degree of Doctor of Philosophy.
Signed: __________________________________________________________
Randall R. Stewart
Member of dissertation committee
ACKNOWLEDGMENTS
improvise my ideas, paper drafts, and presentation slides. His “hands off” approach to
advising has been both a challenging and a rewarding experience. Thanks to Prof.
v
I would like to acknowledge the financial support that made this
dissertation possible. This research was sponsored in part by the U.S. Army Research
Laboratory, and by Cisco System’s University Research Program.
This dissertation would not have been possible without the support from
family and friends. My parents, Raju and Gowri, were steadfast on giving me good
education, even if it meant pushing their needs to the back burner. Raju never ceases to
vi
TABLE OF CONTENTS
LIST OF FIGURES................................................................................................. xi
LIST OF TABLES ................................................................................................ xiv
ABSTRACT............................................................................................................ xv
Chapter
1. INTRODUCTION .............................................................................................. 1
2.1 Introduction............................................................................................... 10
2.2 Head-of-line Blocking ................................................................................ 11
vii
2.4.2 Adapting Apache............................................................................ 20
2.7 Single TCP Connection vs. Single Multistreamed SCTP Association .......... 30
2.8 Multiple TCP Connections vs. Single Multistreamed SCTP Association ..... 40
2.8.1 Background.................................................................................... 41
2.8.2 In-house HTTP 1.1 Client............................................................... 42
2.8.3 Experiment Parameters................................................................... 44
2.8.4 Results: HTTP Throughput ............................................................ 46
3.1 Introduction............................................................................................... 64
3.2 Problem Description................................................................................... 65
3.2.1 Background.................................................................................... 66
3.2.2 Unordered Data Transfer using SACKs .......................................... 67
viii
3.2.3 Implications to CMT ...................................................................... 70
3.5 Results....................................................................................................... 81
ix
4.5.1 Simulation Setup ...........................................................................114
4.5.2 Evaluations during Symmetric Loss Conditions..............................116
4.5.3 Evaluations during Asymmetric Loss Conditions............................120
REFERENCES ......................................................................................................130
x
LIST OF FIGURES
Figure 2.1: Model for HTTP 1.1 Persistent, Pipelined Transfer ................................. 12
Figure 2.11: Fast Retransmits during SACK Recovery (Object Size = 5K) ................ 53
xi
Figure 2.13: HTTP Throughput (Object Size = 10K) ................................................ 56
Figure 2.14: RTO Expirations on Data at Server (1Mbps.200ms; Object Size = 10K) 57
Figure 4.4: CMT-PF Reduces Rbuf Blocking during Failure ................................... 105
xii
Figure 4.6: CMT vs. CMT-PF during Permanent Failure......................................... 107
Figure 4.7: CMT vs. CMT-PF under Varying PMR Values..................................... 109
Figure 4.8: CMT vs. CMT-PF during Short-term Failure ........................................ 110
Figure 4.9: CMT vs. CMT-PF under Varying Rbuf Sizes ........................................ 111
Figure 4.11: CMT-PF1 Data Transfer during no Rbuf Blocking .............................. 113
Figure 4.12: CMT-PF2 Data Transfer during no Rbuf Blocking .............................. 114
Figure 4.15: CMT vs. CMT-PF Goodput Ratios during Symmetric Loss and
Asymmetric RTT Conditions................................................................ 118
Figure 4.16: CMT vs. CMT-PF Rbuf Blocking Durations ....................................... 119
Figure 4.18: CMT vs. CMT-PF during Asymmetric Loss Conditions ...................... 121
Figure 4.19: Emulation Topology for CMT vs. CMT-PF Experiments .................... 123
Figure 4.20: CMT vs. CMT-PF during Permanent Path Failure ............................... 124
Figure 4.21: CMT vs. CMT-PF during Symmetric Loss Conditions ........................ 125
xiii
LIST OF TABLES
Table 4.1: CMT vs. CMT-PF Mean Consecutive Data Timeouts on Path 2 ............. 139
xiv
ABSTRACT
We investigate three issues related to the transport layer, and address these
issues using the innovative transport layer services offered by the Stream Control
Transmission Protocol (SCTP) [RFC4960].
sequential bytestream, and in-order data delivery within the bytestream. Transferring
independent web objects over a single TCP connection results in head-of-line (HOL)
blocking, and worsens web response times. On the contrary, transferring these objects
over different SCTP streams eliminates inter-object HOL blocking. We propose a
design for HTTP over SCTP streams, and implement this design in the open source
Apache web server and Firefox browser. Using emulation, we show that persistent and
pipelined HTTP 1.1 transfers over a single multistreamed SCTP association improves
web response times when compared to similar transfers over a single TCP connection.
The difference in TCP vs. SCTP response times increases and is more visually
perceivable in high latency and lossy browsing condition, as found in the developing
world.
The current workaround to improve an end user’s perceived WWW
performance is to download an HTTP transfer over multiple TCP connections. While
we expect multiple TCP connections to improve HTTP throughput, emulation results
show that the competing and bursty nature of multiple TCP senders degrade HTTP
performance especially in end-to-end paths with low bandwidth last hops. In such
xv
browsing conditions, a single multistreamed SCTP association not only eliminates HOL
blocking, but also boosts throughput compared to multiple TCP connections.
In the second issue, we explore how SCTP’s (or TCP’s) SACK mechanism
degrades end-to-end performance when out-of-order data is non-rengable. Using
simulation, we show that SACKs result in inevitable send buffer wastage, which
increases as the frequency of loss events and loss recovery durations increase. We
simulation, we demonstrate how CMT suffers from significant “rbuf blocking” which
degrades performance during permanent and short-term path failures. To improve
performance, we introduce a new destination state called the “Potentially Failed” (PF)
state. CMT’s failure detection and (re)transmission policies are augmented to include
the PF state, and the modified CMT is called CMT-PF. Using simulation, we
demonstrate that CMT-PF outperforms CMT during failures − even under aggressive
failure detection thresholds. We also show that CMT-PF performs on par or better but
never worse than CMT during non-failure scenarios. In light of these findings, we
recommend CMT be replaced by CMT-PF in existing and future CMT implementations
and RFCs.
xvi
Chapter 1
INTRODUCTION
1
design required a reliable transport protocol, TCP was the only available option and
was ‘chosen’ for HTTP transfers. However, transferring independent web objects over
TCP results in sub-optimal response times, since, a TCP connection (i) offers a single
sequential bytestream to the application, and (ii) provides in-order delivery within the
bytestream ─ if a piece of one web object is lost in the network, successively
transmitted web objects will not be delivered to the client until the lost piece is
2
Data that has been delivered to the application, by definition, is non-
renegable by the transport receiver. Unlike TCP which never delivers out-of-order data
to the application, SCTP’s multistreaming and unordered data delivery services result
in out-of-order data being delivered to the application and thus becoming non-
renegable. Interestingly, TCP and SCTP implementations can be configured such that
the receiver is not allowed to and therefore never reneges on out-of-order data.
characteristics. But [Iyengar 2006] did not consider path failures, which is the scope of
our work.
Both TCP and UDP are unaware of multihoming. Hence, [Iyengar 2006]
used the multihomed-aware transport protocol – SCTP, to perform CMT at the
transport layer. Since this research is a continuation of [Iyengar 2006], our
investigations also use SCTP. Incidentally, SCTP also supports path failure detection.
3
1.2 An SCTP Primer
SCTP was originally developed to carry telephony signaling messages over
IP networks. With continued work, SCTP evolved into a general purpose transport
protocol with advanced delivery options [RFC4960]. Similar to TCP, SCTP provides a
reliable, full-duplex, congestion and flow-controlled connection, called an association.
An SCTP packet, or more generally, protocol data unit (PDU), consists of one or more
concatenated building blocks called chunks: either control or data. For the purposes of
reliability and congestion control, each data chunk in an association is assigned a
unique Transmission Sequence Number (TSN). Since chunks are atomic, TSNs are
associated with chunks of data, as opposed to TCP which associates a sequence
number with each data octet in the bytestream.
Unlike TCP, SCTP offers innovative transport layer services such as
multihoming and multistreaming.
are subject to shared congestion control, and thus SCTP’s multistreaming adheres to
TCP’s fairness principles.
4
However, maintaining order of delivery between transport protocol data units (TPDUs)
transmitted on different streams is not a constraint. That is, data arriving in-order
within an SCTP stream is delivered to an application without regard to data arriving on
other streams.
5
transmission of new data. Note that a single port number is used at each endpoint
regardless of the number of IP addresses.
path failure. For instance, users may be simultaneously connected through dial-
up/broadband, or via multiple wireless technologies such as 802.11b and GPRS.
Concurrent Multipath Transfer (CMT) [Iyengar 2006] is an experimental extension to
SCTP that assumes multiple independent paths, and exploits these paths for
simultaneous transfer of new data between end hosts. A naïve version of CMT, where
a data sender simply transfers new data over multiple paths, increases data reordering
and adversely affects performance. [Iyengar 2006] investigates these negative effects
6
and proposes algorithms and retransmission policies that improve application
throughput.
SCTP. The chapter proposes a design for HTTP over SCTP streams, and discusses our
efforts to implement the design in the popular Apache web server and Firefox browser.
Using emulation, we show that persistent and pipelined HTTP 1.1 transfers over a
single multistreamed SCTP association improves web response times when compared
to similar transfers over a single TCP connection. The difference in TCP vs. SCTP
response times increases and is more visually perceivable in high latency and lossy
browsing condition, as found in the developing world.
The current workaround to improve an end user’s perceived WWW
performance is to download an HTTP transfer over multiple TCP connections. While
we expect multiple TCP connections to improve HTTP throughput, emulation results
show that the competing and bursty nature of multiple TCP senders degrade HTTP
performance especially in end-to-end paths with low bandwidth last hops. In such
browsing conditions, a single multistreamed SCTP association not only eliminates HOL
blocking, but also boosts throughput compared to multiple TCP connections. These
experiments were performed as part of this author’s summer 2008 internship at Cisco
Systems.
7
Our body of work in HTTP over SCTP has triggered significant interest in
the area. The Protocol Engineering Lab has secured additional funding from Cisco
Systems to pursue some of the ongoing and future work discussed in Chapter 2.
Chapter 3 discusses the second issue – how the existing SACK mechanism
8
comparisons show that NR-SACKs enable more efficient utilization of a transport
sender’s memory. Further investigations show that NR-SACKs also improve
throughput in CMT. The final section of Chapter 3 discusses ongoing activity,
including our efforts within the IETF to standardize NR-SACKs for SCTP, and at UD
to implement NR-SACKs in FreeBSD SCTP.
Chapter 4 presents our work on the third issue – CMT performance during
path failures. Using simulation, we demonstrate how CMT suffers from significant
“rbuf blocking” which degrades performance during permanent and short-term path
failures. To improve performance, we introduce a new destination state called the
“Potentially Failed” (PF) state. CMT’s failure detection and (re)transmission policies
are augmented to include the PF state, and the modified CMT is called CMT-PF. Using
simulation, we demonstrate that CMT-PF outperforms CMT during failures − even
under aggressive failure detection thresholds. We also show that CMT-PF performs on
par or better but never worse than CMT during non-failure scenarios. In light of these
9
Chapter 2
This chapter discusses the first problem – HTTP over SCTP streams.
Sections 2.1 and 2.2 explain the head-of-line (HOL) blocking problem and its negative
consequences in HTTP over TCP. Section 2.3 describes our design of HTTP over
multistreamed SCTP. Sections 2.4 and 2.5 discuss HTTP over SCTP implementation
specifics in the Apache web server and Firefox web browser, respectively. Section 2.6
explains evaluation preliminaries and Sections 2.7 and 2.8 present results. Section 2.9
concludes and presents ongoing and future work. Section 2.10 discusses related work.
2.1 Introduction
HTTP [RFC2616] requires a reliable transport protocol for end-to-end
communication. While historically TCP has been used for this purpose, HTTP does not
require TCP. A TCP connection offers a single sequential bytestream to a web server.
In the case of HTTP 1.1 with persistence and pipelining, the independent HTTP
responses are serialized and sent sequentially over a single connection (i.e., one TCP
bytestream). In addition, a TCP connection provides in-order delivery within the
bytestream ─ if a TPDU containing HTTP response i is lost in the network, successive
TPDUs containing HTTP responses i+n (n≥1) will not be delivered to the web client
until the lost TPDU is retransmitted and received. This situation, known as head-of-
line (HOL) blocking, occurs because TCP cannot logically separate independent HTTP
responses in its transport and delivery mechanisms.
10
Transport layer multistreaming is the ability of a transport protocol to
support multiple streams, where each stream is a logical data flow with its own
sequencing space. Within each stream, the transport receiver delivers data in-sequence
to the application, without regard to the relative order of data arriving on other
streams. SCTP [RFC4960] is a standardized reliable transport protocol which provides
multistreaming. Independent HTTP responses transmitted over different streams of an
SCTP association can be delivered to the web browser without HOL blocking.
While most web users in developed nations experience excellent browsing
conditions, a large and growing portion of WWW users in developing nations
experience high end-to-end delays and loss rates. In such network conditions, persistent
and pipelined HTTP 1.1 transfers over TCP suffer from exacerbated HOL blocking,
resulting in poor browsing experience (discussed in the next section). In this work, we
evaluate multistreamed web transport’s ability to reduce HOL blocking and improve a
web user’s browsing experience in developing regions.
11
Figure 2.1: Model for HTTP 1.1 Persistent, Pipelined Transfer
12
objik = kth piece of obji, 0 ≤ k ≤ M; obji0 denotes the response header, and
obji1..M denote the different pieces of obji. Note that M depends on the size of obji. In
our emulations, we assume all objects are the same size (M).
rspik = time when transport delivers objik to the web client.
renik = time when web client renders objik on user’s monitor.
procik = (renik – rspik) denotes the web client’s processing time (e.g.,
retransmission after 3 duplicate acks (fast retransmit) takes ~1 round-trip time (RTT),
and retransmission after timeout expiration (timeout retransmit) takes between the
initial retransmission timeout value (RTO) of 3 seconds and the maximum of (1RTT,
min RTO (1 second)) [RFC2988]. Note that the loss recovery period increases as the
path’s RTT increases. Also, the frequency of HOL blocking increases as the loss rate
on the end-to-end path increases. Intuitively, HOL blocking would be exacerbated over
a high RTT, lossy path.
Apart from end-to-end path characteristics, individual object sizes also
influence the degree of HOL blocking. As object size increases, the probability that a
13
piece of the object is lost also increases. Hence, a large object in a pipelined transfer is
more likely to block delivery of subsequent objects than a smaller object would.
Due to a multitude of factors, VSAT solutions (Figure 2.2) are the most
cost-effective and efficient method of providing Internet connectivity for commercial
customers, governments and consumers in developing nations and other areas where a
land-based infrastructure does not exist [WiderNet, CAfrica, Tarahaat, VSAT-
14
systems]. The successful deployment of VSAT systems and services in more than 120
countries provides communities with access to information, knowledge, education and
business opportunities, and has been crucial in the communities’ socio-economic
development [Rahman 2002].
The propagation delay from ground station to geostationary satellite to
ground station is ~280ms [Gurtov 2004, RFC2760]. Therefore, the delay over a VSAT
link increases the RTT by ~560ms. The bandwidth-limited VSAT link is most likely the
bottleneck in the transmission path. Any resulting queuing and/or processing delays
within the satellite further increase the RTT. The delay caused by shared channel access
over a VSAT link can sometimes increase the RTT on the order of few seconds
[RFC3135].
GPRS and 3G links are characterized by variable and high latencies; the
RTTs in such networks can vary between a few hundreds of milliseconds to 1 second
[Chakravorty 2002, Chan 2002, RFC3481]. The proliferation of mobile phones in
developing regions, and the increasing use of web browsers and other web applications
on mobile phones is another example of web transfers over high latency paths. High
Speed Download Packet Access (HSDPA) technology is the successor to 3G, and is
emerging from research to deployment. HSDPA offers improved broadband Internet
access (~1Mbps per user per cell), and is targeted as a viable option for regular Internet
connectivity to both residential and mobile customers. However, channel access and/or
propagation delay on an HSDPA link adds ~80ms to the path RTT [Jurvansuu 2007],
which is significantly higher than current wired last hop delays.
In addition to propagation delays, sub-optimal traffic routing increases
latency of Internet traffic in developing nations [Baggaley 2007, Cottrell 2006]. For
15
example, sub-optimal routing for intra-African traffic results in Internet traffic
traversing multiple VSAT links, and/or being routed through North America or
Europe, leading to RTTs as high as 2.5 seconds [PingER]. Furthermore, Internet traffic
to/from developing regions traverses through lossy paths, and experiences significant
end-to-end loss rates [Cottrell 2006, PingER].
Online U.S. shoppers consider 4 seconds as the maximum acceptable page
download time before potentially abandoning a retail site [Akamai 2006]. Response
times above 4 seconds interrupt the user experience, causing the user to leave the site
or system. While web users over high latency and lossy paths in developing nations
must be more tolerant to response times, these users will prefer to use a system that
provides better browsing experience.
16
web server [Faber 1999]. Also, SCTP’s COOKIE mechanism prevents SYN attacks,
and SCTP multihoming provides fault-tolerance and the possibility of multipath transfer
[Natarajan 2006a].
Two guidelines governed our HTTP over SCTP design:
• Make no changes to the existing HTTP specification, to reduce deployment
concerns.
server. Further, the client is better positioned to make scheduling decisions that rely on
user perception and the operating environment. We therefore concluded that the client
should decide object scheduling on streams.
We considered two designs by which the client conveys the selected SCTP
stream to the web server: (1) the client specifies an SCTP stream number in the HTTP
GET request and the server sends the corresponding response on this stream, or (2) the
server transmits the HTTP response on the same stream number on which the
corresponding HTTP request was received. Design (1) requires just one incoming
stream and several outgoing streams at the server, but requires modifications to the
HTTP GET request specification. Design (2) requires the server to maintain as many
17
incoming streams as there are outgoing streams, increasing the memory overhead at the
server. Every inbound or outbound stream requires additional memory in the SCTP
Protocol Control Block (PCB), and the amount of memory required varies with the
SCTP implementation. The reference SCTP implementation on FreeBSD (version 6.1),
requires 25 bytes for every inbound stream and 33 bytes for every outbound stream
[FreeBSD]. We considered this memory overhead per stream to be insignificant
compared to the effort to modify the HTTP specification, and chose option (2).
Figure 2.3 gives an overview of our HTTP over SCTP design. A web
client and server first negotiate the number of SCTP streams to use for the web
transfer. During association establishment, the web client requests m inbound and m
outbound streams. The INIT-ACK from the server carries the web server’s offer on the
number of inbound/outbound streams (n). After association establishment, the number
of inbound and number of outbound streams available for HTTP transactions, s =
18
MIN(m,n). Note that an SCTP end point can initially offer a lower number of streams
and later increase the offer using the streams reset functionality [Stewart 2008a].
When a web server receives a request on an inbound SCTP stream a (a<
s), the server sends the corresponding response on the outbound stream a. If s <
number of pipelined requests, the web client must schedule the requests over the
available SCTP streams using a scheduling policy, such as round-robin.
authentication, dynamic content handling are performed by separate modules. The core
module relies on Apache Portable Runtime (APR), a platform independent API, for
network, memory and other system dependent functions.
Apache uses filters ─ functions through which different modules process
an incoming HTTP request (input filters) or an outgoing HTTP response (output
filters). The core module’s input filter calls APR’s read API to read HTTP requests.
During request processing, all state information related to the request are maintained in
a request structure. Once the response is generated, the core module’s output filter
calls APR’s send API for transmitting the response.
19
Apache has a set of multi-processing architectures that can be enabled at
compile time. We considered the following architectures: (1) prefork ─ non-threaded
pre-forking server and (2) worker ─ hybrid multi-threaded multi-processing server.
With prefork, a configurable number of processes are forked during server
initialization, and are setup to listen for connections from clients. With worker, a
configurable number of server threads and a listener thread are created per process.
The listener thread listens for incoming connections from clients, and passes the
connection to a server thread for further processing. In both architectures, the server
processes or threads handle requests sequentially from a transport connection.
initialization.
20
(mail/news reader), belong to the top layer, and rely on the services layer for access to
network and file I/O. The services layer uses platform independent APIs offered by the
Netscape Portable Runtime (NSPR) library.
Firefox has a multi-threaded architecture. To render a web page, the HTTP
module in the services layer parses the URL, uses NSPR to open a TCP connection to
the appropriate web server, and downloads the web page. While parsing the web page,
the HTTP module opens additional TCP connections as required, and pipelines HTTP
GET requests for the embedded objects.
Adapting Firefox to work over SCTP streams involved modifications to
both NSPR and the HTTP module.
21
connectivity through GPRS, a better scheduling policy might be ‘smallest pending
object first’ where the next GET request goes on the SCTP stream that has the smallest
sum of object sizes pending transfer. Such a policy reduces the probability of HOL
blocking among the responses downloaded on the same SCTP stream.
interspersed fashion.
2. Responses are delivered in the same sequence in which the pipelined
requests were transmitted.
These assumptions hold when the underlying transport is TCP – a reliable
protocol delivering in-order data to nsHttpPipeline. However, various factors result in
out-of-order response delivery in HTTP over SCTP streams.
22
Figure 2.4: Modifications to Firefox HTTP Module
23
2.5.2.1.1 Non HOL Blocked Requests
Loss of HTTP requests transmitted on stream i, does not prevent delivery
of successfully received requests on stream j. During request losses, server_request will
be different from client_request. Therefore, the generated server_response, and
client_response will be different from client_request, violating nHttpPipeline’s
assumption (2).
The receiving SCTP uses the (i) (B)eginning fragment bit, (ii) sequential TSNs, and (iii)
(E)nding fragment bit for correct reassembly [RFC4960]. In effect, SCTP’s
fragmentation and reassembly creates dependencies in message transmission. A
fragment of message i+1 cannot be transmitted until all fragments of message i have
been transmitted.
24
Apache’s request processing rate is often higher than SCTP’s data
transmission rate, especially when SCTP’s data transmission is limited by low
bandwidth/high latency links and/or packet losses. In such scenarios, as long as the
SCTP socket’s send buffer allows, Apache writes multiple HTTP responses on the
socket, and these responses await transmission at the SCTP send buffer. If Apache
writes a 100K response on stream i followed by a 1K response on stream j, SCTP will
not transmit the 1K response until all fragments of the 100K response are successfully
transmitted. Note that the transmission time of the 100K response increases in low
bandwidth/high latency/high loss scenarios. Since the 100K and 1K responses are self-
regulating, it is highly desirable that browser’s rendering of the 1K response does not
depend on transmission/arrival/rendering of the 100K response.
To overcome this issue, we relocated message fragmentation from the
SCTP layer to HTTP response fragmentation at Apache. Apache writes an HTTP
response as multiple application messages, such that, each message at the SCTP layer
25
mod m)’s queue is considered for transmission. When Apache’s request processing rate
is higher than SCTP’s transmission rate, multiple SCTP stream queues contain
messages (pieces of HTTP responses) awaiting transmission. Due to FreeBSD SCTP’s
round-robin transmission, the HTTP response pieces are transmitted in an interspersed
fashion, and arrive in the same fashion at Firefox’s SCTP layer. In fact, even under no
loss conditions, delivery of a piece of response i can be followed by a piece of response
We call case (ii) object interleaving and discuss its advantages in [Natarajan 2006a].
Now, assume that the web server does HTTP response fragmentation and
both responses are transmitted on the same SCTP stream. In case (i), the server writes
all pieces of response 1 on the stream before writing response 2. Therefore, all pieces
of response 1 are transmitted (and delivered) to Firefox before any piece of response 2.
However, in case (ii), the two server threads write concurrently over the same SCTP
26
stream. Therefore, the response pieces can be transmitted and delivered in an
interspersed fashion at Firefox, violating nsHttpPipeline’s assumption (1).
27
MSG_PEEK flag and/or SCTP’s extended receive information structure
[Stewart 2008b] to gather this information.
• Once nsHttpConnection knows the SCTP input stream, nsHttpConnection
associates the received piece of response to the nsHttpTransaction at the head
of the stream’s queue.
• When the nsHttpTransaction object is read completely, nsHttpConnection
deletes this transaction from the head of the stream queue, so that the next
piece of response on the stream is delivered to the new head of queue.
and the design of more efficient algorithms for improved WWW performance.
Using server logs from six different web sites, Arlitt et. al. identified
several key web server workload attributes that were common across all six servers
[Arlitt 1997]. Their work also predicted that these attributes would likely “persist over
time”. Of these attributes, the following are most relevant to our study: (i) both file size
and transferred file size distributions are heavy-tailed (Pareto), and (ii) the median
transferred file size is small (≤5KB). A similar study conducted several years later
28
confirmed that the above two attributes remained unchanged over time [Williams
2005]. Also [Williams 2005] found that the mean transferred file size had slightly
increased over the years, due to an increase in the size of a few large files. Other
studies such as [Houtzager 2003, Williamson 2003] agree on [Arlitt 1997]’s findings
regarding transferred file size distribution and median transferred file size.
These measurement studies lead to a consensus that unlike bulk file or
multimedia transfers, HTTP transfers are short-lived flows, where, a typical web object
consists of a small number of TPDUs and can be transferred in a few RTTs.
29
Figure 2.5: Emulation Setup
single TCP connection vs. over a single multistreamed SCTP association. The impact
of multiple transport connections is discussed in Section 2.8.
30
The following high latency browsing environments are considered for
evaluation [Cottrell 2006, PingER]. Results for other high latency environments such
as High Speed Download Packet Access (HSDPA) links are available in [Natarajan
2007].
• 1Mbps link with 350ms RTT (1Mbps.350ms): User in South Asia, accessing a
web server in North America over a land line.
• 1Mbps link with 850ms RTT (1Mbps.850ms): User in Africa, sharing a VSAT
link to access a web server in North America.
• 1Mbps link with 1100ms RTT (1Mbps.1100ms): User in Africa, sharing a
VSAT link to access a web server within Africa. The web traffic traverses at
least 2 VSAT links; the RTT over each VSAT link is ~550ms.
31
(a): 1Mbps.350ms
(b): 1Mbps.850ms
(c): 1Mbps.1100ms
Figure 2.6: Page Rendering Times (N=10)
32
Our initial hypotheses about SCTP and TCP’s page rendering times were
as follows:
• Both SCTP and TCP have similar values for their initial cwnd, and employ
delayed acks with a 200ms timer. Therefore, we expected both TCP and
SCTP’s page rendering times to be identical when no losses occur.
• Though SCTP and TCP congestion control are similar, minor differences enable
better loss recovery and increased throughput in SCTP [Alamgir 2002]. Unlike
TCP whose SACK info is limited by the space available for TCP options, the
size of SCTP’s SACK chunk is larger (only limited by the path MTU), and
therefore at times contains more information about lost TPDUs than TCP’s
SACK. Also, FreeBSD’s SCTP stack implements the Multiple Fast Retransmit
algorithm (MFR), which reduces the number of timeout recoveries at the sender
[Caro 2006]. Therefore, as loss rates increase, we expected the enhanced loss
recovery features to help SCTP outperform TCP.
Figure 2.6 shows the page rendering times for N=10, averaged over 50
runs with 95% confidence. Similar results for N=5 and 15 can be found in [Natarajan
2007]. Interestingly, in all 3 graphs, the results for the no loss case contradict (i), and
TCP’s rendering times are slightly (but not perceivably) better than SCTP’s. Detailed
investigation revealed the following difference between the FreeBSD 6.1 SCTP and
TCP implementations. SCTP implements Appropriate Byte Counting (ABC) with L=1.
During slow start, a sender increments cwnd by 1MSS bytes for each delayed ack. The
TCP stack does packet counting which results in a more aggressive cwnd increase
when the client acks TCP PDUs smaller than 1MSS (such as HTTP response headers).
33
We expect SCTP to perform similar to TCP when the TCP stack implements ABC with
L=1.
As the loss rate increases, SCTP’s enhanced loss recovery offsets the
difference in SCTP vs. TCP cwnd evolution. SCTP begins to perform better; the
difference even more pronounced for transfers containing larger objects (10K and
15K). For the 1Mbps.1100ms case, the difference between SCTP and TCP page
rendering times for 10K and 15K transfers is ~6 seconds at 3% loss, and as high as ~15
seconds at 10% loss. For the same types of transfers, the difference is ~8-10 seconds
for 10% loss in 1Mbps.350ms scenario. Similar trends are observed in results for N=5
and 15 as well [Natarajan 2007].
To summarize, SCTP’s page rendering times are comparable to TCP’s
during no loss, and SCTP’s enhanced loss recovery enables faster page rendering times
during lossy conditions. More importantly, the absolute page rendering time difference
increases, and is more visually perceivable as the end-to-end delay, loss rate, and
all objects in the pipelined transfer, these independent objects are delivered to Firefox
only in a sequential manner, such that Firefox processes and renders at most one object
at a time. Packet losses cause HOL blocking and further delay the sequential delivery of
independent objects. On the other hand, SCTP streams provide concurrency in the
transfer and delivery of independent objects an SCTP receiver can deliver object i+1
to Firefox even before object i is completely delivered as long as these two objects are
34
transmitted over different SCTP streams. This concurrency enables Firefox to render
multiple objects in parallel, a.k.a., concurrent rendering.
While browsers have to open multiple TCP connections to achieve
concurrent rendering, concurrent rendering is innate to a multistreamed web transport.
The browser tunes the concurrency level by simply adjusting the number of streams. An
SCTP association with one stream provides the same concurrency as a single TCP
connection, and results in sequential rendering. An SCTP association with two streams
provides twice as much concurrency as sequential rendering. A multistreamed
association provides maximum concurrency for a pipelined transfer when the number
of streams equals the number of objects in the transfer. Note that concurrent rendering
remains unaffected by a further increase in concurrency.
In our initial investigations, we discovered that a multistreamed web
transport enables concurrent rendering even during no losses. Irrespective of packet
losses, the interaction between Apache’s HTTP response fragmentation and FreeBSD
SCTP (Section 2.5.2.1.3) causes Firefox’s SCTP layer to receive pieces of multiple
objects in an interleaved fashion. The SCTP receiver delivers these pieces of multiple
objects in an interspersed fashion to Firefox, resulting in concurrent rendering even
during no losses. During packet losses, SCTP streams eliminate or reduce HOL
blocking, thus increasing the degree of concurrent rendering. Concurrent rendering is
demonstrated in a number of movies available online at [Movies].
To reiterate, the fundamental difference between sequential and concurrent
rendering is that in sequential rendering, a piece of object i is rendered only after
objects 1 through i-1 are completely rendered, whereas in concurrent rendering,
pipelined objects are displayed independent of each other. We use the following metric
35
to capture the concurrency and progression in the appearance of all pipelined objects
on the user’s screen. Recall terminology from Section 2.2,
req0 = time when browser sends HTTP GET request for
index.html.
(Preni – req0) = time elapsed from the beginning of the page download
(req0) to the earliest time when at least P% of object i is rendered.
Page is defined as the time elapsed from the beginning of page download
P
to the earliest time when at least P% of all pipelined objects are rendered on the screen,
i.e., PPage = MAX [(Preni – req0); 1≤ i≤ N]
Figure 2.7 plots the 25%Page, Page,
50% Page and
75% Page values for
100%
N=10, averaged over 50 runs. Transfers over SCTP consider maximum concurrency,
i.e., enough SCTP streams are opened so that every pipelined object is downloaded on
a different stream. Results for N=5 and 15 can be found in [Natarajan 2007]. As
expected, 100% Page values for both concurrent (solid points connected by dotted lines)
and sequential (hollow points connected by dashed lines) rendering equal the
corresponding transport’s page rendering times (T). Also, the P Page times in
concurrent rendering are spread out vs. clustered together in sequential rendering.
Concurrent rendering’s dispersion in PPage values signifies the parallelism in the
appearance of all 10 pipelined objects.
36
(a): 1Mbps.350ms
(b): 1Mbps.850ms
(c): 1Mbps.1100ms
Figure 2.7: PPage Values for N=10
37
Both sequential and concurrent rendering schemes’ values are comparable
at 0% loss. As loss rate increases, the difference in two rendering schemes’ PPage
values increase. In addition, we find that concurrent rendering displays 25%-50% of all
pipelined objects much sooner (relative difference ~4 – 2 times for 15K, 10K and 5K
objects) than sequential rendering. This result holds true for N=5 and 15 as well. In the
following subsection, we demonstrate how this result can be leveraged to significantly
improve response times for objects such as progressive images, whose initial 25%-50%
contain sufficient information for the human eye to perceive the object contents.
snapshots shown in Figure 2.8, both sequential (left) and concurrent (right) runs
experienced ~4.3% loss. Both rendering schemes start the download at t=0s. At t=6s
(Figure 2.8a), the sequential scheme rendered a complete image followed by a good
quality 2nd image, and the concurrent scheme displayed a complete image on the
browser window.
38
(a): t=6 seconds
39
t=12s (Figure 2.8c), sequential rendering displays 4 complete images, whereas
concurrent rendering presents the user with all 10 images of good quality. With
concurrent rendering, the complete page is rendered only ~t=23s. From t=12s to 23s,
all 10 images get refined, but the value added by the refinement is negligible to the
human eye. Therefore, the user “perceives” all images to be complete by t=12s, while
the page rendering time is actually t=23s. In the sequential run, all 10 images do not
sequential rendering.
40
HTTP performance multiple TCP connections vs. a single multistreamed SCTP
association. Similar to Section 2.7, investigations here focus on browsing conditions
most likely to exist in the developing world.
2.8.1 Background
In congestion-controlled transports such as TCP and SCTP, the amount of
outstanding (unacknowledged) data is limited by the data sender’s cwnd. Immediately
after connection establishment, the sender can transmit up to initial cwnd bytes of
application data [RFC3390, RFC4960]. Until congestion detection, both TCP and
SCTP employ the slow start algorithm that doubles the cwnd every RTT.
Consequently, the higher the initial cwnd, the faster the cwnd growth and more data
gets transmitted every RTT. When an application employs N TCP connections, during
the slow start phase, the connections’ aggregate initial cwnd and their cwnd growth
increases N-fold. Therefore, until congestion detection, an application employing N
TCP connections can, in theory, experience up to N times more throughput than an
application using a single TCP connection.
When a TCP or SCTP sender detects packet loss, the sender halves the
cwnd, and enters the congestion avoidance phase [Jacobson 1988, RFC4960]. If an
application employing N TCP connections experiences congestion on the transmission
path, not all of the connections may suffer loss. If M of the N open TCP connections
suffer loss, the multiplicative decrease factor for the connection aggregate is (1 - M/2N)
[Balakrishnan 1998a]. If this decrease factor is greater than one-half (which is the case
unless all N connections experience loss, i.e., M<N), the connections’ aggregate cwnd
and throughput increase after congestion detection is more than N times that of a single
TCP connection.
41
On the whole, an application employing multiple TCP senders exhibits an
aggressive sending rate, and consumes a higher share of the bottleneck bandwidth than
an application using fewer or single TCP connection(s) [Mahdavi 1997, Balakrishnan
1998a]. Multiple TCP connections’ aggressive sending behavior has been shown to
increase throughput for various applications so far. [Tullimas 2008] employs multiple
TCP connections to maintain the data streaming rate in multimedia applications.
[Sivakumar 2000] proposes the PSockets library, which employs parallel TCP
connections to increase throughput for data intensive computing applications.
Likewise, we expect multiple TCP connections to improve HTTP throughput.
42
at least 1 RTT, and packet losses during the first HTTP transaction further increase the
transfer time. Clearly, this behavior is detrimental to HTTP throughput over multiple
TCP connections. Also, this behavior interferes in the dynamics we are interested in
investigating – interaction between multiple TCP connections and HTTP performance.
Therefore, we developed a simple HTTP 1.1 client, which better models the general
behavior of HTTP 1.1 over multiple transport connections, and does not bias results
reads/writes, and disabling the Nagle algorithm [RFC896]. The following algorithm
describes the client in detail:
1. Setup a TCP or SCTP socket.
2. If SCTP, set appropriate data structures to request the required number of
input and output streams during association establishment.
3. Connect to the server.
4. Timestamp “Page Download Start Time”.
5. Request for index.html.
6. Receive and process index.html.
7. Make the socket non-blocking, and disable Nagle.
43
8. While there are more transport connections to be opened:
8.1. Setup a socket (non-blocking, disable Nagle).
8.2. Connect to the server.
9. While the complete page has not been downloaded:
9.1. Poll for read, write, or error events on socket(s).
9.2. Transmit pending requests on TCP connections or SCTP
2006, PingER]:
• 200ms RTT: User in East Asia, accessing a web server in North America over a
land line.
• 350ms RTT: User in South Asia, accessing a web server in North America over
a land line.
• 650ms RTT: User accessing a web server over a shared VSAT link.
44
(a): 64Kbps.200ms
(b): 128Kbps.200ms
(c): 1Mbps.200ms
Figure 2.9: HTTP Throughput (Object Size = 5K)
45
The FreeBSD TCP implementation tracks numerous sender and receiver
related statistics including the number of timeout recoveries, and fast retransmits. After
each TCP run, some of these statistics were gathered either directly from the TCP stack
or using the netstat utility.
page download times over a single multistreamed SCTP association (a.k.a. SCTP) vs.
N TCP connections (N=1, 2, 4, 6, 8, 10; a.k.a. N-TCP) for the 64Kbps, 128Kbps and
1Mbps bandwidth scenarios. Results for 256Kbps bandwidth scenario can be found in
[Natarajan 2008d]. Note that each embedded object is transmitted on a different TCP
connection in 10-TCP, and employing more TCP connections is unnecessary. The
values in Figure 2.9 are averaged over 40 runs (up to 60 runs for the 10% loss case),
and plotted with 95% confidence intervals.
multiple TCPs during congestion. As mentioned earlier, the initial cwnds of both TCP
and SCTP are similar ─ 4MSS. Since there is no loss, both transports employ slow
start during the entire page download. This equivalent behavior results in similar
throughputs between SCTP and 1-TCP in 64Kbps and 128Kbps bandwidths. Recall
from Section 2.7.2 that the packet-counting FreeBSD 6.1 TCP sender increases its
cwnd more aggressively than an SCTP sender. As the available bandwidth increases
46
(256Kbps, 1Mbps), this difference in cwnd growth facilitates 1-TCP to slightly
outperform SCTP [Natarajan 2008d].
As mentioned in Section 2.8.1, N-TCP’s aggressive sending rate can
increase an application’s throughput by up to N times during slow start. Therefore, as
the number of TCP senders increase, we expected multiple TCPs to outperform both 1-
TCP and SCTP. Surprisingly, the results indicate that multiple TCPs perform similar to
64Kbps (N=10), 128Kbps (N=8, 10), and 256Kbps (N>2) bandwidths [Natarajan
2008d]. The 1Mbps bottleneck is completely utilized by the initial cwnd of N=4 TCP
senders (~16 1500byte PDUs per RTT). Therefore, 2≤N≤4 TCP senders slightly
improve page download times when compared to 1-TCP and N>4 TCP senders do not
further reduce page download times.
47
As the propagation delay and RTT increase, the bottleneck router forwards
more packets per RTT. For example, the 1Mbps pipe can transmit ~53 PDUs per RTT
in the 650ms scenario vs. ~16 PDUs per RTT in the 200ms scenario. Consequently,
more TCP senders help fully utilize the 1Mbps pipe at 650ms RTT, and N-TCPs
decrease page download times [Natarajan 2008d]. However, similar to the 200ms RTT
scenario, lower bandwidths limit HTTP throughput, and N-TCPs perform similar to 1-
Since no packets were lost, these timeouts must be spurious, and are due to the
following.
During connection establishment, a FreeBSD TCP sender estimates the
RTT, and calculates the retransmission timeout value (RTO) [FreeBSD, RFC2988].
For a 200ms RTT, the calculated RTO equals the recommended minimum of 1 second
[RFC2988].
48
(a): 64Kbps.200ms
(b): 128Kbps.200ms
(c): 1Mbps.200ms
Figure 2.10: RTO Expirations on Data at Server (Object Size = 5K)
49
Connection establishment is soon followed by data transfer from the server. Lower
bandwidth translates to higher transmission and queuing delays. In a 64Kbps pipe, the
transmission of one 1500byte PDU takes ~186ms, and a queue of ~5 such PDUs
gradually increases the queuing delay and the RTT to more than 1 second. When
outstanding data remains unacknowledged for more than the 1 second RTO, the TCP
sender(s) (wrongly) assume data loss, and spuriously timeout and retransmit
unacknowledged data.
As the number of TCP senders increase, more packets arrive at the
bottleneck, and the increased queuing delay triggers spurious timeouts at a greater
number of TCP senders. Of the 4 bandwidth scenarios considered, the1Mbps transfers
experience the smallest queuing delay, and do not suffer from spurious timeouts. As the
bottleneck bandwidth decreases, queuing delay increases. Therefore HTTP transfers
over smaller bandwidths experience more spurious timeouts.
A spurious timeout is followed by unnecessary retransmissions and cwnd
reduction. If the TCP sender has more data pending transmission, spurious timeouts
delay new data transmission, and increase page download times (N=2, 4, 6, 8 TCP in
64Kbps, and N=4, 6 TCP in 128Kbps). As the number of TCP connections increase,
fewer HTTP responses are transmitted per connection. For example, each HTTP
response is transmitted on a different connection in 10-TCP. Though the number of
spurious timeouts (and unnecessary retransmissions) is highest in 10-TCP, the TCP
receiver delivers the first copy of data to the HTTP client, and discards the spuriously
retransmitted copies. Therefore, 10-TCP’s page download times are unaffected by the
spurious timeouts. Nonetheless, spurious timeouts cause wasteful retransmissions that
compete with other flows for the already scarce available bandwidth.
50
As the propagation delay increases, the RTO calculated during connection
establishment is increased (> 1 second). Since transmission and queuing delays remain
unaffected, they impact the RTT less at higher propagation delays. Consequently,
spurious timeouts slightly decrease at 350ms and 650ms RTTs, but still remain
significant at lower bandwidths, and increase page download times [Natarajan 2008d].
To summarize, the aggressive sending rate of multiple TCP senders during
slow start does NOT necessarily translate to improved HTTP throughput in low
bandwidth last hops. Bursty data transmission from multiple TCP senders increases
queuing delay causing spurious timeouts. The unnecessary retransmissions following
spurious timeouts (i) compete for the already scarce available bandwidth, and (ii)
adversely impact HTTP throughput when compared to 1-TCP or SCTP. The
throughput degradation is more noticeable as the bottleneck bandwidth decreases.
Recall from Section 2.8.1 that N-TCPs’ (N>1) aggressive sending rate
during congestion avoidance can, in theory, increase throughput by more than N times.
Therefore, we expected multiple TCPs to outperform both 1-TCP and SCTP. On the
contrary, multiple TCP connections worsen HTTP page download times, and the
degradation becomes more pronounced as loss rate increases. This observation is true
51
for all 4 bandwidth scenarios studied. Further investigation revealed the following
reasons.
52
SACK recovery episodes, and increase retransmissions during SACK recoveries
(Figure 2.11). However, for a given loss rate, the retransmissions decrease as the
number of TCP connections increase. That is, for the same fraction of lost HTTP data
(same loss rate), loss recoveries based on fast retransmits decrease as the number of
TCP senders increase.
(a): 64Kbps.200ms
(b): 1Mbps.200ms
Figure 2.11: Fast Retransmits during SACK Recovery (Object Size = 5K)
Note that loss recovery based on fast retransmit relies on dupack
information from the client. As the number of TCP connections increase, data
53
transmitted per connection decreases, thus reducing the number of potential dupacks
arriving at each TCP sender. Ack losses on the reverse path further decrease the
number of dupacks received. While the TCP senders implement Limited Transmit
[RFC3042] to increase dupack information, the applicability of Limited Transmit
diminishes as the amount of data transmitted per TCP connection decreases.
In summary, increasing the number of TCP connections decreases per
connection dupack information. Fewer dupacks reduce the chances of fast retransmit-
based loss recovery, resulting in each sender performing more timeout-based loss
recoveries.
54
the number of SYN or SYN-ACK retransmissions tends to increase as the number of
TCP connections increase.
A SYN or SYN-ACK loss can be recovered only after the recommended
initial RTO value of 3 seconds [RFC2988], and increases the HTTP page download
time by at least 3 seconds. Consequently, losses during connection establishment
degrade HTTP throughput more when the time taken to download HTTP responses
(a): 64Kbps.200ms
(b): 1Mbps.200ms
Figure 2.12: SYN or SYN-ACK Retransmissions (Object Size = 5K)
55
(a): 64Kbps.200ms
(b): 128Kbps.200ms
(c): 1Mbps.200ms
Figure 2.13: HTTP Throughput (Object Size = 10K)
56
Figure 2.14: RTO Expirations on Data at Server (1Mbps.200ms; Object Size = 10K)
The increased flow of acks in the 10K transfers triggered more fast-retransmissions in
SACK recovery episodes, and fewer timeout-based recoveries compared to the 5K
transfers (Figure 2.14). Consequently, N-TCPs improved HTTP throughput in the 10K
57
transfers. However, as the last hop bandwidth decreases, the negative consequences of
multiple TCP senders, such as increased queuing delay and connection establishment
latency, increase the page download times, and N-TCPs perform similar to or worse
than 1-TCP. More importantly, note that, SCTP’s enhanced loss recovery helps
outperform N-TCPs even in the 10K transfers.
To summarize, object size affects HTTP throughput over multiple TCP
connections. Smaller objects reduce dupack information per TCP connection and
degrade HTTP throughput more than bigger objects. However, the impact of object
size decreases, and the negative consequences of multiple TCP senders dominate more
and bring down HTTP throughput at lower bandwidths.
58
HTTP performance especially in low bandwidth last hops. In such browsing conditions,
a single multistreamed SCTP association not only eliminates HOL blocking, but also
boosts throughput compared to multiple TCP connections.
Our body of work in HTTP over SCTP has stimulated significant interest
in the area. The Protocol Engineering Lab has also secured funding through Cisco
Systems’ University Research Program for some of the ongoing activity discussed
below.
over SCTP design and implementation into the Firefox distribution from mozilla.org,
and the Apache distribution from apache.org. The current activity is focused on
integrating SCTP related APIs in the Netscape Portable Runtime (NSPR) API and the
Apache Portable Runtime (APR) API, which offer platform independent network
implementations to Firefox and Apache, respectively. Subsequent work will focus on
modifying Firefox and Apache to take advantage of these SCTP related APIs, and
59
enabling appropriate SCTP related compile options for various platforms and SCTP
implementations.
streams (inbound/outbound) increases the processing and resource overhead at the web
server or proxy. However, the resources required to support a new pair of SCTP
streams is much less compared to a new TCP connection. For example, on FreeBSD
each inbound or outbound SCTP stream requires an additional 28 or 32 bytes,
respectively, in the SCTP Protocol Control Block (PCB), while a new TCP PCB
requires ~700 bytes [FreeBSD]. The difference in TCP vs. SCTP resource
requirements increases with the number of clients, and can be significant at a web
server farm handling thousands of clients. This difference can also be significant at
intermediate entities such as web caches that serve many web clients and/or other
caches [Squid].
The absolute difference in TCP vs. SCTP resource requirements depends
not only on the respective protocol implementations but also on how optimal the
implementations are. While the TCP stack has been optimized over the past two
decades, the SCTP stack is relatively new, and the SCTP reference implementation on
FreeBSD can be optimized further. For example, Randall Stewart, the designer of
FreeBSD SCTP estimates that the FreeBSD SCTP PCB size can be reduced by ~600
60
bytes. Evaluating TCP vs. SCTP resource usage make more sense after such
optimizations are in place.
cost, gateway-based solution that translates HTTP over TCP to HTTP over SCTP
streams for easier and localized deployment. The solution assumes that the web
browser is capable of HTTP over SCTP, similar to the SCTP-enabled, freely available
Firefox browser used in our emulations. The gateway is physically positioned between
the server and client, such that, the gateway talks SCTP to clients over the last hop
with high propagation delay and/or low bandwidth, and talks TCP to web servers in the
outside world. For the architecture shown in Figure 2.2, the gateway is positioned
between the VSAT ground station (on the left) and the Internet cloud. We believe that
the “proxy” configuration in the SCTP-enabled Apache server is a good starting point
to achieve the gateway functionality at minimal monetary cost [Apache].
At a minimum, a gateway solution should provide faster page downloads
than HTTP over TCP. This solution can be extended to further enhance pipelined
objects’ response times. For example, the gateway could use batch image conversion
software [Gimp] to convert embedded non-progressive JPEG or PNG images to their
corresponding progressive versions before forwarding them to the clients. Image
conversion at the gateway takes on the order of milliseconds per image, but can
improve a user’s response times on the order of seconds.
61
2.10 Related Work
Significant interest exists for designing new transport and session protocols
that better suit the needs of HTTP-based client-server applications than TCP. As
mentioned earlier, several experts agree that the best transport scheme for HTTP
would be one that supports datagrams, provides TCP compatible congestion control on
the entire datagram flow, and facilitates concurrency in GET requests [Gettys 2002].
WebMUX [Gettys 1998] was one such session management protocol that was a
product of the (now historic) HTTP-NG working group [HTTP-NG]. WebMUX
proposed using a reliable transport protocol to provide web transfers with streams for
transmitting independent objects. However, the WebMUX effort did not mature.
[Ford 2007] proposes the use of Structured Stream Transport (SST) for
web transfers. SST was proposed after [Natarajan 2006a] and functions similar to
SCTP streams. SST extends TCP to provide multiple streams over a TCP-friendly
transport connection. Simulation-based evaluations in [Ford 2007] show that SST
provides similar page download times as TCP. The primary contribution of a
multistreamed web transport is the reduction in HOL blocking, which is the focus of
our work. Using real implementations, we show that reduced HOL blocking in HTTP
over SCTP results in visually perceivable improvements to individual objects’ response
times in browsing conditions typical of developing regions. Also, we note that SCTP is
62
control at the end host, thereby enforcing a fair sending rate when an HTTP transfer
employs multiple TCP connections. “TCP Session” [Padmanabhan 1998] proposes
integrated loss recovery across multiple TCP connections to the same web client (these
multiple TCP connections are together referred to as a TCP session). All TCP
connections within a session are assumed to share the transmission path to the web
client. A Session Control Block (SCB) is maintained at the sender to store information
about the shared path such as its cwnd and RTT estimate. While CM and TCP Session
reduce the adverse effects of parallel TCP connections on the network and the
application, these solutions still require a web browser to open multiple TCP
connections, thereby increasing the web server’s resource requirements.
Content Delivery Networks (CDNs) replicate web content across
geographically distributed servers, and reduce response times for web users by
redirecting requests to a server closest to the client. [Krishnamurthy 2001] confirms
that CDNs reduce average web response times for web users along USA’s east coast
for static content. Unfortunately, little research exists on the prevalence of CDNs for
content providers and web users outside of developed nations. Also, CDNs cannot
lessen web response times when latency is due to (i) propagation delay and/or low
bandwidth last hop, as is the case in developing regions, or (ii) sub-optimal traffic
routing that increases end-to-end path RTTs [Baggaley 2007].
63
Chapter 3
and the inefficiencies with TCP and SCTP SACK mechanisms when received data is
non-renegable. Section 3.3 proposes NR-SACKs for SCTP, and discusses the specifics
of SCTP’s NR-SACK chunk. Sections 3.4 and 3.5 discuss simulation preliminaries and
present results comparing SACKs vs. NR-SACKs in both SCTP and CMT. Finally,
Section 3.6 concludes and presents ongoing and future work.
3.1 Introduction
Reliable transport protocols such as TCP and SCTP employ two kinds of
data acknowledgment mechanisms: (i) cumulative acks (cum-acks) indicate data that
has been received in-sequence, and (ii) selective acknowledgments (SACKs) indicate
data that has been received out-of-order. In both TCP and SCTP, while cum-acked
data is the receiver’s responsibility, SACKed data is not, and SACK information is
advisory [RFC3517, RFC4960]. While SACKs notify a sender about the reception of
specific out-of-order TPDUs, the receiver is permitted to later discard the TDPUs.
Discarding data that has been previously SACKed is known as reneging. Though
reneging is a possibility, the conditions under which current transport layer and/or
64
operating system implementations renege, and the frequency of these conditions
occurring in practice (if any) are unknown and needs further investigation.
Data that has been delivered to the application, by definition, is non-
renegable by the transport receiver. Unlike TCP which never delivers out-of-order data
to the application, SCTP’s multistreaming and unordered data delivery services
(Chapter 1) result in out-of-order data being delivered to the application and thus
65
3.2.1 Background
The SCTP (or TCP) send buffer, or the sender-side socket buffer (Figure
3.1), consists of two kinds of data: (i) new application data waiting to be transmitted
for the first time, and (ii) copies of data that have been transmitted at least once and are
waiting to be cum-acked, a.k.a. the retransmission queue (RtxQ). Data in the RtxQ is
the transport sender’s responsibility until the receiver has guaranteed their delivery to
the receiving application, and/or the receiver guarantees not to renege on the data.
within a stream can be delivered to the receiving application even if the data is out-of-
order relative to the association’s overall flow of data. Also, data marked for unordered
delivery can be delivered immediately upon reception, regardless of the data’s position
66
within the overall flow of data. Thus, SCTP’s data delivery services result in situations
where out-of-order data is delivered to the application, and is thus non-renegable.
Operating systems allow configuration of transport layer implementations
such that received out-of-order data is never reneged. For example, in FreeBSD, the
net.inet.tcp.do_tcpdrain or net.inet.sctp.do_sctp_drain sysctl parameters can be
configured to never revoke kernel memory allocated to TCP or SCTP out-of-order
67
Sequence Number (TSN). The timeline slice shown in Figure 3.2 picks up the data
transfer at a point when the sender’s cwnd C=8, allowing transmission of 8 TPDUs
(arbitrarily numbered with TSNs 11-18). Note that when TSN 18 is transmitted, the
RtxQ grows to fill the entire send buffer.
In this example, TSN 11 is presumed lost in the network. The other TSNs
are received out-of-order and immediately SACKed by the SCTP receiver. The SACKs
68
shown have the following format: (S)ACK: CumAckTSN; GapAckStart-GapAckEnd.
Each gap-ack start and gap-ack end value is relative to the cum-ack value, and together
they specify a block of received TSNs.
At the sender, the first SACK (S:10;2-2) is also a dupack and gap-acks
TSN 12. Though data corresponding to TSN 12 has been delivered to the receiving
application, the SACK does not convey the non-renegable nature of TSN 12, requiring
the sender to continue being responsible for this TSN. Starting from the time this
SACK arrives at the sender, the copy of TSN 12 in the sender’s RtxQ is unnecessary.
The gap-ack for TSN 12 reduces the amount of outstanding data (O) to 7 TPDUs.
Since O<C, the sender could in theory transmit new data, but in practice cannot do so
since the completely filled send buffer blocks the sending application from writing new
data into the transport layer. We call this situation send buffer blocking. Note that send
buffer blocking prevents the sender from fully utilizing the cwnd.
The second and third dupacks (S:10;2-3, S:10;2-4) increase the number of
unnecessary TSNs in the RtxQ, and send buffer blocking continues to prevent new data
transmission. On receipt of the third dupack, the sender halves the cwnd (C=4), fast
retransmits TSN 11, and enters fast recovery. Dupacks received during fast recovery
further increase the amount of unnecessary data in the RtxQ, prolonging inefficient
RtxQ usage. Note that though these dupacks reduce outstanding data (O<C), send
buffer blocking prevents new data transmission.
The sender eventually exits fast recovery when the SACK for TSN 11’s
retransmission (S:18) arrives. The sender removes the unnecessary copies of TSNs 12-
18 from the RtxQ, and concludes the current instance of send buffer blocking. Since
send buffer blocking prevented the sender from fully utilizing the cwnd before, the new
69
cum ack (S:18) does not increase the cwnd [RFC4960]. The application writes data
into the newly available send buffer space and the sender now transmits TSNs 19-22.
Based on the timeline in Figure 3.2, the following observations can be
made regarding transfers with non-renegable out-of-order data:
• The unnecessary copies of non-renegable out-of-order data waste kernel
memory (RtxQ). The amount of wasted memory is a function of flightsize
(amount of data “in flight”) during a loss event; a larger flightsize wastes more
memory.
• When the RtxQ grows to fill the entire send buffer, send buffer blocking ensues,
which can degrade throughput.
70
information on out-of-order data. In SCTP, NR-SACKs provide the same information
as SACKs for congestion and flow control, and the sender is expected to process this
information identical to SACK processing. In addition, NR-SACKs provide the added
option to report some or all of the out-of-order data as being non-renegable.
extension lists the NR-SACK chunk in the Supported Extensions Parameter carried in
the INIT or INIT-ACK chunk [RFC5061]. During association establishment, if both
endpoints support the NR-SACK extension, then each endpoint acknowledges received
data with NR-SACK chunks instead of SACK chunks.
The proposed NR-SACK chunk for SCTP is shown in Figure 3.3. Since
NR-SACKs extend SACK functionality, the NR-SACK chunk has several fields
identical to the SACK chunk: the Cumulative TSN Ack, the Advertised Receiver
Window Credit, Gap Ack Blocks, and Duplicate TSNs. These fields have identical
semantics to the corresponding fields in the SACK chunk [RFC4960]. NR-SACKs also
report non-renegable out-of-order data chunks in the NR Gap Ack Blocks, a.k.a. “nr-
gap-acks”. Each NR Gap Ack Block acknowledges a continuous subsequence of non-
renegable out-of-order data chunks. All data chunks with TSNs ≥ (Cumulative TSN
Ack + NR Gap Ack Block Start) and ≤ (Cumulative TSN Ack + NR Gap Ack Block
End) of each NR Gap Ack Block are reported as non-renegable. The Number of NR
Gap Ack Blocks (M) field indicates the number of NR-Gap Ack Blocks included in the
NR-SACK chunk.
71
Note that each sequence of TSNs in an NR Gap Ack Block will be a
subsequence of one of the Gap Ack Blocks, and there can be more than one NR Gap
Ack Block per Gap Ack Block. Also, non-renegable information cannot be revoked. If
a TSN is nr-gap-acked in any NR-SACK chunk, then all subsequent NR-SACKs gap-
acking that TSN should also nr-gap-ack that TSN. Complete details of NR-SACK
chunk can be found in [Natarajan 2008a].
The second least significant bit in the Chunk Flags field is the (A)ll bit. If
the ‘A’ bit is set to '1', all out-of-order data blocks acknowledged in the NR-SACK
chunk are non-renegable. The ‘A’ bit enables optimized sender/receiver processing and
reduces the size of NR-SACK chunks when all out-of-order TPDUs at the receiver are
non-renegable.
72
3.3.2 Unordered Data Transfer using NR-SACKs
NR-SACKs provide an SCTP receiver with the option to convey non-
renegable information on out-of-order data. When a receiver guarantees not to renege
an out-of-order data chunk and nr-gap-acks the chunk, the sender no longer needs to
keep that particular data chunk in its RtxQ, thus allowing the sender to free up kernel
memory sooner than if the data chunk were only gap-acked.
Figure 3.4 is analogous to Figure 3.2’s example, this time using NR-
SACKs. The sender and receiver are assumed to have negotiated the use of NR-
73
SACKs during association establishment. As in the example of Figure 3.2, TSNs 11-18
are initially transmitted, and TSN 11 is presumed lost. For each TSN arriving out-of-
order, the SCTP receiver transmits an NR-SACK chunk instead of SACK chunk. Since
all out-of-order data are non-renegable in this example, every NR-SACK chunk has the
‘A’ bit set, and the nr-gap-acks report the list of TSNs that are both out-of-order and
non-renegable.
On receipt of the second and third dupacks that newly (nr-)gap-ack TSNs
13 and 14, the sender removes these TSNs from the RtxQ. On receiving the second
dupack, the sender transmits new data – TSN 20. On receipt of the third dupack, the
sender halves the cwnd (C=4), fast retransmits TSN 11, and enters fast recovery.
Dupacks received during fast recovery (nr-)gap-ack TSNs 15-20. The sender frees
RtxQ accordingly, and transmits new TSNs 21, 22 and 23. The sender exits fast
recovery when the NR-SACK with new cum-ack (N:20) arrives. This new cum-ack
increments C=5, and decrements O=3. The sender now transmits new TSNs 24 and 25.
The explicit non-renegable information in NR-SACKs ensures that the
RtxQ contains only necessary data − TPDUs that are actually in flight or “received and
74
renegable”. Comparing Figures 3.2 and 3.4, we observe that NR-SACKs use the RtxQ
more efficiently.
preliminaries in detail.
The SCTP evaluations use the dumb-bell topology shown in Figure 3.5,
which models the access link scenario specified in [Andrew 2008]. The central
bottleneck link connects routers R1 (left) and R2 (right), has a 100Mbps capacity, and
2ms one-way propagation delay. Both routers employ drop tail queuing and the queue
size is set to the bandwidth-delay product of a 100ms flow. Each router is connected to
three cross-traffic generating edge nodes via 100Mbps edge links with the following
propagation delays: 0ms, 12ms, 25ms (left) and 2ms, 37ms, 75ms (right). Each left
edge node generates cross-traffic destined to every right edge node and vice-versa.
75
Thus, without considering queuing delays, the RTTs for cross-traffic flows sharing the
bottleneck link range from 8ms—204ms.
[Andrew 2008] recommends application level cross-traffic generation over
packet level generation, since, in the latter scenario, cross-traffic flows do not respond
to the user/application/transport behavior of competing flows. Also, [Andrew 2008]
proposes the use of Tmix [Weigle 2006] traffic generator. However, the recommended
Tmix connection vectors were unavailable at the time of performing our evaluations.
Therefore, we decided to employ existing ns-2 application level traffic generation tools,
recommended by [Wang 2007a, Wang 2007b]. Since our simulation setup uses
application level cross-traffic, we believe that the general conclusions from our
evaluations will hold for evaluations using the Tmix traffic generator.
76
responsive bulk file transfer sessions over TCP. We are unaware of existing
measurement studies on the proportion of each kind of traffic observed in the Internet.
Therefore, the simulations assume a simple, yet reasonable rule for the traffic mix
proportion − more HTTP traffic than video or FTP traffic.
Each edge node runs a PackMime session to every edge node on the other
side, and the amount of generated HTTP traffic is controlled via the PackMime rate
parameter. Similarly, each edge node establishes video and FTP sessions to every edge
node on the other side, and the number of video/FTP sources on each node impacts the
amount of video/FTP traffic. To avoid synchronization issues, the PackMime, video,
and FTP sessions start at randomly chosen times during the initial 5 seconds of the
simulation. The default segment size for all TCP traffic results in 1500 byte IP PDUs;
the segment size for 10% of the FTP flows is modified to result in 576 byte IP PDUs.
Also, the PackMime request and response size distributions are seeded in every
simulation run, resulting in a range of packet sizes at the bottleneck [Andrew 2008].
The bottleneck router load is measured as (L) = (mean queue length ÷ total
queue size). Four packet-level load/congestion variations are considered: (i) Low
(~15% load, < 0.1% loss), (ii) Mild (~45% load, 1-2% loss), (iii) Medium (~60% load,
3-4% loss), (iv) Heavy (~85% load, 8-9% loss).
Topology 1 (Figure 3.5) is used to evaluate SCTP flows. CMT evaluations
are over the dual-dumbbell topology shown in Figure 3.6 (topology 2). Topology 2
consists of two independent bottleneck links between routers R1-R2 and R3-R4. Similar
to topology 1, each router in topology 2 is attached to 3 cross-traffic generating edge
nodes, with similar bottleneck and edge link bandwidth/delay characteristics. In both
topologies, nodes S and R are the SCTP or CMT sender and receiver, respectively. In
77
topology 2, both S and R are multihomed, and the CMT sender uses the two
independent paths (paths 1 and 2) for simultaneous data transfer. In both topologies, S
and R are connected to the bottleneck routers via 100Mbps duplex edge links, with
14ms one-way delay. Thus, the one-way propagation delay experienced by the SCTP
or the CMT flow corresponds to 30ms, approximating the US coast-to-coast
propagation delay [Shakkottai 2004].
78
topology 2, the bottlenecks experience asymmetric path loads; path 1 cross-traffic load
varies from low to heavy, while path 2 experiences low load.
The SCTP or CMT flow initiates an unordered data transfer ~18-20
seconds after the simulation begins such that, all data received out-of-order at R is
deliverable, and thus, non-renegable. Trace collection begins after a 20 second warm-
up period from the start of SCTP or CMT traffic, and ends when the simulation
completes after 70 seconds. The CMT sender uses the recommended RTX-
SSTHRESH retransmission policy, i.e., retransmissions are sent on the path with
highest ssthresh [Iyengar 2006].
SACKs arrive. The RtxQ size varies during the course of a file transfer, but can never
exceed the send buffer size. For time duration ti in the transfer, let,
ri = size of retransmission queue, and
ki = amount of necessary data in the RtxQ.
During ti, only ki ÷ ri of the RtxQ is efficiently utilized, and the efficiency
changes whenever ki or ri changes.
79
k 0 k1 k
Let , ,K n be the efficient RtxQ utilization values during time
r0 r1 rn
durations t 0 , t1 , K t n (∑ t i = T ) , respectively. The time weighted efficient RtxQ
k
utilization averaged over T is calculated as RtxQ _ Util = ∑ t i × i ÷ T . To measure
ri
RtxQ utilization, the ns-2 SCTP (or CMT) sender tracks ki, ri, and ti until association
shutdown. Let,
W = time when trace collection begins after the initial warm-up time, and
E = simulation end time.
In the following discussions, the time weighted efficient RtxQ utilization
averaged over the entire trace collection time, i.e., T = (E – W), is referred to as
RtxQ_Util.
In an unordered transfer using NR-SACKs, all out-of-order data will be nr-
gap-acked and the RtxQ should contain only necessary data. Therefore, we expect an
SCTP or CMT flow using NR-SACKs to most efficiently utilize the RtxQ (RtxQ_Util
= 1) under all circumstances.
80
every entry/exit to/from loss recovery. Since none of the routers reorder packets in our
simulations, the SCTP sender uses the following naive rule − the sender enters loss
recovery on the receipt of SACKs (or NR-SACKs) with at least one gap-ack block,
and exits loss recovery on the receipt of SACKs (or NR-SACKs) with a new cum-ack
and zero gap-acks. We found that this simple rule resulted in a good approximation of
the actual loss recovery periods.
k k k
Let 0 , 1 ,K m be the efficient RtxQ utilization values during the loss
r0 r1 rm
recovery periods l 0 , l1 , Kl m (∑ li = L ), respectively. The time weighted efficient RtxQ
utilization averaged over only the loss recovery durations of trace collection (L) is
k
refereed to as RtxQ_Util_L, and is calculated as RtxQ _ Util _ L = ∑ l i × i ÷ L .
ri
3.5 Results
For each type of sender (SCTP or CMT), different send buffer sizes
imposing varying levels of memory constraints are considered: 32K, 64K and INF
(unconstrained space) for SCTP, and 128K, 256K and INF for CMT. The results
presented here are averaged over 30 runs, and plotted with 95% confidence intervals.
In the following discussions, an SCTP flow using SACKs or NR-SACKs is referred to
81
as SCTP-SACKs and SCTP-NR-SACKs, respectively. Similarly, a CMT flow using
SACKs or NR-SACKs is referred to as CMT-SACKs and CMT-NR-SACKs.
The RtxQ_Util_L values indicate that irrespective of path loss rate, SCTP-
SACKs efficiently utilize only ~50% of RtxQ during loss recovery; ~50% of RtxQ is
wasted buffering unnecessary data. At lower congestion levels (lower cross-traffic), the
frequency of loss events and the fraction of transfer time spent in loss recovery are
smaller, resulting in negligible RtxQ wastage during the entire trace collection period
(RtxQ_Util). As loss recoveries become more frequent, SCTP-SACKs’ inefficient
RtxQ utilization during loss recovery lowers the corresponding RtxQ_Util values. The
simulation results show that, on average, SCTP-SACKs waste ~20% of the RtxQ
during moderate congestion and ~30% during heavy congestion conditions. The
82
amount of wasted kernel memory increases as the number of transport connections
increase, and can be significant at a server handling large numbers of concurrent
connections, such as a web server.
83
simulation results confirm this hypothesis. Under all traffic loads, RtxQ_Util values for
both SCTP-NR-SACKs and CMT-NR-SACKs (Figure 3.9) are unity.
In CMT evaluations, path 2 experiences low traffic load, while path 1’s
traffic load varies from low to heavy (Figure 3.6). Recall that a CMT sender transmits
data concurrently on both paths. Asymmetric path congestion levels aggravate data
reordering in CMT. As path 1 congestion level increases, TPDU losses on the higher
congested path 1 cause data transmitted on the lower congested path 2 to arrive out-
of-order at the receiver. CMT congestion control is designed such that losses on path 1
do not affect the cwnd/flightsize on path 2 [Iyengar 2006]. While losses on path 1 are
being recovered, sender continues data transmission on path 2, increasing the amount
of non-renegable out-of-order data in the RtxQ. As the paths become increasingly
asymmetric in their congestion levels, the amount of non-renegable out-of-order data in
the RtxQ increases, and brings down CMT-SACKs’ RtxQ_Util (Figure 3.9).
Increasing the send buffer/RtxQ space improves SCTP-SACKs’ or CMT-
SACKs’ kernel memory (RtxQ) utilization only to a certain degree. In Figures 3.8 and
3.9, RtxQ_Util for the INF send buffer is essentially the upper bound on how efficient
SCTP or CMT employing SACKs utilizes the RtxQ. Therefore, we conclude that
TPDU reordering results in inevitable RtxQ wastage in transfers using SACKs. The
amount of wasted memory increases as TPDU reordering and loss recovery durations
increase. Also, smaller send buffer sizes further degrade RtxQ_Util_L and RtxQ_Util
values in transfers using SACKs. This degradation is more pronounced in CMT (Figure
3.9). Further investigations reveal this effect to be due to send buffer blocking,
discussed next.
84
3.5.2 Send Buffer Blocking in CMT
When the RtxQ grows to fill the entire send buffer, send buffer blocking
ensues, preventing the application from writing new data into the transport layer
(Section 3.2.2). In both SCTP and CMT, send buffer blocking increases as the send
buffer is more constrained (decreases). In addition, CMT employs multiple paths for
data transfer, increasing a sender’s total flightsize in comparison to SCTP. Therefore,
we hypothesized that CMT would suffer more send buffer blocking than SCTP
(Section 3.2.3). Indeed, in the simulations, CMT suffered significant send buffer
blocking even for 128K and 256K send buffer sizes. In this section, we focus on the
effects of send buffer blocking in CMT.
CMT using either acknowledgment scheme suffers from send buffer
blocking for 128K and 256K buffer sizes. In CMT-SACKs, send buffer blocking
continues until cum-ack point moves forward, i.e., until loss recovery ends. As path 1
congestion level increases, timeout recoveries become more frequent, causing longer
loss recovery durations. Therefore, as congestion increases, the CMT-SACKs sender is
blocked for longer periods of transfer time. On the other hand, send buffer blocking in
CMT-NR-SACKs is unaffected by the congestion level on path 1. As and when NR-
SACKs arrive (on path 2), the CMT-NR-SACK sender removes nr-gap-acked data
from the RtxQ, allowing more data transmission. CMT-SACKs’ longer send buffer
blocking durations adversely impact performance as discussed below.
85
Figure 3.10: RtxQ Evolution in CMT-SACKs
Figures 3.10 and 3.11 illustrate CMT sender’s RtxQ evolution over 40
seconds of a transfer using SACKs and NR-SACKs, respectively. The figures show
that both CMT-SACKs and CMT-NR-SACKs suffer from send buffer blocking − the
maximum RtxQ size in the figures corresponds to 100% of send buffer (128K).
However, the RtxQ evolution in CMT-SACKs (Figure 3.10) exhibits more variance –
86
reaches the maximum and drops to 0 multiple times, while CMT-NR-SACKs’ RtxQ
size is closer to 128K most of the time (Figure 3.11).
87
only 4 MTU sized TPDUs − TSNs 20460-20463. Once the sender transmits data on
both paths, RtxQ size increases to ~8.6K, shown by point C. Subsequent SACKs allow
more data transmission and at point D the sender’s RtxQ reaches the maximum causing
the next instance of send buffer blocking.
Though CMT-NR-SACKs also incurs send buffer blocking (Figure 3.11),
nr-gap-acks free up RtxQ space allowing the sender to steadily clock out more data. A
constrained send buffer is better utilized, and the transmission is less bursty with NR-
SACKs than SACKs. The improved send buffer use contributes to throughput
improvements (discussed later).
88
CMT-SACKs’ inefficient send buffer usage increases the number of timeout
recoveries, and degrades throughput when compared to CMT-NR-SACKs.
3.5.2.3 Throughput
When the send buffer never limits RtxQ growth (INF send buffer size),
both CMT-SACKs and CMT-NR-SACKs do not experience send buffer blocking, and
perform similarly (Figure 3.14). However, CMT-SACKs achieve the same throughput
as CMT-NR-SACKs at the cost of larger RtxQ sizes.
Using terminology defined in Section 3.4.2, the average RtxQ size, RtxQ
over the entire trace collection period (T) is calculated as, RtxQ = (∑ t i × ri ) ÷ T .
Figure 3.15 plots CMT-SACKs vs. CMT-NR-SACKs RtxQ for the INF case. As path 1
cross-traffic load increases, the bandwidth available for the CMT flow decreases, and
CMT-NR-SACKs’ RtxQ decreases (Figure 3.15).
89
Figure 3.14: CMT-SACKs vs. CMT-NR-SACKs Throughput
Similarly, CMT-SACKs’ RtxQ decreases as traffic load increases from low to mild.
However, a different factor dominates and increases CMT-SACKs’ RtxQ during
medium and heavy traffic conditions. Note that RtxQ growth is never constrained in the
INF case, enabling the CMT sender to transmit as much data as possible on path 2
90
while recovering from losses on path 1. At medium and heavy cross-traffic loads, loss
recovery durations increase due to increased timeout recoveries, and the CMT-SACKs
sender transmits more data on path 2 compared to mild traffic conditions. This factor
increases CMT-SACKs’ RtxQ during medium and heavy traffic conditions.
Going back to Figure 3.14, when the send buffer size limits RtxQ growth,
CMT-NR-SACKs’ efficient RtxQ utilization enables CMT-NR-SACKs to perform
better than CMT-SACKs. The throughput improvements in CMT-NR-SACKs increase
as conditions that aggravate send buffer blocking increases. I.e., NR-SACKs improve
throughput more as send buffer becomes more constrained and/or when the paths
become more asymmetric in the congestion levels. Alternately, CMT-NR-SACKs
achieve similar throughput as CMT-SACKs using smaller send buffer sizes. For
example, during mild, medium and heavy path 1 cross-traffic load, CMT-NR-SACKs
with 128K send buffer performs similar or better than CMT-SACKs with 256K send
buffer. Also, CMT-NR-SACKs with 256K send buffer performs similar to CMT-
SACKs with larger (unconstrained) send buffer.
91
Note that a transfer employing NR-SACKs never performs worse than a
transfer using SACKs. When out-of-order data is non-renegable, NR-SACKs perform
better than SACKs. Simulations confirmed that in both SCTP and CMT, NR-SACKs
utilize send buffer and RtxQ space most efficiently. Send buffer blocking in CMT with
SACKs adversely impacts end-to-end performance, while efficient send buffer use in
CMT with NR-SACKs alleviates send buffer blocking. Therefore, NR-SACKs not only
reduce sender’s memory requirements, but also improve throughput in CMT. The only
negative with NR-SACKs is the added complexity of implementation, and the extra
overhead to generate and process NR-SACKs. We argue these negatives are negligible.
92
respectively, and defining a test suite to debug the NR-SACKs implementation. In the
future, we plan to draw on the FreeBSD implementation to compare SACKs vs. NR-
SACKs performance for both SCTP and CMT.
93
Chapter 4
4.1 Motivation
As discussed in Chapter 1, SCTP natively supports transport layer
multihoming for fault-tolerance purposes. Concurrent Multipath Transfer (CMT)
[Iyengar 2006] is an experimental SCTP extension that assumes multiple independent
paths between multihomed end points, and exploits the independent paths for
simultaneous transfer of new data (see Chapter 1).
Path failures arise when a router or a link connecting two routers fails due
to planned maintenance activities or unplanned accidental reasons such as hardware
malfunction or software error. Ideally, the routing system detects unplanned link
failures, and reconfigures the routing tables to avoid routing traffic via the failed link.
94
Using data from an ISP’s routing logs, [Markopoulou 2004] observes that link failures
are part of everyday operation. Around 80% of the failures are unplanned, and the
time-to-repair for any particular failure can be on the order of hours. Existing research
also highlights problems with Internet backbone routing that result in long route
convergence times. [Labovitz 2000] shows that Internet's interdomain routers may take
as long as tens of minutes to reconstruct new paths after a failure. During these delayed
convergences, end-to-end Internet paths experience intermittent loss of connectivity in
addition to increased packet loss, latency, and reordering.
Using probes, [Paxson 1997] and [Zhang 2000] find that “significant
routing pathologies” prevent selected pairs of hosts from communicating about 1.5% to
3.3% of the time. Importantly, the authors also find that this trend has not improved
with time. Reference [Labovitz 1999] examines routing table logs of Internet
backbones to find that 10% of all considered routes were available less than 95% of the
time, and more than 65% of all routes were available less than 99.99% of the time. The
duration of these path outages were heavy-tailed and about 40% of path outages took
more than 30 minutes to repair. In [Chandra 2001], the authors use probes to confirm
that failure durations are heavy-tailed, and report that 5% of detected failures last more
than 2.75 hours, and as long as 27.75 hours. The pervasiveness of path failures in
practice motivates us to study their impact on CMT.
95
4.2.1 Failure Detection in CMT
Since CMT is an extension to SCTP, CMT retains SCTP’s failure
detection process. A CMT sender uses a tunable failure detection threshold called
Path.Max.Retrans (PMR) [RFC4960]. As shown in the finite state machine of Figure
4.1, a destination is in one of the two states – active or failed (inactive). A destination
is active as long as acks come back for data or heartbeats (probes) sent to that
destination. When a sender experiences more than PMR consecutive timeouts while
trying to reach a specific active destination, that destination is marked as failed. Only
heartbeats (i.e., no data) are sent to a failed destination. A failed destination returns to
the active state when the sender receives a heartbeat ack. RFC4960 proposes a default
PMR value of 5, which translates to at least 63 seconds (6 consecutive timeouts) for
failure detection.
96
with out-of-order data. Even though the cwnd would allow new data to be transmitted,
rbuf blocking (i.e., flow control) stalls the sender, causing throughput degradation.
Rbuf blocking problem cannot be eliminated in CMT [Iyengar 2005]. To
reduce rbuf blocking’s negative impact during congestion, [Iyengar 2005] proposes
different retransmission policies that use heuristics for faster loss recovery. These
policies consider different path properties such as loss rate and delay, and try to reduce
rbuf blocking by sending retransmissions on a path with lower loss or delay. In practice,
the loss rate of a path can only be estimated, so [Iyengar 2005] proposed the
RTX_SSTHRESH policy, where retransmissions are sent on the path with the largest
slow-start threshold. Since RTX_SSTHRESH outperformed other retransmission
policies during congestion, [Iyengar 2005] recommended the RTX_SSTHRESH policy
for CMT. However, [Iyengar 2005] did not consider CMT performance during path
failures. As we shall show, CMT with the RTX_SSTHRESH policy suffers from
significant rbuf blocking during path failures.
97
correspondence between an SCTP PDU and TSN, and (b) each SCTP PDU is MTU-
sized.
In Figure 4.2, a SACK labeled <Sa, b-c; Rd> acknowledges all TSNs upto
and including the cumulative TSN value of a, in-order arrival of TSNs b through c
(missing report for TSNs a+1 through b-1), and an advertised receiver window1
capable of buffering d more TSNs. On receiving a SACK, sender A subtracts the
number of outstanding TSNs from the advertised receiver window, and calculates the
amount of new data that can be sent without overflowing the receive buffer. The
transport layer receive buffer for this example can hold a maximum of 5 TSNs, and its
contents are listed after the reception of every TSN.
In the example, both forward and reverse paths between A1 and B1 fail just
after TSN 2 enters the network. Hence, TSN 2 and the SACK for TSN 1 are presumed
lost. TSNs 3 and 4 arrive out of order, each trigger a SACK, and are stored in the
receive buffer. The CMT sender uses the Cwnd Update for CMT (CUC) algorithm
[Iyengar 2006] to decouple a path’s cwnd evolution and data ordering. On receiving
the SACK triggered by TSN 3, the sender uses CUC to increment C2 to 3, and
decrement O1 and O2 to 1. The available receive buffer space for new data is calculated
as advertised receive window (4) – total outstanding TSNs in the association (2). This
available receive buffer space allows the sender to transmit two TSNs, 5 and 6, on path
1 Advertised receiver window (a_rwnd) has different connotations in TCP and SCTP.
TCP’s a_rwnd denotes the available memory in rbuf, starting from the left edge of
received sequence space [RFC793]. SCTP’s a_rwnd denotes the available memory
after considering all TPDUs not yet delivered to the application layer, including the
out-of-order TPDUs [RFC4960].
98
2. On path 1, even though 1 MTU worth of new data can be transmitted (C1 > O1),
rbuf blocking, i.e., flow control stalls data transmission.
99
number of outstanding TSNs) continues to prevent transmission of new data on path 2.
Since O2 < C2, the SACKs triggered by TSNs 5 and 6 do not increment C2 [RFC4960]
(discussed later). But these SACKs decrement O2. Even though O2 < C2, rbuf blocking
stalls data transmission on path 2.
Path 1’s retransmission timer expires and the sender detects the loss of
TSN 2. Note that this timeout is the first of the 6 (PMR = 5) consecutive timeouts
needed to detect path 1 failure. After this timeout, C1 is set to 1, O1 is set to 0, and path
1’s RTO value is doubled [RFC4960]. The CMT sender employs the
RTX_SSTHRESH policy and retransmits TSN 2 on path 2. Data cannot be transmitted
on path 1 due to rbuf blocking.
On receiving TSN 2, the receiver delivers data from TSNs 2-6 to the
application. The corresponding SACK advertises a receive window of 5 TSNs, and
concludes the current rbuf blocking instance. The sender now transmits TSN 7 on path
1, and TSNs 8-11 on path 2. Due to path 1 failure, TSN 7 is lost, and TSNs 8-11 are
received out-of-order and stored in the receiver’s buffer. The SACK triggered by TSN
8 increments C2 to 5 and decrements O2 to 3. The available receive buffer space for
new data=0, triggering another instance of rbuf blocking, which stalls data transmission
until TSN 7 is successfully retransmitted. Note that the loss of TSN 7 can be recovered
only after a timeout on path 1, and due to the exponential backoff algorithm, path 1’s
current RTO value is twice the previous value.
To generalize, sender A transmits new data on path 1 until (PMR + 1)
number of consecutive timeouts mark path 1 as failed. During failure detection, data
transmitted on non-failed path(s) arrive out-of-order, resulting in consecutive rbuf
blocking instances. Each rbuf blocking instance concludes when the sender retransmits
100
lost TPDUs after an RTO. The length of an rbuf blocking instance is therefore
proportional to the failed path’s RTO. Also, each rbuf blocking instance is
exponentially longer than the previous instance due to the exponential backoff of RTO
values.
Rbuf blocking results in the following side-effects that further degrade
CMT’s throughput:
Preventing congestion window growth: Note that rbuf blocking prevents
the sender from fully utilizing the cwnd. When the amount of outstanding data is less
than the cwnd, RFC4960 prevents the sender from increasing the cwnd for future
SACKs. For example, in Figure 4.2, when the sender receives the SACKs for TSNs 5,
6, 9-11, arrive, the sender cannot increment C2.
Reducing congestion window: To reduce burstiness in data transmission,
an SCTP sender employs a congestion window validation algorithm similar to
[RFC2861]. During every transmission, the sender uses the MaxBurst parameter
(recommended value of 4) as follows:
If ((outstanding + MaxBurst * MTU) < Cwnd)
Cwnd = outstanding + MaxBurst * MTU
This algorithm reduces the cwnd during idle periods so that at the next
sending opportunity, the sender cannot transmit more than (MaxBurst * MTU) bytes of
data. During rbuf blocking, the amount of outstanding data can become smaller than
the cwnd. In such cases, the above rule is triggered and further reduces the cwnd. In
Figure 4.2, when the SACK triggered by TSN 11 arrives at the sender, O2 decrements
to 0. The window validation algorithm causes C2 to be reduced to 4 (O2 (0) +
MaxBurst (4)).
101
4.3 CMT with Potentially Failed Destination State
[Caro 2005] recommends lowering the value of PMR for SCTP flows in
Internet-like environments. Correspondingly, lowering the PMR for CMT flows
reduces the number of rbuf blocking episodes during failure detection. However,
lowering the PMR is an incomplete solution to the problem since a CMT flow is rbuf
blocked for any PMR > 0 (discussed later). Also, a tradeoff exists on deciding the value
of PMR – a lower value reduces rbuf blocking but increases the chances of spurious
failure detection, whereas a higher PMR increases rbuf blocking and reduces spurious
failure detection in a wide range of environments.
102
• If a TPDU loss is detected by a timeout, the corresponding destination
transitions to the PF state (Figure 4.3). The sender does not transmit data
to a PF destination. However, when all destinations are in the PF state, the
sender transmits data to the destination with the least number of
consecutive timeouts. In case of tie, data is sent to the last active
destination. This exception ensures that CMT-PF does not perform worse
than CMT when all paths have potentially failed (discussed further in
Section 4.6).
103
• Once a heartbeat ack indicates a PF destination is alive, the destination’s
cwnd is set to either 1 MTU (CMT-PF1), or 2 MTUs (CMT-PF2), and the
sender follows the slow start algorithm to transmit data to this destination.
Detailed analysis on the cwnd evolution of CMT-PF1 vs. CMT-PF2 can be
found in Section 4.6.
• Acks for retransmissions do not transition a PF destination to the active
state, since a sender cannot determine whether the ack was for the original
transmission or the retransmission(s).
104
Figure 4.4: CMT-PF Reduces Rbuf Blocking during Failure
In the simulation topology (Figure 4.5), the multihomed sender, A, has two
independent paths to the multihomed receiver, B. The edge links between A (or B) to
the routers represent last-hop link characteristics. The end-to-end one-way delay is
45ms on both paths, representing typical coast-to-coast delays experienced by
significant fraction of the flows in the Internet [Shakkottai 2004]. We note that the final
conclusions regarding CMT vs. CMT-PF are independent of the actual bandwidth and
delay configurations used in the topology, as long as these configurations are similar on
both paths.
105
Figure 4.5: Topology for Failure Experiments
The sender A transfers an 8MB file to receiver B using both path 1 and
path 2. Path 2 fails during the file transfer; this failure is simulated by bringing down the
bidirectional link between routers R20 and R21. Unless stated otherwise, the PMR=5,
rbuf=64KB, and both paths experience Bernoulli losses with low loss rate (1%). We
acknowledge that the Bernoulli loss model is less realistic than the nature of losses
observed in the Internet. Since evaluations in this Section assume failure scenarios and
rare loss events (1% or no loss), we expect the final conclusions between CMT vs.
CMT-PF to remain similar even with a more realistic loss model
106
4.4.1.1 Evaluations during Single Permanent Failure (without Congestion)
This experiment highlights the essential differences between CMT and
CMT-PF during a permanent path failure. To eliminate the influence of congestion-
induced rbuf blocking, the simulation is setup such that the sender does not experience
any congestion losses on either paths.
107
blocking helps CMT-PF to complete the file transfer (~15 seconds) using path 1 alone,
even before path 2 failure is detected.
108
permanent failure, CMT-PF performs as well as CMT for PMR=0, and better than
CMT for PMR > 0.
109
at 10 seconds, CMT’s data and CMT-PF’s heartbeat transmissions on the path (after
the 3rd timeout − ~12.5 seconds) are successful, and both CMT and CMT-PF complete
the file transfer without further rbuf blocking.
110
to alleviate rbuf blocking is more valuable at smaller rbuf sizes, and CMT-PF performs
increasingly better than CMT as rbuf size decreases.
111
Sender is limited by rbuf: Both CMT and CMT-PF senders cannot transmit
new data until the rbuf blocking is cleared, i.e., until after successful retransmission(s)
of lost data. The only difference is that CMT considers p for retransmissions, whereas
CMT-PF transmits a heartbeat on p, and tries to retransmit lost data on other active
paths. (If all destinations are in the PF state, the CMT-PF sender transitions the
destination with the least number of consecutive timeouts to the active state (Section
4.3), and retransmits lost data to this new active destination.)
Sender is not limited by rbuf: Assume that SCTP PDUs (data or
heartbeats) transmitted after the first timeout on path p successfully reach the receiver.
In CMT, the cwnd allows 1 MTU worth of new data transmission on p (Figure 4.10),
and the corresponding SACK increments path p’s cwnd by 1 MTU. At the end of 1
RTT after the timeout (shown by point A in Figure 4.10), (i) the cwnd on p=2 MTUs,
and (ii) 1 MTU worth of new data has been successfully sent on p.
CMT-PF transmits a heartbeat on p and new data on other active path(s).
(Note: if all destinations are marked PF, the CMT-PF sender transitions a PF
destination to the active state.) Path p is marked active when the heartbeat ack reaches
the sender. Therefore, after 1 RTT from the timeout (shown by point B in Figure 4.11),
(i) cwnd on p =1 MTU (CMT-PF1), and (ii) no new data has been sent on p.
Comparing points A and B in Figures 4.10 and 4.11, respectively, it can be seen that
CMT has a 1 RTT “lead” in path p’s cwnd growth. Assuming no further losses on p,
after n RTTs, the cwnd on p will be 2n with CMT, and 2n-1 with CMT-PF1.
112
Figure 4.10: CMT Data Transfer during no Rbuf Blocking
113
Figure 4.12: CMT-PF2 Data Transfer during no Rbuf Blocking
114
distribution between 5-20 ms, resulting in end-to-end one-way propagation delays
ranging ~35-65ms [Shakkottai 2004]. All links (both edge and core) have a buffer size
twice the link's bandwidth-delay product, which is a reasonable setting in practice.
115
For both CMT and CMT-PF flows, rbuf=64KB, PMR=5, and loss rates are
controlled by varying the cross-traffic load. The graphs in the subsequent discussions
plot the average goodput (file size ÷ transfer time) of CMT vs. CMT-PF with 5% error
margin.
Figure 4.14: CMT vs. CMT-PF during Symmetric Loss and RTT Conditions
As the cross-traffic load and loss rate increases, the number of timeouts on each path
increases. Under such conditions, the probability that both paths are simultaneously
116
marked “potentially-failed” increases in CMT-PF. To ensure that CMT-PF does not
perform worse when all destinations are marked PF, CMT-PF transitions the
destination with the smallest number of consecutive timeouts to the active state,
allowing data to be sent to that destination (refer to Section 4.3). This modification
guarantees that CMT-PF performs on par with CMT even when both paths experience
high loss rates (Figure 4.14).
117
CMT-PF performance? More importantly, does CMT-PF perform worse when the
paths have asymmetric RTTs?
Using topology in Figure 4.5, we performed the following Bernoulli loss
model experiment to gain insight. The Bernoulli loss model simulations, while less
realistic, take much less time than cross-traffic ones, and initial investigations revealed
that both loss models resulted in similar trends between CMT and CMT-PF. Path 1’s
one-way propagation delay was fixed at 45ms while path 2’s one-way delay varied as
follows: 45ms, 90ms, 180ms, 360ms, and 450ms. Both paths experience identical loss
rates ranging from 1%-10%.
Figure 4.15: CMT vs. CMT-PF Goodput Ratios during Symmetric Loss and
Asymmetric RTT Conditions
Figure 4.15 plots the ratio of CMT’s goodput over CMT-PF’s (relative
performance difference) with 5% error margin. As expected, both CMT and CMT-PF
perform equally well during symmetric RTT conditions. As the asymmetry in paths’
118
RTTs increases, an interesting dynamic dominates and CMT-PF performs slightly
better than CMT (goodput ratios < 1).
Further investigation revealed the following about CMT vs. CMT-PF rbuf
blocking durations, shown in Figure 4.16. For each combination of path 2’s delay and
loss rate, Figure 4.16 plots the ratio of rbuf blocked durations (CMT over CMT-PF)
during timeout recoveries. As path 2 one-way delay and loss rate increases, the ratio
becomes increasingly greater than 1, signifying that a CMT sender suffers longer rbuf
blocking durations than CMT-PF.
1.4
6% Loss Rate
7% Loss Rate
8% Loss Rate
Ratio of Rbuf Blocked Timeout Recovery
1
Duations (CMT/CMT-PF)
0.8
0.6
0.4
0.2
0
45 90 180 360 450
Path 2 One-way Delay (ms)
Note that rbuf blocking depends on the frequency of loss events (loss rate),
and the duration of loss recovery. As loss rate increases, the probability that a sender
experiences consecutive timeout events on the path increases. After the first timeout,
CMT-PF transitions the path to PF, and avoids data transmission on the path (as long
as another active path exists) until a heartbeat-ack confirms the path as active. But, a
CMT sender suffers back-to-back timeouts on data sent on the path, with exponential
119
backoff of timeout recovery period. As path 2’s RTT increases, path 2’s RTO
increases, and the back-to-back timeouts on data result in longer rbuf blocking
durations in CMT than CMT-PF. Therefore, as path 2‘s RTTs increase, CMT’s
goodput degrades more than CMT-PF’s, and the goodput ratio decreases (Figure
4.15).
In summary, during symmetric loss conditions, CMT and CMT-PF
perform equally well when paths experience symmetric RTT conditions. As the RTT
asymmetry increases, CMT-PF demonstrates a slight advantage at higher loss rates.
Table 4.1: CMT vs. CMT-PF Mean Consecutive Data Timeouts on Path 2
As discussed in the previous sub-section, as path 2’s cross-traffic load
increases, the probability that a sender experiences back-to-back timeouts on path 2
increases. CMT suffers a higher number of consecutive timeouts on data (Table 4.1)
resulting in more extended rbuf blocking periods when compared with CMT-PF.
120
Therefore, as path 2’s cross-traffic load increases, CMT-PF performs better than CMT
(Figure 4.18).
121
transmission strategy is applied to new data, and CMT-PF avoids new data
transmissions on the PF path. As shown in Table 4.2, when compared to CMT, CMT-
PF reduces the number of (re)transmissions on the higher loss rate path 2 and
(re)transmits more on the lower loss rate path 1. This transmission difference (ratio of
transmissions on path 1 over path 2) between CMT-PF and CMT increases as the paths
become more asymmetric in their loss conditions.
In summary, CMT-PF does not perform worse than CMT during
asymmetric path loss conditions. In fact, CMT-PF is a better transmission strategy
than CMT, and performs better as the asymmetry in path loss increases.
122
4.6.1 CMT-PF Implementation in FreeBSD
Joeseph Szymanski extended the FreeBSD CMT implementation to include
CMT-PF. The following emulation experiments were performed using this FreeBSD
implementation.
123
4.6.1.1 Single Failure Scenario
To validate the behavioral differences between CMT and CMT-PF, we
emulated a single failure scenario, similar to the scenario described in Section 4.4.1.1.
Neither paths experience loss in this experiment. At time t=5, path 2 fails; this failure is
emulated by setting up appropriate Dummynet rules to block all packets traversing on
path 2 to and from the client and server, respectively Figure 4.20 plots the cumulative
bytes received at the client during this transfer.
124
2 to PF after the first timeout (~6.5 seconds), and transmits only heartbeats on path 2.
Data transmission continues on path 1 and the file transfer finishes ~18 seconds.
125
loss conditions. Further investigation exposed few potential bugs in the CMT-PF
implementation. We are currently exploring these issues.
126
Chapter 5
This dissertation investigated three issues related to the transport layer and
proposed solutions to address these issues. This chapter summarizes our contributions
for each issue, and concludes the dissertation.
127
a single multistreamed SCTP association not only eliminated HOL blocking, but also
boosted throughput compared to multiple TCP connections.
Our body of work in HTTP over SCTP has triggered significant interest in
the area. We are currently working with the IETF to standardize our HTTP over SCTP
streams design.
128
detection thresholds. During non-failure scenarios such as congestion, CMT-PF
performed on par or better but never worse than CMT. In light of these findings, we
recommend CMT be replaced by CMT-PF in existing and future CMT implementations
and RFCs.
129
REFERENCES
130
[Barford 1999] P. Barford, A. Bestavros, A. Bradley, M. Crovella, "Changes in Web
Client Access Patterns: Characteristics and Caching Implications," World Wide
Web Journal, 2(1-2), pp. 1528, 1999.
[Bickhart 2005] R. Bickhart, “SCTP Shim for Legacy TCP Applications”, MS Thesis,
Department of Computer & Information Sciences, University of Delaware,
USA, August 2005.
[Cao 2004] J. Cao, W.S. Cleveland, Y. Gao, K. Jeffay, F.D. Smith, M.C, Weigle,
"Stochastic Models for Generating Synthetic HTTP Source Traffic", IEEE
INFOCOM, Hong Kong, China, March 2004.
[Chan 2002] M.Chan , R. Ramjee, "TCP/IP Performance over 3G Wireless Links with
Rate and Delay Variation," 8th International Conference on Mobile Computing
and Networking, Georgia, USA, September 2002.
131
[Chandra 2001] B. Chandra, M. Dahlin, L. Gao, A. Nayate, "End-to-End WAN
Service Availability," 3rd USENIX Symposium on Internet Technologies and
Systems, San Francisco, USA, March 2001.
[Ekiz 2007] N. Ekiz, P. Natarajan, J. Iyengar, A. Caro, “ns-2 SCTP Module,” Version
3.7, September 2007. pel.cis.udel.edu.
[Faber 1999] T. Faber, J. Touch, W. Yue, "The TIME-WAIT State in TCP and Its
Effect on Busy Servers," IEEE INFOCOM, New York, USA, March 1999.
[Gettys 1998] J. Gettys, H. Nielsen, "The WebMUX Protocol," IETF Internet Draft
(expired), August, 1998.
132
[Houtzager 2003] G. Houtzager, C. Williamson, "A Packet-Level Simulation Study of
Optimal Web Proxy Cache Placement," 11th IEEE International Symposium on
Modeling, Analysis, and Simulation of Computer and Telecommunications
Systems, Orlando, USA, October 2003.
[Koh 2004] S. Koh, M. Chang, M. Lee, “mSCTP for Soft Handover in Transport
Layer," IEEE Communications Letters, 8(3), pp. 189191, March 2004.
[Koh 2005] S. Koh, Q. Xie, “Mobile SCTP (mSCTP) for IP Handover Support, draft-
sjkoh-msctp, IETF Internet Draft (expired),” October 2005.
133
[Mahdavi 1997] J. Mahdavi, S. Floyd, "TCP-Friendly Unicast Rate-Based Flow
Control," Technical note sent to the end2end-interest mailing list, January 1997.
[Movies] HTTP over SCTP versus HTTP over TCP Movies, October 2007.
http://www.cis.udel.edu/%7Eamer/PEL/leighton.movies/index.html
[Natarajan 2007] P. Natarajan, P. Amer, R. Stewart, "The Case for Multistreamed Web
Transport in High Latency Networks," TR2007-342, Department of Computer
& Information Sciences, University of Delaware, USA, October 2007.
134
[Natarajan 2008e] P. Natarajan, N. Ekiz, E. Yilmaz, P. Amer, J. Iyengar, R. Stewart,
"Non-Renegable Selective Acknowledgements (NR-SACKs) for SCTP," 16th
International Conference on Network Protocols, Orlando, USA, October 2008.
[Nielsen 1999] J. Nielsen, “Designing Web Usability: The Practice of Simplicity,” New
Riders, 1999, ISBN: 156205810X
135
[RFC2760] M. Allman, S. Dawkins, D. Glover, J. Griner, D. Tran, T. Henderson, J.
Heidemann, J. Touch, H. Kruse, S. Ostermann, K. Scott, J. Semke, "Ongoing
TCP Research Related to Satellites," RFC 2760, Febraury 2001.
136
[RFC896] J. Nagle, "Congestion Control in IP/TCP Internetworks," RFC 896, January
1984.
[Wang 2007a] G. Wang, Y. Xia, D. Harrison, "An ns-2 TCP Evaluation Tool:
Installation Guide and Tutorial," April 2007. http://labs.nec.com.cn/tcpeval.htm.
137
[Wang 2007b] G. Wang, Y. Xia, D. Harrison, "An NS2 TCP Evaluation Tool," draft-
irtf-tmrg-ns2-tcp-tool, IETF Internet Draft (expired), November 2007.
138