Você está na página 1de 105

Chapter 3: Transport Layer

Our goals:
ˆ understand principles ˆ learn about transport
behind transport layer protocols in the
layer services: Internet:
 multiplexing/  UDP: connectionless
demultiplexing transport
 reliable data transfer  TCP: connection-oriented
transport
 flow control
 TCP congestion control
 congestion control

Transport Layer 3-1


Transport services and protocols
ˆ provide logical communication application
transport
between app processes network
data link
running on different hosts physical
network
data link
network physical
ˆ transport protocols run in

lo
data link

icg
physical
end systems

al
network

en
data link

d-
 send side: breaks app physical network

en
data link
messages into segments,

d
physical

tr
an
passes to network layer network

sp
data link

or
physical
 rcv side: reassembles

t
segments into messages, application
transport
passes to app layer network
data link
ˆ more than one transport physical

protocol available to apps


 Internet: TCP and UDP

Transport Layer 3-2


Transport vs. network layer
ˆ network layer: logical Household analogy:
communication 12 kids sending letters
between hosts to 12 kids
ˆ transport layer: logical ˆ processes = kids
communication ˆ app messages = letters
between processes in envelopes
relies on, enhances,

ˆ hosts = houses
network layer services
ˆ transport protocol =
Ayşe and Bülent
ˆ network-layer protocol
= postal service

Transport Layer 3-3


Internet transport-layer protocols
ˆ reliable, in-order application
transport
delivery (TCP) network
data link network
physical
 congestion control network
data link
physical

lo
data link

g
flow control

ic
 physical

al
network

en
 connection setup data link

d-
physical network

en
data link
unreliable, unordered

d
ˆ physical

tr
an
delivery: UDP network

sp
data link

or
physical
no-frills extension of

t

“best-effort” IP application
transport
network
ˆ services not available: data link
physical

 delay guarantees
 bandwidth guarantees

Transport Layer 3-4


Multiplexing/demultiplexing
Demultiplexing at rcv host: Multiplexing at send host:
gathering data from multiple
delivering received segments
sockets, enveloping data with
to correct socket
header (later used for
demultiplexing)
= socket = process

application P3 P1
P1 application P2 P4 application

transport transport transport

network network network

link link link

physical physical physical

host 2 host 3
host 1
Transport Layer 3-5
How demultiplexing works
ˆ host receives IP datagrams
 each datagram has source 32 bits
IP address, destination IP
address source port # dest port #

 each datagram carries 1


transport-layer segment other header fields
 each segment has source,
destination port number
(recall: well-known port application
numbers for specific data
applications) (message)
ˆ host uses IP addresses & port
numbers to direct segment to
appropriate socket TCP/UDP segment format

Transport Layer 3-6


Connectionless demultiplexing
ˆ When host receives UDP
ˆ Create sockets with port
segment:
numbers:
DatagramSocket mySocket1 = new  checks destination port
DatagramSocket(9111); number in segment
DatagramSocket mySocket2 = new  directs UDP segment to
DatagramSocket(9222); socket with that port
number
ˆ UDP socket identified by
two-tuple: ˆ IP datagrams with
different source IP
(dest IP address, dest port number)
addresses and/or source
port numbers directed
to same socket

Transport Layer 3-7


Connection-oriented demux
ˆ TCP socket identified ˆ Server host may support
by 4-tuple: many simultaneous TCP
 source IP address sockets:
 source port number  each socket identified by
 dest IP address its own 4-tuple
 dest port number ˆ Web servers have
ˆ recv host uses all four different sockets for
values to direct each connecting client
segment to appropriate  non-persistent HTTP will
socket have different socket for
each request

Transport Layer 3-8


Connection-oriented demux
(cont)

P1 P4 P5 P6 P2 P1P3

SP: 5775
DP: 80
S-IP: B
D-IP:C

SP: 9157 SP: 9157


client DP: 80 server DP: 80 Client
IP: A IP: C S-IP: B
IP:B
S-IP: A
D-IP:C D-IP:C

Transport Layer 3-9


Connection-oriented demux
Threaded Web Server

P1 P4 P4
P5 P6 P2 P1P3

SP: 5775
DP: 80
S-IP: B
D-IP:C

SP: 9157 SP: 9157


client DP: 80 server DP: 80 Client
IP: A IP: C S-IP: B
IP:B
S-IP: A
D-IP:C D-IP:C

Transport Layer 3-10


UDP: User Datagram Protocol [RFC 768]
ˆ “no frills,” “bare bones”
Internet transport Why is there a UDP?
protocol
ˆ no connection
ˆ “best effort” service, UDP establishment (which can
segments may be: add delay)
 lost ˆ simple: no connection state
 delivered out of order at sender, receiver
to app ˆ small segment header
ˆ connectionless: ˆ no congestion control: UDP
 no handshaking between can blast away as fast as
UDP sender, receiver desired
 each UDP segment
handled independently
of others

Transport Layer 3-11


UDP: more
ˆ often used for streaming
multimedia apps 32 bits

 loss tolerant Length, in source port # dest port #


 rate sensitive bytes of UDP length checksum
segment,
ˆ other UDP uses including
 DNS header
 SNMP
ˆ reliable transfer over UDP: Application
add reliability at data
application layer (message)
 application-specific
error recovery!
UDP segment format

Transport Layer 3-12


UDP checksum
Goal: detect “errors” (e.g., flipped bits) in transmitted
segment

Sender: Receiver:
ˆ treat segment contents ˆ compute checksum of
as sequence of 16-bit received segment
integers ˆ check if computed checksum
ˆ checksum: addition (1’s equals checksum field value:
complement sum) of  NO - error detected
segment contents with  YES - no error detected.
wraparound of carry out
bit
ˆ sender puts checksum
value into UDP checksum
field

Transport Layer 3-13


Principles of Reliable data transfer
ˆ important in app., transport, link layers
ˆ top-10 list of important networking topics!
network
layer

ˆ characteristics of unreliable channel will determine


complexity of reliable data transfer protocol (rdt)
Transport Layer 3-14
Reliable data transfer: getting started
rdt_send(): called from above, deliver_data(): called by
(e.g., by app.). Passed data to rdt to deliver data to upper
deliver to receiver upper layer

send receive
side side

udt_send(): called by rdt, rdt_rcv(): called when packet


to transfer packet over arrives on rcv-side of channel
unreliable channel to receiver

Transport Layer 3-15


Reliable data transfer: getting started
We’ll:
ˆ incrementally develop sender, receiver sides of
reliable data transfer protocol (rdt)
ˆ consider only unidirectional data transfer
 but control info will flow on both directions!
ˆ use finite state machines (FSM) to specify
sender, receiver
event causing state transition
actions taken on state transition
state: when in this
“state” next state state state
1 event
uniquely determined 2
by next event actions

Transport Layer 3-16


Rdt1.0: reliable transfer over a reliable channel

ˆ underlying channel perfectly reliable


 no bit errors
 no loss of packets
ˆ separate FSMs for sender, receiver:
 sender sends data into underlying channel
 receiver read data from underlying channel

Wait for rdt_send(data) Wait for rdt_rcv(packet)


call from call from extract (packet,data)
above packet = make_pkt(data) below deliver_data(data)
udt_send(packet)

sender receiver

Transport Layer 3-17


Rdt2.0: channel with bit errors
ˆ underlying channel may flip bits in packet
 recall: checksum to detect bit errors
ˆ the question: how to recover from errors:
 acknowledgements (ACKs): receiver explicitly tells sender
that pkt received OK
 negative acknowledgements (NAKs): receiver explicitly
tells sender that pkt had errors
 sender retransmits pkt on receipt of NAK
ˆ new mechanisms in rdt2.0 (beyond rdt1.0):
 error detection
 receiver feedback: control msgs (ACK,NAK) rcvr->sender

Transport Layer 3-18


rdt2.0: FSM specification
rdt_send(data)
snkpkt = make_pkt(data, checksum) receiver
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for Wait for rdt_rcv(rcvpkt) &&
call from ACK or udt_send(sndpkt) corrupt(rcvpkt)
above NAK
udt_send(NAK)

rdt_rcv(rcvpkt) && isACK(rcvpkt)


Wait for
Λ
call from
sender below

rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)

Transport Layer 3-19


rdt2.0: operation with no errors
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for Wait for rdt_rcv(rcvpkt) &&
call from ACK or udt_send(sndpkt) corrupt(rcvpkt)
above NAK
udt_send(NAK)

rdt_rcv(rcvpkt) && isACK(rcvpkt)


Wait for
Λ call from
below

rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)

Transport Layer 3-20


rdt2.0: error scenario
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
isNAK(rcvpkt)
Wait for Wait for rdt_rcv(rcvpkt) &&
call from ACK or udt_send(sndpkt) corrupt(rcvpkt)
above NAK
udt_send(NAK)

rdt_rcv(rcvpkt) && isACK(rcvpkt)


Wait for
Λ call from
below

rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)

Transport Layer 3-21


rdt2.0 has a fatal flaw!
What happens if Handling duplicates:
ACK/NAK corrupted? ˆ sender adds sequence
ˆ sender doesn’t know what number to each pkt
happened at receiver! ˆ sender retransmits current
ˆ can’t just retransmit: pkt if ACK/NAK garbled
possible duplicate ˆ receiver discards (doesn’t
deliver up) duplicate pkt
What to do?
ˆ sender ACKs/NAKs
receiver’s ACK/NAK? What stop and wait
if sender ACK/NAK lost? Sender sends one packet,
then waits for receiver
ˆ retransmit, but this might
response
cause retransmission of
correctly received pkt!
Transport Layer 3-22
rdt2.1: sender, handles garbled ACK/NAKs
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt) rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
Wait for Wait for
ACK or
isNAK(rcvpkt) )
call 0 from
NAK 0 udt_send(sndpkt)
above
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt) rdt_rcv(rcvpkt)
&& isACK(rcvpkt) && notcorrupt(rcvpkt)
&& isACK(rcvpkt)
Λ
Λ
Wait for Wait for
ACK or call 1 from
rdt_rcv(rcvpkt) && NAK 1 above
( corrupt(rcvpkt) ||
isNAK(rcvpkt) ) rdt_send(data)

udt_send(sndpkt) sndpkt = make_pkt(1, data, checksum)


udt_send(sndpkt)

Transport Layer 3-23


rdt2.1: receiver, handles garbled ACK/NAKs
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq0(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt) rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
sndpkt = make_pkt(NAK, chksum) sndpkt = make_pkt(NAK, chksum)
udt_send(sndpkt) udt_send(sndpkt)
Wait for Wait for
rdt_rcv(rcvpkt) && 0 from 1 from rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) && below below not corrupt(rcvpkt) &&
has_seq1(rcvpkt) has_seq0(rcvpkt)
sndpkt = make_pkt(ACK, chksum) sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt) udt_send(sndpkt)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq1(rcvpkt)

extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)

Transport Layer 3-24


rdt2.1: discussion
Sender: Receiver:
ˆ seq # added to pkt ˆ must check if received
ˆ two seq. #’s (0,1) will packet is duplicate
suffice. Why?  state indicates whether
0 or 1 is expected pkt
ˆ must check if received seq #
ACK/NAK corrupted
ˆ note: receiver can not
ˆ twice as many states know if its last
 state must “remember” ACK/NAK received OK
whether “current” pkt
at sender
has 0 or 1 seq. #

Transport Layer 3-25


rdt2.2: a NAK-free protocol

ˆ same functionality as rdt2.1, using ACKs only


ˆ instead of NAK, receiver sends ACK for last pkt
received OK
 receiver must explicitly include seq # of pkt being ACKed
ˆ duplicate ACK at sender results in same action as
NAK: retransmit current pkt

Transport Layer 3-26


rdt2.2: sender, receiver fragments
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
Wait for Wait for
ACK isACK(rcvpkt,1) )
call 0 from
above 0 udt_send(sndpkt)
sender FSM
fragment rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
rdt_rcv(rcvpkt) && && isACK(rcvpkt,0)
(corrupt(rcvpkt) || Λ
has_seq1(rcvpkt)) Wait for receiver FSM
0 from
udt_send(sndpkt) below fragment
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
&& has_seq1(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK1, chksum)
udt_send(sndpkt) Transport Layer 3-27
rdt3.0: channels with errors and loss

New assumption: Approach: sender waits


underlying channel can “reasonable” amount of
also lose packets (data time for ACK
or ACKs) ˆ retransmits if no ACK
 checksum, seq. #, ACKs, received in this time
retransmissions will be ˆ if pkt (or ACK) just delayed
of help, but not enough (not lost):
Q: how to deal with loss?  retransmission will be
duplicate, but use of seq.
 sender waits until
#’s already handles this
certain data or ACK
lost, then retransmits  receiver must specify seq
# of pkt being ACKed
 drawbacks?
ˆ requires countdown timer

Transport Layer 3-28


rdt3.0 sender
rdt_send(data)
rdt_rcv(rcvpkt) &&
sndpkt = make_pkt(0, data, checksum) ( corrupt(rcvpkt) ||
udt_send(sndpkt) isACK(rcvpkt,1) )
rdt_rcv(rcvpkt) start_timer Λ
Λ Wait for Wait
for timeout
call 0from
ACK0 udt_send(sndpkt)
above
start_timer
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt) rdt_rcv(rcvpkt)
&& isACK(rcvpkt,1) && notcorrupt(rcvpkt)
stop_timer && isACK(rcvpkt,0)
stop_timer
Wait Wait for
timeout for call 1 from
udt_send(sndpkt) ACK1 above
start_timer rdt_rcv(rcvpkt)
rdt_send(data) Λ
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) || sndpkt = make_pkt(1, data, checksum)
isACK(rcvpkt,0) ) udt_send(sndpkt)
start_timer
Λ

Transport Layer 3-29


rdt3.0 in action

Transport Layer 3-30


rdt3.0 in action

Transport Layer 3-31


Performance of rdt3.0

ˆ rdt3.0 works, but performance stinks


ˆ example: 1 Gbps link, 15 ms e-e prop. delay, 1KB packet:

Ttransmit = L (packet length in bits) 8kb/pkt


= = 8 microsec
R (transmission rate, bps) 10**9 b/sec

U L/R .008
= = = 0.00027
sender 30.008
RTT + L / R

 U sender: utilization – fraction of time sender busy sending


 1KB pkt every 30 msec -> 33kB/sec thruput over 1 Gbps link
 network protocol limits use of physical resources!

Transport Layer 3-32


rdt3.0: stop-and-wait operation
sender receiver
first packet bit transmitted, t = 0
last packet bit transmitted, t = L / R

first packet bit arrives


RTT last packet bit arrives, send
ACK

ACK arrives, send next


packet, t = RTT + L / R

U L/R .008
= = = 0.00027
sender 30.008
RTT + L / R

Transport Layer 3-33


Pipelined protocols
Pipelining: sender allows multiple, “in-flight”, yet-to-
be-acknowledged pkts
 range of sequence numbers must be increased
 buffering at sender and/or receiver

ˆ Two generic forms of pipelined protocols: go-Back-N,


selective repeat
Transport Layer 3-34
Pipelining: increased utilization
sender receiver
first packet bit transmitted, t = 0
last bit transmitted, t = L / R

first packet bit arrives


RTT last packet bit arrives, send ACK
last bit of 2nd packet arrives, send ACK
last bit of 3rd packet arrives, send ACK
ACK arrives, send next
packet, t = RTT + L / R

Increase utilization
by a factor of 3!

U 3*L/R .024
= = = 0.0008
sender 30.008
RTT + L / R

Transport Layer 3-35


Utilization=N(L/R)/(RTT+L/R) if NL/R<RTT+L/R Utilization=1 if
and the sender pauses after it transmits a window NL/R>RTT+L/R and the
of packets until it receives first ACK sender does not pause

Transport Layer 3-36


Go-Back-N
Sender:
ˆ k-bit seq # in pkt header
ˆ “window” of up to N, consecutive unack’ed pkts allowed

ˆ ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”


 may receive duplicate ACKs (see receiver)
ˆ timer for the entire window
ˆ timeout(n): retransmit pkt n and all higher seq # pkts in window

Transport Layer 3-37


GBN: sender extended FSM
rdt_send(data)
if (nextseqnum < base+N) {
sndpkt[nextseqnum] = make_pkt(nextseqnum,data,chksum)
udt_send(sndpkt[nextseqnum])
if (base == nextseqnum)
start_timer
nextseqnum++
}
Λ else
refuse_data(data)
base=1
nextseqnum=1
timeout
start_timer
Wait
udt_send(sndpkt[base])
rdt_rcv(rcvpkt) udt_send(sndpkt[base+1])
&& corrupt(rcvpkt) …
udt_send(sndpkt[nextseqnum-1])
Λ
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
base = getacknum(rcvpkt)+1
If (base == nextseqnum)
stop_timer
else
start_timer Transport Layer 3-38
GBN: receiver extended FSM
default
udt_send(sndpkt) rdt_rcv(rcvpkt)
&& notcurrupt(rcvpkt)
Λ && hasseqnum(rcvpkt,expectedseqnum)
expectedseqnum=1 Wait extract(rcvpkt,data)
sndpkt = deliver_data(data)
make_pkt(0,ACK,chksum) sndpkt = make_pkt(expectedseqnum,ACK,chksum)
udt_send(sndpkt)
expectedseqnum++

ACK-only: always send ACK for correctly-received pkt


with highest in-order seq #
 may generate duplicate ACKs
 need only remember expectedseqnum
ˆ out-of-order pkt:
 discard (don’t buffer) -> no receiver buffering!
 Re-ACK pkt with highest in-order seq #
Transport Layer 3-39
GBN in
action

Transport Layer 3-40


Selective Repeat
ˆ receiver individually acknowledges all correctly
received pkts
 buffers pkts, as needed, for eventual in-order delivery
to upper layer
ˆ sender only resends pkts for which ACK not
received
 sender timer for each unACKed pkt
ˆ sender window
 N consecutive seq #’s
 again limits seq #s of sent, unACKed pkts

Transport Layer 3-41


Selective repeat: sender, receiver windows

Transport Layer 3-42


Selective repeat
sender receiver
data from above : pkt n in [rcvbase, rcvbase+N-1]
ˆ if next available seq # in ˆ send ACK(n)
window, send pkt ˆ out-of-order: buffer
timeout(n): ˆ in-order: deliver (also
ˆ resend pkt n, restart timer deliver buffered, in-order
pkts), advance window to
ACK(n) in [sendbase,sendbase+N-1]: next not-yet-received pkt
ˆ mark pkt n as received
pkt n in [rcvbase-N,rcvbase-1]
ˆ if n smallest unACKed pkt,
ˆ ACK(n)
advance window base to next
unACKed seq # otherwise:
ˆ ignore

Transport Layer 3-43


Selective repeat in action

Transport Layer 3-44


Selective repeat:
dilemma
Example:
ˆ seq #’s: 0, 1, 2, 3
ˆ window size=3

ˆ receiver sees no
difference in two
scenarios!
ˆ incorrectly passes
duplicate data as new
in (a)

Q: what relationship
between seq # size
and window size?
Transport Layer 3-45
Sequence Number vs. Window Size
Suppose we use k bits to represent SN
Question: What’s the minimum number of bits k
necessary for a window size of N?
Go-Back-N
Q: For a given expectedSN, what’s the largest possible value for
snd_base?
A: If all the last N ACKs sent by the receiver are received,
snd_base = expectedSN
snd_base=expectedSN expectedSN+N-1

sender
sender’s window

receiver

expectedSN
Transport Layer 3-46
Sequence Number vs. Window Size
Suppose we use k bits to represent SN
Question: What’s the minimum number of bits k
necessary for a window size of N?
Go-Back-N
Q: For a given expectedSN, what’s the smallest possible value for
snd_base?
A: If all the last N ACKs sent by the receiver are not received,
snd_base = expectedSN-N
snd_base=expectedSN-N expectedSN-1
sender
sender’s window

receiver

expectedSN
Transport Layer 3-47
Sequence Number vs. Window Size
Go-Back-N
All SNs in the interval [expectedSN-N,expectedSN+N-1] (an interval
of size 2N) can be received by the receiver. Since the receiver
accepts on the packet with SN=expectedSN, there should be no other
packet within this interval with SN=expectedSN. Therefore,

2k ≥ N+1
snd_base=expectedSN-N expectedSN+N-1

sender

receiver

expectedSN

Transport Layer 3-48


Sequence Number vs. Window Size
Suppose we use k bits to represent SN
Question: What’s the minimum number of bits k
necessary for a window size of N?
Selective Repeat
Q: For a given rcv_base, what’s the largest possible value for
snd_base?
A: If all the last N ACKs sent by the receiver are received,
snd_base = rcv_base (same as go_back-N)
snd_base=rcv_base rcv_base+N-1

sender
sender’s window

receiver’s window
receiver

rcv_base rcv_base+N-1
Transport Layer 3-49
Sequence Number vs. Window Size
Suppose we use k bits to represent SN
Question: What’s the minimum number of bits k
necessary for a window size of N?
Selective Repeat
Q: For a given rcv_base, what’s the smallest possible value for
snd_base?
A: If all the last N ACKs sent by the receiver are not received,
snd_base = rcv_base-N (same as Go-Back-N)
snd_base=rcv_base-N rcv_base-1
sender
sender’s window

receiver’s window
receiver

rcv_base rcv_base+N-1
Transport Layer 3-50
Sequence Number vs. Window Size
Selective Repeat
All SNs in the interval [rcv_base-N,rcv_base+N-1] (an interval of size
2N) can be received by the receiver. Since the receiver should be able
to distinguish between all packets in this interval and take
corresponding action, there should be no two packets within this
interval having the same SN. Therefore,

2k ≥ 2N
snd_base=rcv_base-N rcv_base+N-1

sender

receiver’s window
receiver

rcv_base rcv_base+N-1
Transport Layer 3-51
TCP: Overview RFCs: 793, 1122, 1323, 2018, 2581

ˆ point-to-point: ˆ full duplex data:


 one sender, one receiver  bi-directional data flow
ˆ reliable, in-order byte in same connection
stream:  MSS: maximum segment
size
 no “message boundaries”
ˆ connection-oriented:
ˆ pipelined:
 handshaking (exchange
 TCP congestion and flow of control msgs) init’s
control set window size sender, receiver state
ˆ send & receive buffers before data exchange
ˆ flow controlled:
sender will not
application application
socket
writes data reads data
socket 
overwhelm receiver
door door
TCP TCP
send buffer receive buffer
segment

Transport Layer 3-52


TCP segment structure
32 bits
URG: urgent data counting
(generally not used) source port # dest port #
by bytes
sequence number of data
ACK: ACK #
valid acknowledgement number (not segments!)
head not
PSH: push data now len used
UA P R S F Receive window
(generally not used) # bytes
checksum Urg data pnter
rcvr willing
RST, SYN, FIN: to accept
Options (variable length)
connection estab
(setup, teardown
commands)
application
Internet data
checksum (variable length)
(as in UDP)

Transport Layer 3-53


TCP seq. #’s and ACKs
Seq. #’s:
 byte stream “number” of first byte in segment’s data
ACKs:
 seq # of next byte expected from other side
 cumulative ACK
Q: how receiver handles out-of-order segments
 A: TCP spec doesn’t say, - up to implementation
 Widely used implementations of TCP buffer out-of-
order segments

Transport Layer 3-54


TCP Round Trip Time and Timeout
Q: how to set TCP Q: how to estimate RTT?
timeout value? ˆ SampleRTT: measured time from
ˆ longer than RTT segment transmission until ACK
 but RTT varies
receipt
 ignore retransmissions
ˆ too short: premature
timeout ˆ SampleRTT will vary, want
 unnecessary
estimated RTT “smoother”
retransmissions  average several recent

ˆ too long: slow reaction


measurements, not just
to segment loss current SampleRTT

Transport Layer 3-55


TCP Round Trip Time and Timeout
EstimatedRTT = (1- α)*EstimatedRTT + α*SampleRTT

ˆ Exponential weighted moving average


ˆ influence of past sample decreases exponentially fast
ˆ typical value: α = 0.125

Transport Layer 3-56


Example RTT estimation:
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr

350

300

250
RTT (milliseconds)

200

150

100
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)

SampleRTT Estimated RTT

Transport Layer 3-57


TCP Round Trip Time and Timeout
Setting the timeout
ˆ EstimatedRTT plus “safety margin”
 large variation in EstimatedRTT -> larger safety margin
ˆ first estimate of how much SampleRTT deviates from
EstimatedRTT:

DevRTT = (1-β)*DevRTT +
β*|SampleRTT-EstimatedRTT|

(typically, β = 0.25)

Then set timeout interval:

TimeoutInterval = EstimatedRTT + 4*DevRTT

Transport Layer 3-58


TCP reliable data transfer
ˆ TCP creates rdt ˆ Retransmissions are
service on top of IP’s triggered by:
unreliable service  timeout events
ˆ Pipelined segments  duplicate acks
ˆ Cumulative acks ˆ Initially consider
ˆ TCP uses single
simplified TCP sender:
ignore duplicate acks
retransmission timer; 
ignore flow control,
however it just 
congestion control
retransmits the first
segment in the window

Transport Layer 3-59


TCP sender events:
data rcvd from app: timeout:
ˆ retransmit segment that
ˆ Create segment with
caused timeout (first
seq # segment in the window)
ˆ seq # is byte-stream ˆ restart timer
number of first data Ack rcvd:
byte in segment ˆ If acknowledges previously
ˆ start timer if not unacked segments
already running (think  update what is known to
be acked
of timer as for oldest  start timer if there are
unacked segment) outstanding segments
ˆ expiration interval:
TimeOutInterval

Transport Layer 3-60


NextSeqNum = InitialSeqNum
SendBase = InitialSeqNum

loop (forever) { TCP


sender
switch(event)

event: data received from application above


create TCP segment with sequence number NextSeqNum (simplified)
if (timer currently not running)
start timer
pass segment to IP Comment:
NextSeqNum = NextSeqNum + length(data)
• SendBase-1: last
event: timer timeout cumulatively
retransmit not-yet-acknowledged segment with ack’ed byte
smallest sequence number Example:
start timer • SendBase-1 = 71;
y= 73, so the rcvr
event: ACK received, with ACK field value of y wants 73+ ;
if (y > SendBase) { y > SendBase, so
SendBase = y
that new data is
if (there are currently not-yet-acknowledged segments)
start timer
acked
}

} /* end of loop forever */


Transport Layer 3-61
TCP: retransmission scenarios
Host A Host B Host A Host B

Seq=9 Seq=9
2 , 8 byt 2 , 8 byt
es da es da
ta

Seq=92 timeout
ta Seq=
1 00, 2
0 byt
es da
timeout

ta
=100
A CK 0
10
X C K=
= 120
loss A AC K

Seq=9 Seq=9
2
2 , 8 byt
es da Sendbase , 8 byt
es da
ta
ta = 100

Seq=92 timeout
SendBase
= 120 = 12
0
=100 A CK
A C K

SendBase
= 100 SendBase
= 120 premature timeout
time time
lost ACK scenario
Transport Layer 3-62
TCP retransmission scenarios (more)
Host A Host B

Seq=9
2 , 8 byt
es da
ta

=100
timeout

Seq=1 A C K
0 0, 20
bytes
data
X
loss

SendBase C K =120
A
= 120

time
Cumulative ACK scenario

Transport Layer 3-63


TCP ACK generation [RFC 1122, RFC 2581]

Event at Receiver TCP Receiver action


Arrival of in-order segment with Delayed ACK. Wait up to 500ms
expected seq #. All data up to for next segment. If no next segment,
expected seq # already ACKed send ACK

Arrival of in-order segment with Immediately send single cumulative


expected seq #. One other ACK, ACKing both in-order segments
segment has ACK pending

Arrival of out-of-order segment Immediately send duplicate ACK,


higher-than-expect seq. # . indicating seq. # of next expected byte
Gap detected

Arrival of segment that Immediate send ACK, provided that


partially or completely fills gap segment startsat lower end of gap

Transport Layer 3-64


Fast Retransmit
ˆ Time-out period often ˆ If sender receives 3
relatively long: ACKs for the same
 long delay before data, it supposes that
resending lost packet segment after ACKed
ˆ Detect lost segments data was lost:
via duplicate ACKs.  fast retransmit: resend
 Sender often sends segment before timer
many segments back-to- expires
back
 If segment is lost,
there will likely be many
duplicate ACKs.

Transport Layer 3-65


Fast Retransmit Host A Host B
ˆ Resend a segment
after 3 duplicate
ACKs since a seq # x1
duplicate ACK seq # x2
seq # x3
means that an out- seq # x4 X ACK x1

of sequence seq # x5
ACK x1
segment was ACK x1
ACK x1
received
triple
ˆ duplicate ACKs due duplicate
to packet ACKs resen
d seq X
reordering!
2

ˆ if window is small
don’t get duplicate timeout
ACKs!
time
Transport Layer 3-66
Fast retransmit algorithm:

event: ACK received, with ACK field value of y


if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
else {
increment count of dup ACKs received for y
if (count of dup ACKs received for y = 3) {
resend segment with sequence number y
}

a duplicate ACK for fast retransmit


already ACKed segment

Transport Layer 3-67


TCP Flow Control
flow control
sender won’t overflow
ˆ receive side of TCP receiver’s buffer by
connection has a transmitting too much,
receive buffer: too fast

ˆ speed-matching
service: matching the
send rate to the
receiving app’s drain
rate
ˆ app process may be
slow at reading from
buffer
Transport Layer 3-68
TCP Flow control: how it works
ˆ Rcvr advertises spare
room by including value
of RcvWin in segments
ˆ Sender limits unACKed
data to RcvWin
(Suppose TCP receiver  guarantees receive
discards out-of-order buffer doesn’t overflow
segments)
ˆ spare room in buffer
= RcvWin
= RcvBuffer-[LastByteRcvd -
LastByteRead]

Transport Layer 3-69


Sliding Window Flow Control
Example
Receiver
Buffer
Sender
0 4K
sends 2K
of data 2K SeqNo=0

2K
RcvWin =2048
Sender AckNo=2048
sends 2K
of data 2K SeqNo=2
048
Sender blocked

4K

N o=4 09 6 Rc vWin=0
Ack
3K

o=4 09 6 Rc vWin=1024
AckN

Transport Layer 3-70


Principles of Congestion Control

Congestion:
ˆ informally: “too many sources sending too much
data too fast for network to handle”
ˆ different from flow control!
ˆ manifestations:
 lost packets (buffer overflow at routers)
 long delays (queueing in router buffers)
ˆ a top-10 problem!

Transport Layer 3-71


Causes/costs of congestion: scenario 1
Host A λout
ˆ two senders, two λin : original data

receivers
ˆ one router,
Host B unlimited shared
output link buffers

infinite buffers
ˆ no retransmission

ˆ large delays
when congested
ˆ maximum
achievable
throughput
Transport Layer 3-72
Causes/costs of congestion: scenario 2

ˆ one router, finite buffers


ˆ sender retransmission of lost packet

Host A λin : original data λout

λ'in : original data, plus


retransmitted data

Host B finite shared output


link buffers

Transport Layer 3-73


Causes/costs of congestion: scenario 2
ˆ always: = λ
λ (goodput)
in out
ˆ “perfect” retransmission only when loss: λ > λout
in
ˆ retransmission of delayed (not lost) packet makes λ larger
in
(than perfect case) for same λout
R/2 R/2 R/2

λout R/3

λout
λout

R/4

R/2 R/2 R/2


λin λin λin

a. b. c.
“costs” of congestion:
ˆ more work (retrans) for given “goodput”
ˆ unneeded retransmissions: link carries multiple copies of pkt
Transport Layer 3-74
Causes/costs of congestion: scenario 3
ˆ four senders
Q: what happens as λ
ˆ multihop paths in
and λ increase ?
ˆ timeout/retransmit in
Host A λout
λin : original data
λ'in : original data, plus
retransmitted data

finite shared output


link buffers

Host B

Transport Layer 3-75


Causes/costs of congestion: scenario 3
H λ
o
o
s
u
t
A t

H
o
s
t
B

another “cost” of congestion:


ˆ when packet dropped, any “upstream transmission
capacity used for that packet was wasted!

Transport Layer 3-76


Approaches towards congestion control
Two broad approaches towards congestion control:

End-end congestion Network-assisted


control: congestion control:
ˆ no explicit feedback from ˆ routers provide feedback
network to end systems
ˆ congestion inferred from  single bit indicating
end-system observed loss, congestion (SNA,
delay DECbit, TCP/IP ECN,
ˆ approach taken by TCP ATM)
 explicit rate sender
should send at

Transport Layer 3-77


TCP Congestion Control
ˆ end-end control (no network How does sender
assistance) perceive congestion?
ˆ sender limits transmission: ˆ loss event = timeout or
LastByteSent-LastByteAcked 3 duplicate acks
≤ CongWin ˆ TCP sender reduces
ˆ CongWin is dynamic, function rate (CongWin) after
of perceived network loss event
congestion two modes of operation:
 Slow Start (SS)
 Congestion avoidance
(CA) or Additive
Increase Multiplicative
Decrease (AIMD)
Transport Layer 3-78
TCP congestion control: bandwidth probing
ˆ “probing for bandwidth”: increase transmission rate
on receipt of ACK, until eventually loss occurs, then
decrease transmission rate
 continue to increase on ACK, decrease on loss (since available
bandwidth is changing, depending on other connections in
network)
ACKs being received,
X loss, so decrease rate
so increase rate
X
X
X
sending rate

TCP’s
X “sawtooth”
behavior

time

ˆ Q: how fast to increase/decrease?


 details to follow Transport Layer 3-79
TCP Congestion Control: details

ˆ sender limits rate by limiting number


of unACKed bytes “in pipeline”:
LastByteSent-LastByteAcked ≤ cwnd
 cwnd: differs from rwnd (how, why?)

 sender limited by min(cwnd,rwnd) cwnd


ˆ roughly, bytes

cwnd
rate = bytes/sec
RTT
RTT
ˆ cwnd is dynamic, function of
perceived network congestion ACK(s)

Transport Layer 3-80


TCP Congestion Control: more details

segment loss event: ACK received: increase cwnd


reducing cwnd
ˆ Two modes of operation:
ˆ timeout: no response  slowstart phase:
from receiver
• increase exponentially
 cut cwnd to 1
fast (despite name)
ˆ 3 duplicate ACKs: at at connection start,
least some segments or following timeout
getting through (recall  congestion avoidance:
fast retransmit)
• increase linearly
 cut cwnd in half, less
aggressively than on
timeout
Transport Layer 3-81
TCP Slow Start Phase
ˆ when connection begins, cwnd = 1 Host A Host B
MSS
 example: MSS = 500 bytes & one segm
ent
RTT = 200 msec

RTT
 initial rate = 20 kbps
ˆ available bandwidth may be >>
two segm
en ts
MSS/RTT
 desirable to quickly ramp up to
respectable rate four segm
ents
ˆ increase rate exponentially until
first loss event or when threshold
reached
 double cwnd every RTT
 done by incrementing cwnd by 1
time
for every ACK received

Transport Layer 3-82


Slow Start Example
ˆThe congestion cwnd = 1 segment 1

window size grows ACK for segm


ent 1
very rapidly
cwnd = 2 segment 2
 For every ACK, we
increase CongWin by
segment 3

1 irrespective of the ACK for segm


ents 2
number of segments cwnd = 3 ACK for segm
ents 3
ACK’ed cwnd = 4 segment 4
 double CongWin segment 5
every RTT segment 6
 initial rate is slow but segment 7
ramps up
exponentially fast ents 4
ACK for segm
ˆTCP slows down the cwnd = 5 ACK for segm
ents 5

increase of CongWin cwnd = 6 ACK for segm


ents 6

when cwnd = 7 ACK for segm


ents 7

CongWin ≥ ssthresh cwnd = 8


Transport Layer 3-83
TCP Congestion Avoidance Phase
ˆ when cwnd ≥ ssthresh AIMD
grow cwnd linearly
ˆ ACKs: increase cwnd
 increase cwnd by 1
by 1 MSS per RTT:
MSS per RTT additive increase
 approach possible
ˆ loss: cut cwnd in half
congestion slower (non-timeout-detected
than in slowstart loss ): multiplicative
 implementation: cwnd decrease
= cwnd + MSS2/cwnd
for each ACK received AIMD: Additive Increase
Multiplicative Decrease

Transport Layer 3-84


Congestion Avoidance
ˆ Congestion avoidance phase is started if CongWin has
reached the slow-start threshold value

ˆ If CongWin >= ssthresh then each time an ACK is


received, increment CongWin as follows:
• CongWin = CongWin + 1/CongWin (CongWin in
segments)
• In actual TCP implementation CongWin is in Bytes
CongWin = CongWin + MSS * (MSS/CongWin)
ˆ So CongWin is increased by one only if all CongWin
segments have been acknowledged.

Transport Layer 3-85


Example Slow Start/
Congestion cwnd = 1

Avoidance
cwnd = 2

cwnd = 3
cwnd = 4

Assume that cwnd = 5

ssthresh = 8
cwnd = 6
cwnd = 7
cwnd = 8
14
12
Cwnd (in segments)

10 ssthresh

8
6
4 cwnd = 9

2
0
0

6
t=

t=

t=

t=

Roundtrip times
cwnd = 10
Transport Layer 3-86
Slow Start / Congestion Avoidance
ˆ A typical plot of CongWin for a TCP connection
(MSS = 1500 bytes) with TCP Tahoe:

CA

ssthresh

SS

Transport Layer 3-87


Responses to Congestion
ˆ TCP assumes there is congestion if it detects a packet
loss
ˆ A TCP sender can detect lost packets via loss events:
• Timeout of a retransmission timer
• Receipt of 3 duplicate ACKs (fast retransmit)
ˆ TCP interprets a Timeout as a binary congestion signal.
When a timeout occurs, the sender performs:
 ssthresh is set to half the current size of the congestion
window:
ssthresh = CongWin / 2
 CongWin is reset to one:
CongWin = 1
 and slow-start is entered

Transport Layer 3-88


Fast Recovery (differentiation
btwn two loss events)
ˆ After 3 dup ACKs (fast Philosophy:
Retransmit):
 ssthresh = CongWin/2 • 3 dup ACKs indicates
network capable of
 CongWin = CongWin/2
delivering some segments
 window then grows
• timeout before 3 dup
linearly ACKs is “more alarming”
ˆ But after timeout event:
 CongWin = 1 MSS;
 window then grows
exponentially
 to the threshold, then
grows linearly
Transport Layer 3-89
TCP Congestion Control
Initially: Slow Start
CongWin = 1; (exponential
ssthresh = advertised window size; increase phase) is
New Ack received: continued until
if (CongWin < ssthresh) /* Slow Start*/ CongWin reaches
CongWin = CongWin + 1; half of the level
else /* Congestion Avoidance */ where the loss
CongWin = CongWin + 1/CongWin; event occurred
Timeout: last time.
ssthresh = CongWin/2; CongWin is
CongWin = 1; increased slowly
Fast Retransmission: after (linear
ssthresh = CongWin/2; increase in
CongWin = CongWin/2; Congestion
Avoidance phase).
3-90
Popular “flavors” of TCP
cwnd window size (in segments)

TCP Reno

ssthresh

ssthresh

TCP Tahoe

Transmission round

Transport Layer 3-91


Summary: TCP Congestion Control
ˆ When CongWin is below Threshold, sender in slow-
start phase, window grows exponentially.
ˆ When CongWin is above Threshold, sender is in
congestion-avoidance phase, window grows linearly.
ˆ When a triple duplicate ACK occurs, Threshold set
to CongWin/2 and CongWin set to Threshold.
ˆ When timeout occurs, Threshold set to CongWin/2
and CongWin is set to 1 MSS.
ˆ The actual sender window size is determined based
on the congestion and flow control algorithms

SenderWin=min(RcvWin,CongWin)

Transport Layer 3-92


TCP Congestion Control Summary
Event State TCP Sender Action Commentary
ACK receipt Slow Start CongWin = CongWin + MSS, Resulting in a doubling of
for previously (SS) If (CongWin ≥ Threshold) CongWin every RTT
unacked set state to “Congestion
data Avoidance”
ACK receipt Congestion CongWin = CongWin+MSS * Additive increase, resulting
for previously Avoidance (MSS/CongWin) in increase of CongWin by
unacked (CA) 1 MSS every RTT
data
Loss event SS or CA Threshold = CongWin/2, Fast recovery,
detected by CongWin = Threshold, implementing multiplicative
triple Set state to “Congestion decrease. CongWin will not
duplicate Avoidance” drop below 1 MSS.
ACK
Timeout SS or CA Threshold = CongWin/2, Enter slow start
CongWin = 1 MSS,
Set state to “Slow Start”
Duplicate SS or CA Increment duplicate ACK count CongWin and Threshold not
ACK for segment being acked changed

Transport Layer 3-93


TCP throughput
ˆ Q: what’s average throughout of TCP as
function of window size, RTT?
 ignoring slow start
ˆ let W be window size when loss occurs.
 when window is W, throughput is W/RTT
 just after loss, window drops to W/2,
throughput to W/2RTT.
 average throughout: .75 W/RTT

Transport Layer 3-94


TCP Futures: TCP over “long, fat pipes”

ˆ example: 1500 byte segments, 100ms RTT, want 10


Gbps throughput
ˆ requires window size W = 83,333 in-flight
segments
ˆ throughput in terms of loss rate:

1.22 ⋅ MSS
RTT L
ˆ L = 2·10-10 Wow
ˆ new versions of TCP for high-speed

Transport Layer 3-95


TCP Fairness
Fairness goal: if K TCP sessions share same
bottleneck link of bandwidth R, each should have
average rate of R/K

TCP connection 1

bottleneck
TCP
router
connection 2
capacity R

Transport Layer 3-96


Why is TCP fair?
Two competing sessions:
ˆ Additive increase gives slope of 1, as throughout increases
ˆ multiplicative decrease decreases throughput proportionally

R equal bandwidth share


Connection 2 throughput

loss: decrease window by factor of 2


congestion avoidance: additive increase
loss: decrease window by factor of 2
congestion avoidance: additive increase

Connection 1 throughput R

Transport Layer 3-97


Fairness (more)
Fairness and UDP Fairness and parallel TCP
ˆ Multimedia apps often
connections
do not use TCP ˆ nothing prevents app from
 do not want rate opening parallel cnctions
throttled by congestion between 2 hosts.
control ˆ Web browsers do this
ˆ Instead use UDP: ˆ Example: link of rate R
 pump audio/video at supporting 9 cnctions;
constant rate, tolerate
 new app asks for 1 TCP, gets
packet loss
rate R/10
ˆ Research area: TCP  new app asks for 11 TCPs,
friendly gets R/2 !

Transport Layer 3-98


TCP Connection Management
Recall: TCP sender, receiver Three way handshake:
establish “connection”
before exchanging data Step 1: client host sends TCP
segments SYN segment to server
ˆ initialize TCP variables:  specifies initial seq #
 seq. #s  no data
 buffers, flow control Step 2: server host receives
info (e.g. RcvWindow) SYN, replies with SYNACK
ˆ client: connection initiator segment
Socket clientSocket = new
Socket("hostname","port
 server allocates buffers
number");  specifies server initial
seq. #
ˆ server: contacted by client
Socket connectionSocket = Step 3: client receives SYNACK,
welcomeSocket.accept(); replies with ACK segment,
which may contain data

Transport Layer 3-99


TCP Connection Management (cont.)

Closing a connection: client server

client closes socket: close


FIN
clientSocket.close();

Step 1: client end system AC K


sends TCP FIN control close
segment to server FIN

Step 2: server receives

timed wait
A CK
FIN, replies with ACK.
Closes connection, sends
FIN.
closed

Transport Layer 3-100


TCP Connection Management (cont.)

Step 3: client receives FIN, client server


replies with ACK. closing
FIN
 Enters “timed wait” -
will respond with ACK
to received FINs AC K
closing
Step 4: server, receives FIN

ACK. Connection closed.

timed wait
A CK

closed

closed

Transport Layer 3-101


TCP Connection Management (cont)

TCP server
lifecycle

TCP client
lifecycle

Transport Layer 3-102


Tuning TCP/IP Parameters
ˆ TCP/IP parameters
 A set of default values may not be optimal for all applications.
 The network administrator may wish to turn on or off some
TCP/IP functions for performance or security considerations.
ˆ Many Unix and Linux systems provide some flexibility in
tuning the TCP/IP kernel.
ˆ /sbin/sysctl is used to configure the Linux kernel
parameters at runtime.
 Default kernel configuration file is /sbin/sysctl.conf.
 Frequently used sysctl options:
• sysctl –a or sysctl –A: list all current values.
• sysctl –p file_name: load the sysctl setting from a configuration
file.
• sysctl –w variable=value: change the value of the parameter
Transport Layer 3-103
SomeTCP Parameters in Linux Kernel
ˆ tcp_syn_retries
 Number of SYN packets the kernel will send before giving up on
the new connection.
ˆ tcp_synack_retries
 number of SYN+ACK packets sent before the kernel gives up on
the connection.
ˆ tcp_window_scaling
 Maximum window size of 65535 bytes not enough for for really
fast networks. The window scaling options allows for almost
gigabyte windows, which is good for connections with large
delay-bandwidth product.
ˆ tcp_max_syn_backlog
 Maximal number of remembered connection requests, which
still did not receive an acknowledgment from connecting client.
ˆ tcp_fin_timeout
 How many seconds to wait for a final FIN packet before
the socket is closed; required to prevent denial-of-service
(DoS) attacks. Default value is 60 seconds.
Transport Layer 3-104
SomeTCP Parameters in Linux Kernel
ˆ tcp_rmem
 This is a vector of 3 integers: [min, default, max].
These parameters are used by TCP to dynamically adjust
receive buffer sizes.
 min - minimum size of the receive buffer used by each
TCP socket. The default value is 4K.
 default - the default size of the receive buffer for a
TCP socket. The default value is 87380 bytes, and is
lowered to 43689 in low memory systems. If larger
receive buffer sizes are desired, this value should be
increased.
 max - the maximum size of the receive buffer used by
each TCP socket. The default value of 87380*2 bytes is
lowered to 87380 in low memory systems.
ˆ tcp_smem
 Send buffer parameters [min, default, max] similar to
tcp_rmem. Transport Layer 3-105

Você também pode gostar