Você está na página 1de 15

Protocol Analysis in a Complex

Enterprise: The Importance of The Art of


Recognition.

June 16th, 2009

Hansang Bae
Senior VP| Citi (f.k.a Citigroup)
hbae@nyc.rr.com
SHARKFEST '09
Stanford University
June 15-18, 2009
SHARKFEST '09 | Stanford University | June 1518, 2009

Challenges:
As it turns out, size does matter!
Citis branch network spans 5,000+ locations in
the US
Citis network infrastructure includes 30,000+
devices
300,000 users located in over 100 countries.
Number of servers in use is mind numbingly
large!
Compliance/Security Quagmire
Doing a full packet capture is difficult.
Tools in use include NetVCR and Opnets ACE.
Wireshark is the only approved protocol
SHARKFEST '09 | Stanford University | June 1518, 2009
analyzer
at Citi. It dislodged past market

Act I: Much Ado about Nothing!


Old medical school saying: When you hear hooves beating, think
horses and not zebras!
Server SA reports extreme slowness during file transfersICMP_BHNew2.pcap
What are the top issues that come to mind?
Server SA started a ping script and in it showed..
ICMP_BHNew2ICMPOnly.pcap
Lessons Learned:
Learn to recognize what should and should not change as you
go through the trace files.
RFC1323 was not in play because they are on the same switch!
Take a few minutes to scan the trace files. Learn to trust your
brains ability to spot differences.
Know how protocols work so you can rule out red-herrings. This
is what separates techs from engineers
Try not to filter. You might have missed the arp frames in this
trace. This is different than capturing in promiscuous mode.
SHARKFEST '09 | Stanford University | June 1518, 2009

Act II: Taming the SSH


Logging into a server via ssh takes over two minutes:
What are the top issues that come to mind for slow telnet/ssh
login?
Lets capture and find out. Packet captures are like Shakiras
hips. They dont lie!
SlowSSHLogin2.pcap
Lessons Learned:
Scroll through the trace to look for patterns. Again, trust your
brain.
Develop a technique; a list of common filters to run through
when troubleshooting. e.g. tcp.flags==02, tcp.analysis.flags
Dont forget UDP. What important function runs on UDP?
Do not blindly trust the tcp analysis. Wireshark can only know
what you feed it. It too suffers from GIGO (Garbage In, Garbage
Out)
Use the graphical tools available in Wireshark. Picture *IS*
worth a thousand words!
Capture
placement
important.
If |I captured
at the client, I
SHARKFEST
'09 | is
Stanford
University
June 1518, 2009

Act III: To Stream or Not to


Stream?
Application developers report extreme slowness when ftping
a file.
ftp anon.pcap
What are the top issues that come to mind for slowSlow
ftp
sessions?
Lessons Learned:
Scroll through the trace to look for patterns. Again, trust
your brain.
Develop a technique; a list of common filters to run
through when troubleshooting. e.g. tcp.flags==02,
tcp.analysis.flags
Buffer tearing is pretty common. Applications are
constantly trying to do TCPs job. App bytes can help you
identify it. Learn to recognize it! (Oracle, MS SQL, Sybase,
they all do it)
Understand what streaming really means. TCP *HAS
SHARKFEST
'09 | Stanford University | June 1518, 2009
NO* byte
boundaries.

Act IV: Windows Tale


Call center servers are not able to keep up with call volume after a
data center migration
The servers are not getting the data fast enough - causing a
backlog. What simple change can increase the throughput?
The path after the migration is longer by 50 ms.
MQSlow.pcap MQSlowPrint.txt
Lessons Learned:
If latency is causing a problem, look for RFC1323 related
problems.
Know what affects a transfer throughput. Buffer tearing,
window sizes, or packet loss.
Use the graphical plots to zoom in on the problem so lets look
at the window size. Should we look at the receive or send
window?
Argue your case. If youre right, youre right! But you had
better be right. You earn your cred over time, but you can
blow it in one shot!
Use the graphical tools available in Wireshark. Picture *IS*
SHARKFEST '09 | Stanford University | June 1518, 2009

Act IV: Windows Tale

Use STATISTICS, IO GRAPH to bring up this graph. Modify the highlighted items
to bring up this view

SHARKFEST '09 | Stanford University | June 1518, 2009

Act V: A Users Complaint


Smith Barney Financial Consultants are complaining of slow
page load times for their home page. The problem is
sporadic and random but happens enough that its impacting
their productivity.
The problem is wide-spread, not easily
reproducible.where do you start? What do you do?
Who you gonna call?
Whats common in the problem? Home page; use of load
balancer; common backend servers; affecting many
users.
Whats the job of a load balancer?
Where should we take the trace?
What bad things can happen if you are using a load
balancer with Source NAT configured?
SHARKFEST '09 | Stanford University | June 1518, 2009

Act V: A Users Complaint (cont)

SHARKFEST '09 | Stanford University | June 1518, 2009

Act V: A Users Complaint (cont)

LBProblemNew.pcap

SHARKFEST '09 | Stanford University | June 1518, 2009

Act V: A Users Complaint (cont)


Lessons Learned:
Start by looking at what infrastructure is in common for
all users experiencing the problem.
What constitutes a TCP packet? 2-Tuple? 4-Tuple?
Remember that sequence numbers are nothing more
than the number of bytes transferred. Acknowledgement
is nothing more than an indication of how much of the
data you received. You receive something outside of
whats expected, something went horribly wrong!
When you have a 22,000 user base, having a ephemeral
port range of 1024-5000 can be exhausted quickly.
Sometimes, you have to resort to turning off relative
sequence numbers for analysis. This is especially true
when load balancers or any device that NATs is in the
data path.
SHARKFEST '09 | Stanford University | June 1518, 2009

Act V: A Users Complaint (cont)

LBTCPHands hake.pcap

Lessons Learned (cont): (Turn off relative sequence


numbers)
Frames 1-8 contain the orderly close of a connections.
Frame 9 which occurs approx. 14 seconds later is an attempt of a
new client to open a connection to the LB. (Frame 10 is the LB
translated request to the web server).
Frame 11 is an acknowledgement for the prior connection. This
occurs, because the Web server still has this socket in FIN-WAIT.
(Frame 12 is the translated request LB to client).
Frames 13 and 14 is the RST generated by the client, and the
translated request, respectively.
Frames 15-18 contain a connection creation. This is allowed to
occur because of the RST. However, this causes the client to
pause for approx 3. Seconds.
SHARKFEST '09 | Stanford University | June 1518, 2009

DCMove_Original_LookAt197.pcap

Act VI: As You Log It

DCMove_OneSideLookAt10-11-12.pcap
DCMove_BothSideLookAt918.pcap

After a data center migration, an application was no longer


able to support the production traffic. The new data center
was separated by 11ms round trip latency. Before the move,
both servers were located in the same DC
Naturally, first inclination was to blame the network!
After all, the problem started after the migration.
The application generates a 3 byte alert message
followed by another small packet with the actual data.
What should be the first problem that comes to your
mind?
What looked like a slam-dunk turned out be quite
complicated!
In the Army, we had a saying: Be, Know, Do. It applies to
packet analysis.
At the end of the day, in depth knowledge of how TCP
'09 | Stanford
University
| June
1518, 2009
shouldSHARKFEST
work allowed
us to
find the
problem.

Act VI: As You Log It (cont)


Lessons Learned:
Nagle and Delayed Acknowledgment deadlock is very
common when TCP is used to shuttle small amounts of
data.
This can be a killer when trading programs are
involved.
Turning on application level logging can help, but dont
forget to turn it off!
Know what impact you can have if you decide to log. For
us router-jockeys, its equivalent to doing a debug ip
ospf on a production backbone router. Hint: not a good
idea. Its a self correcting error if you do it once, youll
never do it again!
If you know how TCP really works, you can argue your
point with conviction because deep down inside, you
know SHARKFEST
youre right.
'09 | Stanford University | June 1518, 2009

Appendix: IPs used in the


examples

ACT I: ICMP_BHNew*pcap
192.168.1.1 and 192.168.1.254 are servers on the same switch.
ACT II: SlowSSHLoging2.pcap:
192.168.1.1 is the client. 172.16.50.50 is the ssh server. 192.168.75.75 and
192.168.200.200 are NIS+ servers.
ACT III: SlowFtpAnon.pcap
10.10.10.10 is the ftp server. 192.168.1.1 client is pulling the file from the server.
ACT IV: MQSlow.pcap
172.16.50.50 is the MQ server. 192.168.1.1 is the MQ client. The server is pushing the file to
the client.
ACT V: LBProblemNew.pcap
10.2.53.102 and 10.17.97.111 are users in different branches. 172.16.10.10 and 172.16.20.20
belong to the load balancer. 172.16.254.254 is the real web server. 172.16.10.10 is end user
facing IP of the LB and 172.16.20.20 is the IP used by the LB for source NATing when talking
to the real web server.
ACT VI: DCMove_*.pcap
192.168.1.102 and 172.16.1.125 are two servers involved in the transfer. Both send data
independently of one another.

Please email me at hbae@nyc.rr.com if you would like the


The Tool Visio macro.
SHARKFEST '09 | Stanford University | June 1518, 2009

Você também pode gostar