Escolar Documentos
Profissional Documentos
Cultura Documentos
1 Introdu
tion
Computers and networks need maintenan
e. They need hardware and software
upgrades, pat
hes, and periodi
re-ar
hite
ting. This paper looks at a stru
tured
way to perform those tasks. All system administrators (SAs) need to do mainte-
nan
e, and need to minimize the impa
t the maintenan
e has on their
ustomers.
These days people rely heavily on the
omputer and network infrastru
ture
to perform their daily work. They are not tolerant of ad-ho
approa
hes to main-
tenan
e. Bringing a server down in the middle of the day in order to repla
e
a broken tape drive, or to add more disk spa
e, is no longer a
eptable. SAs
are expe
ted to perform these tasks outside of normal working hours. What we
re
ommend is a stru
tured approa
h to maintenan
e. S
hedule periodi
main-
tenan
e windows, delay intrusive tasks until the next maintenan
e window, and
plan upgrades in advan
e of the need for them.
S
heduled maintenan
e windows will look dierent for dierent
ompanies.
Small
ompanies typi
ally have less equipment and therefore only need short
maintenan
e windows. Fast-growing
ompanies need frequent maintenan
e win-
dows. High-availability
ompanies need maintenan
e windows that use redundant
systems to maintain servi
e availability during the window. Mid-sized
ompa-
nies need infrequent, but long maintenan
e windows. Large
ompanies need a
ombination of lo
ation-spe
i
maintenan
e windows, and \high-availability"
maintenan
e windows for shared infrastru
ture.
This paper will
on
entrate on the long, infrequent maintenan
e windows,
and will in
lude some spe
i
s for high-availability sites. Other sites use a subset
of these te
hniques. We
all the te
hnique des
ribed here the \
ight dire
tor"
te
hnique, named after the role of the
ight dire
tor in NASA spa
e laun
hes.1
The
ight dire
tor te
hnique guides the a
tivities before the window, during
exe
ution, and after. It is summarized in Table 1.
Preparation S
hedule the window
Pi
k a
ight dire
tor
Prepare
hange proposals
Build a master plan
Exe
ution Disable a
ess
Shutdown sequen
e
Exe
ute plan
Perform testing
Resolution Announ
e
ompletion
Enable a
ess
Have a visible presen
e
Be prepared for problems
2 Motivation
Some
ompanies are willing to s
hedule regular maintenan
e windows for major
systems and networking work in return for better availability during normal oper-
ations. Depending on the size of the site, this
ould be one evening and night per
month, or perhaps an entire weekend, from Friday evening to Monday morning,
on
e a quarter.
System administrators often like to have a maintenan
e window during whi
h
they
an take down any and all systems, and stop all servi
es, be
ause it redu
es
omplexity and makes testing easier. For example, in
utting email servi
es over
to a new system, you need to transfer existing mailboxes as well as swit
hing the
in
oming mail feed to the new system. Trying to transfer the existing mailboxes
without
utting the feed and the read a
ess for those mailboxes, and yet ensure
onsisten
y is a very tri
ky problem. But, if you
an bring email servi
es down
while you do the transfer, it be
omes a lot easier. In addition, it is a lot easier
to
he
k that the system is working
orre
tly before you turn the mail feed and
the read a
ess ba
k on again than it is to deal with having dropped or boun
ed
mail if something didn't work quite right with the live
utover.
However, you will have to sell the
on
ept to the
ompany at large in terms of
a benet to them, not in terms of it making the SA's life easier. That means that
1The origin of this terminology, and a number of the approa
hes used, was with Paul Evans.
Paul is an avid observer of the spa
e program.
you need to be able to promise better servi
e availability the rest of the time. In
other words, you need to plan in advan
e: if you have one maintenan
e window
per quarter, you need to make sure that the work you do this quarter will hold
you through the end of the next quarter, so that you won't need to bring the
system down again. You should also be prepared to provide metri
s to ba
k up
your
laims of higher availability2 from before and after you have su
eeded in
getting s
heduled maintenan
e windows. A single large outage also
an be mu
h
less annoying to users than many little outages [LRNL97℄.
3 S
heduling
A maintenan
e window is by denition a short period of time in whi
h a lot of
systems work needs to be performed. It is disruptive to the rest of the
ompany,
and so the s
heduling must be done in
ooperation with the
ustomers. You need
to nd out what dates do and don't work for them. In parti
ular, you will almost
ertainly need to avoid the end-of-month, end-of-quarter and end-of-s
al-year
dates so that the sales team
an get all of the rushed orders entered, and the
a
ounting group
an produ
e the nan
ials for that period. You will also need
to avoid produ
t release dates, if that is relevant to your business. There may
be major trade shows or other demos that must be avoided. Universities have
dierent
onstraints around the a
ademi
year. Some businesses, su
h as toy and
greeting
ard manufa
turers, may have seasonal
onstraints.
4 Planning
As with all work on important systems, the tasks need to be planned by the indi-
viduals performing them so that no original thought, or problem solving, should
be involved in performing the task on the day. There should be no unforeseen
events, only planned
ontingen
ies.
Planning for a maintenan
e window also has another dimension, however.
Sin
e they o
ur only o
asionally, the system administrators need to plan far
enough in advan
e so that they
an get quotes, submit pur
hase orders, get them
approved, and have any new equipment arrive a week or so before the maintenan
e
window. Sin
e the lead-time on some equipment
an be 6 weeks or more, this
means starting to plan for the next maintenan
e window almost immediately
after the pre
eding one has just ended.
2 The most ee
tive way of doing this is to use an existing monitoring system that
an
produ
e histori
al graphs. The most popular ones at the time of writing are MRTG [Oet98℄
and Cri
ket [All99℄.
5 Flight dire
tor
The
ight dire
tor is the one who is responsible for
rafting the announ
ement
noti
es and making sure that they go out on time. She is also responsible for
s
heduling the submitted work proposals based on the intera
tions between them
and the sta required, de
iding whi
h ones (if any) don't make the
ut for that
maintenan
e window, monitoring the progress of the tasks during the mainte-
nan
e window, ensuring that the testing o
urs
orre
tly, and
ommuni
ating
status to the rest of the
ompany at the end of the maintenan
e window.
The person who lls the role of
ight dire
tor must be a senior system admin-
istrator who is
apable of assessing work proposals from other members of the
SA team, and spotting dependen
ies and ee
ts that may have been overlooked.
The
ight dire
tor must also be
apable of making judgment
alls on the level of
risk versus need for some of the more
riti
al tasks that impa
t the infrastru
ture.
She must have a good overview of the site and the impli
ations of all the work.
In addition, the
ight dire
tor
annot perform any te
hni
al work of her own
during that maintenan
e window. This means that typi
ally the
ight dire
tor
is a member of a multi-person team, and the other members of the team take
on the work that would normally have been the responsibility of that individual.
The
ight dire
tor is not normally a manager, unless the manager was re
ently
promoted from a senior SA position, be
ause of the skill requirements des
ribed
above.
6 Change proposals
One week prior to the maintenan
e window, all
hange proposals should have
been submitted. A good way of managing the
hange proposal pro
ess is to
have all the
hange proposals online in a revision-
ontrolled area. Ea
h SA edits
do
uments in a dire
tory with his name on it. The do
uments supply all the
required information. One week before the
hange, this revision-
ontrolled area
is frozen and all subsequent requests to make
hanges to the do
uments have to
be made through the
ight dire
tor.
A
hange proposal form should answer at least these questions:
What
hanges are going to be made?
What ma
hines will you be working on?
What are the pre-maintenan
e window dependen
ies, and due dates?
What needs to be up for the
hange to happen?
What will be ae
ted by the
hange?
Who is performing the work?
How long will the
hange take in a
tive time and elapsed time, in
luding
testing, and how many additional helpers will be needed?
What are the test pro
edures? What equipment do they require?
What is the ba
k-out pro
edure and how long will it take?
7 The master plan
One week before the maintenan
e window, the
ight dire
tor freezes the
hange
proposals, and starts working on a master plan. The master plan takes into
a
ount all the dependen
ies, elapsed and a
tive times for the
hange proposals.
The end result is a series of tables, one for ea
h person, showing them what
task they will perform during what time interval and who the
oordinator for
that task is. There is also a master
hart that shows all the tasks that are
being performed over the entire time, who is performing them, the team lead,
and what the dependen
ies are. The master plan must also take into a
ount
omplete system-wide testing after all work has been
ompleted.
If there are too many
hange proposals, the
ight dire
tor will nd that
s
heduling all of them produ
es too many
on
i
ts, either in terms of ma
hine
availability, or in terms of the people required. There needs to be sla
k in the
s
hedule to allow for things to go wrong. The diÆ
ult de
isions about whi
h
proje
ts should go ahead and whi
h ones have to wait should be made before-
hand, rather than in the heat of the moment when something is taking too long
and blowing the s
hedule, and everyone is tired and stressed. The
ight dire
tor
makes the
all on when some
hange proposals need to be
ut out, and gets all the
parties involved in the s
heduling
on
i
t to dis
uss it and pi
k the best
ourse
for the
ompany.
8 Disabling a
ess
The very rst task in the maintenan
e window is to disable or dis
ourage system
a
ess, and provide reminders that it is a maintenan
e window. Depending on
what the site looks like, and what fa
ilities are available, this pro
ess may involve:
Pla
ing noti
es on all doors into the
ampus buildings with the maintenan
e
window times
learly visible.
Disabling all remote a
ess to the site, whether by dial-in, dedi
ated lines,
wireless or over a publi
network.
Making an announ
ement over the publi
address system in the
ampus
buildings to remind everyone that systems are about to go down.
Changing the helpdesk voi
email message to announ
e that this is a main-
tenan
e window, and stating when normal servi
e should be restored.
These steps redu
e the
han
e that someone will try to use the systems during
the maintenan
e window, whi
h
ould
ause in
onsisten
ies, or a
idental loss or
damage of their work. It also redu
es the
han
e that the person
arrying the
on-
all pager will have to respond to urgent helpdesk voi
emails saying that the
network is down.
9.3 Radios
Sin
e the maintenan
e window is tightly s
heduled, there are many dependen
ies,
and system administration work
an be unpredi
table at times, everyone has to
he
k with the
ight dire
tor to let him or her know when they are nished with
a task, and before they start a new task, to make sure that the prerequisite tasks
have all been
ompleted.
We re
ommend using hand-held radios to
ommuni
ate within the group.
Rather than seeking out the
ight dire
tor, an SA
an just
all her over the radio
to
he
k in. Likewise, the
ight dire
tor
an
onta
t the SAs to nd out status,
and team members and team leaders
an nd ea
h other, and
oordinate, over
the radio. If someone needs extra help, they
an also ask for it over the radio.
There are multiple radio
hannels, and long
onversations
an move to another
hannel to keep the primary one free.
The radios are also essential for system-wide testing at the end of the main-
tenan
e window, whi
h will be des
ribed later, in Se
tion 11.
If radios won't work, or work badly, in your data
enter due to RF shielding
put an internal phone extension with a long
ord at the end of every row. That
way SAs in the data
enter
an still
ommuni
ate with other SAs while working
in the data
enter. At worst, they
an go outside the data
enter, raise someone
on the radio and arrange to talk to her on a spe
i
telephone inside the data
enter.
6 Usually high availability sites avoid dependen
ies between ma
hines as mu
h as possible.
7 For instan
e, how long does the routing system take to rea
h
onvergen
e when one of the
routers goes down or
omes ba
k up?
Referen
es
[All99℄ Je R. Allen. Driving by the rear-view mirror: Managing a network
with
ri
ket. In First Conferen
e on Network Administration (NETA
'99), pages 1{10, Santa Clara, California, April 7-10 1999. USENIX.
[LH01℄ Thomas A. Limon
elli and Christine Hogan. The Pra
ti
e of System
and Network Administration. Addison-Wesley, August 2001, ISBN:
0201702711.
[LRNL97℄ Thomas A. Limon
elli, Tom Reingold, Ravi Narayan, and Ralph
Loura. Creating a Network for Lu
ent Bell Labs Resear
h south. In
Eleventh Systems Administration Conferen
e (LISA '97), page 123,