Project2 Talk

Checkpoint Based
Recovery from Power

Failures
Christopher Sutardja
Emil Stefanov
Goals
Consistent checkpoint
A consistent snapshot of memory for a specific
time in the past.
Safe even under power failure

The checkpoint is never in transition
Small storage overhead

Not much more than double the memory.
Low performance overhead

Should not stall the processor for too long.
Scalable
Scales well in large core networks such as
meshes.
Related Work
On the feasibility of incremental checkpointing
for scientific computing by J. Sancho et al
Speculates about the future role of checkpointing
in parallel machines.
As the number of processing nodes grows
exponentially, failure of any one node becomes
much more likely.
Error correction codes and other redundancies
would introduce too much overhead when used
alone.
As a result, researching Checkpoint recovery is
growing in importance.
Related Work
Modular Checkpointing for Atomicity
by L. Ziarek et al.
Introduces an abstraction called
stabilizers to make checkpointing easier.
Targets message-passing machines
Makes consistent checkpointing more
challenging.
Related Work
SafetyNet: improving the availability of
shared memory multiprocessors with
global checkpoint/recovery by D. Sorin et
al.
Explores the concept of checkpointing in
logical time.
Multiple checkpoints.
Each dirty cache line has a tag indicating
when it was modified relative to a checkpoint.
Low execution overhead.
Not safe from power failures.
Related Work
ReVive: cost-effective architectural support
for rollback recovery in shared-memory
multiprocessors by M. Prvulovic et al.
Explores different ways of rollback recovery in
shared-memory multiprocessor systems.
Considers:
the scope of the checkpoint
memory
checkpointing mechanism.
Achieves about 6% checkpointing overhead.

Not safe from power failures.
Not geared towards non-volatile memory:
requires fast writes.
Related Work
Efficient Initialization and Crash
Recovery for Log-based File Systems
over Flash Memory by Chin Wu et al.
As Flash Memory becomes cheaper and
denser, the uses for Flash increase.
Uses flash for recovering file systems.
Yet another use of flash for recovery.
Use a log-based method to accelerate
remounting after system crash by
minimizing the amount of information that
has to be changed upon reboot.
DRAM
DRAM
DRAM
DRAM
Core
L1
L2
DRAM
DRAM
DRAM
Checkpoint
er
DRAM
Checkpoint
er
DRAM
DRAM
DRAM
Checkpoint
er
Checkpoin
t
Coordinat
or
Core
L1
L2
Checkpoint
A
Checkpoint
B
Cache
Checkpoi
nt
Controlle
r
Cache
Checkpoi
nt
Controlle
r
DRAM
Checkpoint
er
Checkpoint
A
Checkpoint
B
Checkpoint
A
Checkpoint
B
Address Decoder
Buffe Buffe Buffe Buffe
r
r
r
r
Log
Log
Log
Log
Chec Chec Chec Chec
k
k
k
k
point point point point
Checkpointing Techniques
For Caches and Cores:
Each cache/core has two flash storages adjacent
to it.
One is for the previous checkpoint
One for the current checkpoint.
During a checkpoint, the cache/core internal

state is copied to flash storage.
For DRAM:
The checkpointing system snoops on DRAM.
DRAM changes are continuously logged to flash
memory.
A chain of parallel buffers ensues that DRAM
checkpointing almost never causes a stall.
Responsibilities of the Main

Components
Checkpoint Coordinator
Notifies the nodes and DRAM checkpointers
that a checkpoint is beginning.
DRAM Checkpointer
Continuously logs DRAM changes.
Checkpoints when instructed by the
coordinator.
Cache Checkpoint Controller

Checkpoints the adjacent cache when
instructed by the coordinator.
Steps for Checkpointing (1

of 2)
1. The coordinator sets the checkpoint
signal to 1.
2. In parallel each
a. Core:
i. Pauses processing instructions.
ii. Copies internal state to flash memory.
b. Cache Checkpoint Controller:

i.
Copies cache internal state to flash memory (data

is copied one line at a time).
c. DRAM Checkpointer:
i. Flushes buffer to flash log.
ii. Notifies checkpoint coordinator that the buffer has
been flushed.
Steps for Checkpointing (2

of 2)
3. The coordinator sets the checkpoint signal
to 0.
4. In parallel each
a. Core:
i.
Flips flash memory bit to indicate the new

checkpoint buffer.
b. Cache Checkpoint Controller:

i.
Flips flash memory bit to indicate the new

checkpoint buffer.
c. DRAM Checkpointer:
i.
Marks checkpoint boundary in flash log.
Core
L1
L2
Checkpoint
A
Checkpoint
B
Cache
Checkpoi
nt
Controller
Cache
Checkpoi
nt
Controller
Checkpoint
A
Checkpoint
B
Checkpoint
A
Checkpoint
B
FFFFFFFF
Address Decoder
Buffer
ed
Chang
es
Previous
Checkpoint
Changes
start
Buffe Buffe Buffe Buffe

r
r
r
r
Log
Log
Log
Log
Next
Checkpoin
t Changes
Chec Chec Chec Chec

k
k
k
k
point point point point
end
Previous
Checkpoint
(random
access)
Recovering
1. Determining which Checkpoint to use
a.
b.
System checks which Checkpoint is the most recent

If the most recent checkpoint was in progress during
crash, the older checkpoint is used.
2. Restoring Previous State

a.
b.
c.
d.
Each architectural register is rewritten.

Each cache is written to by its adjacent FLASH buffer
(one cache line at a time)
Main Memory is recovered
Take advantage of pipelined write if available.
3. Resume Execution
a.
b.
Resume program counter

Notify that CPUs that the system is restoring from a
checkpoint (single bit)

Project2 Talk

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Project2 Talk

Enviado por

Direitos autorais:

Formatos disponíveis

Checkpoint Based

Recovery from Power

Safe even under power failure

Small storage overhead

Low performance overhead

Achieves about 6% checkpointing overhead.

During a checkpoint, the cache/core internal

Responsibilities of the Main

Cache Checkpoint Controller

Steps for Checkpointing (1

b. Cache Checkpoint Controller:

Copies cache internal state to flash memory (data

Steps for Checkpointing (2

Flips flash memory bit to indicate the new

b. Cache Checkpoint Controller:

Flips flash memory bit to indicate the new

Marks checkpoint boundary in flash log.

Buffe Buffe Buffe Buffe

Chec Chec Chec Chec

System checks which Checkpoint is the most recent

2. Restoring Previous State

Each architectural register is rewritten.

Resume program counter

Você também pode gostar