Você está na página 1de 16

Checkpoint Based

Recovery from Power


Failures
Christopher Sutardja
Emil Stefanov

Goals
Consistent checkpoint
A consistent snapshot of memory for a specific
time in the past.

Safe even under power failure


The checkpoint is never in transition

Small storage overhead


Not much more than double the memory.

Low performance overhead


Should not stall the processor for too long.

Scalable
Scales well in large core networks such as
meshes.

Related Work
On the feasibility of incremental checkpointing
for scientific computing by J. Sancho et al
Speculates about the future role of checkpointing
in parallel machines.
As the number of processing nodes grows
exponentially, failure of any one node becomes
much more likely.
Error correction codes and other redundancies
would introduce too much overhead when used
alone.
As a result, researching Checkpoint recovery is
growing in importance.

Related Work
Modular Checkpointing for Atomicity
by L. Ziarek et al.
Introduces an abstraction called
stabilizers to make checkpointing easier.
Targets message-passing machines
Makes consistent checkpointing more
challenging.

Related Work
SafetyNet: improving the availability of
shared memory multiprocessors with
global checkpoint/recovery by D. Sorin et
al.
Explores the concept of checkpointing in
logical time.
Multiple checkpoints.
Each dirty cache line has a tag indicating
when it was modified relative to a checkpoint.
Low execution overhead.
Not safe from power failures.

Related Work
ReVive: cost-effective architectural support
for rollback recovery in shared-memory
multiprocessors by M. Prvulovic et al.
Explores different ways of rollback recovery in
shared-memory multiprocessor systems.
Considers:
the scope of the checkpoint
memory
checkpointing mechanism.

Achieves about 6% checkpointing overhead.


Not safe from power failures.
Not geared towards non-volatile memory:
requires fast writes.

Related Work
Efficient Initialization and Crash
Recovery for Log-based File Systems
over Flash Memory by Chin Wu et al.
As Flash Memory becomes cheaper and
denser, the uses for Flash increase.
Uses flash for recovering file systems.
Yet another use of flash for recovery.
Use a log-based method to accelerate
remounting after system crash by
minimizing the amount of information that
has to be changed upon reboot.

DRAM

DRAM

DRAM

DRAM

Core
L1
L2

DRAM

DRAM

DRAM
Checkpoint
er

DRAM
Checkpoint
er

DRAM

DRAM

DRAM
Checkpoint
er
Checkpoin
t
Coordinat
or

Core
L1
L2

Checkpoint
A
Checkpoint
B
Cache
Checkpoi
nt
Controlle
r
Cache
Checkpoi
nt
Controlle
r

DRAM
Checkpoint
er

Checkpoint
A
Checkpoint
B
Checkpoint
A
Checkpoint
B

Address Decoder
Buffe Buffe Buffe Buffe
r
r
r
r
Log
Log
Log
Log
Chec Chec Chec Chec
k
k
k
k
point point point point

Checkpointing Techniques
For Caches and Cores:
Each cache/core has two flash storages adjacent
to it.
One is for the previous checkpoint
One for the current checkpoint.

During a checkpoint, the cache/core internal


state is copied to flash storage.

For DRAM:
The checkpointing system snoops on DRAM.
DRAM changes are continuously logged to flash
memory.
A chain of parallel buffers ensues that DRAM
checkpointing almost never causes a stall.

Responsibilities of the Main


Components
Checkpoint Coordinator
Notifies the nodes and DRAM checkpointers
that a checkpoint is beginning.

DRAM Checkpointer
Continuously logs DRAM changes.
Checkpoints when instructed by the
coordinator.

Cache Checkpoint Controller


Checkpoints the adjacent cache when
instructed by the coordinator.

Steps for Checkpointing (1


of 2)
1. The coordinator sets the checkpoint
signal to 1.
2. In parallel each
a. Core:
i. Pauses processing instructions.
ii. Copies internal state to flash memory.

b. Cache Checkpoint Controller:


i.

Copies cache internal state to flash memory (data


is copied one line at a time).

c. DRAM Checkpointer:
i. Flushes buffer to flash log.
ii. Notifies checkpoint coordinator that the buffer has
been flushed.

Steps for Checkpointing (2


of 2)
3. The coordinator sets the checkpoint signal
to 0.
4. In parallel each
a. Core:
i.

Flips flash memory bit to indicate the new


checkpoint buffer.

b. Cache Checkpoint Controller:


i.

Flips flash memory bit to indicate the new


checkpoint buffer.

c. DRAM Checkpointer:
i.

Marks checkpoint boundary in flash log.

Core
L1
L2

Checkpoint
A
Checkpoint
B
Cache
Checkpoi
nt
Controller
Cache
Checkpoi
nt
Controller

Checkpoint
A
Checkpoint
B
Checkpoint
A
Checkpoint
B

FFFFFFFF

Address Decoder

Buffer
ed
Chang
es

Previous
Checkpoint
Changes

start

Buffe Buffe Buffe Buffe


r
r
r
r
Log
Log
Log
Log

Next
Checkpoin
t Changes

Chec Chec Chec Chec


k
k
k
k
point point point point

end
Previous
Checkpoint
(random
access)

Recovering
1. Determining which Checkpoint to use
a.
b.

System checks which Checkpoint is the most recent


If the most recent checkpoint was in progress during
crash, the older checkpoint is used.

2. Restoring Previous State


a.
b.
c.
d.

Each architectural register is rewritten.


Each cache is written to by its adjacent FLASH buffer
(one cache line at a time)
Main Memory is recovered
Take advantage of pipelined write if available.

3. Resume Execution
a.
b.

Resume program counter


Notify that CPUs that the system is restoring from a
checkpoint (single bit)

Você também pode gostar