Escolar Documentos
Profissional Documentos
Cultura Documentos
Goals
Consistent checkpoint
A consistent snapshot of memory for a specific
time in the past.
Scalable
Scales well in large core networks such as
meshes.
Related Work
On the feasibility of incremental checkpointing
for scientific computing by J. Sancho et al
Speculates about the future role of checkpointing
in parallel machines.
As the number of processing nodes grows
exponentially, failure of any one node becomes
much more likely.
Error correction codes and other redundancies
would introduce too much overhead when used
alone.
As a result, researching Checkpoint recovery is
growing in importance.
Related Work
Modular Checkpointing for Atomicity
by L. Ziarek et al.
Introduces an abstraction called
stabilizers to make checkpointing easier.
Targets message-passing machines
Makes consistent checkpointing more
challenging.
Related Work
SafetyNet: improving the availability of
shared memory multiprocessors with
global checkpoint/recovery by D. Sorin et
al.
Explores the concept of checkpointing in
logical time.
Multiple checkpoints.
Each dirty cache line has a tag indicating
when it was modified relative to a checkpoint.
Low execution overhead.
Not safe from power failures.
Related Work
ReVive: cost-effective architectural support
for rollback recovery in shared-memory
multiprocessors by M. Prvulovic et al.
Explores different ways of rollback recovery in
shared-memory multiprocessor systems.
Considers:
the scope of the checkpoint
memory
checkpointing mechanism.
Related Work
Efficient Initialization and Crash
Recovery for Log-based File Systems
over Flash Memory by Chin Wu et al.
As Flash Memory becomes cheaper and
denser, the uses for Flash increase.
Uses flash for recovering file systems.
Yet another use of flash for recovery.
Use a log-based method to accelerate
remounting after system crash by
minimizing the amount of information that
has to be changed upon reboot.
DRAM
DRAM
DRAM
DRAM
Core
L1
L2
DRAM
DRAM
DRAM
Checkpoint
er
DRAM
Checkpoint
er
DRAM
DRAM
DRAM
Checkpoint
er
Checkpoin
t
Coordinat
or
Core
L1
L2
Checkpoint
A
Checkpoint
B
Cache
Checkpoi
nt
Controlle
r
Cache
Checkpoi
nt
Controlle
r
DRAM
Checkpoint
er
Checkpoint
A
Checkpoint
B
Checkpoint
A
Checkpoint
B
Address Decoder
Buffe Buffe Buffe Buffe
r
r
r
r
Log
Log
Log
Log
Chec Chec Chec Chec
k
k
k
k
point point point point
Checkpointing Techniques
For Caches and Cores:
Each cache/core has two flash storages adjacent
to it.
One is for the previous checkpoint
One for the current checkpoint.
For DRAM:
The checkpointing system snoops on DRAM.
DRAM changes are continuously logged to flash
memory.
A chain of parallel buffers ensues that DRAM
checkpointing almost never causes a stall.
DRAM Checkpointer
Continuously logs DRAM changes.
Checkpoints when instructed by the
coordinator.
c. DRAM Checkpointer:
i. Flushes buffer to flash log.
ii. Notifies checkpoint coordinator that the buffer has
been flushed.
c. DRAM Checkpointer:
i.
Core
L1
L2
Checkpoint
A
Checkpoint
B
Cache
Checkpoi
nt
Controller
Cache
Checkpoi
nt
Controller
Checkpoint
A
Checkpoint
B
Checkpoint
A
Checkpoint
B
FFFFFFFF
Address Decoder
Buffer
ed
Chang
es
Previous
Checkpoint
Changes
start
Next
Checkpoin
t Changes
end
Previous
Checkpoint
(random
access)
Recovering
1. Determining which Checkpoint to use
a.
b.
3. Resume Execution
a.
b.