Você está na página 1de 6

2013 27th International Conference on Advanced Information Networking and Applications Workshops

Encrypted Incremental Backup without Server-Side Software

Shih-Yu Lu
Cloud System Software Institute
Institute for Information Industry
Taipei City, Taiwan (R.O.C)
shiyosylu@iii.org.tw

Abstract With the development of science and technology, the


capacity of hard disk and quantity of data are constantly
increasing. Backup has become an important mechanism for
prevention of data loss. When the amount of data is small, full
backup is appropriate. However, if the amount of data is
considerably large, ensuring full backup every time becomes time
consuming. In this case, incremental backup is used for saving
time. Incremental backup preserves data by not creating multiple
copies; therefore, incremental backup is faster than full backup. To
solve the related security problem, encryption is necessary when
the backup data are stored in the storage server. However, most
incremental backup cannot be encrypted. Thus, this paper presents
a method to take encrypted incremental backup without any
additional server-side software requirements. The encrypted
incremental backup collects information from every file and stores
it into a single file called the checksum file. This information
includes filename, checksum, last modified time, file size, delete
stamp, and encryption key. When the backup begins, the client
collects the filename, checksum, last modified time, and file size.
Then, the client gets another checksum file from the storage server.
By comparing these two checksum files, the system will know
which files have been changed and should be transmitted in the
backup this time. Before these files are sent to the storage server,
the client will generate random keys to encrypt the files and store
the encrypted keys in the checksum file. The system will use
another key (key encryption key; KEK) to encrypt the checksum
file; the administrator password is used for encrypting/decrypting
this KEK.
Keywords-Incremental Backup, Encrypted Backup, Restore

I.

INTRODUCTION

Extensive network coverage and mobile devices have


made information and files readily available. The security of
business information and users data has become pivotal for
enterprises [1][2][3][3][4][5][6][7]. Therefore, there is a
need for security technology that can be used for protecting
backup data in a storage server. The term encrypted
backup refers to the backup data that have been encrypted
to keep data secure. If such a technology is used, then even
if other users get these encrypted data, they will not be able
to access the data content easily. On the other hand, because
the amount of data used in day-to-day life has increased
considerably in the recent years, incremental backup is has
become increasingly important now. The main concept of
incremental backup is to back up only the difference
978-0-7695-4952-1/13 $26.00 2013 IEEE
DOI 10.1109/WAINA.2013.178

between two file versions as doing so will reduce the


transmitting time and the data transfer.
Unfortunately, as encryption protects the file content, it
may not allow the incremental backup algorithm [8] to
detect the parts of a file that have been changed since the last
backup. Therefore, most incremental backup methods do not
support encryption. In this paper, I propose a backup
mechanism that supports encryption and incremental backup.
In order to achieve this goal, we need to do the following: (1)
support incremental backup, (2) support encrypted backup,
(3) take multiple versions of backup/restoration, and (4)
support different storage servers. In summary, I divide the
proposed method into three main functions in this paper; the
overview is presented in Figure 1:
Check Files: To determine what parts of a file have been
changed since the last backup and need to be transmitted to
the storage server in the current backup.
Encryption: After the system generates the transmitted
file list, all files in this list will be encrypted before sending
files to the storage server.
Storage Server Switcher: Different users have different
storage server requirements, so a storage server switcher will
transmit data through a different application interface on the
basis of the storage server that the user chooses.
After the encryption of a file, the encrypted data will be
treated as normal data in the storage server, and the storage
server will need to have the ability to perform only a few
basic functions such as copy, move, and transmit file. When
a user wants to restore a file, the system can use the
checksum file to retrieve the encrypted file from the server
and decrypt it on the client side. In this way, the server does
not need additional software for backup; it becomes a simple
storage space because most of the computing is performed
on the client side. It is easy to implement this system. We do
not need to implement checksum, split file, and encryption
ourselves. We can use well-designed open-source software
to help us complete this system.
II.

CHECK FILES

A. Check Files
After the data are encrypted, the system cannot compare
the original file with the encrypted file for determining the
modified parts of the file. Therefore, the system must have
the file information from the last backup for determining
1467

Figure 1. Overview of Encrypted Incremental Backu

what parts of the file have been changed since then. In this
proposed method, the system will collect file information
before the backup; this file information includes the full path
of the file, last modified time, file size, checksum, delete
stamp, and encryption key. Hence, the system can compare
the two checksum lists to determine which part of the file
has been modified since the last backup.
Let T be the file size threshold, F be a collection of
backup data, C(Fi) be the checksum of Fi where Fi F, S(Fi)
be the file size, and N be the file number in F. Assume that
there are N files, and each is of a different size. If S(F1) T,
the checksum of F1 will be C(F1). If S3 > T, F3 will split into
(S3 mod T) +1 blocks. Assume that there are I blocks in F3;
this implies that F3 will have I + 1 checksums: C(F31)
C(F32). C(F3I-1), C(F3I), and C(F3). The details of the
sequence are as follows:
1) Check whether there are backup data in the target
storage server.
a) If there are no any backup data in the storage server,
the system will perform full backup.
b) If there are backup data in the storage server, we
will require the checksum file from the storage server.
2) Start making the checksum file. The first system
records the full path, last modified time, and file size of the
backup data first. Then, it starts comparing the two
checksum lists.

1468

a) If the last modified time and the file size of a file


are the same in two checksum lists, this file has not been
changed since the last backup; so nothing needs to be done
for this file. Remove this file information from the
checksum list of the client.
b) If the file information is different in the two
checksum file, the file will need to be backed up in the
storage server or deleted. If the file information only exists
in the checksum list of the storage server, add a delete
stamp to this file in the checksum list of the storage server.
3) Start generating the checksum of the file in the
checksum list from the client.
a) If S(Fi) T, generate C(Fi), add it to the checksum
list, and transmit the list.
b) If S(Fi) > T, split Fi into I blocks, generate the
checksum for every block, and then go to step 2; do not
compare the last modified time and the file size, but
compare the checksum of every block.
The basic principle is that the delete stamp should be
changed from 0 to 1 when the file exists only in the
checksum list obtained from the storage server and the
revised parts of the file are transmitted. In fact, the check file
step may be the most important part of the entire system, and
it costs the maximum time to implement. There are so many
scenarios that need to be taken care of that if the developers

last modified time and the file size are the indicators in the
first check file step. In order to avoid the collision problem
[15], the last modified time should have millisecond
accuracy, and the unit of data size should be bytes. Assume
that there are two different files; they have the same last
modified time, file size, and path, but they are saved on
different storage servers; the system will not transmit these
files, and this will be a big problem.

do not think carefully and consider them, the system may


delete the wrong data or not back up the necessary data.
B. Checksum File Format
The checksum file is the one of the important
components of the proposed method, and Table 1 shows the
format of the checksum file when the file size threshold is
300 kB. The first column is File Name and includes the full
path and the filename of a file. The second column has the
last modified time, and its format is as follows:
YYYYMMDDhhmmss.nnn. FS stands for file size. Because
the file size threshold is 300 kB in this table and file1 and
file2 are smaller than 300 kB, they do not change. File3 is
bigger than 300 kB and hence will be split into two chunks.
Chunk 1 is 300 kB, and chunk 2 is 262 kB (562 kB - 300 kB
= 262 kB). CKS is the checksum value and depends on the
selected hash algorithm [9] [10] [11]. The fifth column D is
for the delete stamp; when D is equal to 1, it implies that this
file or chunk should be deleted or should not be transmitted.
When the system is in the backup step, if the system
modifies the delete stamp from 0 to 1, it implies that this file
no longer exists on the client now but it exists in the
previous version of the backup data in the storage server.
Hence, I use the delete stamp as an indicator for restoring
data from multiple backup versions and letting the system
know which file should be transferred from the storage
server to the client, when the system begins the restore
operation. Key refers to the encryption key and is used for
encrypting a file; it is generated by a random key generator.
The format and the length of encryption key also depend on
the encryption algorithm [12][13][14]. Different algorithms
will generate different keys with different encryption
strengths.
The two most important parameters given in Table 1 are
CKS and Key; these two values depend on the algorithms or
functions that the developers use for implementing the
backup system. Further, they affect the performance of the
backup; in most cases, increased security means increased
computing, and increased computing needs relatively more
time to complete.
TABLE I.
FN
Path/F1
Path/F2
Path/F3
Path/F3_1
Path/F3_2

III.

ENCRYPTION

Encryption is the process of transforming data using an


algorithm to make the data unreadable to anyone except
those possessing a special key [16]. Using encryption to
protect certain information and data from other people is a
very common procedure. In this paper, I do not propose my
own security algorithm; I only propose a flow that shows
how to use an encryption function in the backup and restore
process.
Figure 2 shows the detailed process of encryption. After
the check file step, the system will generate a transmitting
file list and encrypt every file in the transmitting file list.
The key generator will produce random keys for encryption.
Every file has different encryption keys, and these keys are
written into the checksum file. The system will encrypt the
checksum file by using KEK; KEK is a hashed combination
of the user password and a random key. For increasing the
difficulty of decryption, the user password is used for
encrypting the random key, and the encrypted key is stored
in the storage server. Thus, even if people know the user
password or the random key, they will still not be able to
decrypt the data easily.

CHECKSUM FILE FORMAT

LMT
20120812131425.111
20120903012256.234
20120831173073.729
20120831173073.729
20120831173073.729

FS
290390
153787
562000
300000
262000

CKS
C(F1)
C(F2)
C(F3)
C(F31)
C(F32)

D
0
1
0
0
0

Key
34sdrt6y
rewg645
7yjf8qil
3098ller
u63jduk

Figure 2. Encryption

Encryption is used for backup that stores data in the


storage server, and decryption is used for restoring the
decrypted data so that the data owner can use these data.
Figure 3 shows the details of the decryption process. If we
want to decrypt data, we need to get the encryption key
stored in the checksum file. Although the checksum file is
protected by KEK, we have to get the key combination and
then hash it by a specific hash key designed by the developer.
In order to obtain the key combination, the administrator
password is used for decrypting the encrypted key to get the
key in the first step; this key is combined with the

C. Notice
The following need to be noted: First, the reason for the
non-generation of the checksum when the system writes the
last modified time and the file size into the checksum file is
performance. Checksum needs more computing power and
cost more time. In particular, when there are many files in a
modern computer, if the system generates the checksum for
every bit of backup data, the backup process will take more
time to complete. For saving the backup time of the check
file, I designed two file check steps in this study. Second, the

1469

administrator password and a specific hash function is used


for hashing the key combination. In this way, we can obtain
the key required for decrypting the checksum file. For
getting the encryption keys of every file, the system will use
the hashed key combination to decrypt the encrypted
checksum file. After we obtain the encryption key of every
file, we can start decrypting the file for the restore operation.
.

Figure 4. Storage Switcher

V.

In this section, I will show how to use the checksum file


to obtain multiple versions of the backup and restoration.
Figure 5 shows an example when the maximum version
limitation is two versions.
The first backup is the full backup; the system will
back up all data to the storage server, and checksum
file version 1 has information about every file. We
assume that the first backup is stored in a folder
named V1.
When the second backup begins, the system only
transmits the revised parts of the files and chunks to
the storage server and generates checksum file
version 2 that has the information of the transmitting
file and chunk.
In this example, the system begins the third backup.
The system will first check the number of versions
of the backup in the storage server. If the number of
versions is greater than or equal to the maximum
version limitation, the system will move all the data
from the V2 folder to the V1 folder, and the two
checksum files will be merged into a new checksum
file version1. After all the merge processes are
completed, the system will begin backing up data to
the storage server.

Figure 3. Decryption

IV.

MULTIPLE VERSIONS OF BACKUP AND RESTORATION

STORAGE SERVER SWITCHER

In this section, the system already has the transmitting


file list and encrypts all the files in the list. Now, we have to
decide the storage server to store the encrypted data in.
Considering that customers have difference storage servers
that they want to backup to, this backup system must have
the ability to store data in different storage servers. In most
cases, the backup software is for Windows, Linux, and Mac,
but customers may have other choices now. The choice is a
network storage service. Network storage services have
become increasingly popular in the last few years (e.g.,
Amazon S3, Dropbox, Microsoft Skydrive, and Google
Drive). Users upload, download, back up, and share files
using these network storage services. These service
providers not only offer application software that can be
installed on computers and mobile devices but also offer an
application interface for synchronizing the client and the
server data in real time. Hence, a developer can use these
application interfaces to connect to the network storage
service. Combining the abovementioned two kinds of
storage servers may be a good way to enhance the user
experience.
As users have more choices to store their data now, for
communication with different storage servers, developers
have to implement many sets of application interfaces, as
shown in Figure 4. The user interface should change when
the user chooses a different storage server because the
application interfaces of different storage servers are
different.

Figure 5. Backup process when maximum version limitation is two

1470

A complete backup system can not only back up data


from the client to the storage server but also restore data
from the storage server. Hence, in this part, I will discuss the
restoration of data from the storage server by using the
checksum file. Figures 6, 7, and 8 show the scenarios in
which the user restores data from three different versions of
the backup. Let Fn be file n, Fn be the new file n, and
Fn_m be the number m chunk of file n.
If we want to restore from the incremental backup data,
we need to access the storage server that has the base data
for the restore operation. In this method, I always use the
data in folder V1 as the base data. The other folders store the
modified parts of the files and chunks.
Figure 6 shows the scenario in which the user restores
data from the second version of the backup. We consider the
data in folder V1 as the base data, and the data in folder V2
as the modified date. When the user chooses version 2 data
for restoration, we require both checksum files, version 1
and version 2, from the storage server first. After we get
these two checksum files, the system will merge them into a
new transmitting file list containing information regarding
the file and chunks that need to be transmitted from the
server to the client. In this case, the user modified file3 and
the third chunk of file2, so we keep this two pieces of data in
the transmitting file list; the other data are obtained from
folder V1.

version 3 data is based on version 2, so there are F2_3 and


F3 data in version 3. Finally, we obtain F1 from folder V3,
F2_1 F2_2 from folder V1, and F2_3 F3 from folder V2.
Because F4 no longer exists in backup data version 3, we do
not need to transmit F4. Figure 8 is similar to Figure 7; it
just shows the restoration from backup data version 4

Figure 8. Restoration from backup data version 4

VI.

CONCLUSION

This paper presents an encrypted incremental backup


method and system. By pre-collect information from every
file, developers can implement this method easily. There are
many parts of system are functional programs, like split file,
combine file, encryption, decryption, and hash function. If
developer want communicates with cloud storage service,
there are a lot of APIs (application interface), provided by
service provider, for implementation.
There are still two issues in this system. First problem
is key management; the second is size of file size threshold.
But those two issues came from the same source, system
performance.
VII. ACKNOWLEDGMENT
The authors would like to thank Enago (www.enago.tw)
for the English language review.

Figure 6. Restoration from backup data version 2

.
REFERENCES
[1]

[2]

[3]

Figure 7. Restoration from backup data version 3

[4]

As shown in Figure 7, the user modifies file1 and deletes


file 4, so there is a delete stamp on the information of file4,
and the information of the new file1 in the checksum file
version 3. The only difference from Figure 6 is that the

[5]

1471

S. Chaitanya, B. Urgaonkar, A. Sivasubramaniam, MultilevelCrypto Disk: Secondary Storage with Flexible Performance
Versus Security Trade-offs, 2010 IEEE International Symposium on
Modeling, Analysis & Simulation of Computer and
Telecommunication Systems, pp. 434-436, 2010.
M. Liang, C. Chang, Research and design of full disk encryption
based on virtual machine, 3rd IEEE International Conference on
Computer Science and Information Technology, pp. 642-646, 2010.
R. Prabhakar, Seung Woo Son, C. Patrick, S. Narayanan, M.
Kandemir, Securing Disk-Resident Data through Application Level
Encryption, Fourth International Workshop on Security in Storage,
pp. 46-57, 2007.
C. Gebhardt, A. Tomlinson, Secure Virtual Disk Images for Grid
Computing, Third Asia-Pacific Conference on Trusted
Infrastructure Technologies, pp. 19-29, 2008.
J. He, M. Xu, Research on Storage Security Based on Trusted
Computing Platform, International Symposium on Electronic
Commerce and Security, pp. 448-452, 2008.

[6]

[7]

[8]

[9]
[10]

[11]

[12]

[13]
[14]

[15]
[16]

F. Hou, N. Xiao, F. Liu, H. He, Secure Disk with Authenticated


Encryption and IV Verification, Fifth International Conference on
Information Assurance and Security, pp. 41-44, 2009.
J. Li, H. Yu, Trusted full disk encryption model based on TPM,
2nd International Conference on Information Science and
Engineering, pp. 1-4, 2010.
A. Tridgell, P. Mackerras, "The rsync algorithm", available at :
http://cs.anu.edu.au/techreports/1996/TR-CS-96-05.html, Retrieved:
2012-02-28
R. L. Rivest, "The MD5 Message Digest Algorithm", Internet RFC
1321 (1992-04)
R. Housley, A 224-bit One-way Hash Function: SHA-224,
available at : http://tools.ietf.org/html/rfc3874, Retrieved: 2012-0512
D. Eastlake, US Secure Hash Algorithms (SHA and SHA-based
HMAC and HKDF) available at : http://tools.ietf.org/html/rfc6234,
Retrieved 2012-05-26
J. Daemen, S. Borg, V. Rijmen, The Design of Rijndael : AES
The Advanced Encryption Standard, Springer-Verlag, 2002. ISBN
3-540-42580-2
B. Schneier, Description of a New Variable-Length Key, 64-bit
Block ciphers, Dast Software Encryption 1993: 191-204
B. Schneier, J. Kelsey, D. Whiting, D. Wanger, C. Hall, N. Ferguson,
Twofish: A 128-Bit Block Cipher, available at :
http://www.schneier.com/paper-twofish-paper.pdf. Retrieved 201203-025
X. Wang; H. Yu. "How to Break MD5 and Other Hash Functions".
EUROCRYPT. 2005. ISBN 3-540-25910-4.
Wikipedia,
Encryption,
available
at
:
http://en.wikipedia.org/wiki/Encryption, Retrieved 2012-07-30

1472

Você também pode gostar