Você está na página 1de 28

Presented by Anubhav,Saurav,Ravi,Ashutosh (ASRA Group) CSE/2k7 Guided by Prof.

Binod Kumar

ASRA Group

13/07/2011

1. Introduction 2. Motivation 3. Achieving Anonymity via Clustering 4. Proposed algorithm 5. Experimental result 6. Conclusion 7. Future Work

ASRA Group

13/07/2011

Data holders, Statistics Offices are facing tremendous demand for Person specific data for the application such as : Data mining Cost analysis Fraud detection

ASRA Group 13/07/2011 3

How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cant be re-identified while the data remains practically useful for survey work.
ASRA Group 13/07/2011 4

k-Anonymity Model

ASRA Group 13/07/2011 5

Sensitive
Uniquely identify you!

Zipcode Age
75275 22

Gender
Male

Disease Flu Cold

75277
75278 75275 75275 75275

23
24 33 38 36

Male

Quasi-identifiers: Male Diabetes approximate foreign keys Male Flu Female Female

Arthritis Heart problem


13/07/2011 6

ASRA Group

Identifying
Mobile number Name Zipcode Gender age

Sensitive
Disease

9905150112 9905121223

Amit John

75275 75277 75278 75275 75275 75275

Male Male Male Male

22 23

Flu Cold

9431103097
9334292352 9431109087 9421345678

Rajan
Robin Ramesh Dhoni

Quasi-identifiers: Diabetes 24 approximate foreign keys


33 Flu Arthritis Arthritis

Female 38 Female 36

ASRA Group 13/07/2011 7

Sensitive Age
22 23 24

Gender
Male Male Male

Zip code Disease


75275

Flu

Cold 75277 Quasi-identifiers: approximate foreign keys Diabetes 75278


75275 75275
75275

33 38
36

Male Female
Female

Flu Arthritis Heart problem


ASRA Group 13/07/2011 8

Zip Code

Gender

Age

Disease

Expense

75277
75277 75277 75275 75275 75275

Male
Male Male Male Female Female

22
23 24 33 38 36

Flu
Cancer

100

3000 Quasi-identifiers: approximate foreign keys HIV+ 5000 Diabetes Diabetes Diabetes 2500 2800 2600
ASRA Group 13/07/2011 9

Zip Code Gender 7527* 7527* 7527* 7527* 7527* 7527* Person Person Person Person Person Person

Age [21-30] [21-30] [21-30] [31-40] [31-40] [31-40]

Disease Flu Cancer HIV+ Diabetes Diabetes Diabetes

Expense 100 3000 5000 2500 2800 2600

ASRA Group 13/07/2011 10

Zip Code

Gender

Age

Disease

Expense

7527*
7527* 7527* 75275

Male
Male Male Person

[21-25]
[21-25] [21-25] [31-40]

Flu
Cancer HIV+ Diabetes

100
3000 5000 2500

75275
75275

Person
Person

[31-40]
[31-40]

Diabetes
Diabetes
ASRA Group

2800
2600
13/07/2011 11

Zipcode
83100*

Gender
Person

Age
[25-30]

Disease
Flu

82530* 83400* 83100* 82530* 83400* 82530* 83100* 83400*

Person Person Person Person Person Person Person Person

[10-15] [30-35] [25-30] [15-20] [30-35] [25-30] [25-30] [30-35]

Obesity Cancer HIV+ Cancer Diabetes Obesity Flu Flu

ASRA Group

13/07/2011

12

How to decide number of cluster?

ASRA Group

13/07/2011

13

Distance between two numerical values

ASRA Group

13/07/2011

14

ASRA Group

13/07/2011

15

Distance between two Categorical values


Country

America

Asia

North

South

East

West

USA

Canada

Brazil

Mexico

Iran

Egypt

India

Pakistan

C ( v i, v j)=H(( v i , v j ))/H(TD)
Fig : Taxonomy Tree of Country
ASRA Group 13/07/2011 16

Function greedy_k_member_clustering (S, k) If ( |S| k) Return S; End if; Result =; r = a randomly picked from S; While ( |S| k) r= the furthest record from r; S=S-{r}; C ={r}; While ( |C| < k) r= find_best_record(S,C); S=S-{r}; C=C U {r}; End while; Result =Result U {C}; End while; While ( |S| 0) r= a randomly picked record from S; S=S-{r}; C=find_best_cluster(Result, r); C=C U {r}; End while;
ASRA Group 13/07/2011 17

Function find_best_record (S, c) Input: a set of records S and a cluster c Output: a record r S such that IL(c U {r}) is minimal n= |S|; min=; best = null; for(i=1..n) r= i-th record in S; diff= IL(c U {r}) IL(c); If(diff<min) min=diff; best=r; End if; End for; Return best; End;

ASRA Group

13/07/2011

18

Function find_best_cluster (C, r) Input: a set of clusters C and a record r. Output: a cluster c C such that IL(c {r} is minimal n=|C|; min=; best=null; for( i=1..n) c=i-th cluster in C; diff=IL(CU{r}) IL(C); if(diff<min) min=diff; best=c; end if; end for; return best;

End.

ASRA Group

13/07/2011

19

ASRA Group

13/07/2011

20

The time complexity of this algorithm is O ((n2 log (n))/c), where c is the average number of records in each cluster. The time complexity of this algorithm is better than greedy k-member algorithm
ASRA Group 13/07/2011 21

It is difficult to decide a proper


value for the user-defined threshold This algorithm might delete many records, which in turn cause a significant information loss. This algorithm is less sensitive to outliers
ASRA Group 13/07/2011 22

The main goal of the experiments was to investigate the implementation of the k-anonymity model using clustering algorithm. We mainly focus on the data quality, k-anonymization and scalability which are main consideration of kanonymity model
ASRA Group 13/07/2011 23

ASRA Group

13/07/2011

24

Finally, keeping in mind data quality is the big problem in kanonymization. We also focus on data quality rather than the computation efficiency that should be the main consideration in kanonymity model, so we are encouraged by our result which demonstrates that our algorithm is flexible and is able to produce a range of desired anonymization.
ASRA Group 13/07/2011 25

Encouraged

by experimental result, we are currently working on more efficient heuristics to improve the performance of our approach. We are also working to utilize this clustering algorithm to detect fraud.
ASRA Group 13/07/2011 26

1. Sweeney, L.: k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowlege-Based Systems 10, 557570 (2002) 2. Efficient k-Anonymization using clustering techniques, Ji-Wyun, R.Kotagiri et al. (Eds.):DASFAA 2007,LNCS 4443, pp. 188-2007.
ASRA Group 13/07/2011 27

ASRA Group

13/07/2011

28

Você também pode gostar