Você está na página 1de 4

ABSTRACT

Duplicate detection is the process of identifying multiple representations of same


real world entities. Today, duplicate detection methods need to process ever larger
datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly
difficult. This project present two novel, progressive duplicate detection algorithms that
significantly increase the efficiency of finding duplicates if the execution time is limited:
They maximize the gain of the overall process within the time available by reporting
most results much earlier than traditional approaches. Comprehensive experiments show
that our progressive algorithms can double the efficiency over time of traditional
duplicate detection and significantly improve upon related work.
Both PSNM and PB algorithms increase the efficiency of duplicate detection for
situation with limited execution time; they dynamically change the ranking of
comparisons candidates based on intermediate results to execute promising comparisons
first and less promising comparisons later.
To determine the performance gain of these algorithms, this project also proposed
novel quality measure for progressiveness that integrates seamlessly with existing
measures. Using this measure, experiments shows that these approaches outperform the
traditional SNM by up to 100 percent and related work by up to 30 percent. In future
work, these progressive approaches with scalable approaches for duplicate detection to
deliver results even faster.

Contents
Abstract

Contents

ii iii

List of Figures

iv

Chapter 1: Introduction

13

1.1
1.2
1.3
1.4

Objective of the Project


Existing System
Proposed System
Organization of Thesis

Chapter 2: System Requirements and Feasibility Study


2.1
2.2
2.3
2.4

Functional Requirements
Non Functional Requirements
Other Requirements
Pseudo Requirements
2.4.1 Hardware Requirements
2.4.2 Software Requirements
2.5 Feasibility Study
Chapter 3: System Design
3.1 System Design Introduction
3.2 Input Design
3.3 Output Design
3.4 UML Diagrams
3.4.1 Use-Case Diagram
3.4.2 Class Diagram Sequence diagram
3.4.3 Sequence diagram
3.4.4 Activity Diagram
3.4.5 Collaboration Diagram

Chapter 4: Implementation
4.1 Modules Description
4.1.1 Dataset Collection
4.1.2 Preprocessing Method
4.1.3 Data Separation
4.1.4 Duplicate Detection
4.1.5 Quality Measures

1
12
23
3
46
4
4
5
5
5
5
6
7 15
7
78
89
9 15
10 11
11 12
13
14
14 15

16 17
16
16
16
16
17
17

Chapter 5: Software Testing


5.1
5.2
5.3
5.4
5.5

18 23

Testing Introduction
Testing cycle
White box testing
Black Box Testing
Types of Testing
5.5.1 Unit Testing
5.5.2 Integration testing
5.5.3 System testing

18
18
19
20
20
21
22
23

Chapter 6: Results

24 34

6.1 Home Page


6.2 Dataset Loading Page
6.3 Data Pre-processing Page
6.4 Data Separation Page
6.5 Data Duplicate Detection Page
6.6 Quality Measures Page

24
25
26
27
28 29
32 34

Chapter 7: Conclusion

35

References

36

Appendix

37 43

List of Figures
Figure No

Figure Name

PAGE NO

3.4.1

Use case diagram for user

11

3.4.2

Class Diagram for User

12

3.4.3

Sequence diagram for user

13

3.4.4

Activity Diagram for user

14

3.4.5

Collaboration Diagram for use

15

5.3

White box testing

19

5.4

Black Box Testing

20

5.5.1

Unit Testing

21

5.5.2

Integration testing

22

5.5.3

System testing

23

6.1

Home page

24

6.2

Dataset loading page

25

6.3

Data Pre-processing Page

26

6.4

Data Separation Page

27

6.5.1

Progressive Blocking Page

28

6.5.2

PSNM Comparison Page

29

6.5.3

Delete Duplicate Page

30

6.5.4

Result Page

31

6.6

Quality Measures Page

32

6.6.1

Effectiveness Page

33

6.6.2

Runtime Results Page

33

6.6.3

Compare Page

34

Você também pode gostar