PDT - A Report

PLAGIARISM DETECTION TECHNIQUES
A SEMINAR REPORT
Submitted to GITAM (Deemed-to-be University)
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
Devaki.T.V.S (2210315655)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
GITAM, HYDERABAD CAMPUS

(Declared as Deemed-to-be University u/s 3 of UGC Act 1956)
Rudraram Village, Patancheru Mandal,
Medak District - 502329, Telangana, India.
DECLARATION
I hereby declare that the seminar report called “PLAGIARISM

DETECTION TECHNIQUES” is submitted to GITAM (Deemed-
to-be University) in partial fulfilment is an original work carried out
by me.
The matter embodied in this seminar is a genuine work done by

me and has not been submitted earlier to this or any other university.
Devaki.T.V.S
2210315655
2
ABSTRACT
Plagiarism refers to “the act of copying materials without actually

acknowledging the original source”. Plagiarism has seen a widespread
activity in the recent times. The increase in the number of materials
available now in the electronic form and the easy access to the
internet has increased plagiarism. Manual detection of plagiarism is
not very easy and is time consuming due to the vast amount of
contents available. There are several techniques available now which
help us to detect plagiarism. As the amount of programming code
being created is increasing, plagiarism techniques are available now
to detect plagiarism in source code also.
Using Plagiarism Detection Techniques, we

can compare a given material with any target material which is either
a particular document or in a repository.
3
CONTENTS
1. INTRODUCTION TO PLAGIARISM 5
2. TYPES OF PLAGIARISM 6
3. AVOIDING PLAGIARISM 7
4. PLAGIARISM PREVENTION 7
5. PLAGIARISM DETECTION 8
5.1 PLAGIARISM DETECTION TECHNIQUES 8
5.1.1 TEXT BASED DETECTION 9
5.1.2 CODE BASED DETECTION 13
6. DISADVANTAGES OF 18
PLAGIARISM DETECTION TECHNIQUES
7. CONCLUSION 19
8. REFERENCES 19
4
1. INTRODUCTION TO PLAGIARISM
What is Plagiarism?
◦ Plagiarism the practice of taking someone else's work or ideas

and passing them off as one's own.
◦ You may use another person's words, ideas, or information, but

to do so without acknowledgment is plagiarism.
◦ According to the Merriam Webster Online Dictionary,

“plagiarize" means –
◦ to steal and pass off as one’s (the ideas or words of

another)
◦ use (a created production) without crediting the source
◦ to commit literary theft: present as new and original an

idea or product derived from an existing source
◦ Example of Plagiarism: A writer decides that he wants to create

an Internet website to generate ad revenue. Instead of writing his
own articles, he visits twenty other websites that have articles on
the topic in which he is interested. He copies each of the articles,
changes the titles and the authors' names to his name and posts
the articles on his own website.
◦ The following are considered as Plagiarism:

5
◦ Turning someone else’s work as your own.
◦ Copying words or ideas from someone else without giving

credit.
◦ Giving incorrect information about the source of

something.
◦ Changing words but copying sentence structure.
◦ Copying so many words or ideas from a source that it

makes up the majority of your work, even though by
credit.
2. TYPES OF PLAGIARISM
Plagiarism can be deliberate or accidental.
Deliberate (Intentional) Plagiarism: Steals the property of

somebody else and claims it to be his/her own.
Accidental (Unintentional) Plagiarism: Somebody unknowingly

cites a phrase or copies words without acknowledging the author of
the material.
6
3. AVOIDING PLAGIARISM
Mainly two methods exist to avoid Plagiarism:
 Plagiarism Prevention
 Plagiarism Detection
4. PLAGIARISM PREVENTION
◦ Use your own words and ideas.

◦ Give proper credit for copied, adapted or paraphrased materials.
◦ Avoid using others work with minor cosmetic changes.
Examples: using synonyms such as therefore and thus, reversing
sentence order or changing format or layout of a text.
7
◦ If you use another’s exact words, you must use quotation marks
(“..”) or (‘..’).
5. PLAGIARISM DETECTION
Plagiarism can be detected manually, which takes more effort,

or with the help of software detection. The Detection
Techniques are software programming methods which are
easier, simpler and faster to detect plagiarism.
5.1 PLAGIARISM DETECTION TECHNIQUES
Most existing detectors are specially designed to process natural

language text or program source code. Systems designed for
finding similarities in natural language texts mainly searched the
Internet for the possible matches. Text comparisons use simple
comparison methods aiming mostly at processing speed and
wide coverage. The program source code usually performs a
pair wise comparison between single submissions only. Though
sophisticated procedures are being developed which compares
with multiple source code programs simultaneously.
Therefore, based on the content that is plagiarised, the

Plagiarism Detection Techniques are of two types:
8
◦ Text Based Plagiarism Detection Techniques
◦ Source Code Based Plagiarism Detection Techniques
5.1.1 Text Based Plagiarism Detection Techniques
 Substring Matching:
Substring matching approaches try to identify maximal
matches in pairs which then are used as plagiarism
indicators. Typically, the substrings are represented in
suffix trees, and graph-based measures are employed to
capture the fraction of the plagiarized sections.
 Keyword Similarity
◦ Topic identifying keywords are extracted from a
document.
◦ Compare with keywords of another document.
◦ If the similarity exceeds a threshold, the candidate

documents are said to have similarity, therefore
implying that the document has been plagiarised.
 Exact Fingerprint Match

◦ The documents are partitioned into term sequences
called chunks which are used as plagiarism indicators,
9
from which digital digests are computed that form the
document’s fingerprint.
◦ Digests are inserted into a hash table then collisions

indicate matching sequences.
◦ For the fingerprint computation some standard hashing

suffers from two severe problems:
 Computationally expensive.
 A small chunk size (3-10 words) must be chosen

to identify matching passages
 Text Parsing
◦ Any sentence of the text can be automatically

represented in the form of the tree, which reflects the
structure of the sentence.
◦ Example: The phrase the monkey ate the banana will be

parsed by the software as,
10
◦ Once a parse tree is created, we can invoke a tree matching
procedure
◦ Initially the algorithm builds a flowchart-styled parse tree for

each file to be analysed
◦ Then for each pair of files, the algorithm performs a rough

“abstract comparison”, when only types of the parse tree
elements (like Assignment, Loop, Branching) are taken into
account.
◦ This is done recursively for each level of tree nodes If the

similarity percentage becomes lower, the trees are immediately
treated as not similar.
◦ E.g. the phrases the monkey ate the banana and the banana was
eaten by the monkey will be very close after the tokenization.
Tools used for Text Based Plagiarism
Detection tools are present that operate on free text and also that
finds similarity in spreadsheets, diagrams, scientific
experiments, music or any other non-textual corpora. Tools can
be divided in three based on the type of corpus: tools that
operate only intra-corpally (where the source and copy
documents are both within a corpus), tools that operate only
extra-corpally (where the copy is inside the corpus and the
11
source outside) and tools that operate both – intra- and extra-
corpally.
 CheckForPlagiarism.net
◦ One of the best online plagiarism checkers that used to stop or

prevent of online plagiarism.
◦ The fingerprint-based approach used to analyse and summarize

collection of document and create a kind of fingerprint for it.
◦ Uses its own database that include millions of documents and

articles over World Wide Web.
◦ Support synonym and sentence structure checking.
◦ Can compare set of different documents simultaneously with

other documents
 SIM (Software Similarity Tester)
The SIM system tokenizes source programs and compares strings

using pattern matching algorithms based on work from the human
genome project.
 Turnitin
Corpus-based plagiarism detection software takes as input a

suspect document and an archived corpus of authenticated
documents, compares the suspect document to the corpus, and
12
outputs passages that the suspect document shares with the corpus
and a measure of the likelihood that the author plagiarized material.
At a more operational level, a corpus-based protocol is
implemented in several ways. TurnItIn.com, the most high-profile
company in the field, employs document source analysis to
generate digital fingerprints of documents, those submitted for
authentication, those in the archive, and those available through
ProQuest, and it complements the search of its in-house archive
with searches of the World Wide Web.
5.1.2 Source Code Detection Techniques
 Lexical Similarities
The process of lexical analysis takes the human readable source
code for a program and converts it into a stream of lexical
tokens which a parser or compiler may use to extract meaning
from the source. During the lexical analysis phase, the source
code undergoes a series of transformation. Some of these
transformations, such as the identification of reserved words,
identifiers, and numbers are beneficial for plagiarism detection.
Consider the following two snippets of Java Code:
13
Clearly, there is no semantic difference between the two code
snippets. However, by changing the array A to B and the
looping variable from i to j we have introduced differences on
all but the last lines.
The lexical stream of the 2 snippets of code is the following:
 LITERAL_int LBRACK RBRACK IDENT ASSIGN LCURLY
NUM_INT COMMA NUM_INT COMMA NUM_INT
COMMA NUM_INT RCURLY SEMI
 LITERAL_for LPAREN LITERAL_int IDENT ASSIGN
NUM_INT SEMI IDENT LT IDENT DOT IDENT SEMI
IDENT INC RPAREN LCURLY NUM_INT SEMI
Both the java snippets have the exact lexical stream, indicating
that the code is same. Changes in the whitespace or comments
do not affect the lexical stream.
14
 Parse Tree Similarities
◦ The parse tree or derivation tree built from the program exhibits
structure for a given program.
◦ A compiler, during the compilation process builds a parse tree

which represents the program.
◦ The parse tree will have the same structure for both the snippet
of code as the lexical streams are same.
◦ An algorithm for detecting plagiarism using this method would

first, parse each program.
◦ Next, for each pair of parse trees, it attempts to find as many

common sub trees as possible.
◦ Use this number as a measure of similarity between the two

programs
Example of a Parse Tree:
15
 Metrics
Similarity analysis using software metrics requires that each

fragment be characterized by a set of features measured by
metrics such as number of passed parameters, the number of
statements etc. Metrics computation requires the parsing of
source code to identifying interesting fragments and extract
software metrics. In particular, each fragment was modelled
by the following seven software metrics:
i. Number of used or defined local variables
ii. Number of used or defined non-local variables
iii. Number of parameters
iv. Number of statements
v. Number of branches
vi. Number of loops
vii. Number of function calls
Tools used for Code-based Plagiarism

 JPlag
JPlag finds similarities among multiple sets of source code
files. It compares bytes of text as explained in the text-
based plagiarism tool. JPlag is also aware of the
programming language syntax and program structure. It is
16
more robust. It currently supports Java, C#, C, C++ and
natural language text. It has a very powerful graphical
interface for presenting the results. It does not search the
web rather it searches for plagiarism between two given
content.
 MOSS
MOSS (Measure Of Software Similarity) checks for
plagiarism in programming classes. It has been effectively
used as the algorithm used is significantly better compared
to other cheating algorithms. It can analyse code written in
various languages like C, C++, Python, Visual Basic,
JavaScript, FORTRAN, Lisp, Ada etc.
MOSS is being provided as an internet service. A list of
files is to be given for comparing and this tool checks for
any plagiarism. Currently MOSS submission script is for
Linux. It highlights the passages that are similar.
 CodeMatch
CodeMatch compares thousands of source code files in
multiple directories and subdirectories to determine those
files which are closely correlated. The technique used here
considerably increases the speed of detecting the source
code plagiarism. It is also useful for finding open source
code within proprietary code, determining common
17
authorship of two different programs, and discovering
common, standard algorithms within different programs.
6. DISADVANTAGES OF PLAGIARISM
DETECTION TOOLS
◦ Plagiarism Detection systems are built based on a

few languages. To check for plagiarism with the
same software can be difficult.
◦ Most of the detection software checking is done with
some repository situated in an organization. Other
people are unable to access it and verify for
plagiarism.
◦ As the number of digital copies are going up the
repository size should be large and the plagiarism
Detection software should be able to handle it.
◦ There is some plagiarism detection software
available which ask us to load a file to their link.
Once done the file is copied to their database and
then checked for plagiarism. This also comes with an
inherent chance of our data being leaked or hacked
for other purposes.
18
7. CONCLUSION
Plagiarism is rampant now. With most of the data available to us in

digital format the venues for plagiarism is opening up. To avoid
this kind of cheating and to acknowledge the originality of the
author new detection techniques are to be created. Not only
systems with speed but also new systems should which can be able
to collect information about plagiarism in the web or large
repositories.
As there are a large number of detection tools

available for text-based plagiarism the number of copying incidents
have reduced considerably in this field. Currently we use a lot of
computer-based applications. To protect the intellectual property in
the source code new techniques are to be developed and
implemented.
8. REFERENCES
 https://en.wikipedia.org/wiki/Plagiarism_detection
 http://www.ijirae.com/volumes/vol1/issue7/AUCS10085.0
6.pdf
 http://www.drdobbs.com/architecture-and-
design/detecting-source-code-plagiarism/184405734
19
20

PDT - A Report

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

PDT - A Report

Enviado por

Direitos autorais:

Formatos disponíveis

PLAGIARISM DETECTION TECHNIQUES

Submitted to GITAM (Deemed-to-be University)

COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

GITAM, HYDERABAD CAMPUS

I hereby declare that the seminar report called “PLAGIARISM

The matter embodied in this seminar is a genuine work done by

Plagiarism refers to “the act of copying materials without actually

Using Plagiarism Detection Techniques, we

5.1 PLAGIARISM DETECTION TECHNIQUES 8

5.1.1 TEXT BASED DETECTION 9

5.1.2 CODE BASED DETECTION 13

PLAGIARISM DETECTION TECHNIQUES

◦ Plagiarism the practice of taking someone else's work or ideas

◦ You may use another person's words, ideas, or information, but

◦ According to the Merriam Webster Online Dictionary,

◦ to steal and pass off as one’s (the ideas or words of

◦ use (a created production) without crediting the source

◦ to commit literary theft: present as new and original an

◦ Example of Plagiarism: A writer decides that he wants to create

◦ The following are considered as Plagiarism:

◦ Copying words or ideas from someone else without giving

◦ Giving incorrect information about the source of

◦ Changing words but copying sentence structure.

◦ Copying so many words or ideas from a source that it

Plagiarism can be deliberate or accidental.

Deliberate (Intentional) Plagiarism: Steals the property of

Accidental (Unintentional) Plagiarism: Somebody unknowingly

Mainly two methods exist to avoid Plagiarism:

◦ Use your own words and ideas.

Plagiarism can be detected manually, which takes more effort,

5.1 PLAGIARISM DETECTION TECHNIQUES

Most existing detectors are specially designed to process natural

Therefore, based on the content that is plagiarised, the

5.1.1 Text Based Plagiarism Detection Techniques

◦ Compare with keywords of another document.

◦ If the similarity exceeds a threshold, the candidate

 Exact Fingerprint Match

◦ Digests are inserted into a hash table then collisions

◦ For the fingerprint computation some standard hashing

 A small chunk size (3-10 words) must be chosen

◦ Any sentence of the text can be automatically

◦ Example: The phrase the monkey ate the banana will be

◦ Initially the algorithm builds a flowchart-styled parse tree for

◦ Then for each pair of files, the algorithm performs a rough

◦ This is done recursively for each level of tree nodes If the

Tools used for Text Based Plagiarism

◦ One of the best online plagiarism checkers that used to stop or

◦ The fingerprint-based approach used to analyse and summarize

◦ Uses its own database that include millions of documents and

◦ Support synonym and sentence structure checking.

◦ Can compare set of different documents simultaneously with

 SIM (Software Similarity Tester)

The SIM system tokenizes source programs and compares strings

Corpus-based plagiarism detection software takes as input a

5.1.2 Source Code Detection Techniques

Consider the following two snippets of Java Code:

◦ A compiler, during the compilation process builds a parse tree

◦ An algorithm for detecting plagiarism using this method would

◦ Next, for each pair of parse trees, it attempts to find as many

◦ Use this number as a measure of similarity between the two

Example of a Parse Tree:

Similarity analysis using software metrics requires that each