Detecting Plagiarism in Java Code

Detecting Plagiarism in Java Code
Andrew Granville
04 May 2002
Supervisor: Yorick Wilks
This report is submitted in partial fulfilment of the requirement for the degree of Bachelor of Engineering with Honours in Software Engineering by Andrew Granville
Declaration
All sentences or passages quoted in this dissertation from other people's work have been specifically acknowledged by clear cross-referencing to author, work and page(s). Any illustrations which are not the work of the author of this dissertation have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure in this dissertation and the degree examination as a whole. Name: Andrew Granville Signature: Date: 04/05/02
Andrew Granville
Abstract
The thought of copying another persons programming code has always been an appealing idea, with the reward of an instant saving in time and effort for the perpetrators. However, with the potential of universities and other academic establishments to unwittingly award qualifications because of such plagiarism, it is recognised that there must be an attempt to identify any culprits. This project therefore considers the detection of plagiarism that could occur between pairs of Java algorithms, for example those that have been handed-in as part of a university assignment. To achieve this, two methods one attribute counting and one structure metric, were chosen from a number reviewed. The aim was to implement these on a set of predetermined Java source code files, to identify which type of method performs best, and whether or not a more concrete set of findings can be found if the two methods results are combined together. The report concludes that although plagiarism was able to be detected, the results are in some ways, not as desired.
Acknowledgements
I would like to thank Paul Clough for his tireless enthusiasm on the subject and his willingness to always help.
Andrew Granville
Contents
1. Introduction ....................................................................................................................................1 2. Literature Review...........................................................................................................................2 3. Choosing the Methods....................................................................................................................4 Attribute Counting Methods .......................................................................................................4 Structure Metric Methods ...........................................................................................................6 Dotplot...............................................................................................................................6 YAP3 ..................................................................................................................................9 Plague..............................................................................................................................11 Final Choices and Justifications ...............................................................................................11 Attribute Counting Method .............................................................................................11 Structure Metric Method .................................................................................................11 4. Implementing the Methods..........................................................................................................13 Attribute Counting Method (McCabes Cyclomatic Complexity) ............................................13 Structure Metric Method (GST) ................................................................................................14 Requirements...................................................................................................................14 Quantifying the Similarity...............................................................................................14 Tokenising the Source and Target Files..........................................................................15 The Construction of jGST................................................................................................15 The Completed Tool........................................................................................................17 5. Collecting the Data Set.................................................................................................................20 6. Pre-Testing ....................................................................................................................................27 Identifying the Threshold (McCabes Cyclomatic Complexity) ...............................................27 Identifying the Thresholds (GST) .............................................................................................29 7. Testing and Analysis ....................................................................................................................38 Attribute Counting Method (McCabes Cyclomatic Complexity) ............................................38 Structure Metric Method (GST) ................................................................................................39 8. Evaluation and Conclusions ........................................................................................................44 References .........................................................................................................................................48
ii
Andrew Granville
1. Introduction
It is safe to say that the potential for plagiarism in programming code has never been greater. With the explosion in the amount of information available from sources such as the Internet, there no longer seems a need to write code entirely from scratch. This has proved more than a headache for academic establishments, whose aim is to award qualifications to those that have achieved a mental level of proficiency in certain areas, and not just to those that have recorded acceptable marks. For many individual students, programming courses are seen as a way of achieving creditable results with the minimum of effort. This is often as thoughts of simply copying code from other sources and claiming it as their own are high. Although it is very difficult, if not impossible to trace and prove listings have been somehow manipulated from external sources, showing that a student has collaborated with another is very much possible. This project therefore considers the detection of plagiarism that could occur between pairs of Java algorithms, for example those that have been handed-in as part of a university assignment. This indicates the first aim, which is to be able to distinctly determine whether any piece of code has in some way been constructed from another. The only constraint at this point is that this project will only seek to find plagiarism in Java code. This is due to both time constraints and the recognition that the language is widely used and accepted, especially in teaching institutions. To begin to achieve the above aim, a data set will be compiled. It will include a range of Java algorithms of which some will be genuinely plagiarised from others, and the remainder original. Not only will this set provide the samples to make up the test set, it will also be used to allow thresholds for the chosen detection methods to be calculated - a figure which will indicate plagiarism if surpassed, or in some cases, if not reached. Before outlining how this project aims to identify plagiarism, it is important to briefly understand the difference between attribute counting systems and structure metric systems. It can be noted that attribute counting systems simply count the level of a certain attribute contained within a length of code, in contrast to structure metric systems which as Clough [1] writes, compare string representations of the program structure, assessing the similarity of token strings. With this knowledge, a selection of methods from both types of systems are reviewed, and one from each chosen to be implemented by this project. This allows two more project aims to be introduced. The first is to analyse the performance of both methods when the test set is used, with the hope of identifying which may have the greater potential for plagiarism detection. The second is to consider both sets of results in a way that when used together, they can provide a more concrete plagiarism classification. This aim of course, depends on how well the results can be interpreted, and may prove to be difficult if contradictory information is returned. The final aim of the project is to construct a desktop tool that implements one of the chosen plagiarism detection methods. The algorithms for this will be coded from first principles in Java, using descriptions and explanations found in reference material. Because of time constraints, the results for the other method will be acquired through the use of prewritten software. Overall, this report will offer an insight into the ever-growing field of software plagiarism detection. And finally, in an ironic conclusion, it will demonstrate that the Java lexicon the plagiarist uses to cheat, can itself be manipulated in such a way as to help catch them out.
Andrew Granville
2. Literature Review
Firstly, to reinforce the general project concept of only identifying plagiarism in software code, Clough [1] recognises that detecting such plagiarism is simpler than within natural language, adding the complete grammar for a programming language can be defined and specified, but natural language is much more complex and ambiguous making it much harder to build a successful plagiarism system. It was therefore felt that this report was a promising place to begin the project, as his paper continues by reviewing various methods available to achieve successful software plagiarism detection. These include, to name two of the attribute counting techniques, Halsteads software science metrics [4] and McCabes Cyclomatic Complexity [5], both of which he briefly describes. He then gives an example of a simple algorithm which uses string comparisons (structure metrics) to detect plagiarism. Distinguishing in his comprehensive review of currently available plagiarism software, that structure metric techniques are generally found to be more promising than attribute counting ones. Sallis et al. [2] in his study of software forensics, narrows the use of the attribute counting methods Clough reviewed, to be of primary use in detecting plagiarism in code - as opposed to identifying outright authorship of software. This is stated as being because these metrics produce values that are clearly program-specific, providing more evidence to strengthen the choice of not considering the problems of attributing authorship in this project. Of more interest however, Sallis then goes on to identify a six-tuple vector of programming code characteristics that he believes should enable effective plagiarism detection. Another important review for detecting plagiarism was carried out by Verco and Wise [3]. They have taken the step of comparing the performance between structure metric and attribute counting methods, with interesting results. They found that attribute-counting metric systems performed better in the detection of plagiarism where very close copies were involved, but were often unable to detect partial plagiarisms. Their overall conclusion was that no single number, or set of numbers, can adequately capture the level of information about program texts that a structure-metric system is able to achieve. This review however, does indicate that using an attribute counting method may still provide reliable results when very similar versions of code are compared. It is hoped that this will indeed be the case. Whale [8] is another author who discusses which methods, both attribute counting and structure metric, could be used in the detection of plagiarism. The most important feature of his paper is that he attempts to identify what program similarity actually is and on what levels it occurs. There is also an understanding that eliminating coincidence as a possible cause [of similarity] is the ultimate goal for the user of a similarity detection system. To do this, Whale suggests defining a set of twelve characteristics that are used in the process of attempting to disguise code. These characteristics which increase in complexity, range from simply changing comments to combining original and copied program fragments, with the most complex change having the least chance of being deemed coincidental. Whale also goes on to topically describe the most common ways that the copying of programs can occur. These include people (mainly students) asking to borrow completed work and invariably copying it, poor security measures allowing access to an individuals computer account, but most interestingly the problem of unsupervised waste bins and printers in and around computer areas.
Andrew Granville
Another paper which attempts to demonstrate the characteristics that can make up plagiarised code is by Faidhi and Robinson [9]. They develop a plagiarism spectrum, which details six levels of modification that can be made to code to render it plagiarised. These range from simply changing comments and indentation, to modifying the decision logic. The overall idea here, is that by taking a code listing, each of the six levels of the spectrum can be applied in order, to build up a comprehensive plagiarised copy of the original. To complete the review, is it worth briefly considering some of the original reports which detail actual methods of plagiarism detection. Note that some of the methods will be discussed in more detail in the subsequent Choosing the Methods chapter. In Thomas McCabes paper A Complexity Measure [5], McCabes Cyclomatic Complexity is first introduced. The report shows how this mathematical technique for program modularisation is developed, going though the mathematics behind it and offering ways that it can be simplified. It is interesting whilst reading the paper, that the author has not intended the method to be used as a way of detecting plagiarism. But with hindsight, this can be achieved by viewed with suspicion two programs that have a similar complexity rating. Moving on, Wise [11] declares his structure metric YAP3 system as one for detecting suspected plagiarism in computer programs and other texts submitted by students. He states the reason for its creation as being because of in spite of years of effort, plagiarism in student assignment submissions still causes considerable difficulties for course designers. The paper primarily discusses the advancements that YAP3 has made over the previous versions, aptly named YAP and YAP2. He describes how like these previous releases, YAP3 is split into two phases, but in the latest release the second phase introduces the Running-Karp-Rabin GreedyString-Tiling algorithm, something which the author describes as novel. Finally, the report is concluded with a comparison of the performance of the three YAP programs, and as hoped, YAP3 is found to perform best. Finally, Prechelt et al. [13] in their review of JPlag, a web service1 that finds pairs of similar programs among a given set of programs, find some interesting results that can also hold for the YAP3 method discussed above. The JPlag system, which analyses program source text written in Java, Scheme, C or C++, is stated as using roughly the same basic comparison algorithm as YAP3. The report later concludes, after carrying out some extensive testing of the system, that for clearly plagiarised programs, i.e. programs taken completely and then modified to hide the origin, JPlags results are almost perfect often even if the programs are less than 100 lines long. The report also gives an excellent review of the graphical interface that is shown on the internet, and describes how the results the system can output are presented and should be interpreted.
See http://wwwipd.ira.uka.de/jplag/ 3
Andrew Granville
3. Choosing the Methods

Having completed the literature review, one attribute counting method and one structure metric method must now be selected in line with the project aims laid out in the introduction of this report. This section firstly considers in detail a selection of both types of method, and then selects one from each, justifying why it was chosen and to a certain extent, why the others were not. Attribute Counting Methods The following list of attribute counting methods are outlined as a six-tuple vector by Sallis et al. [2]. Despite the fact that each of them simply counts the level of a certain attribute contained within a length of code, as will be seen from their descriptions, all six can be regarded as valid methods of detecting plagiarism. 1. Volume This can be quantified using Halsteads software science metrics [4] and is said to be a reflection of the size of the implementation of any algorithm. The idea behind this is to identify and count, unique or distinct operands and operators from within the studied code. Once completed, mathematics can be applied on these figures to calculate a measure of volume. Therefore, the following is needed: n1 (The number of unique or distinct operators) n2 (The number of unique or distinct operands) N1 (The total usage of all the operators) N2 (The total usage of all the operands) n = n1 + n1 (Known as the vocabulary n) N = N1 + N2 (Known as the implementation length N) from which the volume (V) can be calculated using: V = N log2 n 2. Structure This measure considers the use of some indicator to illustrate the degree of coupling between modules. This will provide a representation of the data and control transfer of a program. 3. Data Dependency In a similar fashion to the way control flow of a program can be represented, data dependency can be measured by allowing both predicate clauses and variable definitions to be illustrated as nodes on a flowgraph. All that is further required is a way of understanding this representation, which has been suggested by Bieman and Debnath [6] in the form of a Generalised Program Graph (GPG). 4. Nesting Depth This is a simple measure which returns the average nesting depth of a program by assigning each line of code a depth value. The average is then calculated by dividing the total of these values by the number of statements in the program.
Andrew Granville
5. Control Structure Each type of control structure that occurs in a program is assigned a weight, for example an IFTHEN construct is worth 5. By applying this principle to an entire program, a sum of all the weightings can be found and is said to be the program complexity. 6. Control Flow This can be measured using McCabes Cyclomatic Complexity, which works by firstly taking a listing of programming code and converting it into a control graph which has unique entry and exit nodes. Note that this method is language dependent, meaning the syntax and semantics of the language that a piece of code is written in, must be understood before a control graph can be created. In the graph, each node corresponds to a block of code where the flow is sequential, and each arc corresponds to branches taken in the program. An example of such a graph, as taken from McCabes own report [5] is as follows:
a
e f
There are two assumptions made when creating a graph, one is that each node can be reached from the entry node (node a), and the other is that each node can reach the exit node (node f) To aid the construction of the control graph, McCabe [5] lists the usual constructs used in programming. These are: Sequence
If Then Else
While
Until
Andrew Granville
Having created a graph, the overall strategy will be to measure the complexity of a program by computing the number of linearly independent paths v(G) [5]. (Where G is defined to be the created control graph.) There are several properties of the complexity that must also be considered, these are [5] 1. v(G) 1 2. v(G) is the maximum number of linearly independent paths in G. 3. Inserting or deleting functional statements in G does not affect v(G) 4. If v(G) = 1, then G has only one path. 5. Inserting a new edge in G increases v(G) by unity. 6. v(G) depends only on the decision structure of G. With all the above considered, v(G) can be defined as v(G) = e - n + 2p where e is the number of edges in the graph, n is the number of nodes in the graph, p is the number of connected components. For example, p = 1 for one module, p = 2 for when two connected modules are considered etc. Note that p 0. Therefore it follows that McCabes Cyclomatic Complexity can calculate the complexity of a collection of connected programs if required, by simply altering p accordingly. Although the method was not developed for detecting plagiarism, it can do so because code with a similar complexity can be treated with suspicion and subsequently checked by human eye. As McCabe states, it has been interesting to note how individual programmers style relates to the complexity measure. Structure Metric Methods The following structure metric methods were shortlisted as possible techniques that could be implemented. Dotplot This is a technique which shows patterns of string matches between two pieces of code (or any text) visually. Importantly, dotplot is not language specific, therefore it would not be required to understand the semantics and syntax of the code that is to be compared. This gives the method a great deal of flexibility. The major advantage of dotplot is that it relies on the human visual system to detect patterns of similarity. As Helfman [12] describes, previous approaches to detecting similarity, such as algorithms that find longest common substrings, do not reveal the richness of the similarity structures that have been hidden in our software, data, literature and languages. However, by simply using the human eye to uncover plagiarism, the question of how you quantify the results of each comparison becomes one of dotplots major disadvantages. It means that interpreting for example how much a person has copied anothers code is not going to be able to be represented by a value. Let us now consider how the method actually works. Firstly, two programs are selected and each processed into a sequence of tokens. These two sequences are then joined together.
6
Andrew Granville
For example, taking two programs which for simplicities sake consists only of a comment in each, i.e., Program One: //test this Program Two: //we must test this, now would be tokenised into the final sequence
test this we must test this now
This sequence is now plotted against itself and a dot drawn where a match of tokens is found, i.e.,
test test this we must test this now this we must test this now
From this, all diagonals (excluding the main diagonal) represent a similarity in structure. In the example above the two diagonals (top right and bottom left of grid) indicate the comment from program one has been found in program two. This shows the possibility that program one has been plagiarised. The potential of the dotplot method for finding plagiarism in code is in theory very good. The two following examples, taken from Helfman [12], show how a plot can cope when for example software is copied and has some other code inserted into it, and when code is simply taken and reordered. Example 1: The sequence of tokens a b c d e f (each of which could be some code syntax) is copied and has Z Y inserted into it. I.e. a b Z Y c d e f. The plot is as follows
a a b c d e f a b Z Y c d e f b c d e f a b Z Y c d e f
It can be seen that the diagonals indicate that the original sequence can be found in the plagiarised one. This shows that the dotplot method is not affected by simply splitting up copied code.
7
Andrew Granville
Example 2: The sequence of tokens a b c d e f g (again each of which could be some code syntax) is copied but rearranged into c d a b e f g. The plot is as follows
a a b c d e f g c d a b e f g b c d e f g c d a b e f g
Once again the diagonals either side of the centre diagonal show that the method is not affected by simply copying and rearranging code. Another important feature of dotplot, is the appearance of squares of dots in the grid. These identify unordered matches, or subroutines with lots of matching symbols [12]. The following example demonstrates this. Example 3: A sequence of tokens a a b b a a, which can be thought of as a program consisting of three subroutines, is copied and rearranged into b b a a b b, another three subroutined program.
a a a b b a a b b a a b b a b b a a b b a a b b
When plotted, the squares either side of the centre diagonal represent where these subroutines have been copied. This means dotplot is immune for example to the case where a person copies a set of routines, modifies them and recreates a program with the routines in a different order.
Andrew Granville
One final advantage of dotplot, is that it can work for very large sets of data, with the interpretation of plagiarism always remaining the same (identifying squares and diagonals) regardless of size. YAP 3 YAP3 [11] is the third release of YAP, a system for detecting suspected plagiarism in computer programs and other texts submitted by students. Even from this description, it would appear that using the methods behind this system, would certainly help the project aim of detecting plagiarism in code that has been handed-in as part of a university assignment. The biggest change that has occurred in this version of YAP over the previous ones, is a switch to the underlying use of the Running-Karp-Rabin Greedy-String-Tiling (RKR-GST) algorithm. It is now important to understand that RKR is only an optimisation technique to speed up the GST algorithm from its worst case runtime complexity of O(n3). In simple terms, it works by introducing the use of hash-values and hash-tables to reduce the algorithm to an average complexity that is almost linear. However, the method will not be described any further here. This is due to the fact that further detailed research as to how to implement it would be required2, and that the GST algorithm can still be constructed without the need for RKR. Therefore the rest of the section will only detail how the GST algorithm functions. The GST method attempts to compute the degree of similarity between two files of source code. These will be named the source and target files, where the target is suspected of being plagiarised from the source. The overall method works in two stages, with the first being to convert both the source and target files into token strings. This involves in each case [11], Removing comments and string-constants. Translating upper case letters into lower case. Mapping of synonyms to a common form (i.e. function mapped to procedure). Reordering the functions into their calling order. In the process, the first call to each function is expanded to its full token sequence. Subsequent calls are replaced by the token FUN. Removing all tokens that are not from the lexicon of the target language, i.e. any token that is not a reserved word, built-in function, etc. The next stage is the comparison phase, where the actual GST algorithm is introduced. It is based on the following important notions [11], It introduces a tile, which is a one-to-one pairing of a substring from the source file (sFile) and a substring from the target file (tFile). Once a token becomes part of a tile it is said to be marked. A Maximal Match (MaxM) is defined. This is similar to a tile, but is only a temporary pairing of substrings between the source and target files. A Minimal Match Length (MinML) is defined. This is the minimum length of tiles being sought, with potential tiles below this length being ignored. Overall, what is being looked for by the GST algorithm is a maximal tiling of sFile and tFile, i.e. a coverage of non-overlapping substrings of tFile with non-overlapping substrings of sFile, which bearing in mind the MinML, maximises the number of tokens that have been covered by tiles.
See R.M.Karp, M.O.Rabin, Efficient Randomized Pattern-Matching Algorithms, IBM Journal of Research and Development, 31(2), 249-260, March 1987. 9
Andrew Granville
To understand how the method works, the following pseudo code of the algorithm (adapted from [13]), is shown below.
Greedy-String-Tiling(String sFile, String tFile) { tiles = {}; do { searchLength = MinML; matches = {}; Forall unmarked tokens sFiles in sFile { Forall unmarked tokens in tFilet in tFile { j = 0; while (sFiles+j == tFilet+j && unmarked(sFiles+j) && unmarked(tFilet+j)) j++; if (j == searchLength) matches = matches match(s, t, j); else if (j > searchLength) { matches = {match(s, t, j)}; searchLength = j; } } } Forall match(s, t, searchLength) matches { For j = 0...( searchLength 1) { mark(sFiles+j); mark(tFilet+j); } tiles = tiles match(s, t, searchLength); } } while (searchLength > MinML); return tiles; }
We can start by saying that the algorithm is made up of two main stages. Using Wises [11] terminology these are entitled scanpattern (shown in red) and markstrings (shown in blue) Multiple passes of these stages are completed until no more possible substrings between the sFile and tFile of length greater than or equal to the MinML can be found. During a scanpattern phase, all Maximal Matches of a certain size are collected (in the above code in the matches variable). These are the greatest length substrings that can be found between the unmarked sFile and tFile tokens. Note that this size is denoted in the pseudo code by the searchLength variable. This activity gives rise to the algorithm being classed as greedy, because of its preference of marking the longest substrings first. Moving on, during a markstrings phase, the Maximal Matches (the contents of matches) are taken one at a time and checked to see if parts of the MaxM are already marked. If not, then a tile is created from the MaxM and the tokens that make up the tile are marked. When all the Maximal Matches have been dealt with, the searchLength and matches variables are reset to the MinML and null respectively, and another pass is begun. Note that as stated earlier, those tokens that have just been marked are no longer available to be matched again, so if all the tokens from either the sFile or the tFile have been marked, the searchLength value will remain unaltered during the next pass. This allows the algorithm to always terminate.
10
Andrew Granville
Plague The final structure metric technique to be (briefly) considered is called Plague. It works in a similar fashion to the YAP3 method discussed previously, but without using the RKR-GST algorithm. As Clough writes [1], Plague works using the following three phases, 1. Create a sequence of tokens and a list of structure metrics to form a structure profile. The profile summarises the control structures used in the program and represent iteration / selection and statement blocks. 2. An O(n2) phase compares the structure profiles and determines pairs of nearest neighbours. 3. Finally, a comparison of the token sequences using a variant of the Longest Common Subsequence3 is made for similarity. But in summary, as Clough continues, Plague suffers from a number of problems, these include 1. The fact that it is hard to adapt to new languages. 2. Because of the way it produces its output (two lists of indices that require interpretation), results are not obvious. Final Choices and Justifications Having completed a review of possible methods, those that were chosen to be implemented in this project are the Greedy-String-Tiling algorithm that YAP3 uses to be the structure metric method, and McCabes Cyclomatic Complexity to represent the attribute counting method. The justification for these choices is made below. Attribute Counting Method (McCabes Cyclomatic Complexity) Because of a lack of available evidence as to which attribute counting method actually performs best, this was not an easy decision. However, as McCabes method is widely accepted as an industry standard for calculating the complexity of code, it was thought of as a good and respectable method to use in the attempt of detecting plagiarism. Structure Metric Method (GST) At the time of the release of YAP the forerunner to YAP3, Wise [10] considered how well the program can cope with twelve techniques used to disguise plagiarism in program code. These twelve techniques, that were defined by Whale [8] and mentioned in the Literature Review, are listed below, 1. Changing comments or formatting. 2. Changing identifiers. 3. Changing the order of operands in expressions. 4. Changing data types (i.e. integers into reals). 5. Replacing expressions by equivalents. 6. Adding redundant statements or variables. 7. Changing the order of independent statements. 8. Changing the structure of iteration statements (i.e. REPEAT and WHILE loops etc). 9. Changing the structure of selection statements. (i.e. nested IF and CASE statements). 10. Replacing procedure calls by the procedure body. 11. Introducing non-structured statements. 12. Combined original and copied program fragments.
See T. H. Cormen, C. E. Leiserson and R. L. Rivest, Introduction to Algorithms, MIT Press, 1990 11
Andrew Granville
Wise found that YAP was able to cope adequately with all of the above points except number seven, changing the order of independent statements. However, since using the GST algorithm in YAP3, Wise [11] has noted that it is now able to detect transposed subsequences, and that it no longer suffers from the difficulties faced by the wellknown algorithms in detecting similarities in the presence of block-moves. Overall, with the possibility of creating a program in this project which can detect plagiarism caused by all twelve techniques listed above, the GST algorithm that YAP3 uses was chosen to be implemented. The question of why dotplot is it not being used is as stated earlier, simply down to the problem of quantifying the results it produces. It was recognised that if using this method, interpreting any final project results would be made much more difficult, if not impossible. Finally, the simple reason as to why Plague was not selected, is that YAP3 can out perform it. As Whale [8] describes, the longest common subsequence is not an ideal indicator of the degree of commonality between two such [token] sequences, as it does not account for relocated blocks. In other words, Plague fails to recognise when plagiarism is being hidden by the changing of the order of independent statements. This is something which as we have just seen, the GST algorithm that YAP3 uses, is able to cope with.
12
Andrew Granville
4. Implementing the Methods

As was stated earlier in the introduction of this report, because of time constraints only one of the two chosen methods for detecting plagiarism would be constructed from first algorithmic principles. It was felt that because McCabes Cyclomatic Complexity requires a more detailed knowledge of the Java syntax and semantics - in that it needs an intricate control graph to be developed before a result can be found, it would be better to implement it through the use of prewritten software. In contrast, although the Greedy String Tiling (GST) algorithm that YAP3 uses is relatively complex, it was recognised that it should be a more straight forward process to convert it to Java. Therefore the decision was made to construct a Java tool which implements the GST algorithm. This chapter will now discuss how the implementation of both methods was achieved. Attribute Counting Method (McCabes Cyclomatic Complexity) After an extensive search of the internet, a program called RSM (Resource Standard Metrics)4 was located. It allows McCabes Cyclomatic Complexity to be calculated on a piece of Java source code. Below is an example to demonstrate the output of the program. It shows part of the listings it produces, with the relevant complexity figure highlighted in red.
~~ Function Metrics ~~ ~~ Complexity Analysis ~~ File: BinarySearch.java Date: Thu Feb 14 18:56:16 2002 File Size: 1470 Bytes ________________________________________________________________________ Function: BinarySearch.binarySearch Complexity Param 2 Return 2 LOC 14 eLOC 10 lLOC 6 Function: BinarySearch.main Complexity Param 1 Return 1 LOC 9 eLOC 7 lLOC 6 Cyclo Vg 4 Comment 6 Cyclo Vg 3 Comment 1 Total Lines Total Lines 8 16 5 10
-----------------------------------------------------------------------~~ Total File Summary ~~ LOC 31 eLOC 23 lLOC 15 Comment 9 Lines 47 -----------------------------------------------------------------------~~ File Functional Summary ~~ File Function Count ...: 2 Total LOC Lines LOC ...: 23 Total eLOC Lines ......: 17 Total lLOC Lines ......: 12 Total Function Params .: 3 Total Function Return .: 3 Total Cyclo Complexity : 7 Total Function Complex.: 13 ---------------------------Max Function LOC ......: 14 Average Function LOC ..: 11.50 Max Function eLOC .....: 10 Average Function eLOC .: 8.50 Max Function lLOC .....: 6 Average Function lLOC .: 6.00 ---------------------------Max Function Parameters: 2 Avg Function Parameters: 1.50 Max Function Returns ..: 2 Avg Function Returns ..: 1.50 Max Interface Complex. : 4 Avg Interface Complex. : 3.00 Max Cyclomatic Complex.: 4 Avg Cyclomatic Complex.: 3.50 Max Total Complexity ..: 8 Avg Total Complexity ..: 6.50 ________________________________________________________________________ End of File: BinarySearch.java
http://msquaredtechnologies.com/ 13
Detecting Plagiarism in Java Code Structure Metric Method (GST)
Andrew Granville
Before discussing how the GST algorithm was implemented in this project, it is important to have read and understood how the algorithm actually functions. A description of this can be found on page 10. Also, from this point, the developed Java tool will be known as jGST, standing for Java Greedy-String-Tiling. This chapter will now proceed to go through each step of the development of jGST, concluding with a section demonstrating how the tool functions under normal use. Requirements Firstly, the requirements of the tool can be identified. The first four are of vital importance if the tool is to be of any use at all. These minimum requirements are considered to be, 1. To allow the user to choose source and target Java code files those that are to be compared for similarity. 2. To allow the user to specify a Minimum Match Length (MinML). 3. To output a quantifiable plagiarism score between two pieces of source code. 4. To be able to run on the PC platform. In addition to those above, four further desirable requirements were recognised. These were, 5. To display any matching substrings between the source and target files after they have been compared. 6. To be a able to save the plagiarism scores of all the comparisons that have been executed during a single session - and in a ranking order. 7. To be able to reset the tool, so that a new session can begin without having to reload the program. 8. To incorporate the tool into a friendly graphical user interface (GUI). Quantifying the Similarity In order to satisfy the third requirement of having to output quantifiable plagiarism scores, the following dice score formula [13] will be used.
diceScore( sFile, tFile) = 2(
itiles lengthi ) | sFile | + | tFile |
It measures similarity by the fraction of tokens between the source and target files that are covered by matches, with an output given between 0 and 1 - where 0 represents no similarity and 1 represents the detection of equivalent files. It was felt that implementing this method would provide realistic results, which could be easily interpreted. To demonstrate how it works, consider the following two files, (where the MinML = 1) FileA: int i; static double j; FileB: static double j; int i; The two matching token sequences found between them are int i; of length 3, and static double j; of length 4. Using this, and noting that the length of both FileA and FileB is 7, the diceScore formula can be implemented as below,
diceScore( FileA, FileB) = 2( =1

14
3+ 4 ) |7|+|7|
Andrew Granville
The output of 1 shows that although FileB is simply a rearrangement of FileA, it is indeed equivalent and should be interpreted as plagiarised.
Tokenising the Source and Target Files As was noted when describing the GST algorithm on page 9, the preliminary stage of the method is to convert both the source and target files into token strings. As this is not actually part of the GST algorithm, it was felt that to save time, a prewritten tokeniser would be used. One such example was subsequently located5 and adapted for use in this project. It was also found that this tokeniser was able to remove comments and string-constants, something that as already discussed on page 9, was recommended. To demonstrate the actions of the tokeniser, given the following piece of Java code,
// this is a test public test(int i) { /* constructor */ intVar = i; }
the resulting token sequence would be output as,

public test ( int i ) { intVar = i ; }
The Construction of jGST The class diagram on the following page shows the relationships between the seven classes that were constructed in order for the jGST tool to be completed. It can be seen that they are split into two groups, with the GSTinterface, InterfaceFrame and JavaFilter classes used for creating a graphical interface, and the Comparer, GSTalgorithm, GSTtoken and GSTtile classes involved in implementing the actual GST algorithm.
The GSTinterface class has the sole purpose of starting the jGST tool, which it achieves through the execution of its main method. This creates a single instance of the InterfaceFrame class and so begins a session of the tool. InterfaceFrame is made up of nineteen attributes, of which the first seventeen control the physical look of the user interface. These include labels, text fields and file choosers etc. The resultsOnScreen and resultsNum attributes store information regarding the plagiarism tests that have taken place up to the current point of the session. The JavaFilter class is instantiated many times by InterfaceFrame, and simply adds the *.java filter to either the sourceChooser or targetChooser attributes when they are instantiated. An example of this filter in action will be seen later. Finally, the InterfaceFrame class is able to create any number of instances of GSTalgorithm, providing each one is constructed with a valid Minimum Match Length. This is achived through implementing the actionPerformed method, and is the only way in which the interface can execute the underlying GST algorithm. The class diagram does not include the classes that make up the imported tokeniser. This is because as they were not constructed as part of the project, it was not felt relevant attempting to comprehensively explain how they work. Therefore, all that is required is an understanding that the GSTalgorithm class can instantiate as many instances of the tokeniser as it needs to complete a session of plagiarism tests.
5
http://www.devx.com/premier/mgznarch/javapro/2001/09sep01/cd0109/cd0109-1.asp 15

::GSTinterface main(String[]) 1 1 ::InterfaceFrame clearMenuItem saveMenuItem exitMenuItem setSourceMenuItem setTargetMenuItem executeMenuItem 1 saveChooser sourceChooser targetChooser sourceLabel sourceField targetLabel targetField minMatchLabel minMatchField resultsArea scrollPane resultsOnScreen resultsNum actionPerformed(ActionEvent) 1 * ::JavaFilter accept(File) getDescription() GSTalgorithm can also create many instances of the Tokeniser ::Comparer compare(Object, Object) * 1 ::GSTalgorithm
Andrew Granville
minMatchLength resWords GSTalgorithm(int) tokeniseFile(String) impMethod(Hashtable, Hashtable) getDiceScore(Stack, double, double) getMeanTileLength(Stack) 1 * 1 * ::GSTtile sourceIndex targetIndex length GSTtile(Integer, Integer, int) getSourceIndex() getTargetIndex() getLength()
::GSTtoken name marked GSTtoken(String) getName() getMarked() setMarked()
The GSTalgorithm class has two attributes. The first is the now self explanatory minMatchLength, and the second is a list of reserved words (resWords). These words, excluding the names of built-in functions, make up the Java lexicon. Also, they are now the only tokens to be considered when constructing the source and target file token sequences. This helps to attempt to remove all tokens that are not from the lexicon of the target language, as was stated as desired on page 9. To demonstrate this procedure, consider again the token sequence that was produced earlier,
public test ( int i ) { intVar = i ; }
This would now reduce to,

public ( int ) { = ; }
Each of the tokens in the above sequence can be represented by an instance of GSTtoken. This class can be instantiated many times by GSTalgorithm, and is made up of the name of the token and whether or not it is marked. Following on, each instance of GSTtile represents a sequence of matched tokens that occur between the source and target files. Again it can be instantiated many times by the GSTalgorithm class, and is constructed with three attributes. These are sourceIndex and targetIndex - which state where the matched sequence begins in each file, and a length value which indicates how long the matched sequence is.
16
Andrew Granville
By noting that the Comparer class is instantiated only to help with some internal ordering of token names, the remaining methods that GSTalgorithm implements can be explained. It is this class which has the ability, using hash-tables of the source and target file tokens to implement the actual GST algorithm. The stack of GSTtiles that are returned from implementing impMethod, along with the number of tokens in each file, can then be used by the getDiceScore method to calculate the quantified plagiarism figure. Note for completeness, this class also has a method to determine the mean tile length of those GSTtiles returned by impMethod.
The Completed Tool Having shown how the jGST tool was developed, this section will now demonstrate how it operates under normal conditions and how it meets the requirements laid out on page 14. Firstly, on executing the GSTinterface class, the following screen is displayed.
Figure 4.1 The jGST main screen This is the main screen. From here the user can operate the entire tool, with the Method menu as shown below, being the starting point.
Figure 4.2 The Method menu This allows the user to select a source or target file via a file chooser, see Figure 4.3 below. Notice hoe this completes the first minimum requirement as stated on page 14.
17
Andrew Granville
Figure 4.3 Selecting the source file using a file chooser. It can now be seen how the *.java filter is applied, to restrict the user to selecting only Java code files. Once a source or target file has been chosen, its path will appear in its respective text field on the main screen (see Figure 4.4). Once both files have been selected and a valid MinML entered into the MML field on the main screen, the Execute item on the Method menu can be selected (see Figure 4.2). Note that it is not possible to execute the GST algorithm until both files and a MinML are set. Also, the second minimum requirement concerning the option of being able to specify any MinML has been achieved. Below is an example of the tool after a similarity comparison between two files has taken place.
Figure 4.4 The main screen after an execution. It can be seen that all the matches greater than or equal to the MinML have been displayed, completing the fifth (desired) requirement. Note how the dice score, along with other possible useful figures, is also shown. As a reminder, this is our quantified plagiarism score.
18
Andrew Granville
Having executed as many similarity comparisons between source and target files as required, the tool allows the current results shown on screen to be saved. This is done using the Save Results item on the File menu as shown in Figure 4.5 below.
Figure 4.5 The File menu The results can then be saved as a text file to any required destination. This is completed using another file chooser, similar in appearance to one seen in Figure 4.3. To understand what is being saved, a sample output file is displayed below.
Legend: Rank, Dice Score, Mean Match Length, Source File, Target File, MinML, MaxML Rank 1: 0.8338762214983714 16.0 c:\java\testData\binary\Level 0\BinarySearch.java c:\java\testData\binary\Level 4\BinarySearch4.java 5 38 Rank 2: 0.6750788643533123 11.88888888888889 c:\java\testData\binary\Level 0\BinarySearch.java c:\java\testData\binary\Level 5\BinarySearch5.java 5 38 Rank 3: 0.6246056782334385 9.0 c:\java\testData\binary\Level 0\BinarySearch.java c:\java\testData\binary\Level 6\BinarySearch6.java 5 18
Using the legend provided, it can be seen how the data from each execution is recorded. Also the tests are ranked in order of dice score, i.e. the most plagiarised pair is shown first. This has allowed the sixth requirement, that of being able to save the results in a ranking order, to be achieved. The seventh requirement of allowing the tool to be reset at any time has also been met. By selecting the Clear Results item on the File menu (see Figure 4.5), all previous results will be lost and the main display will return blank. This means that there is never a need to restart the tool to begin a new session of plagiarism tests, this was also part of the requirement. Finally, it must be noted that the jGST tool has been executed on the PC platform in order to carry out this demonstration of functionality, and that a graphical user interface has been developed. Because of this, it can be concluded that the tool has successfully implemented all four minimum and all four desired requirements.
19
Andrew Granville
5. Collecting the Data Set

Having established an implementation for the chosen attribute counting and structure metric methods, the construction of a data set is now required. The completed set will serve two functions. Firstly, it will provide Java algorithms that can be used for the calculation of a threshold for each method, and secondly it will supply algorithms to make up a test set. To reflect the idea of identifying plagiarism in code that may have been handed-in as part of a university assignment, it was decided to consider six general algorithmic problems and compile the data set from implementations of those. These were, 1. 2. 3. 4. 5. 6. A Binary search. (BinarySearch) The copying of the contents of a file to another. (CopyFile) The Sieve of Eratosthenes problem - the calculation of prime numbers. (Eratosthenes) The Towers of Hanoi problem - also known as Pedagogic Towers. (TowersOfHanoi) A Merge sort. (MergeSort) A Shell sort. (ShellSort)
All six were felt to be an adequate representation of at least part of a basic assignment on which a student may attempt to cheat by copying anothers code. The problem now however, is that for the purpose of this project, the data set must be composed of both algorithms that are plagiarised and algorithms that are not. This identification of a suitable software sample is, as discussed by Sallis et al., of prime difficulty [2]. At first, the most obvious solution seemed to be to ask within the Department of Computer Science for a set of completed programming assignments. But the main problem here is that we do not know how many, if any are plagiarised. As Verco and Wise [3] point out, establishing the actual set of positive detections will always be a matter of guesswork. Also, it is highly unlikely that such code would concern itself with one of the six algorithmic problems listed above. Therefore looking elsewhere, Verco and Wise go on to discuss their alternative solution to collecting a data set, which is as stated below. What might therefore be attempted is to solicit solutions to a typical assignment from readers of Internet newsgroups such as [name]. A request would be broadcast over the net asking respondents to register with an independent adjudicator. Respondents will then be given the specifications of a programming assignment to be done in [language]. Most respondents will simply return their solutions. However, the adjudicator will also arrange a percentage of cheats who will be given other respondents solutions and asked to use these as the basis for their solution. This solution was initially considered as a viable option for the collection of the data sample, as it was felt it could lead to more thorough and accurate project results. But after further investigation and thoughts as to which newsgroups would be willing to take part and in what time-scale such an elaborate scheme could be set up and implemented in, the plan was dropped. (Note in addition, that even if this system had been executed as planned, the collection of valid algorithms would not have been guaranteed.) The time and effort spent researching this proposal was not completely wasted however. Firstly it lead to an extensive search of the internet for possible sources of data. This took into account more newsgroups, mainly those to do with the field of Java programming, and many academic institutions. Overall, despite the problem of potential material being copyrighted, the search resulted in two implementations of each of the six algorithms being located. With these algorithms, the method of creating the remaining entries for the data set was chosen. This was to use Faidhi and Robinsonss six-level plagiarism spectrum [9] to self create the plagiarised algorithms required. It was felt that this was the only way in which a reliable data set
20
Andrew Granville
was going to be constructed given the time constraints of the project. This spectrum is shown below,
Control Logic Program Statements Procedure Combination Variable Position Identifiers Comments No Changes L0 L1 Li - Level of Plagiarism L2 L3 L4 L5 L6 Increasing levels of Modification
The six levels of the above spectrum are defined as follows, [9] L0: No Changes - this is the original program. L1: Represents the changes in comments and indentation. L2: Represents the changes in level 1 and changes in identifiers. L3: Represents the changes of level 2 and changes in declarations (i.e. declaring extra constants, changing the positions of declared variables and shuffling the functions, etc.). L4: Represents the changes of level 3 and changes in program modules (i.e. merging two functions into one or creating new functions). L5: Represents the changes of level 4 and changes in the program statements (i.e. FOR instead of WHILE, etc.). L6: Represents the changes of level 5 and changes in the decision logic (i.e. changes in expression). As the level of modification increases, the code becomes less recognisable from the original level 0 algorithm. This means of course, that the implemented methods should find detecting plagiarism more difficult when the higher level code is used. This is backed up by the authors claim that when the original program and the level 6 transformed program have been scanned by several assignment evaluators, they have indicated that their visual inspection would have failed to detect plagiarism [9]. This is more evidence to suggest that this method will allow us to create a valid data set. In applying the method, to complete the collection of the data set, one of the two implementations of each algorithmic problem was run through the spectrum. This resulted in a final data set containing the following algorithms, (Note that due to the simplicity of some of the algorithms, not all levels of the spectrum were able to be executed on them.) Two non-plagiarised versions of the BinarySearch, CopyFile, Eratosthenes, TowersOfHanoi, MergeSort and ShellSort algorithms. Six plagiarised versions of the BinarySearch, Eratosthenes and MergeSort algorithms. Five plagiarised versions of the CopyFile and TowersOfHanoi algorithms. Four plagiarised versions of the ShellSort algorithm.
21
Andrew Granville
This data set is then split into seven smaller sets. The first is made up of the non-plagiarised implementations of each algorithm that were not ran through the spectrum. This shall be called the non-plagiarised algorithm set. The other six are made up of the listings developed at every level (including level 0) of the spectrum for each algorithmic problem hence six sets. These shall be called the six plagiarised algorithm sets. To demonstrate how the spectrum works, one of the six plagiarised algorithm sets the BinarySearch one, in presented in full. It consists of seven algorithms, increasing in modification from the non-plagiarised level 06 code to the most modified level 6 code. Note that code segments that are in red, show the changes that have occurred at that level.
Level 0: This is the original program and is non-plagiarised.
// Program to implement a Binary Search // SOURCE - http://www.cs.fiu.edu/~weiss/dsaajava/code/Miscellaneous/Fig02_09.java import DataStructures.Comparable; import DataStructures.MyInteger; public class BinarySearch { public static final int NOT_FOUND = -1; /** * Performs the standard binary search. * @return index where item is found, or -1 if not found */ public static int binarySearch( Comparable [ ] a, Comparable x ) { int low = 0, high = a.length - 1; while( low <= high ) { int mid = ( low + high ) / 2; if( a[ mid ].compareTo( x ) < 0 ) low = mid + 1; else if( a[ mid ].compareTo( x ) > 0 ) high = mid - 1; else return mid; // Found } } return NOT_FOUND; // NOT_FOUND is defined as -1
// Test program public static void main( String [ ] args ) { int SIZE = 8; Comparable [ ] a = new MyInteger [ SIZE ]; for( int i = 0; i < SIZE; i++ ) a[ i ] = new MyInteger( i * 2 ); for( int i = 0; i < SIZE * 2; i++ ) System.out.println( "Found " + i + " at " + binarySearch( a, new MyInteger( i ) ) );
Source at http://www.cs.fiu.edu/~weiss/dsaajava/code/Miscellaneous/Fig02_09.java 22

Level 1: Represents the changes in comments and indentation.
import DataStructures.Comparable; import DataStructures.MyInteger; public class BinarySearch1 { public static final int NOT_FOUND = -1; // Binary Search Algorithm - Returns index if item found, -1 if not found public static int binarySearch( Comparable [ ] a, Comparable x ) { int low = 0, high = a.length - 1; while( low <= high ) { int mid = ( low + high ) / 2; if( a[ mid ].compareTo( x ) < 0 ) low = mid + 1; else if( a[ mid ].compareTo( x ) > 0 ) high = mid - 1; else // Return index of found item return mid;
Andrew Granville
} return NOT_FOUND;
// Program to test Binary Search public static void main( String [ ] args ) { int SIZE = 8; Comparable [ ] a = new MyInteger [ SIZE ]; for( int i = 0; i < SIZE; i++ ) a[ i ] = new MyInteger( i * 2 ); for( int i = 0; i < SIZE * 2; i++ ) System.out.println( "Located: " + i + " at index: " + binarySearch( a, new MyInteger( i ) ) ); } }
Level 2: Represents the changes in level 1 and changes in identifiers.

import DataStructures.Comparable; import DataStructures.MyInteger; public class BinarySearch2 { public static final int NOT_IN = -1; // Binary Search Algorithm - Returns index if item found, -1 if not found public static int search( Comparable [ ] inArray, Comparable toFind ) { int down= 0, up = inArray.length - 1; while( down<= up ) { int middle = ( down+ up ) / 2; if( inArray[ middle ].compareTo( toFind ) < 0 ) down= middle + 1; else if( inArray[ middle ].compareTo( toFind ) > 0 ) up = middle - 1; else // Return index of found item return middle; } } return NOT_IN;
// Program to test Binary Search public static void main( String [ ] args ) { int numItems = 8; Comparable [ ] itemArray = new MyInteger [ numItems ]; for( int i = 0; i < numItems; i++ ) itemArray[ i ] = new MyInteger( i * 2 ); for( int j = 0; j < numItems * 2; j++ ) System.out.println( "Located: " + j + " at index: " + search( itemArray, new MyInteger( j ) ) );
23
Andrew Granville
Level 3: Represents the changes of level 2 and changes in declarations (i.e. declaring extra constants, changing the positions of declared variables and shuffling the functions, etc.). In this example, it can also be seen that the main and search methods have been swapped around.
import DataStructures.Comparable; import DataStructures.MyInteger; public class BinarySearch3 { public static final int NOT_IN = -1; public static final int ZERO_FLAG = 0; public static final int NUM_ITEMS = 8; // Program to test Binary Search public static void main( String [ ] args ) { Comparable [ ] itemArray = new MyInteger [ NUM_ITEMS ]; for( int i = 0; i < NUM_ITEMS; i++ ) itemArray[ i ] = new MyInteger( i * 2 ); for( int j = 0; j < NUM_ITEMS * 2; j++ ) System.out.println( "Located: " + j + " at index: " + search( itemArray, new MyInteger( j ) ) );
// Binary Search Algorithm - Returns index if item found, -1 if not found public static int search( Comparable [ ] inArray, Comparable toFind ) { int up = inArray.length - 1; int down= 0; int middle; while( down<= up ) { middle = ( down+ up ) / 2; if( inArray[ middle ].compareTo( toFind ) < ZERO_FLAG ) down= middle + 1; else if( inArray[ middle ].compareTo( toFind ) > ZERO_FLAG ) up = middle - 1; else // Return index of found item return middle; } } } return NOT_IN;
Level 4: Represents the changes of level 3 and changes in program modules (i.e. merging two functions into one or creating new functions). Note how the searchResults() and fillArray() methods have been created.
import DataStructures.Comparable; import DataStructures.MyInteger; public class BinarySearch4 { public static final int NOT_IN = -1; public static final int ZERO_FLAG = 0; public static final int NUM_ITEMS = 8; public static void main( String [ ] args ) { searchResults(); } public static void searchResults() { for( int j = 0; j < NUM_ITEMS * 2; j++ ) System.out.println( "Located: " + j + " at index: " + search( fillArray(), new MyInteger( j ) ) ); } public static Comparable[] fillArray() { Comparable [ ] itemArray = new MyInteger [ NUM_ITEMS ]; for( int i = 0; i < NUM_ITEMS; i++ ) itemArray[ i ] = new MyInteger( i * 2 ); return itemArray; } // Binary Search Algorithm - Returns index if item found, -1 if not found public static int search( Comparable [ ] inArray, Comparable toFind ) { int up = inArray.length - 1; int down= 0; int middle; while( down<= up ) { middle = ( down+ up ) / 2; if( inArray[ middle ].compareTo( toFind ) < ZERO_FLAG ) down= middle + 1;
24

else if( inArray[ middle ].compareTo( toFind ) > ZERO_FLAG ) up = middle - 1; else // Return index of found item return middle;
Andrew Granville
} return NOT_IN;
Level 5: Represents the changes of level 4 and changes in the program statements (i.e. FOR instead of WHILE, etc.).
import DataStructures.Comparable; import DataStructures.MyInteger; public class BinarySearch5 { public static final int NOT_IN = -1; public static final int ZERO_FLAG = 0; public static final int NUM_ITEMS = 8; public static void main( String [ ] args ) { searchResults(); } public static void searchResults() { int j = ZERO_FLAG; while(j < NUM_ITEMS * 2) { System.out.println( "Located: " + j + " at index: " + search( fillArray(), new MyInteger( j ) ) ); j++; } } public static Comparable[] fillArray() { Comparable [ ] itemArray = new MyInteger [ NUM_ITEMS ]; int i = ZERO_FLAG; while(i < NUM_ITEMS) { itemArray[ i ] = new MyInteger( i * 2 ); i++; } return itemArray; } // Binary Search Algorithm - Returns index if item found, -1 if not found public static int search( Comparable [ ] inArray, Comparable toFind ) { int up = inArray.length - 1; int down= -1; int middle; for(int k=down; k <=up; k++) { middle = ( down+ up ) / 2; if( inArray[ middle ].compareTo( toFind ) < ZERO_FLAG ) down= middle + 1; else if( inArray[ middle ].compareTo( toFind ) > ZERO_FLAG ) up = middle - 1; else // Return index of found item return middle;
} }
} return NOT_IN;
25
Andrew Granville
Level 6: Represents the changes of level 5 and changes in the decision logic (i.e. changes in expression). In this case an if-then-else statement has been reordered.
import DataStructures.Comparable; import DataStructures.MyInteger; public class BinarySearch6 { public static final int NOT_IN = -1; public static final int ZERO_FLAG = 0; public static final int NUM_ITEMS = 8; public static void main( String [ ] args ) { searchResults(); } public static void searchResults() { int j = ZERO_FLAG; while(j < NUM_ITEMS * 2) { System.out.println( "Located: " + j + " at index: " + search( fillArray(), new MyInteger( j ) ) ); j++; } } public static Comparable[] fillArray() { Comparable [ ] itemArray = new MyInteger [ NUM_ITEMS ]; int i = ZERO_FLAG; while(i < NUM_ITEMS) { itemArray[ i ] = new MyInteger( i * 2 ); i++; } return itemArray; } // Binary Search Algorithm - Returns index if item found, -1 if not found public static int search( Comparable [ ] inArray, Comparable toFind ) { int up = inArray.length - 1; int down= -1; int middle; for(int k=down; k <=up; k++) { middle = ( down+ up ) / 2; if( inArray[ middle ].compareTo( toFind ) > ZERO_FLAG ) up = middle - 1; else if( inArray[ middle ].compareTo( toFind ) < ZERO_FLAG ) down= middle + 1; else // Return index of found item return middle; } return NOT_IN;
26
Andrew Granville
6. Pre-Testing
Before full testing and analysis can take place for both attribute counting and structure metric methods, it is important to identify a plagiarism threshold for each of them. This will consist of a value for which each quantified test result can be compared against and declared either plagiarised or non-plagiarised. The data used to calculate the thresholds will be the six plagiarised algorithm sets that were created as described in the previous Collecting the Data Set chapter.
Identifying the Threshold (McCabes Cyclomatic Complexity)
It is important at this point to fully understand what McCabes Cyclomatic Complexity is attempting to achieve when analysing a given piece of Java code. To briefly recap, it first of all takes the listing and converts it into a control graph before as McCabe states, the overall strategy will be to measure the complexity of a program by computing the number of linearly independent paths v(G) [5]. Further discussion can be found in the Choosing the Methods chapter of this report if required. It must also be noted that the highest level of modification from each of the six plagiarised algorithm sets (i.e. level 5 or 6) is not used when calculating the threshold. These six pieces of code are saved to be used as the genuine plagiarised versions of the algorithms during full testing. The graph below shows the complexity figures for each of the remaining levels of each algorithm.
Cyclom atic Com plexity of Algorithm s at each Level of Plagiarism 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5
Binary Search
Cyclomatic Complexity
0 1 2 3 4
Copy File
0 1 2 3 4 5
Erat ost henes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
It can be seen in all except the Shell Sort algorithms results, that the only change in complexity occurs at the fourth level of plagiarism. Referencing back to the Collecting the Data Set chapter (page 21), this is the level where functions are merged, or new ones created. It can therefore be easily understood as to why McCabes Cyclomatic Complexity recorded such movement, as the introduction or removal of control structures will alter the number of linearly independent paths. In addition, the other levels of plagiarism (1,2,3 and 5), are more designed to simply shuffle or substitute code for a similar structure, than to remove or add to it. This allows complexities to remain constant around these levels, which can also be seen in the Figure above. Using these results, the suggested method for obtaining a threshold, is to take the difference between the complexities of level 0 and the highest level of modification available (i.e. 4 or 5) for each algorithm, resulting in six different thresholds. (Note that due to the simplicity of the original level 0 Shell Sort algorithm, level 3 was used to calculate its threshold)
27
Andrew Granville
Then to test a target listing, if the difference between itself and source listing > pre-calculated threshold, the target is deemed non-plagiarised, whereas if the difference between itself and source listing <= pre-calculated threshold, the target should be treated as plagiarised. For the six algorithms used above, the thresholds are calculated to be the following,
Algorithm Binary Search Copy File Eratosthenes Towers of Hanoi Merge Sort Shell Sort Threshold (Complexity +/-) 2 2 2 1 1 0
However, this is recognised as being far from ideal, as it is clearly obvious that this method allows for two programs which are completely different to be viewed as a plagiarised pair, simply because they share the same complexity level. It can also be easily fooled, as we have already seen with the sudden increase in complexity when control structures are added to a program. Another drawback is that should the complexity difference rise and then fall at the most modified level, complexity differences for earlier levels of plagiarism may occur outside the threshold and become false negative results. For example if the first five levels of an algorithm produced difference figures of 2, 3, 4, 5 and 1 respectively, the threshold would be set to +/-1, resulting in levels 1 to 4 falling outside of this and being classed as non-plagiarised. However, the method does reflect the fact that since the highest level of modification known about is used directly to determine the threshold - and that generally includes the fourth level alterations, generating a fifth or sixth level should if anything, lead to a decrease in complexity. Therefore the threshold found for each algorithm should be adequate in most cases to detect plagiarism up to the level it was calculated from, and beyond. One possible improvement to the method described above is to take an average of the complexity differences and use that as the threshold. This however also leads to problems. Firstly, there is a risk of calculating a decimal threshold and then having to loose the accuracy gained by rounding off. The second problem is that one difference value (or a number of similar values) can add unwanted weighting to the final threshold figure. This is demonstrated in the example of five difference values being 1, 1, 1, 1 and 5 for each level of plagiarism respectively, with the threshold calculated to be 1.8, which is then rounded to 2. This means that the fifth level of plagiarism (with difference score of 5) would be categorised incorrectly as non-plagiarised. It is because of these problems that this potential method of calculating the thresholds has been rejected and the earlier method discussed along with its threshold figures shown above, will be used during the testing of the attribute counting method.
28

Identifying the Thresholds (GST)
Andrew Granville
Firstly, as will be discussed in the subsequent Testing and Analysis chapter, four different Minimum Match Lengths (see page 9) will be required to allow for comprehensive testing of the structure metric method. This means that four different thresholds, one for each Minimum Match Length (MinML) will be required. In attempting to identify the threshold for the structure metric method, the highest modified level of each of the six plagiarised algorithm sets is not used, and reserved in the same way as occurred in the attribute counting methods threshold identification, for use later as positive test data. The method to calculate a threshold starts by using the developed jGST tool to output a quantified score of the level 0 of each of the six algorithms when paired with the remaining algorithms. In this case, each level 0 piece of code is paired with itself. This is deemed a valid procedure as simply taking an entire code listing is, if not primitive, a type of plagiarism. The data that is produced from this is displayed on six graphs. Each individual data value from each graph is then placed into one of two sets, depending on whether or not it belongs to an algorithm pair that is known to include plagiarism. The quartile ranges of each set are then calculated. The actual threshold is the average value of the upper quartile of the non-plagiarised set and the lower quartile of the plagiarised set. This method is demonstrated in the box plot diagram below,
Non-Plagiarised Values Set
Plagiarised Values Set
Upper Quartile
Threshold
Lower Quartile
If the value is above the threshold it can be said to be plagiarised, whereas if it falls below or on the threshold, it can be classed as non-plagiarised. The benefit of using the quartile ranges in calculating the threshold is that anomalies in each set are discounted, thus hopefully allowing for a more accurate figure to be found. Finally, this procedure is repeated for all MinMLs (1, 3, 5 and 10), resulting in the four thresholds being identified as required. Below are the six graphs produced for each match length, along with its calculated threshold.
29

Minimum Match Length = 1
Andrew Granville
Paired w ith Level 0 Binary Search Algorithm (MinML=1) 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Paired w ith Level 0 Copy File Algorithm (MinML=1) 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Paired w ith Level 0 Eratosthenes Algorithm (MinML=1) 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
30
Andrew Granville
Paired w ith Level 0 Tow ers of Hanoi Algorithm (MinML=1) 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Paired w ith Level 0 Merge Sort Algorithm (MinML=1) 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Paired w ith Level 0 Shell Sort Algorithm (MinML=1) 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Lower Quartile of Plagiarised Algorithm Pairs = 91.8575 Upper Quartile of Non-Plagiarised Algorithm Pairs = 72.5725
Threshold = 82.22
31

Andrew Granville

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
32
Andrew Granville

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Threshold = 62.07
33

Andrew Granville

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
34
Andrew Granville

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Eratosthenes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Threshold = 50.98
35

Andrew Granville

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Erat ost henes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Erat ost henes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Erat ost henes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
36
Andrew Granville

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Erat ost henes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Erat ost henes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm

Binary Search
Dice Score (%)
0 1 2 3 4
Copy File
0 1 2 3 4 5
Erat ost henes
0 1 2 3 4
Towers of Hanoi
0 1 2 3 4 5
M erge Sort
0 1 2 3
Shell Sort
Algorithm
Lower Quartile of Plagiarised Algorithm Pairs = 69.5425 Upper Quartile of Non-Plagiarised Algorithm Pairs = 0
Threshold = 34.77
37
Andrew Granville
7. Testing and Analysis

With the plagiarism thresholds for both the attribute counting and structure metric methods identified, testing and analysis of results can now take place. The test set to be used will be made up of code from two sources. The six highest modification level algorithms from each of the six plagiarised algorithm sets - those that were not used in calculating the thresholds (i.e. level 5 or 6), will act as the plagiarised test data. The algorithms from the nonplagiarised algorithm set, as specified in the Collecting the Data Set chapter will be the nonplagiarised test samples. The objective of the test data is to allow the methods to process realistic copied and original code, meaning that whatever the results, they should at least be quite reliable. Of course, there is also the hope that both methods will be able to identify which are the plagiarised algorithms and which are not. This chapter is therefore split into two, with the results from both methods being compiled and then analysed.
Attribute Counting Method (McCabes Cyclomatic Complexity)
To test the attribute counting method, the Cyclomatic Complexities of each of the twelve algorithms in the test set was calculated. Then, using the previously identified thresholds, the differences between each algorithms complexity and its threshold is noted. All results are shown in the table below, with the entries in red showing where the method has failed to categorise an algorithm correctly when compared with the original level 0 code.
Level 0 Complexity Plagiarised Code Complexity (Difference) 9 (+2) 9 (+1) 10 (+2) 4 (+1) 8 (+1) 8 (+2) Non-Plagiarised Code Complexity (Difference) 5 (-2) 2 (-6) 9 (+1) 3 (0) 10 (+3) 4 (-2)
Algorithm
Threshold (+/-)
Binary Search Copy File Eratosthenes Towers of Hanoi Merge Sort Shell Sort
7 8 8 3 7 6
2 2 2 1 1 0
Firstly, it is easily explained why the plagiarised Shell Sort algorithm has been wrongly classified as non-plagiarised. Because of the simplicity of the original code, only four levels of plagiarism was able to be created from it. This resulted in the level 3 algorithm being used to identify the threshold for the Shell Sort algorithms (see page 27). However, as the graph shows on the same page, the only movement in the complexity figures in each set of plagiarised algorithms was at the level 4 stage, which for the Shell Sort is being used as part of the test set - hence the misclassification. It is also worth noting though, that if the Shell Sorts threshold was then altered to +/-2 to reflect the changes in its level 4 complexity, the non-plagiarised version of the algorithm would now be classified incorrectly as plagiarised. The main problem with McCabes Cyclomatic Complexity is that it does not consider each token that helps to make up an algorithm, but simply records the number of linearly independent paths that run through the code. This means that when attempting to categorise a non-plagiarised version of an algorithm, there is a risk of finding its complexity falls inside of the threshold, resulting in a wrong classification despite the fact the code is known to be clearly different.
38
Andrew Granville
A good example of this is shown in the results above, where the non-plagiarised Towers of Hanoi code has a complexity of 3. This is identical to the complexity of the original level 0 algorithm, suggesting that it is the same piece of code - which is not true. With the non-plagiarised Binary Search and Eratosthenes algorithms also being categorised incorrectly, it is clear that this method on its own is not conclusive enough to be able to accurately detect plagiarism. This is something that Whale [8] recognised, in that better performance is possible if the number of [attribute] counters is determined by the size of the program. More discussion on this suggestion and how both methods could have their overall results improved in general, is provided in the following chapter.
Structure Metric Method (GST)
To allow for comprehensive testing of the structure metric method, four different Minimum Match Lengths (MinML) of 1, 3, 5 and 10 were considered. The reason behind why each length was specifically chosen is given below,
1: The matching of individual tokens between files provides a plagiarism score which indicates the percentage of the total material found in the source file, that also appears in the target file. This is regardless of how the token strings have been organised in either file, and so plagiarism scores calculated using this length are generally only affected by the addition or removal of tokens in the target file. 3: This figure was the default value chosen by Wise [11] for use during implementation of his YAP3 program. However, when studying Java code there is often many occurrences of token strings of length 3 that appear in any number of programs, for example, break;} or )};. It was felt that detecting matches of these token strings between a source and target file, does therefore not always constitute plagiarism, and in many cases such similarities are simply a coincidence. 5: This was the default MinML chosen by the developed jGST tool. Taking an example of its implementation, no part of the token string private static final int = ; would now be matched against private static final double = ;. The aim of this larger MinML is to remove as much of the plagiarism that was felt detected by coincidence, when a MinML of 3 was used. It is hoped that this will result in more structurally specific token strings being matched between files, thus producing more accurate plagiarism results. 10: The purpose of this much larger MinML is to simply demonstrate that if set too high, results will began to deteriorate. This is because plagiarism scores will be more affected by the fragmentation occurring in plagiarised code as it is created. The hope however, is that genuine plagiarised versions of the algorithms will still, on the whole, be identified correctly due to their main recurring structures.
At each of these lengths, the six plagiarised and six non-plagiarised algorithms (or target files) from the test set were paired with their equivalent level 0 algorithm (or source file), and ran through the jGST tool to obtain a plagiarism score. The results have been graphed below. Analysis concerning the results at each MinML in relation to a threshold level is discussed later.
39

Minimum Match Length = 1 (Threshold is 82.22)
Plot of Test Results (MinML=1)
100 90 80 Dice Score (%) 70 60 50 40 30 20 10 0
Andrew Granville
Binary Search
Copy File
Eratosthenes Towers of Hanoi
Algorithm
Merge Sort
Shell Sort
Plagiarised
Non-Plagiarised
Figure 7.1 To summarise, it can be seen that when the Minimum Match Length is 1, the plagiarised Copy File and Towers of Hanoi algorithms have been wrongly identified as non-plagiarised.
Minimum Match Length = 3 (Threshold = 62.07)
100 90 80 Dice Score (%) 70 60 50 40 30 20 10 0
Binary Search
Copy File
Algorithm
Merge Sort
Shell Sort
Plagiarised
Non-Plagiarised
Figure 7.2 In summary, although the non-plagiarised version of the Towers of Hanoi algorithm is close to the threshold, all the test data has been successfully classified correctly.
40

Andrew Granville

100 90 80 Dice Score (%) 70 60 50 40 30 20 10 0
Binary Search
Copy File
Merge Sort
Shell Sort
Algorithm Plagiarised Non-Plagiarised
Figure 7.3 As with the previous results, the non-plagiarised Towers of Hanoi algorithm is close to the threshold. But as can be seen, all the test data has once again been successfully categorised..
100 90 80 Dice Score (%) 70 60 50 40 30 20 10 0
Binary Search
Copy File
Merge Sort
Shell Sort
Algorithm Plagiarised Non-Plagiarised
Figure 7.4 It can be seen that when the MinML is 10, both the plagiarised Binary Search and Eratosthenes algorithms have been wrongly identified as non-plagiarised.
41
Andrew Granville
To begin to explain the results, it helps to try to understand why the threshold at each Minimum Match Length obtained the value that it did, as without a threshold to compare against, any test results would prove quite meaningless. To aid this, the following graph was constructed. (Note that it must be remembered that it is not just the results from pairings that include known plagiarised code that determines the final threshold (see page 29), and that the subsequent discussion is using the following data to simply attempt to indicate why the level of a threshold occurred at a given MinML and how in turn that relates to the test results.)
Pairing of Plagiarised Binary Search Algorithm s w ith Level 0 Binary Search Algorithm (All Minim um Match Lengths)
100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Dice Score (%)
3 5 10 Minimum Match Length (w ith Plagiarism Level)
Figure 7.5 The graph shows the results output from the jGST tool of each of the plagiarised Binary Search algorithms when paired with the level 0 Binary Search code, at each MinML. Starting when the MinML is 1 from Figure 7.5, even after five levels of plagiarism are applied to the original code, the dice score remains at over 85%, providing a indication as to why the threshold is high, at 82.22. As previously discussed, this is primarily to do with individual tokens being matched between the target and source files. As was also concluded with a MinML of 1, in general only the addition or removal of tokens that the tokeniser accepts will have the effect of influencing the plagiarism score. This provides an explanation as to why the plagiarised Copy File and Towers of Hanoi algorithms were incorrectly categorised as non-plagiarised. They were the only two algorithms of the six, which as part of their level 5 plagiarism changes, had an if-then-else statement replaced with a switch statement. This includes the tokens switch, case and break, as well as additional colons and semicolons, hence the noticeable reduction in plagiarism level. Considering the results shown on the previous charts when the MinML is 3 and then 5 (Figures 7.2 and 7.3), it can be seen that all twelve algorithms in the test set were successfully categorised for both match lengths. The thresholds in both figures are also lower, at 62.07 and 50.98 respectively. The reason for this is that both MinMLs are beginning to suffer from the previously unseen problem of target file fragmentation and reshuffling as the level of plagiarism increases. This problem can be seen graphically on Figure 7.5, where for the match lengths of 3 and 5, the plagiarism score decreases further as the plagiarism level becomes more complex. Also, in both Figures 7.2 and 7.3, there almost appears an anomaly. The non-plagiarised Towers of Hanoi algorithms are both close to their respective threshold. One explanation for this is that the Towers of Hanoi problem is itself not diverse enough, and is perhaps only implementable in a single way. This is backed up by a comparison of the two versions from the test set, which suggests the only real difference between them is the way in which the output from the algorithms is handled. Fortunately in this case, the similar nature of both programs was not enough to affect the categorisation of each piece of code, and that the calculated thresholds remain generally accurate.
42
Andrew Granville
Finally, it was discussed earlier in this chapter that it was hoped that a MinML of 5 would produce better results than a MinML of 3. By visually comparing Figures 7.2 and 7.3 this appears to be the case. However to quantifiably prove this, for each MinML the average difference between the threshold and the scores for the plagiarised and non-plagiarised algorithms was calculated and compared. The higher the average difference, the better the MinML were deemed to have performed. The results were are follows,
MinML Average Difference Between Threshold and Plagiarised Scores 14.11 16.65 Average Difference Between Threshold and Non-Plagiarised Scores 14.36 19.00
3 5
They suggest that a MinML of 5 will decide whether an algorithm is plagiarised or non-plagiarised with more confidence than when a MinML of 3 is used. This was as predicted. Moving onto the results when the MinML is 10 (see Figure 7.4), it can be seen that both the plagiarised Binary Search and Eratosthenes algorithms have been wrongly classified as nonplagiarised. It is however, impossible to offer a full explanation as to why this has occurred, other than to suggest that these two algorithms were subjected to more fragmentation than the others during their creation. It can also be noted from Figure 7.4 that the threshold has dropped to 34.77, the lowest threshold value of all four MinMLs considered. Evidence to support this fall is found in Figure 7.5, It shows that when the MinML is 10, the plagiarism score decreases sharply to below 40% as the level of modification increases. This accurately reflects the calculated threshold value and is clearly down to the fact that finding token string matches of length 10 or greater, becomes more difficult once the addition, removal and shuffling of the target code begins. The hope was that this MinML would, on the whole, still detect genuine plagiarised versions of algorithms. This has been shown, with four of the six plagiarised pieces of code being classified correctly. Also the aim to demonstrate that as the MinML increases, results deteriorate, has also been shown. However, care must be taken with accepting these achievements, due to the small size of the test set.
43
Andrew Granville
8. Evaluation and Conclusions

As the test results for the attribute counting and structure metric methods have already been analysed, and explained where possible in the previous chapter, it was felt more useful to combine the final evaluation and conclusions into one discussion. The aim of this chapter therefore, will be to revisit the original aims of the project, as laid out in the introduction, and comment on if and how they have been implemented. Also, a number of individual aspects of the project will be picked on and critically discussed and evaluated, thus offering a more constructive series of conclusions for the project, as opposed to one large summing up. The first main aim was to analyse the performance of both methods when the test set is used, and to try to identify which may have the greater potential for plagiarism detection. To do this, the results of both methods need to be briefly recapped. The attribute counting method (McCabes Cyclomatic Complexity), was found to be relatively accurate at identifying the plagiarised algorithms, of which it correctly determined 5 out of the 6. However, it could only determine 3 out of the 6 non-plagiarised algorithms as being nonplagiarised, leaving the feeling that the methods overall performance was disappointing and rather erratic. The structure metric method (GST), displayed perfect results when the MinML was set to 3 or 5. It was able to identify all twelve algorithms in the test set correctly. Even when the MinML was set to 1 or 10, only two of the algorithms were classified wrongly in each case. It was also calculated, that the most concrete classifications were noted when the MinML was 5 (see page 43). Therefore when discussing the GST method subsequently in this chapter, the results returned using this setting will be the ones referred to. Before concluding the findings of this aim, other similar research was considered. Whale [8] was able to demonstrate the clear superiority of structure-based similarity detection over attribute-counting methods, and noted that the performance of the attribute counting schemes on [his] experiment was consistently poor. As already discussed, Verco and Wise [3] have found that evidence points to the fact that no single number, or set of numbers, can adequately capture the level of information about program texts that a structure-metric system is able to achieve. In conclusion, there is an agreement with those above, that attribute counting methods are simply not comprehensive enough in attempting to analyse the structure of source code, and so often provide confused results. It is with no doubt that this project can say that the GST algorithm has the better potential for plagiarism detection. Although to be fair to the McCabes Cyclomatic Complexity method, it has been known throughout this project that it was not primarily designed to detect plagiarised code, so perhaps this outcome was not too unexpected. The second aim is to consider both sets of results in a way that when used together, they can provide a more concrete plagiarism classification. We have already seen that on its own, the attribute counting method struggles, but perhaps when the results are combined with those produced by the GST algorithm, some similarities might be found. However, if we take the non-plagiarised TowersOfHanoi algorithm, which under the GST method has the closest result to the threshold, we can see that McCabes Cyclomatic Complexity does not help confirm the classification, as it wrongly identifies it as plagiarised. This is obviously not of much use. McCabes method also fails to help in the classification of the two least plagiarised algorithms that the structure metric method identified. It categorises both the non-plagiarised BinarySearch and Eratosthenes algorithms incorrectly as plagiarised. This again is not a useful result.
44
Andrew Granville
As a final example, the attribute counting method believes the most conclusively identified plagiarised algorithm, which according to the GST method is the ShellSort, is not plagiarised. (Note that a possible reason for this result was given on page 38.) In conclusion, the attribute counting method seems only to contradict the largely accurate results found by the structure metric method. This would suggest that using the GST algorithm alone would be sufficient enough to achieve good results, and that using the two methods together, as intended in this aim, offers little additional benefit. So how could improvements be made to the attribute counting side of the project, to make results more accurate? The most obvious solution would seem to be to consider more than one attribute counting metric. Sallis et al. [2] supports this theory, believing that using more counters should enable effective plagiarism detection. Earlier in this project, his six-tuple vector was described and McCabes Cyclomatic Complexity chosen from it. Perhaps then, with the benefit of hindsight, the project should have implemented all six metrics. Whale [8] also comments on the use of more than one metric, stating that it is easily established that the traditional representations based on attribute counters are incapable of detecting sufficient similar programs to be considered effective. Better performance is possible if the number of counters is determined by the size of the program. However, a problem identified here, is that all the algorithms collected for this project are roughly the same size, so implementing this advice would result in little benefit other than to have increased the number of attribute counters, as Sallis et al. [2] has already suggested. Overall, it was difficult to read through the often contradictory evidence supporting the use of attribute counting methods, which once again suggests that perhaps it is simply not worthwhile attempting to use them to detect plagiarism. Another area of concern was the calculation of thresholds. It can be assumed that as the results for the GST algorithm were good, the method used to calculate its thresholds was adequate. But this can not be said for the Cyclomatic Complexity method. Firstly, the question of whether or not it is even possible to determine a threshold for the method must be asked. It is the opinion here, that due to the way the method is not recording similarity between two pieces of code, it may not be. Also, there is the opinion that the threshold calculation that the project used for the attribute counting method may have been flawed. This is because it was calculated without the influences of any complexities from known non-plagiarised algorithms. This was recognised at the time of implementation, but no valid alternative calculation was able to be found. Despite the excellent results found whilst using GST, the tokenising stage of the method did not implement all of the recommendations as specified by its author Wise [11]. To recap, these were, 1. Removing comments and string-constants. 2. Translating upper case letters into lower case. 3. Mapping of synonyms to a common form (i.e. function mapped to procedure). 4. Reordering the functions into their calling order. In the process, the first call to each function is expanded to its full token sequence. Subsequent calls are replaced by the token FUN. 5. Removing all tokens that are not from the lexicon of the target language, i.e. any token that is not a reserved word, built-in function, etc. Only numbers 1, 2 and part of 5 (built-in functions were removed), were implemented in this project. This was initially down to time constraints, but as the results were shown to be accurate anyway, it is thought that perhaps there is no actual need to make the additional modifications. Only further testing and analysis on a wider ranging test set will be able to determine this.
45
Andrew Granville
The next interesting point concerns questioning whether the MinML choices of lengths 1, 3, 5 and 10 were the correct figures to chose. From the results, this would seem conclusively so. It has been shown that the project was right in moving away from Wises [11] original suggestion of a default MinML of 3, and experimenting with the higher MinML of 5. The opinion here is that the improved results were no fluke, and that the discussion concerning Java syntax on page 39, is the main reason behind them. As Sallis et al. [2] commented regarding his research, in order for any inferences to be meaningful from this work, a large sample of software and authors will be required. This was something that this study was also unable to find, and therefore it should be stated that this is the biggest criticism of the overall project. Of course, an attempt was made to find an alternative way of identifying a better test set (see page 20), but it was noted that this scheme was too elaborate and time consuming. However, the data that was used is felt to be reflective of what would normally be considered in this field of study. Parker and Hamblen in their paper [7], also use the Sieve of Eratosthenes algorithm, suggesting that although small, the test set was accurate. The final aim of the project was to construct a desktop tool that implements one of the chosen plagiarism detection methods. This was completed in the form of the jGST tool, a program to implement the GST structure metric algorithm. It was noted however, that there was a number of problems with the tool. Firstly, as discussed on page 9, the Running-Karp-Rabin (RKR) optimisation for the GST algorithm was not implemented. It is recognised though, that if incorporated, the tool would have not functioned any differently or produced any better results, but would have simply worked faster and more efficiently. Therefore given more time, it is proposed that this should be considered. Also, the jGST tool could have gone further and perhaps been able to calculate thresholds and also visualise the data it produces in some way. These suggestions would of course only be desired requirements, but again if more time had been available, they could have been implemented. However, it is worth noting that the tool was at least able to save the results it had generated. Finally, it was felt the best way to conclude the evaluation of the jGST tool, was to carry out a brief heuristic analysis, using as many of the ten factors identified by Nielsen [14] as possible. This was achieved as follows,
Use simple and natural dialogue. It has been seen that the tool presents no irrelevant material to the user, i.e. no logos, slogans or unwanted information is shown. The layout of jGST also follows a natural logical order, with on-screen details running from the top left to the bottom right of the window. Speak the users language. Although some terminology is presented within the system, it is felt that the majority of terms are very much user friendly and self explanatory. For example, the Select Source File menu item. Minimise the users memory load. An example of where the tool is careful to adhere to this heuristic, is the resetting of the MML field to 5 if its previous value was not valid. E.g. if the tool is executed with a MinML of 0, the MML field will display 5 after the failed attempt. Feedback. The tool does not display warning messages if the results on screen are about to be lost, for example if the Clear Results menu item has been selected. If this tool was to be upgraded, this would be made a requirement. For this reason, it has been recognised that jGST fails this heuristic test. Clearly marked exits. As well as the standard cross in the top right hand corner of the window, the tool can also be terminated by selecting Exit from the File menu. This is thought to be quite clear.
46
Andrew Granville
Shortcuts. For the experienced user, Control-key shortcuts have been provided to open either a source or target file, and to execute the tool. Help and Documentation. As the tool is considered to be fairly obvious to operate, and it is thought that potential users will be familiar with the general ideas behind the GST method, no additional help or documentation has been produced. Again, if this tool was to be upgraded, this decision would need to be reconsidered.
This has helped to show that in a future release of the tool, the introduction of warning messages and documentation would improve its overall functionality and usability. It is not easy to suggest any future work that could result from the completion of this project. Certainly, a more thorough review and evaluation of attribute counting methods is a must, if they are to be used again in a similar study. It is felt that attempting to use both types of method to identify stronger, more concrete results, is not necessarily the right option to take. It was clear that the only real benefit the project gained by doing this, was to be able to demonstrate that the approach doesnt really work. This is of course, still a valid result, but perhaps not quite the one that was originally desired.
47
Andrew Granville
References
[1] P.Clough, Plagiarism in natural and programming languages: an overview of current tools and technologies, Department of Computer Science, University of Sheffield, 2000 [2] P.Sallis, A.Aakjaer, S.MacDonell, Software Forensics: old methods for a new science, In Proceedings of Software Engineering: Education & Practice (SE:E&P96), Dunedin, New Zealand, IEEE Computer Society Press (1996) 481-485 [3] K.L.Verco and M.J.Wise, Software for Detecting Suspected Plagiarism: Comparing Structure and Attribute-Counting Systems, First Australian Conference on Computer Science Education, Sydney, Australia, July 3-5, 1996 [4] M.H.Halstead, Elements of Software Science, Elsevier North Holland, 1977 [5] T.J.McCabe, A complexity measure, IEEE Transactions on Software Engineering, SE-2 (4), 308-320, 1976. [6] J.M.Bieman, N.C.Debnath, An analysis of software structure using a generalised program graph, In Proceedings of COMPSAC85, 254-259, 1985. Cited in [1] [7] A.Parker, J.Hamblen, Computer algorithms for Plagiarism Detection, IEEE Transactions on Education, Vol. 32, No 2, 1989. [8] G.Whale, Identification of Program Similarity in Large Populations, The Computer Journal, Vol. 33, Number 2, 1990. [9] J.A.Faidhi and S.K.Robinson, An empirical approach for detecting program similarity and plagiarism within a university programming environment, Computing in Education, Vol. 11 (11-19), 1987. [10] M.J.Wise, Detection of similarities in student programs: YAPing may be preferable to Plagueing, SIGSCI Technical Symposium, Kansas City, USA, 268-271, March 5-6, 1992. [11] M.J.Wise, YAP3: improved detection of similarities in computer programs and other texts, presented at SIGCSE96, Philadelphia, USA, February 15-17 1996, 130-134. [12] J.Helfman, Dotplot Patterns: A Literal Look at Pattern Languages. [13] L.Prechelt, G.Malpohl, M.Philippsen, Finding plagiarisms among a set of programs with JPlag, submission to J. of Universal Computer Science, March 28, 2000. [14] J.Nielsen, Usability Engineering, Academic Press, 1993.
48

Detecting Plagiarism in Java Code

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Detecting Plagiarism in Java Code

Enviado por

Direitos autorais:

Formatos disponíveis

Detecting Plagiarism in Java Code

Supervisor: Yorick Wilks