Developing machine learning coding similarity indicators for C & C++ corpora
Abstract
The digital data in this modern world is vulnerable to copying, altering and claiming
someone else’s work as their own. Performing the same activity in programming
assignments can be referred to as source-code theft or e-plagiarism. Despite years of
efforts, the already existing similarity detection engines perform pretty well in detecting
plagiarism for novice programmers, but provides insufficient results when a student uses
complex and smart plagiarism hacks such as word substitution, structure change, line
spacing placeholder comments. This thesis research aims to deliver an assistive forensic
engine named ‘SimDec’, for the evaluators to help detect similar assignments to address
the aforementioned issues. The system's primary objective is to aid the assignment
evaluators to get closer to the code thieves and abide by the university's dishonesty
regulations. The forensic engine has been developed in Java programming language to
detect C and C++ source code's similarities. The research has been split into two modules
labelled as ‘software forensic engine development’ and ‘Similarity level classification with
machine learning’. The proposed system has a workflow of three stages starting with
lexical analysis, tokenizer customization and the final stage displaying similarity
percentage and the corresponding level of ‘Low’, ‘Average’ and ‘High’. The combination
of similarity algorithms integrated in the engine are Levenshtein distance, Jaro & JaroWinkler measure, Dice coefficient and Cosine similarity. The workflow of lexical analysis
and implementing the set of similarity measures on token categories is defined as the first
module. The machine learning algorithms selected for performing the classification task
are multi-class SVM, logistic regression and a simple neural network. In this second
module, the data gathered and generated by the similarity detection engine is fed to the
ML algorithms to train the models and make them efficient for predicting the plagiarism or
similarity level of newly entered data. This hybrid approach would be impactful in reducing
the time complexity and processing speed for the software engine.