Navigationsweiche Anfang

Navigationsweiche Ende


Contact

Norman Meuschke

Data & Knowledge Engineering Group
University of Wuppertal
School of Electrical,
Information and Media Engineering
Rainer-Gruenter-Str. 21
D-42119 Wuppertal
Office: FC 1.20

Phone: +49 (0)202 439 1618

meuschke{at}uni-wuppertal.de

Download Contact File (vCard)

HyPlag - Hybrid Plagiarism Detection

Fig. 1: HyPlag's Results Overview
(click on figure to enlarge).
Fig. 2: HyPlag's Detailed Comparison View
(click on figure to enlarge).

 

HyPlag realizes a hybrid approach to plagiarism detection for academic documents. The system analyzes mathematical expressions, images, citations, and text to improve the identification of potentially suspicious content similarity, particularly in research publications, such as journal articles, PhD theses, and grant proposal.

 

Try HyPlag (login in GitHub first! user: guest@hyplag.org | pw: hybridPD).

HyPlag’s code and the resources for our experiments, e.g., test cases, are available from GitHub (login in GitHub first! user: hyplag-guest | pw: hybridPD). 

 

The two figures above show HyPlag’s main analysis views – the Results Overview (Fig. 1) and the Detailed Comparison View (Fig. 2). The Results Overview enables users to quickly browse all identified similarities and check which parts of the input document are affected. The left part of the screen shows the full text of the input document (see (1) in the left Figure). The right part shows a list of result summaries (2) for all documents, for which similarities to the input document have been identified. Each result summary includes one or more match views (3). Each match view has two panels and represents the similarities that an analysis method identified, e.g., matching citations or similar formulae. The left panel (4a) represents the input document and the right panel (4b) the comparison document. Matching features appear in the match views connected by lines. For the example, the match views in the left Figure show the similarity of text (left), citations (middle) and mathematical content (right) in a retracted article by and two papers by other authors.

The Detailed Comparison View (Fig. 2) allows users to inspect identified similarities in detail. The screen displays the full text of the input document (8) and a selected comparison document (9) side-by-side. Between the full texts, a match view (10) similar to the match views in the Results Overview highlights all matching features in both documents. However, in this view, each feature match (11a,b) is assigned a separate color. Clicking on any highlight in the full text panels or the central match view aligns the respective feature matches. Since the central match view represents the entire document, the current view port, i.e., the segment of text visible in the adjacent full text panel and the position of the text segment in the document, is indicated using a darker shade.

For details on HyPlag’s visualizations or system architecture, see [1].

HyPlag includes the following analysis methods for the different categories of non-textual content:

Citations

HyPlag employs four citation-based analysis methods, which our prior research proved effective for discovering concealed forms of academic plagiarism (see our project page on Citation-based Plagiarism Detection or [4-10] for details). The code for the citation-based analysis is available as a separate GitHub repository.

  • Bibliographic Coupling (BC), quantifies the absolute number or fraction of shared references while ignoring the number, position, and order of citations in the text.
  • Longest Common Citation Sequence (LCCS) is the maximum number of citations that match in both documents in the same order, but not necessarily in a contiguous block. We showed that LCCS achieves good results for retrieving longer passages of reused text, in which the sequence of ideas remained unchanged.
  • Greedy Citation Tiling (GCT) identifies all individually longest matching substrings of citations in two documents ('citation tiles'), i.e., all blocks of consecutive shared citations in identical order. Longer citation tiles are a strong indicator for high semantic similarity of text passages, even if the order of the passages was changed.
  • Citation Chunking (CC) is a class of heuristic measures to find variably-sized patterns of matching citations, in which the count and order of matching citations can differ.

Images

Currently, HyPlag includes four analysis methods to identify potentially suspicious image similarity (see [2] for details). The code for the image-based analysis is available as a separate GitHub repository.

  • Perceptual hashing (pHash) is a well-established, fast, and reliable method to find highly similar images.
  • Trigram text matching for the text that has been extracted from images using OCR.
  • Positional text matching improves the similarity analysis for OCR text of figures that includes significant recognition errors. The approach only considers text matches for computing the similarity of two images if the matching text occurs in broadly similar regions in both images.
  • Ratio hashing identifies highly similar bar charts by comparing the relative heights of the bars sorted in decreasing order and calculating the sum of the differences of the bar heights.

Mathematical Expressions

To create mathematics-based semantic fingerprints of documents, HyPlag uses three similarity measures that analyze mathematical identifiers. We showed that identifiers are most effective for this purpose in a previous study (see our project page on Math-based PD or [3] for details)

  • Frequency histograms of mathematical identifiers (Histo) quantifies the similarity of two documents by analyzing the union of the identifiers in both documents. HyPlag considers the relative difference in the number of occurrences of individual identifiers. The Histo measure quantifies the global overlap of mathematical identifiers in the analyzed documents. The number of shared identifiers is normalized by the sum of identifiers in both documents. Thus, achieving high scores requires documents that contain a comparable number of identifiers. Typically, this requirement is only met if the two documents are of similar length.
  • Longest Common Subsequence of Identifiers (LCIS) is the maximum number of identifiers that match in both documents in the same order, but not necessarily in a contiguous block. HyPlag considers the number of identifiers in the query document that are part of the longest common identifier sequence. Like Histo, the LCIS measure quantifies the global similarity of documents, but considers the order while Histo is order-agnostic.
  • Greedy Identifier Tiles (GIT) are the set of all individually longest blocks of shared identifiers in identical order that cannot be extended to the left or right without encountering a non-matching identifier. The GIT score quantifies the number of identifiers in the query document that are part of identifier tiles with a minimum length of five.

Additionally, HyPlag performs pairwise similarity assessments of formulae using three similarity measures introduced by Zhang and Youssef [11]:

  • The coverage measure quantifies the number of matching tokens in two formulae.
  • The match depth measure assigns higher weights to matching concepts in two formulae if the concepts occur at higher levels, i.e., closer to the root of the MathML expression tree. The idea is that higher level concepts are more significant for the nature of the expression.
  • The taxonomic distance measure assigns a higher weight to elements from the same class in a content dictionary. For instance, two trigonometric functions, such as sin and cos, would receive a higher similarity score than sin and log. HyPlag uses the content dictionary of the MathML standard.

Text

To find similar text, HyPlag relies on established text retrieval methods.

  • Text fingerprinting performs text chunking using word 3-grams and probabilistically selects a subset of chunks for computing a digital signature of the input text. The mean probability for chunk retention is 1 16. We realized this approach by adapting the Sherlock tool.
  • Encoplot, developed by Grozea et al. [12], is an efficient character 16-gram comparison that achieves a time-complexity of O(n) by ignoring repeated matches.
  • Boyer-Moore string matching to identify all strings (including repetitions) with 12 or more identical words.

RELATED PUBLICATIONS

  1. Academic Plagiarism Detection: A Systematic Literature Review
    T. Foltynek, N. Meuschke, B. Gipp
    ACM Computing Surveys, vol. 52, iss. 6, p. 112:1-112:42, 2019
    (PDF | DOI)
  2. Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations
    N. Meuschke, V. Stange, M. Schubotz, M. Kramer, and B. Gipp,
    Proc. ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019.
    (PDF | DOI | Slides
  3. HyPlag: A Hybrid Approach to Academic Plagiarism Detection,
    N Meuschke, V Stange, M Schubotz, B Gipp,
    Proc. Int. ACM SIGIR Conf, on Research & Development in Information Retrieval, 2018.
    (PDF | DOI | BibTeX)
  4. An Adaptive Image-based Plagiarism Detection Approach,
    N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp,
    Proc. ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), 2018.
    (PDF DOI | BibTeX | Slides)
  5. Analyzing Mathematical Content to Detect Academic Plagiarism,
    N Meuschke, M Schubotz, F Hamborg, T Skopal, B Gipp,
    Proc. ACM Conf. on Information and Knowledge Management (CIKM), 2017.
    (PDF | Poster | BibTeX)
  6. Reducing Computational Effort for Plagiarism Detection by using Citation Characteristics to Limit Retrieval Space,
    N. Meuschke and B. Gipp,
    Proc. IEEE/ACM Int. Conf. on Digital Libraries (JCDL), 2014.
    (PDF | DOI | BibTeX)
  7. Citation-based Plagiarism Detection: Practicability on a Large-scale Scientific Corpus,
    B. Gipp, N. Meuschke, and C. Breitinger,
    Journal of the American Society for Information Science and Technology (JASIST), vol. 65, iss. 2, pp. 1527-1540, 2014.
    (PDF | DOI | BibTeX)
  8. Citation-based Plagiarism Detection - Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis,
    B. Gipp,
    Springer Vieweg Research, 2014.
    (PDF | DOI | BibTeX)
  9. Demonstration of Citation Pattern Analysis for Plagiarism Detection,
    B. Gipp, N. Meuschke, C. Breitinger, M. Lipinski, and A. Nuernberger,
    Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2013.
    (PDF | DOI | BibTeX)
  10. Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,
    B. Gipp and N. Meuschke,
    Proc. ACM Symposium on Document Engineering (DocEng), 2011.
    (PDF | DOI | BibTeX)
  11. Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag,
    B. Gipp, N. Meuschke, and J. Beel,
    Proc. ACM/IEEE Joint Conf. on Digital Libraries (JCDL), 2011.
    (PDF | DOI | BibTeX)
  12. Citation Based Plagiarism Detection – A New Approach to Identify Plagiarized Work Language Independently,
    B. Gipp and J. Beel,
    Proc. ACM Conf. on Hypertext and Hypermedia (HT), 2010.
    (PDF | DOI | BibTeX)

    Cited Sources
  13. An Approach to Math-Similarity Search,
    Qun Zhang and Abdou Youssef
    Proc. Conf. on Intelligent Computer Mathematics (CICM), 2014.
  14. ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection,
    Christian Grozea, Christian Gehl, and Marius Popescu,
    Proc. PAN Workshop, 2009.

MEDIA COVERAGE

The Frankfurter Allgemeine Zeitung (FAZ) describes how our research on novel plagiarism detection methods and blockchain-backed decentralized trusted timestamping facilitates combating plagiarism and...

more

The DFG, Germany’s organisation for research funding, has awarded our group a 3-year research grant for our project: 

“Methods and Tools to Advance the Retrieval of Mathematical Knowledge from...

more

Wikipedia is using a new approach for rendering mathematical formulae - as of May 31st. This approach was developed by our group member Moritz Schubotz.

Bitmap images representing formulae, were...

more

The national public radio broadcasters Deutschlandfunk and Deutschlandradio Kultur covered our collaborative research on Plagiarism Prevention and Detection and our prototype system CitePlag.

more

Several media outlets recently covered our research on Plagiarism Prevention and Detection and our prototype system CitePlag.

Articles appeared in the national daily newspaper "Die Welt" and the...

more

An article about our plagiarism detection system CitePlag appeared in uni'kon #59.

Click on the picture to see the article in high resolution.

more

CitePlag is the first prototype of a citation-based Plagiarism Detection (CbPD) System. The prototype was recently demonstrated at the SIGIR conference 2013.

What makes CitePlag novel?

In contrast...

more
zuletzt bearbeitet am: 24.10.2019