Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application toWebscale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA’s performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.

Keywords

Cross-language plagiarism detection, knowledge graphs, Wikidata

Keywords GND

Konferenzschrift

Conference

2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021), co-located with JCDL 2021, Online, September 30, 2021

Publication Type

BookPart

Version

publishedVersion

URI

https://oa.tib.eu/renate/handle/123456789/8964
https://doi.org/10.34657/8002

Collections

Informationswissenschaften
Informatik

License

CC BY 4.0 Unported

https://creativecommons.org/licenses/by/4.0/

Full item page

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Files

Date

Authors

Editor

Advisor

Volume

Issue

Journal

Series Titel

Book Title

Publisher

Supplementary Material

Other Versions

Link to publishers' Version

Abstract

Description