Detecting Cross-Language Plagiarism using Open Knowledge Graphs
dc.bibliographicCitation.bookTitle | Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021) co-located with JCDL 2021 | eng |
dc.bibliographicCitation.firstPage | 46 | |
dc.bibliographicCitation.journalTitle | CEUR workshop proceedings | eng |
dc.bibliographicCitation.lastPage | 57 | |
dc.bibliographicCitation.volume | 3004 | |
dc.contributor.author | Stegmüller, Johannes | |
dc.contributor.author | Bauer-Marquart, Fabian | |
dc.contributor.author | Meuschke, Norman | |
dc.contributor.author | Ruas, Terry | |
dc.contributor.author | Schubotz, Moritz | |
dc.contributor.author | Gipp, Bela | |
dc.contributor.editor | Zhang, Chengzhi | |
dc.contributor.editor | Mayr, Philipp | |
dc.contributor.editor | Lu, Wie | |
dc.contributor.editor | Zhang, Yi | |
dc.date.accessioned | 2022-05-11T11:11:41Z | |
dc.date.available | 2022-05-11T11:11:41Z | |
dc.date.issued | 2021 | |
dc.description.abstract | Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application toWebscale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA’s performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available. | eng |
dc.description.version | publishedVersion | eng |
dc.identifier.uri | https://oa.tib.eu/renate/handle/123456789/8964 | |
dc.identifier.uri | https://doi.org/10.34657/8002 | |
dc.language.iso | eng | |
dc.publisher | Aachen, Germany : RWTH Aachen | |
dc.relation.essn | 1613-0073 | |
dc.relation.uri | http://ceur-ws.org/Vol-3004/paper7.pdf | |
dc.rights.license | CC BY 4.0 Unported | |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
dc.subject.ddc | 020 | eng |
dc.subject.ddc | 004 | eng |
dc.subject.gnd | Konferenzschrift | ger |
dc.subject.other | Cross-language plagiarism detection | eng |
dc.subject.other | knowledge graphs | eng |
dc.subject.other | Wikidata | eng |
dc.title | Detecting Cross-Language Plagiarism using Open Knowledge Graphs | eng |
dc.type | BookPart | eng |
dc.type | Text | eng |
dcterms.event | 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021), co-located with JCDL 2021, Online, September 30, 2021 | |
tib.accessRights | openAccess | |
wgl.contributor | FIZ KA | |
wgl.subject | Informatik | |
wgl.type | Buchkapitel / Sammelwerksbeitrag |
Files
Original bundle
1 - 1 of 1