Detecting Cross-Language Plagiarism using Open Knowledge Graphs

dc.bibliographicCitation.bookTitleProceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021) co-located with JCDL 2021eng
dc.bibliographicCitation.firstPage46
dc.bibliographicCitation.journalTitleCEUR workshop proceedingseng
dc.bibliographicCitation.lastPage57
dc.bibliographicCitation.volume3004
dc.contributor.authorStegmüller, Johannes
dc.contributor.authorBauer-Marquart, Fabian
dc.contributor.authorMeuschke, Norman
dc.contributor.authorRuas, Terry
dc.contributor.authorSchubotz, Moritz
dc.contributor.authorGipp, Bela
dc.contributor.editorZhang, Chengzhi
dc.contributor.editorMayr, Philipp
dc.contributor.editorLu, Wie
dc.contributor.editorZhang, Yi
dc.date.accessioned2022-05-11T11:11:41Z
dc.date.available2022-05-11T11:11:41Z
dc.date.issued2021
dc.description.abstractIdentifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application toWebscale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA’s performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.eng
dc.description.versionpublishedVersioneng
dc.identifier.urihttps://oa.tib.eu/renate/handle/123456789/8964
dc.identifier.urihttps://doi.org/10.34657/8002
dc.language.isoeng
dc.publisherAachen, Germany : RWTH Aachen
dc.relation.essn1613-0073
dc.relation.urihttp://ceur-ws.org/Vol-3004/paper7.pdf
dc.rights.licenseCC BY 4.0 Unported
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subject.ddc020eng
dc.subject.ddc004eng
dc.subject.gndKonferenzschriftger
dc.subject.otherCross-language plagiarism detectioneng
dc.subject.otherknowledge graphseng
dc.subject.otherWikidataeng
dc.titleDetecting Cross-Language Plagiarism using Open Knowledge Graphseng
dc.typeBookParteng
dc.typeTexteng
dcterms.event2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021), co-located with JCDL 2021, Online, September 30, 2021
tib.accessRightsopenAccess
wgl.contributorFIZ KA
wgl.subjectInformatik
wgl.typeBuchkapitel / Sammelwerksbeitrag
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
paper7.pdf
Size:
2.29 MB
Format:
Adobe Portable Document Format
Description: