Embedding large-scale graph and text-based datasets with LLMs

Kunt, Tim; Buchholz, Annika; Khebouri, Imene; Koch, Thorsten; Litzel, Ida; Vu, Thi Huong

doi:https://doi.org/10.34657/23240

Embedding large-scale graph and text-based datasets with LLMs

dc.bibliographicCitation.seriesTitle	ZIB Report ; 2025,11
dc.contributor.author	Kunt, Tim
dc.contributor.author	Buchholz, Annika
dc.contributor.author	Khebouri, Imene
dc.contributor.author	Koch, Thorsten
dc.contributor.author	Litzel, Ida
dc.contributor.author	Vu, Thi Huong
dc.date.accessioned	2025-10-08T08:08:42Z
dc.date.available	2025-10-08T08:08:42Z
dc.date.issued	2025-07
dc.description.abstract	Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure, enabling us to utilize tools and methods from graph theory, as well as conventional classification methods, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing 56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts. Furthermore, we discuss strategies for combining these emerging methods with traditional graph-based approaches, potentially compensating for each other’s shortcomings.	ger
dc.description.version	publishedVersion
dc.identifier.other	urn:nbn:de:0297-zib-100646
dc.identifier.uri	https://oa.tib.eu/renate/handle/123456789/24223
dc.identifier.uri	https://doi.org/10.34657/23240
dc.language.iso	eng
dc.publisher	Hannover : Technische Informationsbibliothek
dc.relation.affiliation	Zuse Institute Berlin
dc.rights.license	This document may be downloaded, read, stored and printed for your own use within the limits of § 53 UrhG but it may not be distributed via the internet or passed on to external parties.	eng
dc.rights.license	Es gilt das deutsche Urheberrecht. Das Werk bzw. der Inhalt darf zum eigenen Gebrauch kostenfrei heruntergeladen, konsumiert, gespeichert oder ausgedruckt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden.	ger
dc.subject.ddc	000 \| Informatik, Information und Wissen, allgemeine Werke
dc.title	Embedding large-scale graph and text-based datasets with LLMs	ger
dc.type	Report
dcterms.extent	9 Seiten
dtf.funding.funder	BMFTR
dtf.funding.program	16WIK2101A
tib.accessRights	openAccess

Files

Original bundle

Now showing 1 - 1 of 1

Name:: RO9118_2025_11.pdf
Size:: 918.82 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Forschungsberichte ohne Pflichtabgabe (DFG, IGF…)