Embedding large-scale graph and text-based datasets with LLMs

dc.bibliographicCitation.seriesTitleZIB Report ; 2025,11
dc.contributor.authorKunt, Tim
dc.contributor.authorBuchholz, Annika
dc.contributor.authorKhebouri, Imene
dc.contributor.authorKoch, Thorsten
dc.contributor.authorLitzel, Ida
dc.contributor.authorVu, Thi Huong
dc.date.accessioned2025-10-08T08:08:42Z
dc.date.available2025-10-08T08:08:42Z
dc.date.issued2025-07
dc.description.abstractLarge text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure, enabling us to utilize tools and methods from graph theory, as well as conventional classification methods, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing 56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts. Furthermore, we discuss strategies for combining these emerging methods with traditional graph-based approaches, potentially compensating for each other’s shortcomings.ger
dc.description.versionpublishedVersion
dc.identifier.otherurn:nbn:de:0297-zib-100646
dc.identifier.urihttps://oa.tib.eu/renate/handle/123456789/24223
dc.identifier.urihttps://doi.org/10.34657/23240
dc.language.isoeng
dc.publisherHannover : Technische Informationsbibliothek
dc.relation.affiliationZuse Institute Berlin
dc.rights.licenseThis document may be downloaded, read, stored and printed for your own use within the limits of § 53 UrhG but it may not be distributed via the internet or passed on to external parties.eng
dc.rights.licenseEs gilt das deutsche Urheberrecht. Das Werk bzw. der Inhalt darf zum eigenen Gebrauch kostenfrei heruntergeladen, konsumiert, gespeichert oder ausgedruckt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden.ger
dc.subject.ddc000 | Informatik, Information und Wissen, allgemeine Werke
dc.titleEmbedding large-scale graph and text-based datasets with LLMsger
dc.typeReport
dcterms.extent9 Seiten
dtf.funding.funderBMFTR
dtf.funding.program16WIK2101A
tib.accessRightsopenAccess

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
RO9118_2025_11.pdf
Size:
918.82 KB
Format:
Adobe Portable Document Format
Description: