Embedding large-scale graph and text-based datasets with LLMs

Loading...
Thumbnail Image

Volume

Issue

Journal

Series Titel

ZIB Report 2025,11

Book Title

Publisher

Hannover : Technische Informationsbibliothek

Link to publishers version

Abstract

Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure, enabling us to utilize tools and methods from graph theory, as well as conventional classification methods, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing 56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts. Furthermore, we discuss strategies for combining these emerging methods with traditional graph-based approaches, potentially compensating for each other’s shortcomings. Datei-Upload durch TIB

Description

Keywords

License

Es gilt deutsches Urheberrecht. Das Werk bzw. der Inhalt darf zum eigenen Gebrauch kostenfrei heruntergeladen, konsumiert, gespeichert oder ausgedruckt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden. - German copyright law applies. The work or content may be downloaded, consumed, stored or printed for your own use but it may not be distributed via the internet or passed on to external parties.