Final Report on DFG Project "Automatic Transcription of Conversations"

Loading...
Thumbnail Image
Date
2025
Volume
Issue
Journal
Series Titel
Book Title
Publisher
Hannover : Technische Informationsbibliothek
Link to publishers version
Abstract

Multi-talker conversational speech recognition is concerned with transcribing meetings recorded with distant microphones. The difficulty of the task can be attributed to three factors. First, the recording conditions are challenging: The speech signal captured by microphones from a distance is noisy and reverberated and often contains nonstationary acoustic distortions, which makes it hard to decode. Second, there is a significant percentage of time with overlapped speech, where multiple speakers talk at the same time. Finally, the interaction dynamics of the scenario are challenging because speakers articulate themselves in an intermittent manner with alternating segments of speech inactivity, single-, and multi-talker speech. This project was concerned with developing a transcription system that can operate on arbitrarily long input, correctly handles segments of overlapped as well as non-overlapped speech, and transcribes the speech of different speakers consistently into separate output streams. Such a multi-talker Automatic Speech Recognition (ASR) system typically consists of the following three components: a source separation and enhancement block, a diarization stage, that attributes segments of input speech to speakers, and an ASR stage, whereby different orders of processing have been proposed. Those orders differ in when to do diarization. While existing approaches employed separately trained subsystems for diarization, separation, and recognition, our research hypothesis was that a joint approach, which is optimized under a single training objective, should lead to superior solutions compared to the separate optimization of individual components. Such a coherent formulation, however, would not necessarily mean that the three aforementioned tasks had to be carried out in a single, monolithic (probably neural) integrated system. Indeed, the research carried out showed that it is beneficial to have separate subsystems, however, with a tight coupling between them. Examples of such systems we developed are • TS-SEP, which carries out diarization and separation/enhancement, with a tight coupling in-between. • Mixture encoder, which leverages explicit speech separation, but also forwards the not yet separated speech to the ASR module to mitigate error propagation from the separator to the recognizer. • Joint diarization and separation, realized by a statistical mixture model, which integrates a mixture model for diarization and one for separation, that share a common hidden state variable. • Transcription-supported diarization, which uses sentence- and word-level boundaries of the ASR module to support speaker turn detection. Furthermore, we developed new approaches to the individual subsystems and shared several tools and data sets with the research community.

Description
Keywords
Automatic Speech Recognition, Speech Enhancement, Diarization, Source Separation, Meeting Recognition
License
Dieses Dokument darf im Rahmen von § 53 UrhG zum eigenen Gebrauch kostenfrei heruntergeladen, gelesen, gespeichert und ausgedruckt, aber nicht auf anderen Webseiten im Internet bereitgestellt oder an Außenstehende weitergegeben werden.