Final Report on DFG Project "Automatic Transcription of Conversations"

Häb-Umbach, Reinhold; Schlüter, Ralf

doi:https://doi.org/10.34657/17877

Final Report on DFG Project "Automatic Transcription of Conversations"

Date

2025

Authors

Häb-Umbach, Reinhold

Schlüter, Ralf

Publisher

Hannover : Technische Informationsbibliothek

Abstract

Multi-talker conversational speech recognition is concerned with transcribing meetings recorded with distant microphones. The difficulty of the task can be attributed to three factors. First, the recording conditions are challenging: The speech signal captured by microphones from a distance is noisy and reverberated and often contains nonstationary acoustic distortions, which makes it hard to decode. Second, there is a significant percentage of time with overlapped speech, where multiple speakers talk at the same time. Finally, the interaction dynamics of the scenario are challenging because speakers articulate themselves in an intermittent manner with alternating segments of speech inactivity, single-, and multi-talker speech. This project was concerned with developing a transcription system that can operate on arbitrarily long input, correctly handles segments of overlapped as well as non-overlapped speech, and transcribes the speech of different speakers consistently into separate output streams. Such a multi-talker Automatic Speech Recognition (ASR) system typically consists of the following three components: a source separation and enhancement block, a diarization stage, that attributes segments of input speech to speakers, and an ASR stage, whereby different orders of processing have been proposed. Those orders differ in when to do diarization. While existing approaches employed separately trained subsystems for diarization, separation, and recognition, our research hypothesis was that a joint approach, which is optimized under a single training objective, should lead to superior solutions compared to the separate optimization of individual components. Such a coherent formulation, however, would not necessarily mean that the three aforementioned tasks had to be carried out in a single, monolithic (probably neural) integrated system. Indeed, the research carried out showed that it is beneficial to have separate subsystems, however, with a tight coupling between them. Examples of such systems we developed are • TS-SEP, which carries out diarization and separation/enhancement, with a tight coupling in-between. • Mixture encoder, which leverages explicit speech separation, but also forwards the not yet separated speech to the ASR module to mitigate error propagation from the separator to the recognizer. • Joint diarization and separation, realized by a statistical mixture model, which integrates a mixture model for diarization and one for separation, that share a common hidden state variable. • Transcription-supported diarization, which uses sentence- and word-level boundaries of the ASR module to support speaker turn detection. Furthermore, we developed new approaches to the individual subsystems and shared several tools and data sets with the research community.

Keywords

Automatic Speech Recognition, Speech Enhancement, Diarization, Source Separation, Meeting Recognition

URI

https://oa.tib.eu/renate/handle/123456789/18860
https://doi.org/10.34657/17877

Collections

Informatik

License

Es gilt deutsches Urheberrecht. Das Werk bzw. der Inhalt darf zum eigenen Gebrauch kostenfrei heruntergeladen, konsumiert, gespeichert oder ausgedruckt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden. - German copyright law applies. The work or content may be downloaded, consumed, stored or printed for your own use but it may not be distributed via the internet or passed on to external parties.
This document may be downloaded, read, stored and printed for your own use within the limits of § 53 UrhG but it may not be distributed via the internet or passed on to external parties.
Dieses Dokument darf im Rahmen von § 53 UrhG zum eigenen Gebrauch kostenfrei heruntergeladen, gelesen, gespeichert und ausgedruckt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden.

Full item page

Final Report on DFG Project "Automatic Transcription of Conversations"

Date

Authors

Volume

Issue

Journal

Series Titel

Book Title

Publisher

Link to publishers version

Abstract

Description

Keywords

URI

Collections

License