Final Report on DFG Project "Automatic Transcription of Conversations"

Häb-Umbach, Reinhold; Schlüter, Ralf

doi:https://doi.org/10.34657/17877

Final Report on DFG Project "Automatic Transcription of Conversations"

dc.contributor.author	Häb-Umbach, Reinhold
dc.contributor.author	Schlüter, Ralf
dc.date.accessioned	2025-05-07T11:22:09Z
dc.date.available	2025-05-07T11:22:09Z
dc.date.issued	2025
dc.description.abstract	Multi-talker conversational speech recognition is concerned with transcribing meetings recorded with distant microphones. The difficulty of the task can be attributed to three factors. First, the recording conditions are challenging: The speech signal captured by microphones from a distance is noisy and reverberated and often contains nonstationary acoustic distortions, which makes it hard to decode. Second, there is a significant percentage of time with overlapped speech, where multiple speakers talk at the same time. Finally, the interaction dynamics of the scenario are challenging because speakers articulate themselves in an intermittent manner with alternating segments of speech inactivity, single-, and multi-talker speech. This project was concerned with developing a transcription system that can operate on arbitrarily long input, correctly handles segments of overlapped as well as non-overlapped speech, and transcribes the speech of different speakers consistently into separate output streams. Such a multi-talker Automatic Speech Recognition (ASR) system typically consists of the following three components: a source separation and enhancement block, a diarization stage, that attributes segments of input speech to speakers, and an ASR stage, whereby different orders of processing have been proposed. Those orders differ in when to do diarization. While existing approaches employed separately trained subsystems for diarization, separation, and recognition, our research hypothesis was that a joint approach, which is optimized under a single training objective, should lead to superior solutions compared to the separate optimization of individual components. Such a coherent formulation, however, would not necessarily mean that the three aforementioned tasks had to be carried out in a single, monolithic (probably neural) integrated system. Indeed, the research carried out showed that it is beneficial to have separate subsystems, however, with a tight coupling between them. Examples of such systems we developed are • TS-SEP, which carries out diarization and separation/enhancement, with a tight coupling in-between. • Mixture encoder, which leverages explicit speech separation, but also forwards the not yet separated speech to the ASR module to mitigate error propagation from the separator to the recognizer. • Joint diarization and separation, realized by a statistical mixture model, which integrates a mixture model for diarization and one for separation, that share a common hidden state variable. • Transcription-supported diarization, which uses sentence- and word-level boundaries of the ASR module to support speaker turn detection. Furthermore, we developed new approaches to the individual subsystems and shared several tools and data sets with the research community.	eng
dc.description.sponsorship	Deutsche Forschungsgemeinschaft, DFG reference numbers Ha3455/19-1 and SCHL2043/2-1
dc.description.version	publishedVersion
dc.identifier.uri	https://oa.tib.eu/renate/handle/123456789/18860
dc.identifier.uri	https://doi.org/10.34657/17877
dc.language.iso	eng
dc.publisher	Hannover : Technische Informationsbibliothek
dc.rights.license	Es gilt deutsches Urheberrecht. Das Werk bzw. der Inhalt darf zum eigenen Gebrauch kostenfrei heruntergeladen, konsumiert, gespeichert oder ausgedruckt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden. - German copyright law applies. The work or content may be downloaded, consumed, stored or printed for your own use but it may not be distributed via the internet or passed on to external parties.
dc.rights.license	This document may be downloaded, read, stored and printed for your own use within the limits of § 53 UrhG but it may not be distributed via the internet or passed on to external parties.	eng
dc.rights.license	Dieses Dokument darf im Rahmen von § 53 UrhG zum eigenen Gebrauch kostenfrei heruntergeladen, gelesen, gespeichert und ausgedruckt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden.	ger
dc.subject.ddc	4
dc.subject.other	Automatic Speech Recognition	ger
dc.subject.other	Speech Enhancement	ger
dc.subject.other	Diarization	ger
dc.subject.other	Source Separation	ger
dc.subject.other	Meeting Recognition	ger
dc.title	Final Report on DFG Project "Automatic Transcription of Conversations"	eng
dc.type	Report
dc.type	Text
dcterms.extent	S. 10
tib.accessRights	openAccess

Files

Original bundle

Now showing 1 - 1 of 1

Name:: final_report_public_part_dfg_converse.pdf
Size:: 354.75 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Informatik
Forschungsberichte ohne Pflichtabgabe (DFG, IGF…)