Final Report on DFG Project "Automatic Transcription of Conversations"

dc.contributor.authorHäb-Umbach, Reinhold
dc.contributor.authorSchlüter, Ralf
dc.date.accessioned2025-05-07T11:22:09Z
dc.date.available2025-05-07T11:22:09Z
dc.date.issued2025
dc.description.abstractMulti-talker conversational speech recognition is concerned with transcribing meetings recorded with distant microphones. The difficulty of the task can be attributed to three factors. First, the recording conditions are challenging: The speech signal captured by microphones from a distance is noisy and reverberated and often contains nonstationary acoustic distortions, which makes it hard to decode. Second, there is a significant percentage of time with overlapped speech, where multiple speakers talk at the same time. Finally, the interaction dynamics of the scenario are challenging because speakers articulate themselves in an intermittent manner with alternating segments of speech inactivity, single-, and multi-talker speech. This project was concerned with developing a transcription system that can operate on arbitrarily long input, correctly handles segments of overlapped as well as non-overlapped speech, and transcribes the speech of different speakers consistently into separate output streams. Such a multi-talker Automatic Speech Recognition (ASR) system typically consists of the following three components: a source separation and enhancement block, a diarization stage, that attributes segments of input speech to speakers, and an ASR stage, whereby different orders of processing have been proposed. Those orders differ in when to do diarization. While existing approaches employed separately trained subsystems for diarization, separation, and recognition, our research hypothesis was that a joint approach, which is optimized under a single training objective, should lead to superior solutions compared to the separate optimization of individual components. Such a coherent formulation, however, would not necessarily mean that the three aforementioned tasks had to be carried out in a single, monolithic (probably neural) integrated system. Indeed, the research carried out showed that it is beneficial to have separate subsystems, however, with a tight coupling between them. Examples of such systems we developed are • TS-SEP, which carries out diarization and separation/enhancement, with a tight coupling in-between. • Mixture encoder, which leverages explicit speech separation, but also forwards the not yet separated speech to the ASR module to mitigate error propagation from the separator to the recognizer. • Joint diarization and separation, realized by a statistical mixture model, which integrates a mixture model for diarization and one for separation, that share a common hidden state variable. • Transcription-supported diarization, which uses sentence- and word-level boundaries of the ASR module to support speaker turn detection. Furthermore, we developed new approaches to the individual subsystems and shared several tools and data sets with the research community.eng
dc.description.sponsorshipDeutsche Forschungsgemeinschaft, DFG reference numbers Ha3455/19-1 and SCHL2043/2-1
dc.description.versionpublishedVersion
dc.identifier.urihttps://oa.tib.eu/renate/handle/123456789/18860
dc.identifier.urihttps://doi.org/10.34657/17877
dc.language.isoeng
dc.publisherHannover : Technische Informationsbibliothek
dc.rights.licenseDieses Dokument darf im Rahmen von § 53 UrhG zum eigenen Gebrauch kostenfrei heruntergeladen, gelesen, gespeichert und ausgedruckt, aber nicht auf anderen Webseiten im Internet bereitgestellt oder an Außenstehende weitergegeben werden.ger
dc.subjectAutomatic Speech Recognitioneng
dc.subjectSpeech Enhancementeng
dc.subjectDiarizationeng
dc.subjectSource Separationeng
dc.subjectMeeting Recognitioneng
dc.subject.ddc004
dc.titleFinal Report on DFG Project "Automatic Transcription of Conversations"eng
dc.typeReport
dc.typeText
dcterms.extentS. 10
tib.accessRightsopenAccess
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
final_report_public_part_dfg_converse.pdf
Size:
354.75 KB
Format:
Adobe Portable Document Format
Description: