We are pleased to announce that

 

The Kiel Corpus of Spoken German

Read and Spontaneous Speech

 

is now available as

 

a new, revised and enlarged Edition

by Klaus J. Kohler, Benno Peters, Michel Scheffers (Kiel 2017)

 

 

History of the Kiel Corpus

In the 1990s, the Institute of Phonetics and Digital Speech Processing (IPDS) at Kiel University was part of a broad national academic and industrial consortium for computer processing of Spoken German. There were two project phases focusing, in turn, on Read and Spontaneous Speech (PHONDAT90/92, VERBMOBIL), funded by the German Federal Ministry of Research and Technology (BMFT, later BMBF). IPDS recorded large-scale read and spontaneous databases and developed them into computerized formats with systematically related orthographic and phonetic (segmental and prosodic) annotations, for further processing and use within the consortium. The work was directed by Klaus J. Kohler; manual phonetic labelling was carried out with high precision by Wim van Dommelen, Adrian Simpson, Benno Peters, Klaus J. Kohler, and under their supervision, by phonetically trained students. Michel Scheffers developed and serviced the speech processing software and integrated the data into the Kiel Corpus, which IPDS issued on CD-ROMs as Read Speech (Kiel PHONDAT90/92) in 1994, and as Spontaneous Speech (Kiel VERBMOBIL) in three parts 1995-1997.

The PHONDAT90/92 corpus contains readings of different sets of sentences and of two short story texts. The VERBMOBIL corpus comprises dialogues in a standardized appointment-making scenario. For reasons of easy data processing, a technique was used that prevents the recording of overlapping speech (g-dialogues). Holding a button pressed while speaking, opens the speaker’s recording channel and blocks the channel of the other speaker. Although there is an increase of naturalness from text reading of PHONDAT to dialogue interaction in VERBMOBIL, the channel selection procedure removed essential aspects of natural spontaneous speech from these exchanges, particularly various forms of turn taking. The result is thus unscripted speech with some features of spontaneity, without being fully spontaneous speech, but nevertheless the two sections of the Kiel Corpus are referred to as Read Speech and Spontaneous Speech.

To get closer to natural spontaneous speech under standardized studio recording conditions Benno Peters devised a new scenario for setting up a third section in the Kiel Corpus after the VERBMOBIL project was terminated: VIDEO TASK. In this scenario, similar but non-identical video material is presented separately to two subjects sitting in different rooms. After the presentation, the subjects discuss differences and similarities of what they have seen and heard. As a basis for the data collection, two tapes were spliced together from a number of episodes of the well-known German television series ‘Lindenstraße’. The selection of subjects was guided by two new principles: subjects knew each other and were familiar with the ‘Lindenstraße’ series. The dialogues were recorded in stereo and then processed in the same way as the VERBMOBIL corpus with orthographic and phonetic annotations, adjusted to the different 2-channel recording. This work was funded in two German Research Council (DFG) projects on Sound Patterns and Prosodic Phrasing in German Spontaneous Speech 1997-2003 under the directorship of K. J. Kohler.

This New Edition of the Kiel Corpus of Spoken German comprises all the previously published Kiel PHONDAT and VERBMOBIL files, with additional data in both sections, as well as the VIDEOTASK corpus. For more information see the Overview. The distribution of the new Corpus is administered through the General Linguistics Section of the Institute of Scandinavian Studies, Frisian and General Linguistics (ISFAS) at Christian-Albrechts-Universität Kiel (CAU).

Overview of the New Edition of the Kiel Corpus of Spoken German

Annotation of the speech data proceeded along the following series of levels:

  1. Orthographic text, given in the Read Speech corpus, or produced by orthographic transliteration, with special characters representing phonetic phenomena such as breathing, pauses, clicks, for the Spontaneous Speech dialogues; transliteration was done by advanced students of phonetics under supervision by Adrian Simpson, Benno Peters and Klaus J. Kohler.
  2. Phonematic transcription, automatically generated by the grapheme-to-phoneme-module of the text-to-speech-system RULSYS (Kohler 1997).
  3. Segmental labelling, manually produced on the basis of the automatically generated phonematic transcription (Kohler, Pätzold, Simpson 1995).
  4. Prosodic labeling, using the symbolic system PROLAB, which is based on the Kiel Intonation Model (KIM) (Kohler 1997, 2017).

 

The previous CD-ROM editions of the Kiel Corpus contain only data with annotations on levels (1-3). The core of The New Edition of the Kiel Corpus of Spoken German has been annotated at all 4 levels. The VIDEOTASK data and an additional dialogue (f06), recorded within the VERBMOBIL scenario but prior to the introduction of the button pressing technique, have been added to the Spontaneous Speech section. Moreover, in this New Edition additional data are included that are not annotated on all four levels: To facilitate identification of these additional data they are stored in separate sub-directories depending on corpus section and annotation levels

  • addcorp in Read Speech: with annotation levels (1-2) only
  • VMaddSeg in Spontaneous Speech: 3 additional VERBMOBIL dialogues without annotation level (4)
  • VMaddOrt in Spontaneous Speech: 8 additional VERBMOBIL dialogues with annotation level (1) only

 

Michel Scheffers reworked the data structure of the Kiel Corpus for a new edition

  • grouping  the files into two main sections, Read Speech and Spontaneous Speech and subdividing the latter into VERBMOBIL and VIDEOTASK
  • providing a uniform naming structure for the files in both sections
  • converting audio files to wav-format
  • correcting numerical overflow errors in some audio files in Read Speech
  • correcting errors at the orthographic and label levels

 

This New Edition has the following data volume - numbers of male/female speakers/speaker pairs and total recording times in Read Speech and Spontaneous Speech and their various subsections.

 

nr. of speakers

recording times (hrs:min:sec)

 

m

f

m

f

total

Read Speech

27

26

53

2:43:52

2:57:30

5:41:22

core (levels 1-4)

27

26

53

1:59:45

2:15:20

4:15:05

extension (not labelled)

14

14

28

0:44:07

0:42:10

1:26:17

 

 

 

 

 

 

 

 

 

Spontaneous Speech

37

27

64

5:51:16

4:06:49

8:34:35

core (levels 1-4)

22

21

43

3:10:24

3:17:56

5:04:50

extension

15

7

22

2:40:52

0:48:53

3:29:45

VMaddSeg (levels 1-3)

3

3

6

 0:29:21

0:16:47

0:46:08

VMaddOrt (level 1)

12

4

16

 2:11:31

0:32:06

2:43:37

 

The Editors have prepared the corpus documentation Info_KielCorp_2017.pdf, which may be downloaded via the link http://www.isfas.uni-kiel.de/de/linguistik/forschung/kiel-corpus/docs/Info_KielCorp_2017