Categorical Perception in Intonation and the Dynamics of F0 and Intensity movements Background Information, Illustrations, Audio Examples

 

 

Oliver Niebuhr (2007). Categorical Perception in Intonation and the Dynamics of F0 and Intensity movements
Background Information, Illustrations, Audio Examples.

O. Niebuhr, Institute of Phonetics and Digital Speech Processing (IPDS), Christian-Albrecht-University, Kiel, Germany



The following explanations provide background information, illustrations (including acoustic analyses) and sound examples for the paper “Categorical perception in intonation – a matter of signal dynamics?”. Accordingly, it may not be self-explanatory. If anything remains unclear, please refer to the paper and to the literature listed at the bottom of the page.

The phonology of the Kiel Intonation Model KIM (Kohler 1991a) for German distinguishes between two F0 contour classes, (falling-)rising F0 valleys and rising-falling F0 peaks. The latter are further differentiated by the synchronization of the peak contour (defined by the F0 maximum) relative to the boundaries of the accented vowel. Early peaks show the maximum before the vowel onset, late peaks after the vowel offset (provided that there are following unaccented syllables). Medial peaks have the maximum within the boundaries of the vowel. The three peak categories convey attitudinal meanings which can be outlined by ‘given’ or ‘unchangeable’ (for early), ‘new’ (for medial), and ‘unexpected’ (for late). Figure 1 shows characteristic F0 courses for the three peaks categories of the KIM in the short utterance “Eine Malerin” (‘a painter’, accented syllable in bold face).

 

Fig.1 showing acoustic analyses of early, medial, and late peak productions

 

Figure 1: Oscillograms (top) and spectrograms (bottom) for the utterances “Eine Malerin”, produced with an early peak (left), a medial peak (middle), and a late peak (right). Each of the spectrograms additionally contains the F0 course (blue) and the intensity course (yellow). Vertical red lines show the CV boundaries in the accented syllable “Ma-“.




The phonological conceptualization of the KIM is reflected in the peak shift paradigm used in the perception experiments by Kohler (1987, 2005). In this, the complete rising-falling peak contour was shifted in equal-sized steps across the boundaries of the accented vowel in a constant stimulus utterance (e.g. “Sie hat ja gelogen” ‘She’s been lying’, or ”Er war mal schlank” ‘he used to be slim’). For each of these steps, a stimulus was generated. The stimuli of the generated stimulus series (representing an acoustic peak synchronization continuum) were then integrated into AX discrimination tests and indirect function-based identification tests. In the latter, the stimuli are preceded by a constant context utterance, naturally produced by the same speaker as in the stimulus utterance. Although, moreover, context and stimulus utterances show comparable voice qualities, registers as well as speaking rates and styles and can hence in principle be interpreted by the hearer as a connected pair of utterances, the two only match, if the meaning of the stimulus utterance (which is an amalgam of the lexical and intonational/prosodic components) fits into the semantic-pragmatic frame established by the preceding context utterance. So, provided that the meanings of the intonation categories to be investigated have been outlined before, context utterances can be selected which are basically compatible with the stimuli on the lexical level, but which only match with the stimuli, if the latter show (the meaning of) a particular intonation category, in this case (the meaning of) a particular peak category. So, the reactions of subjects judging whether context and stimulus match or not can be interpreted as indirect identifications of the peak category in the stimuli. Starting from this general function-based paradigm, perception experiments are created by repeating the context stimulus pairs several times (usually 10 times) in a randomized order. In this way, Kohler (1987) found an abrupt change from not matching to matching for F0 peaks which cross the boundary of the accented-vowel onset, representing the change from early to medial. Using a slightly different experimental setup, Kohler (1991b) also found that matching judgements decreased again for F0 peaks with maxima after the accented-vowel offset. Furthermore, by combining the results of the indirect identification tests with the results of the AX discrimination tests, Kohler (1987, 1991b) postulated a categorical change in perception from early to medial and a gradual change in perception from medial to late.

Based on the phonology of the KIM and the indirect function-based identification test paradigm, further perception experiments by Niebuhr (2003, 2006a, b) showed that the F0 peak synchronization with the accented vowel is not the only factor determining the perception of German early, medial, and late peaks. The identification of the peak categories is also influenced by other F0 parameters like peak shape and height. Moreover, even non-F0 variables like the intensity levels of the syllables underlying the F0 peak (i.e. of the accented syllable and the two adjacent unaccented ones) contribute to the peak category identification (see this PPT for a summary). These findings converged on the idea that hearer do not identify the German peaks categories by perceiving the different peak synchronizations (relative to segmental events) themselves. Instead, early, medial, and late peaks are assumed to be signalled by an interplay of F0 and intensity in which the low portions at the beginning and the end of the rising-falling peak as well the high portion around the peak maximum are differently (perceptually) highlighted by the underlying intensity course. The resulting different perceptual prominence patterns (of tonal events) within the pitch peak separate the three German peak categories. This idea is illustrated in Figure 2.

 

Fig.2 illustrates the signalling of early, medial, and late peaks by combinations of pitch and prominence patterns

 

Figure 2: The signalling of German early, medial, and late peak by combinations of pitch and prominence patterns.




In consequence, creating the characteristic F0 peak synchronizations (as well as the characteristic shapes and heights) found in the productions of early, medial, and late peaks (see Fig. 1) may only be the most efficient strategy to generate the corresponding prominence patterns in the pitch peaks, since it makes use of the intensity pattern, which is intrinsically already given by the segment (or syllable) articulation. So, segmental information in terms of formant patterns and corresponding sound perceptions should not be decisive for the identification of the German peak categories. Instead, the intensity pattern intrinsically established by the segmental string should be sufficient. The latter was tested in a perception experiment by Niebuhr (2006a). This comparative study started from a stimulus series used in the experiments of Niebuhr (2003). In this, a fast rising-falling peak contour was shifted in 11 steps across the vowel onset of the (only) accented syllable “Ma-“ in the utterance “Sie war mal Malerin” (‘She was once a painter’). By means of praat, the 11 original speech stimuli were resynthesized as HUM sounds. That is, the F0 and intensity courses of each stimulus were kept and integrated into a constant schwa-like formant pattern (to exactly adjust the intensity courses of the HUM stimuli to the ones of the original speech stimuli, the amplitude envelopes of the 11 HUM stimuli were further edited in cool edit, see also this PPT). If the identification of the German peak categories relies on an interplay of F0 and intensity, the identification course, which was received for the 11 original speech stimuli in Niebuhr (2003) and which showed an abrupt transition from early to medial peak, should be replicable by the 11 HUM stimuli only containing the original F0 and intensity courses. Since HUM stimuli are non-speech stimuli, the perceptual judgements could not be guided by meaning. That is, the function-based identification test could not be applied to the HUM stimuli. It was therefore decided to integrate the stimuli into an AXB test, in which A and B provide a constant context frame represented by stimuli 1 and 11 with the extreme peak synchronizations. The X stimuli varied between stimulus 1 and 11. Analogous to the context stimulus pairs created for the original speech stimuli, the AXB triplets were repeated several times in the randomized order in the perception experiment. Subjects were then asked whether the stimulus in the centre of the triplets equals the first or the third stimulus (i.e. A or B). The demands of this task are similar to the one of the indirect identification test. As for the results, it was in fact found by Niebuhr (2006a) that the identification course yielded for the HUM stimuli (showing a transition from X=A to X=B) runs parallel to the one received by Niebuhr (2003) for the original speech stimuli. Statistical tests revealed no significant differences between the two courses. This is illustrated at the top of Figure 3. Moreover, it can be seen at the bottom of Figure 3 that the perceptual transition from early to medial peak or from X=A to X=B, respectively, takes place between the F0 peak positions of stimuli 4-8, in which the F0 peak (maximum) is shifted from the low intensity level of the (original) nasal [m] to the high intensity level of the (original accented) vowel [a:].

 

Fig.3 Almost congruent identification curves for speech and HUM stimuli and the F0 peak positions in the transition phase of the two identification curves

 

Figure 3: Top: The grey line represents the identification course (% matching of context stimulus pairs for all subjects, n=280) for the 11 original speech stimuli received in the indirect identification test carried out by Niebuhr (2003). The black line shows the identification course (% X=B for all subjects, n=120) yielded for the 11 HUM stimuli judged in the AXB test of Niebuhr (2006). Bottom: The F0 peak positions in stimuli 4 and 8 in relation to the intensity course. The transition phase in the two identification courses coincides with the F0 peak positions in the increase in intensity from the low level of the (original) nasal [m] to the high level of the (original accented) vowel [a:].



That the judgement behaviour of the subjects within the 11 original speech stimuli could be replicated by the 11 HUM stimuli which only contain the F0 and intensity courses of the original speech stimuli supports the idea that the signalling of German early, medial, and late peak involves an interplay of F0 and intensity. If this general idea is combined with the further finding that the perceptual transition between the two melodic categories coincides with those F0 peak positions, which fall into the increase in intensity from C to V of the (original) accented syllable, it gives rise to a further expectation: The dynamics of the perceptual transitions between the peak categories depends on the dynamics of the F0 and intensity movements. That is,


  • (a): The faster the rising and falling movements in the F0 peak, the more abrupt is the perceptual transition between the peak categories;
  • and analogously, (b): The faster the rising and falling movements in the intensity course, the more abrupt is the perceptual transition between the peak categories.
  • (c): According to the idea of an interplay of F0 and intensity, the effect of (b) is restricted to the intensity movements underlying the F0 peak. So, the rising and falling intensity movements at the boundaries of the accented vowel should be primarily relevant.

The expectations (a) and (b) are addressed in the paper “Categorical perception in intonation – a matter of signal dynamics?”. Correspondingly, on the basis of the perception experiments carried out by Niebuhr (2003, 2006a) two stimulus series were selected, one for the perceptual transition from early to medial, and another one for the perceptual transition from medial to late. These two series represent the basic conditions, i.e. they are marked by a fast rising-falling F0 peak and fast rising and falling intensity movements in the accented syllable. The results of these two stimulus series yielded in the corresponding perception experiments were compared with the results of stimulus series showing either slower F0 or intensity movements. All stimulus series were either based on the utterance “Sie war mal Malerin” or on ”Sie’s mal Malerin gewesen” ('She was once a painter'). So, the accented syllable “Ma-“ as well as the adjacent ones were segmentally identical in all series. Below, two examples are given for the F0 peak shift continua, the stimulus series were based on. Series aiming at the contrast early vs. medial result from peak shift continua comprising either 7 or 11 steps of 20ms, centered around the accented-vowel onset. The peak shift continua addressing the perceptual change from medial to late again show a step size of 20ms, but consist of either 6 or 9 steps. In both cases, the last three F0 peaks (in the peak shift from left to right) are located after the accented-vowel offset. In all peak shift continua, the complete rising-falling F0 peak contour was shifted. Moreover, in each of the continua, the shifted F0 peaks were integrated into a slight F0 declination ending at a terminal value between 90Hz and 66Hz (depending on the male speaker). The initial F0 value varied between 110Hz and 98Hz. The F0 maxima of the shifted peaks had values between 125Hz and 150Hz (slightly higher F0 peaks were used for the stimulus series from medial to late peak).

 

Fig.4 Examples of peak shift continua for the stimulus series aiming at the perceptual changes from early to medial or from medial to late, respectively

 

Figure 4: Examples of the peak shift continua used to create the stimulus series for the perceptual changes from early to medial (left) and from medial to late (right).



All comparisons of the results of the respective stimulus series In each of the comparisons of the , the outcome is in line with (a) and (b). That is, in the transition from early to medial as well as in the transition from medial to late, less dynamic F0 peak and intensity movements result in identification courses with less dynamic perceptual transitions from the one peak category to the other across the peak shift continuum. Moreover, if the reaction time measurements received for the context stimulus pairs (summed up over all subjects) are included, the perceptual changes received in the two stimulus series representing the basic (i.e. the most dynamic) conditions can count as categorical. By contrast, the perceptual changes in the less dynamic conditions would likely be regarded as gradual. This means that the quality of the perceptual change (categorical vs. gradual) does not depend on the intonational categories/meanings involved. As far as the German early, medial, and late peaks are concerned, a categorical outcome can be turned into a gradual one and vice versa just by modifying the dynamics of the F0 peak movements and the underlying intensity movements. It follows from this conclusion that categorical perception in intonation does not reflect a special cognitive structure or processing in the sense of a linguistic mode based on discrete contrasts and imposed on the acoustic continuity. In consequence, the paradigm of categorical perception cannot be used as an instrument to filter out the intonational categories contained in acoustic continua and/or to separate linguistic categories from paralinguistic ones.

The annotated Figures 5-8 present the F0 and intensity courses of the stimuli in the two basic conditions (1: early vs. medial, referred to as CVff; 2: medial vs. late, referred to as VCff) and compare them with the less dynamic conditions concerning either the F0 courses (CVss and VCss) or the intensity courses (CVint and VCint). In each of the Figures, the F0 course is represented by the red line, whereas the blue line shows the intensity course. The Figures as well as the acoustic analyses were made in praat. Since in all analyses the pitch minimum value was set to 30Hz, the F0 and intensity values are based on an analysis window length of 100ms. In case of the stimulus series addressing the perceptual change from early to medial (Fig. 5-6), the acoustic analyses were made for the stimulus, in which the F0 peak is synchronized at the accented-vowel onset. On the other hand, the F0 and intensity properties of stimulus series aiming at the perceptual change from medial to late (Fig. 7-8) are illustrated by the stimulus having the F0 peak at the accented-vowel offset. F0 values are given in semitones relative to 100Hz, intensity values are in dB. Please notice that the F0 and intensity ranges covered in the Figures can change according to the properties of the stimuli. For each comparison, the audio examples of the analyzed stimulus as well as the complete stimulus series are provided (feel free to download the files; please notice, however, that the indices ff and ss are inverted in the sound file names, since they refer to the German words "steil" (=fast) and "flach" (=slow)).


CVff CVint
play audio play audio
null Figure 5: Stimulus series aiming at perceptual change: early to medial (CV); left: most dynamic basic condition (CVff); right: condition with decreased dynamics in the intensity course of the accented syllable (CVint)

  • Comparably fast rising-falling F0 peaks are used in the stimulus series of both conditions
  • The basic condition (CVff, left) shows a fast increase in intensity from the beginning of [m] to the centre of the accented vowel [a:]. The duration of this increase is about 160ms. After the vowel offset, the intensity course shows a fast decrease of about 80ms.
  • On the on hand, the intensity increase in the less dynamic condition (CVint, right) also ends in the centre of the accented vowel. On the other hand, however, it additionally includes the complete preaccented syllable ("mal"). Correspondingly, it amounts to approximately 210ms. Also the decrease in intensity after the vowel offset is about 20ms longer than in the basic condition, i.e. it takes about 100ms
  • The combinations of identification and reaction time courses (see below) point to a categorical change in the perception from early to medial peak for the stimulus series of the basic condition CVff and to a gradual change in the stimulus series of the less dynamic condition CVint.
Results CVff
Ident.=black, reac. time=grey
Results CVint
Ident.=black, reac. time=grey
download stimulus series download stimulus series
null null



CVff CVss
play audio play audio
null Figure 6: Stimulus series aiming at perceptual change: early to medial (CV); left: most dynamic basic condition (CVff); right: condition with slow rising-falling F0 peak (CVss, doubled durations in both slopes)

  • Comparable intensity courses in the stimulus series of both conditions
  • The basic condition (CVff, left) shows a fast rising-falling F0 peak with slope durations of about 130ms.
  • The less dynamic condition (CVss, right) is marked by a slow rising-falling F0 peak with doubled slope durations, i.e. rise and fall take about 260ms.
  • The combinations of identification and reaction time courses (see below) point to a categorical change in the perception from early to medial peak for the stimulus series of the basic condition CVff and to a gradual change in the stimulus series of the less dynamic condition CVss.
Results CVff
Ident.=black, reac. time=grey
Results CVss
Ident.=black, reac. time=grey
download stimulus series download stimulus series
null null



VCff VCint
play audio play audio
null Figure 7: Stimulus series aiming at perceptual change: medial to late (VC); left: most dynamic basic condition (VCff); right: condition with decreased dynamics in the intensity course of the accented syllable (VCint)

  • Comparably fast rising-falling F0 peaks are used in the stimulus series of both conditions
  • The basic condition (VCff, left) shows a fast increase in intensity from the beginning of [m] to the centre of the accented vowel [a:]. The duration of this increase is about 140ms. After the vowel offset, the intensity course is slightly dipped but remains at a high level. This plateau abruptly ends after the vowel of the postaccented syllable "-le-" in a very fast intensity decrease of about 80ms reaching to the beginning of the vowel [I] in the last syllable "-rin" of "Malerin".
  • As in the comparison CVff vs. CVint (Fig.4), the intensity increase in the less dynamic condition (VCint, right) also ends in the centre of the accented vowel, but the increase additionally spans the complete preaccented syllable ("mal"). Correspondingly, it amounts to approximately 260ms. After the accented-vowel offset, the intensity slowly decreases (in 2 steps) over more than 180ms (80ms+100ms). As in the basic condition (VCff, left), however, this decrease ends at the beginning of the vowel [I] in the last syllable "-rin" of "Malerin".
  • The combinations of identification and reaction time courses (see below) point to a categorical change in the perception from medial to late peak for the stimulus series of the basic condition VCff and to a gradual change in the stimulus series of the less dynamic condition VCint.
Results VCff
Ident.=black, reac. time=grey
Results VCint
Ident.=black, reac. time=grey
download stimulus series download stimulus series
null null



VCff VCss
play audio play audio
null Figure 8: Stimulus series aiming at perceptual change: medial to late (VC); left: most dynamic basic condition (VCff); right: condition with slow rising-falling F0 peak (VCss)

  • Similar intensity courses in the stimulus series of both conditions
  • The basic condition (VCff, left) shows a fast rising-falling F0 peak with slope durations of about 130ms.
  • The less dynamic condition (CVss, right) is marked by a slow rising-falling F0 peak with doubled slope durations, i.e. rise and fall take between 260ms and 300ms.
  • The combinations of identification and reaction time courses (see below) point to a categorical change in the perception from early to medial peak for the stimulus series of the basic condition VCff and to a gradual change in the stimulus series of the less dynamic condition VCss.
Results VCff
Ident.=black, reac. time=grey
Results VCss
Ident.=black, reac. time=grey
download stimulus series download stimulus series
null null


References

  • Kohler, K.J. (1987). Categorical pitch perception. Proceedings of the 11th ICPhS, Tallinn, Estonia, 3311-333.
  • Kohler, K.J. (1991a). Prosody in speech synthesis: the interplay between basic research and TTS application. Journal of Phonetics 19, 121-138.
  • Kohler, K.J. (1991b). Terminal intonation patterns in single-accent utterances in German: phonetics, phonology and semantics. Arbeitsberichte des Instituts für Phonetik und digitale Sprachverarbeitung 25, 117-185.
  • Kohler, K.J. (2005). Timing and communicative functions of pitch contours. Phonetica 62, 88-105.
  • Niebuhr, O. (2003). Perceptual study of timing variables in F0 peaks. Proceedings of the 15th ICPhS, Barcelona, Spain, 1225-1228.
  • Niebuhr, O. (2006a). The role of the accented-vowel onset in the perception of German early and medial peaks. Proceedings of the 3rd international conference of speech prosody, Dresden, Germany, 109-112.
  • Niebuhr, O. (2006b). Perzeption und kognitive Verarbeitung der Sprechmelodie. Theoretische Grundlagen und empirische Untersuchungen. PhD thesis, University Kiel, Germany.

Letzte Aktualisierung / last updated: 06.04.2007
© O. Niebuhr,
on@ipds.uni-kiel.de, Phone: 0431-667-49-29