Alignment Correction Guide

Lucas Annear; Henry Nomeland; Tristan Mahr

1 Introduction

1.1 Forced alignment and textgrids

In the WISC Lab we have audio recordings of children with both typical speech development and speech and motor impairments due to Cerebral Palsy. In these recordings the child is repeating words and phrases from the Test of Children’s Speech (TOCS, Hodge & Daniels, 2007). Examples include phrases like cowboy boots and put all the toys away. Recordings are used to calculate articulation rate, speech sound accuracy, and measures of how intelligible a child is. Figure 1 shows what it looks like when we open a recording of the sentence the sign says keep out in Praat.

Figure 1: Recording of “the sign says keep out” in Praat. This screenshot is taking from a Praat editor window. In this screenshot and all others, we have removed parts of the surrounding Praat interface.

We can see the waveform (top) and the spectrogram (bottom), but what if we want to keep track of how long different words and sounds are?

Marking the start and end points of sounds and words in a recording is useful for a variety of research purposes. However, we can’t annotate sound files directly, so we need to create a separate file that stores all of the locations of events like the start and end of a word or sound. Such a companion file is called a textgrid. This file is designed to be paired with the audio file, and it contains boundaries and labels indicating the locations of the words and phonemes in the audio file.

The tool that we use to create a separate document containing annotations and labels for different words and sounds is called a forced aligner. A forced aligner takes an audio file (typically a .wav file) and a transcription of what was said in the audio file (usually a .txt or .lab file) and uses speech recognition technology to create textgrid.

A textgrid file is associated with a given audio file and has moveable boundaries and labels to note the occurrence of certain events (like the beginning and end of a vowel) that may be useful for research. Figure 2 shows the phrase from Figure 1 when opened with the textgrid produced by the Montreal Forced Aligner. The textgrids we create for the lab contain separate tiers for words, sounds, and the sentence containing them. Each tier has intervals with vertical boundaries separating words and sounds. These files (the .TextGrid and .wav file) can be used for automated retrieval of acoustic data such as the duration of a word or sound, formant frequencies of vowels, and other variables of interest.

Figure 2: “The sign says keep out” when the recording is opened together with a fully annotated textgrid.

We can open these textgrids along with the audio file in a program called Praat. When the textgrids and audio files are paired together in Praat, it is now easier and quicker for us to make measurements because we have labels for each word and sound, as well as the time intervals during which these events occur.

.

1.2 Hand-correction

Forced aligners are far from perfect. When a forced aligner creates a textgrid, the initial placement of boundaries for words and sounds may not be accurate. During the hand-correction process, researchers manually adjust the boundaries of textgrids that have been automatically created using the Montreal Forced Aligner (MFA). When MFA pairs a transcription with an audio file, it often produces fairly accurate boundary placements. In these cases the boundaries may only need some slight shifting, or even no adjustment at all. Other times, the boundaries are placed at the wrong times, or even when an entirely different word is being produced. These boundaries require hand-correction.

The goal of the researcher performing hand-correction is to ensure proper boundary placement. This guide is intended to help visually identify alignment issues by looking at spectrograms and waveforms, and is designed to be a guide for decision-making when boundary placement is difficult to determine.

Figure 3 shows a textgrid prior to hand correction, the same textgrid but with black and gold squares overlaid on the image to show where the word boundaries actually are, and the textgrid after hand-correction.

Figure 3: Force-aligned textgrid of *cowboy boots* before and after hand correction.

1.3 Phonetic alphabets

This guide makes use of two different phonetic alphabets. The first is the International Phonetic Alphabet (IPA) and will be notated using conventional forward slashes as in /t/ or /u/. The IPA includes many non-English characters or diacritics, which can cause headaches for computing systems. A more computing-friendly alphabet is ARPABET which uses ASCII characters, specifically all-caps English letters. We use a particular flavor of ARPABET called CMUBET. It is the alphabet used by the CMU pronunciation dictionary and used in the MFA alignments. CMUBET is the ARPABET but with numbers added onto vowels indicate stress. We note here that the alignments performed in the WISC Lab and any alignments using the CMU dictionary are based on North American pronunciations. Table 1 and Table 2 show vowels and consonants in North American English in the two alphabets.

Table 1: Vowels in CMUBET (monospaced) and IPA (plain text)

Location	Front		Central		Back
Location	lax	tense	lax	tense	lax	tense
Close	`IH` ɪ	`IY` i			`UH` ʊ	`UW` u
Mid	`EH` ɛ	`EY` eɪ	`AH` ə, ʌ			`OW` oʊ
Open	`AE` æ			`AA` ɑ		`AO` ɔ
Diphthongs	`AW` aʊ `AY` aɪ `OY` ɔɪ
R-colored	`ER` ɝ, ɚ

Table 2: Consonants in CMUBET (monospaced) and IPA (plain text)

Manner	Bilabial	Labiodental	Dental	Alveolar	Postalveolar	Palatal	Velar	Glottal
Plosive	`P` p `B` b			`T` t `D` d			`K` k `G` g
Affricate					`CH` tʃ `JH` dʒ
Nasal	`M` m			`N` n			`NG` ŋ
Fricative		`F` f `V` v	`TH` θ `DH` ð	`S` s `Z` z	`SH` ʃ `ZH` ʒ			`H` h
Approximant	`W` w				`R` r	`Y` j
Lateral				`L` l

1.4 Montreal Forced Aligner

This guide assumes that you have access to audio files and corresponding textgrids which have been aligned using MFA. If you do not have access to such files or have audio files but do not know how to force align them, it would be best to learn how to do so before continuing any further into the guide. Eleanor Chodroff has published an excellent tutorial on how to use MFA which can be accessed here. The tutorial is part of a larger guide on corpus phonetics which including theoretical explanations of how forced alignment works.

2 Overarching principles

While the remainder of this guide shows specific examples of where boundaries should be placed in certain circumstances, there are a few overarching principles to keep in mind. These should be used in conjunction with the remainder of the guide to make more efficient decisions regarding boundary placement.

What counts as speech?
We’re counting as speech any sound generated by the articulators that carries information about the sounds produced in the utterance.
When should a boundary be moved?
Boundaries should remain where they were placed by the forced aligner unless you have visual or auditory evidence that a given boundary does not line up with the beginning or end of a sound. Minimizing unnecessary movements promotes consistency across alignments.

3 Using Praat

3.1 Download Praat

If you don’t already have Praat downloaded on your computer, it can be found here.

The download gives you a .zip file, which you can extract to your location of choice (Desktop, Program Files, etc.).

3.2 Set Praat as the default

On Windows machines: right click on any .praat file, and select “Properties”. Under Opens with, select Change and select Praat as the program that automatically opens .praat files (navigate to wherever Praat is saved on your computer).

3.3 Praat shortcuts

Zoom in – ctrl + I
Zoom out – ctrl + O
Zoom to selection – ctrl + N
Select next interval/tier – alt + arrow keys
Remove boundary – click on or after a boundary, and type alt + backspace (do this for each tier that a boundary needs to be deleted on)

4 Hand-correction

4.1 Placement of boundaries

We place boundaries at the beginnings and ends of events. For example, we place a boundary at the beginning and at the end of a sound. In practice, this boundary is often just ever so slightly before the event that we’re marking and ever so slightly after that event. For two continuous events, the end of one sound is normally the beginning of the next sound unless there is a clear pause.
The remainder of this document is a guide to boundary placement for different types of sounds, and for different positions in words and utterances. Use the sidebar and search feature to navigate to relevant sections.

4.2 How sounds appear on the spectrogram

4.2.1 Vowels

Darker energy - Vowels appear as dark regions of energy with visible pulses as a result of vocal fold vibration.

Formants - You will generally see line-like regions called formants. Formants are frequency regions that are emphasized by a given vocal tract configuration, and differ from vowel to vowel. The screenshot below shows the word “toys,” with the vowel OY1 highlighted. Note the formant structure and how one of the formants rises as the diphthong changes from /o/ to /i/.

Figure 4: Vowels appear dark on the spectrogram with distinct darker bands called formants.

4.2.2 Liquids

The lateral approximant L often looks similar to vowels, with the highest energy concentrated in lower frequencies. Here L in Figure 10 contrasts against the IY1 and AE1 vowels on either side.

Figure 5: L sometimes appears vowel-like as featured, and sometimes has a grey, “hollow” look, similar to a nasal.

Like L, the rhotic R looks much like a vowel, but a clearly articulated R will almost always have a third formant that dips down to 3,000 Hz or lower. Notice the transition that the third band makes from AO1 to R in “or” below.

Figure 6: R in or. Appears vowel-like but notice the third formant dropping down adjacent to the second formant.

4.2.3 Nasals

The nasals M, N, and NG will show patterns similar to vowels and liquids, but will often have a “hollow” look compared to a vowel, as in the highlighted portion below in the word “make.”

Figure 7: Nasals have similar energy to vowels, but will appear sparse and grey on the spectrogram.

4.2.4 Fricatives

Fricatives have noise that is generally distributed throughout the frequency spectrum. S, SH, and Z are typically longer and with very apparent noise in higher frequency ranges. TH, DH, F and V are often less pronounced, but if clearly articulated will have visible noise in the spectrogram. They are also generally shorter than S and SH. DH sometimes appears similar to D.

Figure 8: Fricatives are generally long sounds that appear as noise in the spectrogram.

4.2.5 Stops

The stop sounds P, T, K, B, D, and G will often be characterized by a thin dark line across the frequency range. This is from the release of the consonant (the “burst”). Voiceless stops in English will have a longer period of noisy energy after the burst, and for voiced consonants, the vowel will usually start immediately after the burst.

Figure 9: Stop consonants articulated as such will have a burst which appears as a brief dark band spanning the frequency range. P, T, and K in the onset position of most words will have aspiration that follows.

5 Beginning-of-utterance issues

5.1 Initial consonants

5.1.1 Stop consonants and affricates (P, T, K, B, D, G, CH, and J)

Voiceless stop consonants. In English, voiceless stops have a longer period of noise following release of the burst than voiced consonant do.

Canonical boundary placement: boundary is placed adjacently preceding release of the burst for the consonant.

Figure 10: Initial stop consonant boundary placement

Voiced stop consonants – Voiced stop consonants will have a short burst release and the vowel will start very shortly after the release of the consonant. Sometimes, voicing may start before the release of the consonant, which is called prevoicing. Figure 11 shows a voiced consonant with prevoicing before the release, and Figure 12 shows a voiced consonant without prevoicing.

Figure 11: Voiced stop that has prevoicing leading up to the burst release.

Figure 12: Voiced initial stop with no prevoicing (this is typical).

5.1.2 Fricatives (S, Z, H, SH, ZH, F, V, TH, and DH)

Canonical boundary placement – boundary is placed adjacently preceding the first sign of frication noise for the fricative.

Note that DH most often looks similar to a stop consonant.

Figure 13: Initial fricative boundary placement.

5.1.3 Nasals (M, N, and NG)

Canonical boundary placement – boundary is placed adjacently preceding the onset of phonation/nasalization.

Figure 14: Initial nasal boundary placement.

After zooming in there appears to be noise related to beginning of nasalization, but because this could not be confirmed with listening and headphones the boundary was placed where the voicing bar of phonation begins in the lower frequency region of the spectrogram.

5.1.4 Liquids (L and R)

L - Lateral Approximant

Canonical boundary placement – boundary placed at the onset of the segment. This may be clear phonation and intensity seen as nearly black on the spectrogram, or in the case of Figure 15, some formants started to be present as part of the articulation of L, so the boundary was placed at the onset of this noise and formant structure.

Figure 15: Initial L boundary placement.

R - Rhotic

Canonical boundary placement – Place boundary at the onset of articulation-related noise in the signal. In this case, some articulation of R preceded phonation.

Figure 16: Initial R boundary placement.

5.1.5 Potential issues

5.1.5.1 HH initial boundary is missing

Q: What to do if HH is missing the initial boundary?
A: You’ll need to place a boundary at the beginning of the sound (Boundary > Add on Tier 1, Add on Tier 2).

Figure 17: HH is missing the initial boundary.

Placing the boundaries will put the phone and word-level text to the left of the new boundaries:

Figure 18: With boundaries placed but before text has been moved.

Cut the text from the text field near the top of the window and paste it into the appropriate interval.

Figure 19: Cutting text from the interval.

With text moved:

Figure 20: After text has been moved to appropriate intervals.

5.1.5.2 Voiceless and breathy beginnings

Q: When there is an initial H-like beginning to a sound do we include this as part of the first segment?
A: Yes, especially with liquids and nasals, the initial H-like element typically carries phonetic information from the initial segment (be it L, R, or N).

Figure 21: Boundary placement for breathy HH-like beginning of L.

Figure 22: Boundary placement for breathy HH-like beginning of N.

5.1.5.3 New is transcribed as N Y UW1

American English dialects typically do not pronounce new as N Y UW1 /nju/. The Y segment can be deleted (click in the interval and press backspace), as well as the boundary between Y and UW1 (click on the boundary and press alt + backspace).

Figure 23: Example where the Y in the word *new* should be deleted.

5.2 Word-initial vowels and glides

5.2.1 Word-initial vowels

Canonical boundary placement – boundary is placed at the onset of phonation/laryngeal activity related to the beginning of the vowel

Figure 24: Initial vowel boundary placement.

5.2.2 Word-initial glides – W and Y

Canonical placement – boundary placed at the onset of phonation or articulation (onset of glide may be something like a voiceless vowel).

Figure 25: Initial glide boundary placement.

5.2.3 Potential Issues with initial vowels and glides

5.2.3.1 “Glottal pop” at the beginning of a vowel

Q: When “glottal pops” begin vowels, do we count this as part of the vowel?
A: Yes. It’s the beginning of the production.

Figure 26: Initial vowel boundary placement when vowel begins with “glottal pop.”

5.2.3.2 Voiceless beginning of glides

Q: Do we include voiceless /w/ leading in to voiced portion of /w/?
A: Yes, for reasons listed above.

Figure 27: Voiceless beginning of glide W, boundary placement.

6 Within-utterance issues

This section focuses on transitions from one sound to another within a word and within utterances.

6.1 Consonant-to-vowel transitions within a word

Canonical placement – boundary placed at the clearest vertical onset of formant structure.

Figure 28: Consonant to vowel transition boundary placement.

6.2 Consonant-to-vowel transitions across words

6.2.1 Consonant to vowel transition with a gap between words

Q: Where to place boundaries when there is a visible gap between a final consonant of one word and an initial vowel of the following word?
A: If there is a visible gap with no audible vowel sound, there should be a pause between the final consonant of the first word and the initial vowel of the following word. As in figure 41 below

Figure 29: Small gap between final consonant in *jump* and initial vowel in *over*.

6.2.2 Consonant-to-vowel boundaries with no pause

Q: There is sort of a pause between the final consonant of a word and the initial vowel of the following word, but I can hear the vowel starting , just not fully going yet. Should there be a pause?
A: No, if you can hear the vowel starting to go right after the consonant, even if the vowel isn’t fully going yet, count this as part of the vowel. See figure 42 below.

Figure 30: What appeared to be a pause between D and A01 actually has audible/visible information as the vowel is starting. In this case, the beginning of the vowel should be the beginning of this information.

6.2.3 Potential issues

6.2.3.1 When there’s a “notch” of voicing preceding formant structure and phonation

Q: Do we include the little “notch” of voicing as the beginning of the vowel, or start at formant structure? MFA wants to include the notch.
A: We are not going to include the notch. Start the vowel where the formants are visible.

Figure 31: Boundary placement when there is “notch” of voice bar that precedes formant structure in a vowel.

6.2.3.2 Vowel starts before phonation

Q: I can hear vowel starting (in OW1 below) before formants actually start. Should there be a pause?
A: Yes, include a pause. Anything before phonation and formant structure should not be a part of the vowel interval. The figures below show how MFA sometimes aligns these vowels but this pre-phonation area should be a part of the preceding pause.

Figure 32: Vowel starts before phonation.

6.2.3.3 Voiceless vowels

Q: When the first vowel in a word such as potato is voiceless, what do we count as vowel?
A: Look for the part that is most similar to a vowel that doesn’t seem to be aspiration.

6.3 Consonant-to-consonant transitions

6.3.1 Fricative to stop/affricate transitions

Q: Where to put the boundary when there is a transition from a fricative to a stop or affricate?
A: The boundary between a fricative and a stop should be placed at the end of the fricative. There may be a small “silent” period before the stop, which is the closure duration of the stop/affricate, and not actually a pause.

Figure 34: Boundary placement in Fricative > Stop/Affricate transitions. Here between S of “this” and CH of “cheese.”

6.3.2 Fricative to fricative transitions

Q: Where to put the boundary between two fricatives?
A: Look for a shift in the appearance of the noise for the two fricatives. The example below shows the transition from Z in “is” to SH in “showing.”

Figure 35: Fricative to fricative transition. Here there is some noise across the frequency range before, or perhaps as part of the transition to SH, which shows a different noise pattern. This lets us see the shift from Z to SH.

6.3.3 Stop to fricative transitions

Q: Where do we place the boundary between consecutive stops and fricatives (e.g. the boundary between T and S in “cowboy boots”)?
A: Fricatives will almost always start immediately after the burst release, especially in the case of a fricative preceded by a stop

Figure 36: Stop to fricative boundary in *boots*.

6.3.4 When to start HH in consonant|HH transitions

Q: When do we begin the HH when transitioning from a stop (e.g. D below)?
A: Look for the transition from burst to HH. HH usually starts to have more formant structure.

Figure 37: When to start HH in consonant-to-consonant transition.

6.3.5 Stop to stop transitions (across word boundaries)

Q: Where to place boundaries between stop consonants when there is a word boundary?
A: If the first stop is released, place the boundary after the release of the first stop and at the beginning of the closure for the second stop (see figures below).

In context:

Figure 38: Boundary between G and D is placed after release of G and when amplitude in signal reduces as closure for D begins.

Zoomed in:

Figure 39: Cursor is at boundary between G and D.

6.3.6 Unreleased consonant-to-consonant transition

Q: Where do we place the boundary if final consonants are unreleased like in the G of “hug daddy?”
A: If the boundary for the first consonant is placed at least 50ms before the burst of the second consonant, the boundary placement is okay. In the example below, the boundary was placed less than 50ms before the second consonant, so the boundary was moved to 50ms before the second consonant

Pre-correction:

Post-correction:

Figure 41: Boundary for unreleased consonant was moved to 50ms preceding the second consonant.

This boundary between G and D would remain where it is because MFA already placed it more than 50ms prior to the release of D:

Figure 42: Consonant-to-consonant, unreleased, no correction required.

Q: Where do we place the boundary when the gap between two consonants doesn’t allow for 50ms before the second consonant?
A: Place the boundary for the end of the first consonant/beginning of the second consonant in the most reasonable place given what you can see/hear.

The boundary between D and DH in figure 59 below is not placed 50ms before the release of DH, but it’s placed after a relatively clear ending of D, and the audio supported the placement of this boundary.

Figure 43: Boundary placement between D and DH when less than 50ms available before DH.

6.3.7 Pause between words, when do we start the post-pausal consonant?

Q: When there is a pause between words, are we doing a bit of space before a within-sentence onset consonant?
A: Place the boundary 50ms before beginning of the post-pausal consonant (cf. Trouvain & Werner, 2022).

Figure 44: Within utterance pause before a stop consonant. Initial, post-pausal boundary for consonant should be placed 50ms before release of stop consonant.

6.4 Vowel-to-vowel boundaries

6.4.1 Where to place boundary in continuous V|V transitions

Q: Where should we place the boundary between vowels when there is no pause?
A: Using the visual of the spectrogram as well as what you can hear, look for the border between the two vowels. See “she is…” below, where there is no break between the words and the vowels are sequential and continuous.

Figure 45: Vowel to vowel transition with no pause.

Zoomed in:

Figure 46: The vertical cursor line shows the transition from IY1 in “she” to IH0 in “is.” Note how the second formant lowers going from IY1 to IH0 (you can see where the dotted red lines cross on the spectrogram).

6.4.2 Where to place boundary in V|V transitions with a pause between words

Q: Should we place a pause between V# #V word boundaries when there is a brief pause and no visible/audible information?
A: These should have a pause between them since vowels do not have the same articulatory closure periods that consonants have. Final boundary of the preceding vowel should end at the end of information from that vowel, and initial boundary of the following initial vowel should start at the beginning of audible/visible information related to that vowel.

6.5 Vowel-to-voiceless consonant transitions

Canonical placement – end-of-vowel boundary placed where the formant structure and phonation cease. This sometimes precedes the burst of the consonant by what appears to be significant amounts. See below for instances when there is pre-aspiration adjacent to the vowel and preceding the consonant.

What this looks like in context:

Figure 48: Transition from a vowel to a following voiceless consonant.

Zoomed in, one can see a bit of residual phonation in black at the bottom of the spectrogram that continues into the K, but because the formant structure and associated noise in the middle frequencies shuts off about here, this is where we put the boundary.

Figure 49: Zoomed in on end of vowel and transition into K.

6.5.1 Vowel-to-consonant with preaspiration

Q: In “coffee,” do we attribute the voiceless vowel portion to the vowel or to the consonant as part of transition to [f]? If we’re treating it like a stop, this would be consonant closure, but do we treat fricatives the same?
A: We are going to treat this as part of the consonant. Phonation is shutting off due to laryngeal status of following consonant.

Note that this makes boundary placement in vowel > consonant transitions completely analogous to many transitions from voiceless consonants to vowels (e.g. when there is heavy aspiration on a sound like K, but we start the vowel at onset of phonation for the vowel, even though the vocal tract may already be positioned for the vowel during the aspiration).

Figure 50: Here the phonation of the vowel ends suddenly as the glottis positions for S, resulting in H-like pre-aspiration leading into the F.

6.6 Vowel to voiced consonant transitions

Canonical placement – boundary placed at the end of the vowel where you will typically see a sudden reduction in energy in the spectrogram (goes from near-black to lighter grey) as well as a shift in formant structure.

What this looks like in context:

Figure 51: Vowel to voiced consonant transition in context.

Zoomed in:

There can be varying degrees of voicing between when the vowel ends and the consonant is released. The first, canonical examples shows when voicing goes entirely through the consonant closure. The following example shows partial voicing through consonant closure.

In context:

Figure 53: Contextual view of a coda consonant closure with only partial voicing during the closure.

Zoomed in:

Figure 54: Closer view of partial voicing during consonant closure.

6.6.1 Potential issues with vowel-to-voiced consonant transitions

6.6.1.1 Where to put boundary when there is a gap in phonation between a vowel and a consonant.

Q: When there’s a gap between end of vowel and beginning of fricative?
A: Count it towards the fricative and mark the end of the vowel as the end of the vowel.

Figure 55: Vowel to fricative transition, voicing stops before articulation of consonant.

6.6.1.2 When to end a vowel when voicing is inconsistent

Q: When voicing isn’t consistently modal, how do we decide when a vowel actually ends? (in other instances, we would end the vowel with voicing).
A: Look for end of formant structure. There should still be formant structure even without phonation.

Figure 56: Vowel to consonant transition, inconsistent phonation of the vowel.

6.7 Pauses

6.7.1 Pauses from consonant to vowel across word boundaries

Sometimes transitions from consonants to vowels across word boundaries are continuous, and there is no break between words. At other times there is a visible break with no audible or visible information between the consonant and the vowel. There should be a break between words in these instances:

Figure 57: Break between consonant and vowel across word boundaries.

6.7.2 Pauses from vowel to vowel across word boundaries

Similarly, sometimes there is a break between two vowels that are on either side of a word boundary. If there is a break with no audible or visible information between the vowels, this break should be reflected with a break between words and phones in the textgrid.

6.7.3 Speaker holds a pause and phonation keeps going

Q: When a speaker is holding a pause with voice going, is this counted as “speech?”
A: Seems like inserting a pause here might be best, as it’s not just prolonged coarticulation between two segments.

7 End-of-utterance issues

7.1 Final-consonant to end-of-utterance transition

7.1.1 Stops at end of utterances

Canonical placement – boundary placed at the end of noise associate with the release of the consonant (as long as this noise isn’t exhalation! See “Potential Issues” below)

In context:

Zoomed in:

Figure 61: End of utterance consonant, zoomed.

7.1.2 Fricatives at the ends of utterances

Canonical placement – boundary placed at the end of the noise associated with articulation of the fricative (be sure to listen to make sure that the noise doesn’t include exhalation).

In context:

Zoomed in:

7.1.3 Nasals at ends of utterances

Canonical placement – boundary placed at the end of phonation and articulation associated with the nasal consonant (see “Potential Issues” below for when the speaker starts exhaling while the nasal is still being articulated).

In context:

Zoomed in:

7.1.4 Liquids at ends of utterances

Canonical placement in context:

Figure 66: Boundary placement for liquids at the ends of words.

Zoomed in:

7.1.5 Final consonant potential issues

7.1.5.1 Unreleased consonants

Q: Where to place the boundaries in final stop consonants when the consonant is unreleased?
A: Place the boundary for the beginning of the sound (e.g. T in OUT, below) at the end of formant structure of the vowel, and place the boundary for the end of the sound at the end of visible energy related to the consonant.

Figure 68: Boundary placement for utterance-final unreleased stops.

7.1.5.2 Boundary placement – audible exhale with consonant release

Q: What counts as speech – any sound information? E.g., the little bit of breath escaping at the end of a word?
A: We are not including anything after the burst release that is generally just exhale.
Q: What if the little bit at the end of the word is a new sound (e.g., “bee-ya”)?
A: Don’t count exhale “ya.”

Figure 69: Strong exhalation following final stop consonant - this additional noise should not be included as a part of the stop.

7.1.5.3 Where to place boundary when exhalation starts during articulation of a nasal

Q: When there is nasal exhalation after/during a final nasal, do we need to count the voiceless finish of a nasal?
A: Nasals are often finished this way at the end of an utterance. If the exhaley portion starts while still articulating the nasal, this counts as speech.

Once articulation ends and it is just exhale, then it doesn’t count.

Without articulation:

Figure 70: Post-nasal exhale, no articulation (exhale not included in boundary).

With articulation:

7.1.5.4 Where to place end-of-word boundary when ER0 is still articulated when exhale starts

Q: Should we include exhale on ER0, as in “together?”
A: If it is articulated, and not just exhale, it should count (similarly to nasals and vowels above). If the ER0 sounds complete and the rest is primarily exhale, don’t count it.

Figure 72: Exhale begins during articulation of ER0.

7.2 Final vowel to end-of-utterance transition

7.2.1 Typical end of vowel

Canonical placement – boundary placed at the most clear end of both formant structure and phonation.

In context:

Figure 73: End boundary placement for utterance-final vowel.

Zoomed in:

7.3 Final vowel potential issues

7.3.1 Boundary placement – phonation for vowel ends before end of word

Q: Do we include voiceless end of vowel as part of vowel?
A: As long as it is articulated, and not just exhale, this should be counted as vowel.

Figure 75: Include voiceless portions of vowel if it is still articulated as vowel and not simply exhale.

7.3.2 Where to place end-of-vowel boundary when there is exhale at end of word

Q: Is it right to cut exhalation off at the end of a word?
A: Yes. If it is just neutral vocal tract exhalation, this should not count.

Figure 76: Exhalation immediately following a sound should not count as part of that sound.

8 Segmental issues

This section focuses on issues regarding segment spacing and the number of segments in an utterance.

8.1 Blend coalescence

8.1.1 SM coalesces to F

Q: For coalescences: Do we treat one sound as “deleted” (e.g. “two fall pieces” in place of “two small pieces”)? How do we space them?
A: Let the two segments (S and M below) have a relatively equal amount of the segment.

Figure 77: Segment boundaries when segments are coalesced.

8.2 Vocalization of L and Syllabic L

8.2.1 Final L is produced as OW1

Q: When there is /l/ in place of /o/ at the end of a word (e.g. “animo” in place of “animal”), what counts as AH0 and L?
A: In general it should all count as L. Make AH0 small, and L longer because the target production would primarily be L.

Figure 78: Vocalic L (pronounced as /o/) should have very short AH0 segment, L should make up nearly all of the sound.

8.2.2 Where to place L boundaries in AH0 L sequence when L is syllabic

Q: Can we justify AH0 L pronunciations over L?
A: We need to keep these due to the standard speech models used in forced aligners and assumed pronunciations. If it’s a syllabic L, just make AH0 shorter.

Figure 79: Syllabic L, if there is no vowel to be heard between M and L in *animal*, AH0 should be minimal and L should take up nearly the entire duration of the sound.

8.3 Syllabic R issues

8.3.1 Where to place boundaries and whether to keep segments when R is syllabic

Q: Should we change the pronunciation from ”R” to “AA1|R” in cases when R is syllabic?
A: Just make AA1 very small.

9 References

Trouvain, J. & Werner, R. 2022. A phonetic view on annotating speech pauses and pause-internal phonetic principles. Transkription und Annotation Gesprochener Sprache und Multimodaler Interaktion: Konzepte, Probleme, Lösungen, 64, pp. 55-73.

References

Hodge, M. M., & Daniels, J. (2007). TOCS+ intelligibility measures [Computer software]. University of Alberta.