Massaro 2004
Massaro 2004
Massaro 2004
Dominic W. Massaro
Joanna Light
The main goal of this study was to implement a computer-animated talking head,
University of California,
Baldi, as a language tutor for speech perception and production for individuals
Santa Cruz
with hearing loss. Baldi can speak slowly; illustrate articulation by making the skin
transparent to reveal the tongue, teeth, and palate; and show supplementary
articulatory features, such as vibration of the neck to show voicing and turbulent
airflow to show frication. Seven students with hearing loss between the ages of 8
and 13 were trained for 6 hours across 21 weeks on 8 categories of segments (4
voiced vs. voiceless distinctions, 3 consonant cluster distinctions, and 1 fricative
vs. affricate distinction). Training included practice at the segment and the word
level. Perception and production improved for each of the 7 children. Speech
production also generalized to new words not included in the training lessons.
Finally, speech production deteriorated somewhat after 6 weeks without training,
indicating that the training method rather than some other experience was
responsible for the improvement that was found.
KEY WORDS: visible speech, language learning, hearing loss, speech percep-
tion, speech production
I
n the United States, 1–2 infants per 1,000 have a moderate to severe
hearing loss in both ears (U.S. Department of Health and Human
Services, 2002). This loss often goes unnoticed for a considerable pe-
riod of time. If untreated for too long, hearing loss can have severe ef-
fects on language learning. According to the Gallaudet Research Insti-
tute (1999–2000), 90% of children with hearing loss are born to parents
with normal hearing. These parents are forced to decide what communi-
cation method they will choose for their child (oral, manual, or a combi-
nation of both). Although there is no consensus on the best medium
through which children who are deaf should learn language, the com-
munication method that parents choose for their child is one that should
optimize language learning and quality of life. Independently of the com-
munication method of choice, the amount and quality of language in
and out of the classroom is the number one factor leading to communi-
cation and academic success (National Association of State Directors of
Special Education, 1992).
Parents often choose to educate their children through the oral com-
munication method. Although it is possible for some children who are
profoundly deaf to develop excellent spoken language, many do not (Dodd,
304 Journal
Journal of of Speech,
Speech, Language,
Language, andand Hearing
Hearing Research• •Vol.
Research Vol.
4747• •304–320
304–320• •April
April 2004 • ©American Speech-Language-Hearing Association
2004
1092-4388/04/4702-0304
McIntosh, & Woodhouse, 1998). Children with even mod- speech-impaired individuals. Tailoring training lessons
erate hearing loss are not exposed to the wealth of audi- based on the specific needs of the student allows for child-
tory input that is available to the hearing child (Sand- centered instruction, increased time on task, speech
ers, 1988). Because of their degraded auditory language training outside of the classroom and treatment setting,
input, children with hearing loss learning oral language and ideally increased competence and confidence in per-
must depend on distorted speech and perhaps insuffi- ceiving and producing English speech segments.
ciently informative mouth movements. Speech and language science evolved under the as-
Correct perception and production of all phonemes sumption that speech perception was a solely auditory
in a language is essential for spoken language learning event (Denes & Pinson, 1963). However, a burgeoning
(Jusczyk, 1997). Results have indicated that the better record of research findings reveals that our perception
a child with hearing loss can perceive spoken language, and understanding are influenced by a speaker’s face and
the better he/she can approximate development of spo- the accompanying visual information about gestures, as
ken language compared to his/her counterparts with nor- well as the actual sound of the speech (Dodd & Campbell,
mal hearing (Svirsky, Robbins, Kirk, Pisoni, & Miyamoto, 1987; Massaro, 1987, 1998; McGurk & MacDonald, 1976).
2000). The better a child is at perceiving and understand- Perceivers expertly use these multiple sources of infor-
ing spoken words, the better he/she will be at producing mation to identify and interpret the language input. In-
spoken language (Levitt, McGarr, & Geffner, 1987). formation from the face is particularly effective when
Children with early onset deafness generally lag the auditory speech is degraded because of noise, lim-
significantly behind their normally hearing peers in all ited bandwidth, or hearing loss. If only roughly half of a
areas involving speech—speech perception and produc- degraded auditory message is understood, for example,
tion, oral language development, metaphonological abili- adding visible speech can allow comprehension to be
ties, and reading and spelling (Leybaert, Alegria, Hage, almost perfect. The combination of auditory and visual
& Charlier, 1998). Listeners often have trouble under- speech has been called superadditive because their com-
standing speakers who are deaf. In one study, inexperi- bination can lead to accuracy that is much greater than
enced listeners could understand only about 20% of the the sum of the accuracies on the two modalities pre-
speech output of deaf talkers (Gold, 1980). Whether in- sented alone (Massaro, 1998). Furthermore, the strong
tentional or not, the way one speaks can ultimately af- influence of visible speech is not limited to situations with
fect the way others perceive one (Scherer, 1986). This degraded auditory input. A perceiver’s recognition of an
difficulty in oral communication may result in feelings auditory–visual syllable reflects the contribution of both
of social isolation on the part of the deaf individual. Thus, sound and sight. For example, if the nonsense auditory
deafness may vastly affect both the child’s academic and sentence, My bab pop me poo brive, is paired with the
vocational achievement. nonsense visible sentence, My gag kok me koo grive, the
As far back as Hudgins and Numbers (1942, as cited perceiver is likely to hear, My dad taught me to drive.
in Ling, 1976), researchers have primarily focused on Two sources of nonsense are combined to create a mean-
pinpointing the speech segments that are most difficult ingful interpretation (Massaro, 1998; McGurk, 1981).
for individuals with hearing loss to produce (e.g., Kirk, In addition to the information value of visible speech,
Pisoni, & Miyamoto, 1997). The most common articula- there are several reasons why the use of auditory and
tion problems made by individuals with hearing loss are visual information together is so successful, and why
voiced–voiceless errors, omissions/distortions of initial they hold so much promise for language tutoring. These
consonants, omission of consonants in clusters, omis- include: robustness of visual speech, complementarity
sions/distortions of final consonants, nasalization, sub- of auditory and visual speech, and optimal integration
stitution of one consonant for another, and intrusive of these two sources of information.
voicing between neighboring consonants. Speechreading, or the ability to obtain speech infor-
Assistive technology is one means by which children mation from the face, depends somewhat on the talker,
experiencing communication difficulties can be helped. the perceiver, and the viewing conditions (Bernstein,
Along with the evolving technology already in use (e.g., Demorest, & Tucker, 2000; Massaro, 1998; Massaro &
hearing aids, cochlear implants), technological advance- Cohen, 1999). Even so, empirical findings show that
ments can potentially provide individuals who are deaf speechreading is fairly robust (Massaro, 1998). Research
with some of the help they need to perceive and speak has shown that perceivers are fairly good at speechreading
more intelligibly. Because speech training is a labor- even when they are not looking directly at the talker’s
intensive task, requiring endless hours of one-on-one lips (Smeele, Massaro, Cohen, & Sittig, 1998). Further-
training between child and clinician, interactive tech- more, accuracy is not dramatically reduced when the
nology may offer a promising and cost-effective means facial image is blurred (because of poor vision, for ex-
to improve the perception and production skills of ample); when the face is viewed from above, below, or in
306 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004
uses a new noninvasive visual technique to train speech Another characteristic of the training is to provide
perception and production. additional cues for visible speech perception. Baldi can
We have developed, evaluated, and implemented a illustrate the articulatory movements, and he can be
computer-animated talking head, Baldi, incorporated it made even more informative by embellishment of the
into a general speech toolkit, and used this technology visible speech with added features. Several alternatives
to develop interactive learning tools for language train- are obvious for distinguishing phonemes that have simi-
ing for children with language challenges (Bosseler & lar visible articulations, such as the difference between
Massaro, 2003; Massaro & Light, in press). The facial voiced and voiceless segments. For instance, showing
animation program controls a wireframe model, which visual indications of vocal cord vibration and turbulent
is texture mapped with a skin surface. Realistic speech airflow can be used to increase awareness about voiced
is obtained by animating the appropriate facial targets versus voiceless distinctions. These embellished speech
for each segment of speech along with the appropriate cues could make Baldi more informative than he nor-
coarticulation. Baldi can be appropriately aligned with mally is.
either synthetic or natural speech. Paralinguistic infor- Students were trained to discriminate minimal pairs
mation (e.g., amplitude, pitch, and rate of speech) and of words bimodally (simultaneous auditory and visual
emotion are also expressed during speaking (Massaro, input) and were also trained to produce various speech
Cohen, Tabain, Beskow, & Clark, in press). segments by visual information about how the inside
Some of the distinctions in spoken language cannot oral articulators work during speech production. The
be heard with degraded hearing, even when the hear- articulators were displayed from different vantage points
ing loss has been compensated by hearing aids or co- so that the subtleties of articulation could be optimally
chlear implants. To overcome this limitation, we use vis- visualized. The speech was also slowed down signifi-
ible speech when providing our stimuli. Based on reading cantly to emphasize and elongate the target phonemes,
research (Torgesen et al., 1999), we expected that vis- allowing for clearer understanding of how the target
ible cues would allow for heightened awareness of the segment is produced in isolation or with other segments.
articulation of these segments and assist in the train- During production training, different illustrations
ing process. were used to train different distinctions. Although any
Although many of the subtle distinctions among given speech sound can be produced in a variety of ways,
segments are not visible on the outside of the face, the a prototypical production was always used. Supplemen-
skin of our talking head can be made transparent so tary visual indications of vocal cord vibration and tur-
that the inside of the vocal tract is visible, or we can bulent airflow were used to distinguish the voiced from
present a cutaway view of the head along the sagittal the voiceless cognates. The major differences in produc-
plane. Baldi has a tongue, hard palate, and three-di- tion of these sounds are the amount of turbulent airflow
mensional teeth, and his internal articulatory move- and vocal cord vibration that take place (e.g., voiced seg-
ments have been trained with electropalatography and ments: vocal cord vibration with minimal turbulent air-
ultrasound data from natural speech (Cohen, Beskow, flow; voiceless segments: no vocal cord vibration with sig-
& Massaro, 1998). These internal structures can be nificant turbulent airflow). Although the internal views
used to pedagogically illustrate correct articulation. of the oral cavity were similar for these cognate pairs,
The goal is to instruct the child by revealing the ap- they differed on the supplementary voicing features. For
propriate movements of the tongue relative to the hard consonant clusters, we presented a view of the internal
palate and teeth. articulators during the production to illustrate the tran-
sition from one articulatory position to the next. Finally,
As an example, a unique view of Baldi’s internal
both the visible internal articulation and supplemen-
articulators can be presented by rotating the exposed
tary voicing features were informative for fricative ver-
head and vocal tract to be oriented away from the stu-
sus affricate training. An affricate is a stop followed by
dent. It is possible that this back-of-head view would be
a (homorganic) fricative with the same contact, hold,
much more conducive to learning language production.
and release phases (Ladefoged, 2001). The time course
The tongue in this view moves away from and towards
of articulation and how the air escaped the mouth dif-
the student in the same way as the student’s own tongue
fered (e.g., fricative: slow, consistent turbulent airflow;
would move. This correspondence between views of the
affricate: quick, abrupt turbulent airflow).
target and the student’s articulators might facilitate
speech production learning. One analogy is the way one The production of speech segments was trained in
might use a map. We often orient the map in the direc- both isolated segments and word contexts. Successful
tion we are headed to make it easier to follow (e.g., turn- perceptual learning has been reported to depend on the
ing right on the map is equivalent to turning right in presence of stimulus variability in the training materi-
reality). als (Kirk et al., 1997). In the present study, we varied
Table 1. Individual and average unaided and aided auditory thresholds (dB HL) for four frequencies for the
7 students who participated in the current study.
308 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004
Table 2. The speech segments that were trained in the present study.
voiceless vs. voiced vs. voiced /t/ vs. /d/ vs. /b/ Rot vs. rod vs. rob 18
Till vs. dill vs. bill
Tie vs. die vs. buy /aI/
Procedures words per minute, 65% of the normal rate (155 words
per minute), to illustrate these distinctions. Three pro-
Based on articulatory difficulties identified by the
grams involved consonant clusters: two word initial clus-
participants’ teachers, eight programs were developed.
ters involving /r/ (e.g., cry, grow, free) and /s/ (e.g., smile,
Four were used to teach the distinction between voice-
slit, stare), and one word-final cluster involving /l/ (e.g.,
less and voiced cognates: /f/ versus /v/, /s/ versus /z/, /t/
belch, milk, field). For these three programs, Baldi’s
versus /d/ versus /b/ and /T/ versus /D/. Because the in-
speech rate was slowed down even further to 47 words
structors indicated that practice with /p/ (the voiceless
per minute, or 30% of the normal rate. As shown in Fig-
counterpart of /b/) was not necessary, we combined the
ure 2, inside oral articulators were also revealed to teach
three plosives /t/, /d/, and /b/ into one training program.
As with the traditional method proposed by Ling (1976), the articulatory processes involved in producing conso-
our method included training at the segment and word nant clusters. A final program taught the difference be-
level. We added supplementary features consisting of tween the fricative /S/ and the affricate /‰/. This program
visible vibrations (quick back and forth movements of used methods that were involved in teaching both voiced
the virtual larynx) in Baldi’s neck whenever the seg- versus voiceless distinctions and consonant clusters. Slow-
ments were voiced. An air stream expelled from Baldi’s ing down Baldi’s speech to 47 words per minute, 30% of
mouth was also used to differentiate these segments the normal rate, while revealing Baldi’s inside oral
(e.g., a considerable amount of air for voiceless segments articulators provided a perceivable difference between
and a limited amount for voiced counterparts; see Fig- these two segments. With Baldi’s instruction, the students
ure 1). Baldi’s speech rate was slowed down to 100 were able to visibly determine that the starting position
Training
Each student completed two training lessons per
week over the course of 21 weeks, including a 2-week
of articulation was different for these two segments and break when the schools were closed for holiday vaca-
that the affricate was actually a combination of two seg- tion. Occasionally, because of the child’s absence from
ments produced simultaneously (/t/ + /S/ = /‰/). Supple- school, the scheduled training lesson was simply pre-
mentary voicing features were also used in training this sented at the next meeting. Each of the eight training
distinction. Figure 3 illustrates the procedure of this lessons lasted for approximately 15 minutes and was
study in its entirety. completed three times over the course of the study. Thus,
On the first day, before the pretest was given, each each student completed approximately 45 minutes of
student was required to give specific information about eight training lessons for a total of 6 hours of training.
him/herself, including name, age, and date, in order to After the students completed a specific training lesson,
set up a file for his/her data. On each subsequent day, the program was modified to take into consideration
Figure 2. The four presentation conditions of Baldi with transparent skin revealing inside articulators (back view, sagittal view, side view,
front view).
310 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004
Figure 3. Sequence of procedures involved in testing and training.
their difficulties. For example, the experimenter noted training lessons, stimuli consisted of 6 minimal pairs of
that the /v/ sound was being produced with a nasal qual- words, contrasting voiceless and voiced phonemes (e.g.,
ity by a few students. This program was modified so fat vs. vat, shoe vs. chew). For the /d/ versus /t/ distinc-
that during the next training session of this cognate, tion, /b/ was also included in this program; therefore,
Baldi would instruct the students to pinch their noses six minimal triplets were involved. For the consonant
and produce the /v/ sound. This modification allowed the cluster programs, stimuli consisted of six minimal trip-
student to realize that nasality is not a feature of this lets of words, contrasting consonant cluster segments
sound. Although variations were made from one rota- (e.g., for /r/ clusters: crown, frown, brown).
tion to the next, the general format of the lessons re- Test Phase. During the test phase for all categories
mained constant from one day to the next. Each stu- (voiced vs. voiceless, fricative vs. affricate, and conso-
dent completed a speech perception lesson and a speech nant clusters), Baldi said an isolated word from a pair
production lesson during each day of training. The pro- or triplet while written words were simultaneously pre-
cedure for each training lesson is described below. sented on the computer monitor. During the first rota-
tion, the experimenter noticed that some of the students
Perception were attending to the text rather than to Baldi. In an
attempt to redirect the student’s attention to Baldi, a
For speech perception, an identification task was delay between the speech and orthographic presenta-
given. For the voiceless versus voiced (/f/ vs. /v/, /s/ vs. tion was added for the second and third rotations. The
/z/, and /T/ vs. /D/) and fricative versus affricate (/S/ vs. /‰/) experimenter’s impression was that this modification
Figure 4. A full-screen view of a typical identification task (/f/ vs. /v/ distinction).
312 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004
After completion of the speech perception training, in training. Four different internal views of the oral cav-
the student went on to participate in a speech produc- ity were shown during consonant cluster and fricative
tion training lesson. versus affricate training: a view from the back of Baldi’s
head looking in, a sagittal view of Baldi’s mouth alone
Production (static and dynamic), a side view of Baldi’s whole face
where his skin was transparent, and a frontal view of
Test Phase. Baldi said an isolated word that included
Baldi’s face with transparent skin. Each view gave the
the target segment. After a tone, the student was in-
student a unique perspective on the activity, which took
structed to repeat the word. Approximately 2 s after the
place during production (see Figure 2). We expected these
tone, if a response could not be detected, Baldi asked
multiple views to facilitate learning and to anticipate
the student to “please speak after the tone” and the tone
individual preferences for different views.
replayed. Once a verbal response was detected, the com-
puter captured this utterance in a sound file and it was During all training lessons, the student was in-
logged. The production ability of the speaker (i.e., cor- structed in how to produce the segments being trained
rect vs. incorrect) was determined by the voice recogni- (e.g., /f/ and /v/ for a voiceless vs. voiced contrast; /s/,
tion system in the CSLU toolkit. Unfortunately, the voice /sm/, /st/, /sl/, and so on for consonant cluster training; /S/
recognizer was not as accurate as we had hoped. Nega- vs. /‰/ for a fricative vs. affricate contrast). The students
tive feedback was often given for correct responses as were also required to produce the segment in isolation
judged by the experimenter. This inaccurate feedback as well as in words and were given the ability to hear
could hinder learning and was discouraging to the stu- their productions of certain words by a playback fea-
dents so we modified the program by implementing a ture during the tutoring of the consonant clusters. No
technique where the experimenter judged the student’s feedback was given during the training stage, but “good
response and input this recognition decision via the com- job” cartoons were given as reinforcement. The appen-
puter mouse, without the student being aware of the dix gives a more detailed explanation of the processes
procedure. The experimenter’s input to the computer involved in each type of training.
after each response determined the feedback. The next The tutoring phase for all lessons ended with Baldi
trial was presented 1 s after feedback was given. Six saying, “Okay, now let’s see what you’ve learned.”
trials of each target segment in the voiceless/voiced mini- Test Phase. After the tutoring phase was completed,
mal pair (e.g., /f/ vs. /v/) or triplet (e.g., /b/ vs. /d/ vs. /t/) each student performed the repetition phase once again
distinction and the fricative/affricate distinction (e.g., /S/ with feedback. This was identical to the first test phase.
vs. /‰/) were completed (12 trials for all pairs and 18 Six trials of each segment being tested were presented
trials for the triplet), and placement of the target seg- randomly, and placement of the target segment in the
ment within the word was varied. Twelve trials of each word varied. Baldi said a word and the student was re-
consonant cluster were completed. Order of presenta- quired to say that word back to him.
tion was randomized, and on completion of the twelve
trials, the student moved on to a tutoring phase.
Tutoring Phase. In the tutoring phase, the student
Posttest
was trained in how to produce the target segment. These One “rotation” was defined by the completion of all
instructions were composed from various sources (e.g., eight training lessons (see Figure 3). After each rota-
Ling, 1976; Massaro et al., in press). Different training tion, the student was given the general test of 104
methods were used to train certain categories. For ex- words. This test was the same as the one given at pre-
ample, supplementary features such as vocal cord vibra- test. The general tests were used as a measure of the
tion and turbulent airflow were used to visibly indicate degree to which the production abilities of each student
the difference between voiceless and voiced contrasts changed from pretest to posttest. Three rotations of the
(e.g., /f/ vs. /v/). A side view of Baldi with transparent eight lessons (6 hours of training), as well as a pretest,
skin was used during voiced versus voiceless training. two general tests, and a posttest, were completed (see
This view was most effective for presenting the supple- Figure 3).
mentary voicing features. For consonant cluster train-
ing, internal views of the oral cavity were important to
show place features of the tongue during production.
Follow-Up Test
Slowing down Baldi’s speech allowed us to emphasize A follow-up test was given 6 weeks after training
the articulatory sequence involved in producing a con- ended. This test was exactly the same as the pretest
sonant cluster. To teach the fricative versus affricate and the posttest (a general test of 104 words). This test
distinction, supplementary voicing features, internal was used to see how production ability was retained once
articulatory views, and slowed down speech were all used the training lessons ended.
Figure 5. Proportion of correct identifications (and standard error bars) during pretest and posttest for each
of the eight training categories. The results are graphed from left to right by order of presentation during
each training rotation.
314 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004
Figure 6. Intelligibility ratings (and standard error bars) of the pretest and posttest word productions for
each of the eight training categories. The results are graphed from left to right by order of presentation
during each training rotation.
(/s/ vs. /z/ = .38; /S/ vs. /‰/ = .38) and showed the most between test and category, F(7, 56) = 20.71, p < .001.
improvement at posttest (/s/ vs. /z/ = .83, .45 improve- Although all categories showed an improvement in rat-
ment; /S/ vs. /‰/ = .70, .32 improvement). ings from pretest to posttest, the categories that were
rated lowest at pretest showed the greatest improve-
ment at posttest.
Production
A second analysis was performed to assess the ef-
Ratings (intelligibility of the auditory stimulus to fectiveness of the differentiating information involved
the target text on a scale from 0 to 1) of the nine psy- in production training: Supplementary voicing features
chology student judges were used as a measure of pro- including vocal cord vibration and turbulent airflow,
duction accuracy. To determine how well the students inside articulatory views from multiple angles with
improved in their speech production from pretest to slowed-down speech, and a combination of both tech-
posttest, a three-way repeated analysis of variance niques were the information conditions. A three-way
(ANOVA) was performed on the judges’ ratings of the repeated ANOVA was performed on the production rat-
students’ productions. Student (N = 7), test (pretest vs. ings (intelligibility on a scale of 0 to 1). Judge (N = 9)
posttest), and category (/f/ vs. /v/, /s/ vs. /z/, /l/ final clus- was the random source of variance. Information condi-
ters, etc.) were the independent variables. Judge (N = 9) tion (voicing features vs. visible articulation vs. both),
served as the random source of variance in this design. test (pretest vs. posttest), and student (N = 7) were the
Figure 6 gives the pretest and posttest production independent measures.
ratings for each of the eight training categories. Pro- The production rating increased from pretest to
duction ratings showed a .21 increase on the 0 to 1 in- posttest for each information condition. There was a sig-
telligibility scale from pretest to posttest, F(1, 8) = 67.93, nificant rating increase from pretest to posttest, F(1, 8)
p < .001. To determine the reliability of the judges’ rat- = 81.62, p < .001. Production ratings differed depending
ings, we computed the range of differences in the pretest
and posttest scores. These differences varied between .08 Table 3. Change in ratings for each student from pretest to posttest.
and .29 across the nine judges, with seven of the nine
judges falling within the .20 to .29 range, showing that Significance
there was good reliability across the different judges.
Student Pretest Posttest F(1, 8) p
To determine whether each student individually
benefited from the program, a separate analysis was per- S1 .53 .66 7.42 <.05
formed on each student’s results. As can be seen in Table S2 .34 .60 46.09 <.001
S3 .43 .68 17.237 <.001
3, a statistically significant increase in ratings from pre-
S4 .64 .85 78.38 <.001
test to posttest was observed for each of the 7 students.
S5 .57 .83 22.23 <.001
As is shown in Figure 6, performance also varied S6 .46 .72 142.764 <.001
depending on which training category was involved, S7 .48 .59 8.095 <.05
F(7, 56) = 8.31, p < .001, and there was an interaction
316 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004
Dagenais (1992) also noted that electropalatography
Discussion methods limited the abilities of the trainees with hear-
The main goal of this study was to implement Baldi ing loss to generalize to novel situations because of the
as a language tutor for speech perception and production limited tactile feedback that participants received dur-
for individuals with hearing loss. The students’ ability to ing training. Our untrained words actually showed a
perceive and produce words involving the trained seg- somewhat greater improvement from pretest to posttest
ments did change from pretest to posttest. A second analy- than did trained words (.24 change and .18 change, re-
sis revealed an improvement in production ratings no spectively). The small difference probably only reflects
matter which training method was used (e.g., vocal cord that the untrained words received lower initial produc-
vibration and turbulent airflow vs. slowed-down speech tion ratings than did trained words. Learning with our
with multiple internal articulatory views vs. a combina- training method therefore appears to generalize to words
tion of both methods). Although training method was con- outside of the training lessons.
founded with category, an analysis of pretest versus The present findings suggest that Baldi is an effec-
posttest ratings revealed each method to be successful. tive tutor for speech training students with hearing loss.
Our method of training is similar in some respects There are other advantages of Baldi that were not ex-
to electropalatography (EPG), which has been consid- ploited in the present study. Baldi can be accessed at
ered useful in clinical settings because it provides di- any time, used as frequently as desired, and modified to
rect visual feedback (in the form of a computer display) suit individual needs. Baldi also proved beneficial even
on the contact between the tongue and the palate dur- though students in this study were continually receiv-
ing speech production. The student wears a custom- ing speech training with their regular and speech teach-
fitted artificial palate embedded with electrodes, and ers before, during, and after this study took place. Baldi
the clinician may wear one as well. The clinician illus- appears to offer unique features that can be added to
trates a target pattern, and the student attempts to the arsenal of speech-language pathologists.
match it. For instance, the student may be presented Ratings of the posttest productions were signifi-
with a typical contact pattern for /s/, with much contact cantly higher than pretest ratings, indicating signifi-
at the sides of the palate and a narrow constriction to- cant learning. Given that we did not have a control
ward the front of the palate. Certain speech pathologies group, it is always possible that some of this learning
result in /s/ being produced as a pharyngeal fricative. occurred independently of our program or was simply
The pharyngeal fricative would show up on the screen based on routine practice. However, the results provided
as a lack of contact on the hard palate. The clinician can some evidence that at least some of the improvement
then teach the patient how to achieve the target pat- must be due to our program. Follow-up ratings 6 weeks
tern. Dent, Gibbon, and Hardcastle (1995) provide a case after our training was complete were significantly lower
study where EPG treatment improved the production than posttest ratings, indicating some decrement due
of lingual stops and fricatives in a patient who had un- to lack of continued use. From these results we can
dergone pharyngoplasty. conclude that our training program was a significant
contributing factor to the change in ratings seen for pro-
EPG treatment has also proved to be useful in teach-
duction ability. Future studies can directly test the use-
ing children who are deaf to produce normal-sounding
fulness of Baldi to their treatment methods and focus
lingual consonants (e.g., Crawford, 1995; Dagenais,
on which specific training regimens are most effective
Critz-Crosby, Fletcher, & McCutcheon, 1994; Fletcher,
for particular contrasts.
Dagenais, & Critz-Crosby, 1991). Although the visual feed-
back from the EPG is deemed to be extremely impor-
tant to the significant improvement in production, there Acknowledgments
have been very few systematic evaluations of its effec-
tiveness. In the current study, however, our method ap- The research and writing of this article were supported
by the National Science Foundation (Grant No. CDA-
pears to have been more successful, with a 21% improve-
9726363, Grant No. BCS-9905176, Grant No. IIS-0086107),
ment overall. Dagenais (1992) trained four different the Public Health Service (Grant No. PHS R01 DC00236),
segments (e.g., alveolar stops, velar stops, alveolar sibi- and the University of California, Santa Cruz.
lants, and palatal sibilants) and found an average 8%
improvement in linguapalatal contact across 4 partici-
pants from pretest to 6 months after commencement of References
training (71% at pretest vs. 79% at 6 months). Although Bernstein, L. E., Demorest, M. E., & Tucker, P. E. (2000).
Dagenais provided many more hours of training (about Speech perception without hearing. Perception & Psycho-
36 hours), we found a 21% improvement in production physics, 62, 233–252.
ratings after just approximately 6 hours of training. Black, A. W., & Taylor, P. A. (1997). The Festival Speech
318 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004
Svirsky, M. A., Robbins, A. M., Kirk, K. I., Pisoni, D. B., U.S. Department of Health and Human Services.
& Miyamoto, R. T. (2000). Language development in (2002). Retrieved March 9, 2004, from http://www.cdc.gov/
profoundly deaf children with cochlear implants. Psycho- ncbddd/dd/ddhi.htm
logical Science, 11, 153–158.
Torgesen, J. K., Wagner, R. K., Rashotte, C. A., Received February 10, 2003
Lindamood, P., Rose, E., Conway, T., & Garvan, C.
(1999). Preventing reading failure in young children with Accepted July 21, 2003
phonological processing disabilities: Group and individual DOI: 10.1044/1092-4388(2004/025)
responses to instruction. Journal of Educational Psychol-
ogy, 91, 579–593. Contact author: Dominic W. Massaro, PhD, Department of
Psychology, University of California, Santa Cruz, CA
95064. E-mail: massaro@fuzzy.ucsc.edu
Voiced Versus Voiceless Distinction explained that if he/she didn’t feel a vibration, he/she was not
making this sound correctly, but not to worry because he/she
In all of the tutoring, the experimenter was present but did
would have much opportunity to practice and improve.
not provide any additional instruction other than repeating
what Baldi said if the child did not understand (mostly because The same procedure was carried out for the voiceless
of their limited hearing and Baldi’s synthetic speech). For the counterpart. For the voiceless segment, no vocal cord vibration
voiceless versus voiced distinctions (e.g., /f/ vs. /v/, /s/ vs. was shown during Baldi’s production. Baldi asked the student if
/z/), Baldi first showed the student how to produce the target he/she saw his throat vibrate. After a short pause of 2 s, which
segments. He asked the student whether he/she could hear the gave the student a chance to respond, Baldi explained that the
difference between the two target sounds in the minimal pair reason why no vibration occurred was because this sound was
(e.g., “Can you hear the difference between the /f/ in fat and voiceless and when making voiceless sounds, you do not use
the /v/ in vat ?”). A 2-s pause allowed the student to respond your vocal cords. Baldi let the student know that voicing was
before Baldi continued. Baldi told the student to watch carefully one feature that distinguished between the two sounds being
as he produced these two segments again and then went on to trained. The student was then instructed to put his/her hand on
give the student verbal instructions on how he/she should his/her throat and produce this sound. This enabled the student
produce the target segments (e.g., where to position the tongue to feel no vibration and to understand the difference between
with respect to the teeth, the shape of the tongue and lips, etc.). voiced and voiceless segments in his/her own speech.
It should be noted that the instructions are the same for both Baldi then went through the same procedure for turbulent
segments in terms of tongue, teeth, and lip place features. Baldi airflow, showing the difference between the varying degrees of
then produced the voiced segment while the inside articulators air that escape the mouth for voiced versus voiceless segments
were revealed. (e.g., a large degree of expulsion for voiceless segments and
Supplementary features such as vocal cord vibration and almost no expulsion for voiced segments). He asked the student
turbulent airflow were visible when Baldi produced the segment to “put your hand in front of your mouth. Keep it there and
to enhance awareness about articulatory properties. A side make the X sound.” Having a hand in front of his/her mouth
view was the only view used during this type of training, for during production allowed the student to feel the air hit his/her
this was the best way to emphasize the supplementary features. hand in varying degrees, which allowed the student to better
The student was then instructed to produce the voiced target understand the difference between the production of the two
segment. No feedback was given at this time. During Baldi’s sounds in his/her own speech.
production of the voiced segment, vocal cord vibration was Next, Baldi showed the student how to produce various
shown. Baldi asked the student if he/she saw his throat vibrate. words involving the voiced segment from the pair. After Baldi
After a short pause of 2 s, which gave the student a chance to produced a word involving the target phoneme, he gave the
respond, Baldi explained that this segment was voiced, that student helpful tips about tongue positioning and so on, and
voicing was made in his throat, and that this is what caused his then he asked the student to repeat this word to him. After the
throat to vibrate. The student was instructed to watch Baldi as student made an effort to produce this word, he/she was
he produced the voiced segment and to pay attention to his presented with a cartoon on which was written one of various
throat. The student was then asked to “put your hand on your reinforcing statements, including “good job,” “way to go,”
throat. Keep it there and make the X sound.” This enabled the “awesome,” and so on, to encourage the student to keep
student to feel for him/herself whether or not he/she was using trying, regardless of whether or not he/she had produced the
his/her throat to make this sound. Baldi told the student that segment correctly. The same procedure was carried out for the
he/she should feel a vibration in his/her own throat. He voiceless counterpart.
320 Journal of Speech, Language, and Hearing Research • Vol. 47 • 304–320 • April 2004