Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000
…
7 pages
1 file
Abstract In this paper, we investigate the acoustic prosodic marking of demonstrative and personal pronouns in task-oriented dialog. Although it has been hypothesized that acoustic marking affects pronoun resolution, we find that the prosodic information extracted from the data is not sufficient to predict antecedent type reliably. Inter-speaker variation accounts for much of the prosodic variation that we find in our data. We conclude that prosodic cues should be handled with care in robust, speaker-independent dialog systems.
Glossa: a journal of general linguistics
In this study, we investigate how prosodic cues are used when an overt pronoun is associated with either a subject or an object antecedent in Italian and in Swedish. To address this question, 28 Italian speakers and 28 Swedish speakers completed a production task, by reading out loud globally-ambiguous sentences containing overt pronouns and a control interpretation task, where they selected either a subject or an object antecedent for each pronoun, contained in a globally-ambiguous sentence. We expected that the different preference patterns in antecedent assignment in the two languages would affect the speakers’ use of prosody. In Italian, overt pronouns are usually associated with object antecedents, whereas null pronouns are usually associated with subject antecedents (Position of Antecedent Strategy – “PAS” – Carminati 2002). On the other hand, Swedish overt pronouns leave a measure of ambiguity with respect to antecedent assignment. The results of the control interpretation ta...
It is a well-known observation that the reference of free, personal pronouns is sometimes influenced by the specific prosodic context in which they occur (Akmajian & Jackendoff 1970, inter alia). A prevailing view in the literature is that the interpretation of a pronoun bearing a pitch accent can be derived from the interpretation of some unaccented counterpart (esp. Kameyama 1999). In essence, accents on pronouns are assumed to induce a ‘switch’ from some default reference. While intuitively appealing, this view does not always converge with independently motivated theories of prosodic meaning. An alternative approach views the interpretation of accented pronouns as a natural consequence of the general principles that relate prosodic patterns to information structure. If this is right, then any referential effects of prosody should follow from a fully general theory of prosodic meaning. Through a combination of experimental data collection and theoretical analysis, this dissertation addresses the adequacy of the two approaches. First, I report on a production study designed to test the predictions of a particular class of so-called ‘switching’ models against those based on the theory of nuclear accent meaning proposed in Schwarzschild (1999). Two perception studies then compare the relevance of a notion of reference switching in interpretation against an alternative set of predictions based on the meaning of nuclear accents vis-à-vis information structure. Together, the findings straightforwardly support a rejection of earlier switching models in favor of one based on inferencing in the context of a theory of information structure. Building on the key insights of Schwarzschild (1999), I propose a new theory of the interpretation of nuclear accent placement. In an implementation of Bi-Directional Optimality Theory (Blutner 2000), I show that a theoretically preferable treatment of the relationship between prosody and pronominal reference takes into account the distinct, but interdependent contributions of production and perception. In addition, I propose a modification to the principle that utterances should maximize anaphoric links with the context. The resulting model is shown to provide a superior account for key examples, and its compatibility with existing models of pronominal reference based on coherence relations (Kehler 2002) is discussed.
SpeechProsody
Dialog markers, such as yeah and okay generally seem to fit smoothly in the flow of dialog, with prosody that is natural and appropriate for the local context. We here examine this effect, specifically looking at the predictability of the prosody of dialog markers from the prosody of the local context. Using 72 prosodic features representing the local context, we built simple models able to predict the average pitch, log energy, cepstral flux, and harmonic ratio for the 12 most common dialog markers of American English. The model's predictions accounted for over a third of the variance in the observed prosody, showing a modest but meaningful context dependence.
1999
ABSTRACT Prosodic correlates of 6 referent status taxonomies and 3 distance-from-last-mention heuristics both on the acoustic and on the symbolic level (ToBI) were investigated in a corpus of short news reports read by 6 professional newsreaders. Symbolic correlates are found mainly for pronouns, acoustic correlates for nouns and proper names. However, both form and extent of these correlates varies considerably between speakers.
Discourse Processes, 2003
This article explores the relationship between discourse structure, grounding, and prosody in interactive discourse through an empirical analysis of task-oriented dialogue (the Australian Map Task corpus). Our focus is on the role that prosody plays in the process of grounding-the attainment and acknowledgment of mutual knowledge by discourse participants . We investigate how patterns of prosodic boundary strengths, pitch contour, pausing, and overlap relate to the structuring of "common ground units" , collaborative units that capture the points in the discourse at which common ground is established. We also examine the distribution of dialogue acts (Jurafsky, Shriberg, & Biasca, 1997) within common ground units to add to the emerging model of dialogue structure that takes into account the "joint action" feature of interactive discourse. Our results show that responses belonging to different types of common ground units have different prosodic profiles. These results have implications for computational and psychological modeling of dialogue structure as well as our understanding of the functions of prosody in interaction.
2006
This paper presents an empirical evaluation of a pronoun resolution algorithm augmented with discourse segmentation information. Past work has shown that segmenting discourse can aid in pronoun resolution by making potentially erroneous candidates inaccessible to a pronoun's search. However, implementing this in practice has been difficult given the complexities associated with deciding on a useful scheme and then generating the segmentation reliably. In our study, we investigate whether or not shallow schemes that are easy to generate will improve pronoun resolution in a task-oriented corpus of dialogues.
1998
In this paper we show how prosody can be used in spoken dialog systems. First, we describe the phenomena that prosodic analysis is concerned with and give examples why prosody is relevant in the context of spoken dialog processing. Then we examine prosody in the light of pattern classification. We show how prosodic events can be categorized. We detail how those prosodic events manifest themselves in the speech signal, i.e. what acoustic features are important. Having made clear what we want to distinguish and which features we use in order to do that, we present a statistical framework that enables us to reliably determine e.g. phrase boundaries and accents. After that we show how dialog systems can utilize prosodic information. The importance of prosody is strikingly demonstrated in the context of parsing word hypothesis graphs. In the VERBMOBIL speech-to-speech translation system, the use of boundary probabilities yields a speed-up of 92% and a 96% reduction of alternative readings. Segmentation and accentuation in the context of shallow linguistic analysis are other applications where prosody can be gainfully employed. For several new directions in prosody research (in the context of dialog systems) such as emotion detection, multilingual prosody, feature selection, and integration of prosodic knowledge in speech recognition, preliminary results are presented.
These experiments were designed to discover whether untrained speakers produce prosodic cues that are sufficient to allow listeners to interpret ambiguous PP-attachments. A referential communication task was used to elicit productions of ambiguous sentences and determine whether listeners could use pro-sodic cues to correctly interpret these ambiguities in context. In Experiment 1, the referential context supported both potential interpretations of the ambiguity. Acoustic analyses indicated that Speakers produced potentially informative prosodic cues. Listeners' responses to the ambiguous sentences strongly reflected the demonstration the Speaker had seen, indicating that they were able to use this information. However, post-experiment interviews revealed that Speakers were aware of the ambiguous situations. Experiment 2 manipulated Speaker awareness by altering the Speaker's referential context to support only the intended meaning, and by making the resolution of the ambiguity a between subjects variable. Although Listeners' contexts were unchanged from Experiment 1, Listeners now showed no sensitivity to the Speakers' intended meaning. Acoustic analysis indicated that the strong prosodic cues provided in Experiment 1 were absent in Experiment 2. The experiments suggest that informative prosodic cues depend upon speakers' knowledge of the situation: speakers provide prosodic cues when needed; listeners use these prosodic cues when present .
Speech Communication, 1994
The domain of the speech recognition and dialog system EVAR is train time table inquiry. We observed that in real human{human dialogs when the o cer transmits the information, the customer very often interrupts. Many of these interruptions are just repetitions of the time of day given by the o cer. The functional role of these interruptions is often determined by prosodic cues only. An important r e s u l t o f experiments where naive persons used the EVAR system is that it is hard to follow the train connection given via speech s y n t h e s i s . I n t h i s c a s e i t i s e v en more important than in human-human dialogs that the user has the opportunity t o i n teract during the answer phase. Therefore we extended the dialog module to allow the user to repeat the time of day and we added a prosody module guiding the continuation of the dialog by analyzing the intonation contour of this utterance.
Frontiers in Computer Science, 2021
This paper addresses the usefulness of speech pauses for determining whether third person neuter gender singular pronouns refer to individual or abstract entities in Danish spoken language. The annotations of dyadic map task dialogues and spontaneous first encounters are analyzed and used in machine learning experiments act to automatically identify the anaphoric functions of pronouns and the type of abstract reference. The analysis of the data shows that abstract reference is more often performed by marked (stressed or demonstrative pronouns) than by unmarked personal pronouns in Danish speech as in English, and therefore previous studies of abstract reference in the former language are corrected. The data also show that silent and filled pauses precede significantly more often third person singular neuter gender pronouns when they refer to abstract entities than when they refer to individual entities. Since abstract entities are not the most salient ones and referring to them is c...
Introduction
Previous work on anaphora resolution has yielded a rich basis of theories and heuristics for finding antecedents. However, most research to date has neglected an important potential cue that is only available in spoken data: prosody. Prosodic marking can be used to change the antecedent of a pronoun, as demonstrated by this classic example from Lakoff (1971) (capitals indicate a pitch accent):
(1) John i called Jim j a Republican, then he i insulted him j .
(2) John i called Jim j a Republican, then HE j insulted HIM i .
But exactly how the antecedent changes due to the prosodic marking on the pronoun, and whether this effect happens consistently, is an open question. If consistent effects do exist, they would be useful for online pronoun interpretation in spoken dialog systems.
Prosodic prominence directs the attention of the listener to what is important for understanding and interpretation. But how should this principle be applied when words that are normally not very prominent, such as pronouns, are accented? More generally, does acoustic marking provide systematic cues to characteristics of antecedents? More specifically, does it imply that the antecedent is "unusual" in some way? These are the two hypotheses we investigate in this paper. Our data consists of 322 pronouns from a large corpus of spontaneous task-oriented dialog, the TRAINS93 corpus (Heeman and Allen, 1995). This corpus allows us to study pronouns as they occur in spontaneous unscripted discourse, and is one of the very few speech corpora to have been annotated with pronoun interpretation information.
The remainder of this paper is structured as follows: In Section 2, we summarize relevant work on pronoun resolution and report on the few proposals for integrating prosody into pronoun resolution algorithms. Next, in Section 3, we present the dialogs used for our study and the attributes available in the annotation data, while Section 4 describes the acoustic measures that were computed automatically from the data. Section 5 explores whether there are systematic correlations between these properties and the acoustic measures fundamental frequency, duration, and intensity. For these measures, we find that most correlations are in fact due to speaker variation, and that speakers differ greatly in their overall prosodic characteristics. Finally, we investigate whether it is possible to use these acoustic features to predict properties of the antecedent using logistic regression. Again, we do not find acoustic features to be reliable predictors for the features of interest. Therefore, we conclude in Section 6 that acoustic measures cannot be used in speaker-independent online anaphora resolution algorithms to predict the features under investigation here.
Background and Related Work
There is a rich literature on resolving personal pronouns. Many approaches are based on a notion of attentional focus. Entities in attentional focus are highly salient, and pronouns are assumed to refer to the most salient entity in the discourse (cf. (Brennan et al., 1987;Azzam et al., 1998;Strube, 1998)). Centering (Grosz et al., 1995) is a framework for predicting local attentional focus. It assumes that the most salient entity from sentence S n,1 that is realized in sentence S n is most likely to be pronominalized in S n . That entity is termed the Cb (backward-looking center) of sentence S n . Finding the preferred ranking criteria is an active area of research. Byron and Stent (1998) adapted this approach, which had previously been applied to text, for spoken dialogs, but with limited success.
In contrast to personal pronouns, demonstratives do not rely on calculations of salience. In fact, Linde (1979) found that while it was preferred for entities within the current local focus, that was used for items outside the current focus of attention. Passonneau (1989) showed that personal and demonstrative pronouns are used in contrasting situations: personal pronouns are preferred when both the pronoun and its antecedent are in subject position, while demonstrative pronouns are preferred when either the pronoun or its antecedent is not in subject position. She also found that personal pronouns tend to co-specify with pronouns or base noun phrases; the more clause-or sentence-like the antecedent, the more likely the speaker is to choose a demonstrative pronoun.
Pronoun resolution algorithms tend not to cover demonstratives. Notable exceptions are Webber's model for discourse deixis (Webber, 1991) and the model developed for spoken dialog by Eckert and Strube (1999). This algorithm encompasses both personal and demonstrative pronouns and exploits their contrastive usage patterns, relying on syntactic clues and verb subcategorizations as input. Neither study investigated the influence of prosodic prominence on resolution.
Most previous work on prosody and pronoun resolution has focussed on pitch accents and third person singular pronouns that co-specify with persons. Nakatani (1997) examined the antecedents of personal pronouns in a 20-minute narrative monologue. She found that pronouns tend to be accented if they occur in subject position, and if the backward-looking center (Grosz et al., 1995) was shifted to the referent of that pronoun. She then extended this result to a general theory of the interaction between prominence and discourse structure. Cahn (1995) discusses accented pronouns on the basis of a theory about accentual correlates of salience. Kameyama (1998) interprets a pitch accent on pronouns in the framework of the alternative semantics (Rooth, 1992) theory of focus. She assumes that all potential antecedents are stored in a list. Pronouns are then resolved to the most preferred antecedent on that list which is syntactically and semantically compatible with the pronoun. Preference is modeled by an ordering on the set of antecedents. An accent on the pronoun signals that pronoun resolution should not be based on the default ordering, where the default is computed by a number of interacting syntactic, semantic, pragmatic, and attentional constraints.
Compared to he and she, it and that have been somewhat neglected. There are two reasons for this: First, it is not considered to be as accentable as he and she by native speakers of both British and American English, whereas that is more likely than it to bear a pitch accent. An informal study of the London-Lund corpus of spoken British English (Svartvik, 1990) confirmed that observation. Second, that frequently does not have a co-specifying NP antecedent, and most research on cospecification has focussed on pronouns and NPs. Work on accented demonstratives and pronoun resolution is extremely scarce. Pioneering studies were conducted by Fretheim and his collaborators. They tested the effect of accented sentence-initial demonstratives that co-specify with the preceding sentence on the resolution of ambiguous personal pronouns, and found that the pronoun antecedents switched when the demonstrative was accented (Fretheim et al., 1997). However, to our knowledge, there are no studies that compare the co-specification preferences of accented vs. unaccented demonstratives.
The Corpus: TRAINS93
Our data is taken from the TRAINS93 corpus of humanhuman problem solving dialogs in the logistics planning domain. In these dialogs, one participant plays the role of the planning assistant and the other attempts to construct a plan for delivering specified cargo to its destination. We used a subset of 18 TRAINS93 dialogs in which the referent and antecedent of third-person non-gendered pronouns 1 had been annotated in a previous study (Byron and Allen, 1998). In the dialogs used for the present study, 322 pronouns (158 personal and 164 demonstrative) have been annotated. Personal pronouns in the dialogs are it, its, itself, them, they, their and themselves. Demonstrative pronouns in the annotation data are that, this, these, those. There are five male and 11 female speakers. One female speaker contributed 89 pronouns, two others produced more than 30 each (one female, one male), the rest is divided unevenly among the remaining 13 speakers. The set of dialogs chosen for annotation intentionally included a variety of speakers so that no speaker's idiosyncratic discourse strategies would be prevalent in the resulting data. Table 1 describes the attributes captured for each pronoun. These features were chosen for the annotation because many previous studies have shown them to be important for pronoun resolution. Features include attributes of the pronoun, its antecedent (the discourse constituent that previously triggered the referent), and its referent (the entity that should be substituted for the pronoun in a semantic representation of the sentence). Cb was annotated using Model3 from (Byron and Stent, 1998) with a linear model of discourse structure. Note that annotated pronouns were not limited to those with NP antecedents, as is the case with most other studies. In addition to NP antecedents, pronouns in this data set could have an antecedent of some other phrase or clause type, or no annotatable antecedent at all. There are two categories of pronouns with no annotatable antecedent. In the simplest case, the pronominal reference is the first mention of the referent in the dialog. That happens when the referent is inferred from the problem solving state. For example, after the utterance send the engine to Corning and pick up the boxcars, a new discourse en-
Table 1
The features available in the annotation data set.
Feature ID
Description Possible Values PRONTYPE
Pronoun Type def = the pronoun is one of fit, its, itself, them, they, their, themselvesg dem = the pronoun is one of fthat, this, these, thoseg PRONSUBJ Pronoun is subject Y = pronoun subject of main clause of its utterance N = pronoun not subject of main clause ANTEFORM Antecedent form PRONOUN = antecedent is pronoun NP = antecedent is base noun phrase NON-NP = antecedent is other constituent, at most one utterance long NONE = pronoun is first mention or antecedent length one utterance DIST Distance to antecedent SAME = antecedent and pronoun in same utterance ADJ = antecedent and pronoun in adjacent utterances REMOTE = antecedent more than one utterance before pronoun ANTESUBJ Antecedent is subject Y = antecedent subject of the main clause of its utterance N = antecedent not subject of a main clause CB Backward-looking center Y = pronoun is Cb of its utterance N = pronoun is not Cb Table 2: Typical properties of antecedents for personal and demonstrative pronouns in the corpus. All percentages are given relative to the total number of pronouns in that category and rounded. Boldface: most frequent antecedent property.
Table 2
tity, the train composed of the engine and boxcars, is available for anaphoric reference. In the more subtle case, the entity was built from a stretch of discourse longer than one utterance. In an effort to achieve an acceptable level of inter-annotator agreement for the annotation, the maximum size for a constituent to serve as an antecedent was defined to be one utterance. Discourse entities that are built from longer stretches of text include objects such as the entire plan or the discourse itself, and such items are less reliable to annotate. Taking the annotated dialogs as a whole, 21.4 of all pronouns have a non-NP antecedent, and 27 do not have an annotatable antecedent at all. Table 2 shows that the default antecedents of personal and demonstrative pronouns follow the predictions of Schiffman (1985). The antecedent of personal pronouns is most likely itself to be a pronoun or a full NP, while demonstratives are most likely to have no antecedent, or if there is one, it is most likely to be a non-NP. The main role of prosodic information is to help pronoun resolution algorithms identify cases where these default predictions are false.
Acoustic Prosodic Cues
Our selection of acoustic measures covers three classic components of prosody: fundamental frequency (F0), duration, and intensity (Lehiste, 1970). The relationship between those cues and prosodic prominence has been demonstrated by e.g. (Fant and Kruckenberg, 1989;Heuft, 1999). The main correlate of English stress is F0, the second most important is duration, and the least important is intensity (Lehiste, 1970). Therefore, we will pay more attention to F0 measures. Although experimental results indicate that F0 cues of prominence can depend on the shape of the F0 contour of the utterance (c.f. (Gussenhoven et al., 1997)), we do not control for such interactions. Instead, we restrict ourselves to cues that are easy to compute from limited data, so that a running spoken dialogue system might be able to compute them in real time.
Acoustic Measures
Duration: For duration, we found that the logarithmic duration values are normally distributed, both pooled over all speakers and for those speakers with more than 20 pronouns. Logarithmic duration is also the target variable of many duration models such as that of (van Santen, 1992). We assume that speaker-related variation is covered by the variance of this normal distribution; we can control for speaker effects by including a SPEAKER factor in our models. F0 variables: F0 was computed using the Entropic ESPS Waves tool get f0 with standard settings and a frame rate of 10 ms. All F0 values were transformed into the log-domain and then pooled into mean, minimum, and maximum F0 values for each word and each utterance. This log domain is well motivated psychoacoustically (Zwicker and Fastl, 1990). F0 range was computed on the values in the log-domain. We assume that the logarithm of F0 has a normal distribution. Therefore, we can normalize for speaker-dependent differences in pitch range by using z-scores, and we can use standard statistical analysis methods such as ANOVA.
Intensity: Intensity is measured as the root-meansquare (RMS) of signal amplitudes. We measure RMS relative to a baseline as given by the formula logRMS=RMS baseline . The baseline RMS was computed on the basis of a simple pause detection algorithm, which takes the first maximum in the amplitude histogram to be the average amplitude of background noise. The baseline RMS was slightly above that value.
Inter-Speaker Differences
Since we need to pool data from many different speakers, we need to control for inter-speaker differences. The number of pronouns we have from each speaker varies between 1 for speaker GD and 86 for speaker CK. Speakers PH, male, and CK, female, are the only ones to have produced more than 15 personal pronouns and 15 demonstratives. In order to test whether the SPEAKER factor affects the choice between personal pronouns and demonstratives, we fitted a logistic regression model with the target variable PRONTYPE (personal or demonstrative) and the predictors ANTE, ANTESUBJ, DIST, REFCAT, CB and SPEAKER (in this sequence). REFCAT is an additional variable that describes the semantic category of a pronoun's referent (eg. domain objects vs. abstract entities). Even though SPEAKER is the last factor in the model, an analysis of deviance shows a significant influence (p 0.005,F=2.51,df=13). A possible explanation for this is that some speakers prefer to use demonstratives in contexts where others would choose a personal pronoun, and vice versa, or perhaps the SPEAKER variable mediates the influence of a far more complex factor such as problem solving strategy. Resolving this question is beyond the scope of this paper.
On the basis of F0, we can establish four groups of speakers: The first group consists of male speakers with a low mean F0 and a low F0 range. In the next group, we find both male and female speakers with a low mean F0, but a far higher range. Speaker PH belongs to this second group. Interestingly, for these speakers, the mean F0 on pronouns is lower than for those of the first group. Groups 3 and 4 consist entirely of female speakers, with group 3 using a lower range than group 4. Speaker CK belongs to group 4.
Exploring Prominent Pronouns
If data about prosodic prominence is to be useful for pronoun resolution, then there must be prosodic cues that carry information about properties of the antecedent. In this section, we investigate if there are such cues for the properties that we have available in the annotation data, defined in Table 1. More specifically, we hypothesize that prosodic cues will be used if the antecedent is somewhat unusual. For example, the results of Linde and Passonneau would lead us to expect that personal pronouns with non-NP antecedents and demonstratives with NP and pronoun antecedents will be marked. Since the antecedents of pronouns tend to occur no more than 1-2 clauses ago, we would also expect pronouns with more remote antecedents to be marked. A first qualitative look at the data suggets that even if such these tendencies are present in the data, they might not turn out to be significant. For example, in Figure 1, the means of lzmeanf0 behave roughly as predicted, but the variation is so large that these differences might well be due to chance.
Figure 1
Distribution of z-score of mean F0 for different values of ANTEFORM and ANTESUBJ
Correlations between Measures and Properties
Next, we examine whether the measures defined in Section 4 correlate with any particular properties of the antecedent. More precisely, if a property is cued by some aspect of prosody (either duration, F0, or intensity), then the prosody of a pronoun depends to a certain degree on its antecedent. In a statistical analysis, we should find a significant effect of the relevant antecedent property on the prosodic measure. We selected ANOVA as our analysis method, because our prosodic target variables appear to have a normal distribution. For each of the antecedent features defined above, we examined its influence on mean F0 (lmeanf0), the zscore of mean F0 (lzmeanf0), the z-score of F0 range (lzrgf0), logarithmic duration (dur), and normalized energy (energy). In addition, we added the factors, PRONTYPE and SPEAKER.
Results:
The results are summarized in Table 3. For lzmeanf0 and energy, the influence of SPEAKER is always considerable. There are also consistent effects of the syntactic position of a pronoun: In general, demonstratives are shorter in subject position, and for CK, mean F0 on personal pronouns in subject position is higher than on non-subject ones (228 Hz vs. 190 Hz). But when we turn to the factors that interest us most, properties of the antecedent, we cannot find any consistent correlates, although in almost every data set, there are some prosodic cues to ANTESUBJ for personal pronouns. But what these cues are may well depend on the speaker, as the results for CK show. Her pitch range on pronouns with a subject antecedent is double the range on pronouns with an antecedent in non-subject position. Pronouns with subject antecedents are also considerably louder. All in all, antecedent properties can only account for a very small percentage of the variation in these prosodic cues. Therefore, we should not expect the prosodic cues to be stable, robust indicators for predicting antecedent properties in spoken dialog systems.
Table 3
The results are summarized in Table 5. On all tasks except remote, PRONTYPE and PRONSUBJ performed well. Both features have already been shown to be reliable cues for pronoun resolution (c.f. Section 2). On task cb, only PRONTYPE can explain a significant amount of variation. Models which include a speaker factor almost always fare better. In models without speaker information, F0-related measures yield a larger reduction in deviance than the duration measure. The reason for this is that the F0 measures preserve some information about the different speaker strategies. Once SPEAKER has been included as well, only dur leads to significant improvements on task nonNP (p 0.05). Both demonstratives and personal pronouns are shorter when the antecedent is a non-NP.
Table 5
Inter-Speaker Variation
We have seen that inter-speaker differences explain much of the variation in the prosodic measures. Table 4 gives an idea of the size and direction of these differences. On the complete data set, we find that personal pronouns are shorter than demonstratives, they have a lower intensity and show a higher average F0 (Table 4). A closer examination reveals considerable inter-speaker variation in the data, illustrated in Table 4. CK is fairly prototypical. PH barely shows the difference in F0, and for MF, the difference in intensity is actually reversed. MF also has rather short demonstratives. Such speakerspecific variation cannot be eliminated by normalization. It has to be controlled for in the statistical tests. Discovering types of speakers is difficult -two of the 15 speakers, CK, and PH, contribute 48 of all pronouns.
Table 4
Inter-speaker variation in prosody. disc.: complete discourse. All speakers: 322 pronouns, CK: 41 personal, 45 demonstrative, PH: 18 personal, 24 demonstrative, MF: 7 personal, 8 demonstrative
Predicting Properties of the Antecedent
Finally, we examine how much information prosodic cues yield about the antecedent. For this purpose, we set up a prediction task not unlike one that an actual NLU system faces. The input variables are the prosodic properties of the pronoun, whether the pronoun is personal or demonstrative (PRONTYPE), whether it is the subject (PRONSUBJ), and whether it is sentence-initial (PRONINIT). From this, we now have to deduce properties of the antecedent: syntactic role (ANTESUBJ), form (ANTEFORM), and distance (DIST). For prediction, we used logistic regression (Agresti, 1990). This has two advantages: not only can we compare how well the different regression models fit the data, we can also re-analyze the fitted model to determine which factors have a significant influence on classification accuracy.
First, we construct a model on the basis of PRONTYPE, PRONSUBJ, and PRONINIT. Then, we construct a model with these three factors plus SPEAKER. Finally, we train a model with PRONTYPE, PRONSUBJ, PRONINIT, SPEAKER and one of the three measures lzmeanf0,dur,energy. The models are trained to predict whether there is an antecedent (task noAnte), whether the antecedent is a non-NP (task nonNP), whether the antecedent is remote (task remote), whether the antecedent is in subject position (task sjante), and whether the antecedent is the current Cb (task cb). All models are computed over the full data set, because the data set for speaker CK is not sufficient for estimating the regression coefficients. The models are then compared to see which step yielded a significant improvement: adding SPEAKER or adding the prosodic variable after we have accounted for SPEAKER variation.
Conclusion and Outlook
In this paper, we examined patterns of acoustic prosodic highlighting of personal and demonstrative pronouns in a corpus of task-oriented spontaneous dialog. To our knowledge, this is the first comparative study of this kind. We used a straightforward, theory-neutral operationalization of "prosodic highlighting" that does not depend on complex algorithms for F0 stylization or (focal) accent detection and is thus very easy to incorporate into any real-time spoken dialog system. We chose a spoken dialog corpus that includes demonstrative pronouns because demonstratives are both a prominent feature of problem-solving dialogs and a sorely neglected field of study. In particular, we asked two questions:
Do Speakers Signal Antecedent Properties
Acoustically? Based on our data, the answer to this question is: If they do, they do it in a highly idiosyncratic way. We cannot posit any safe generalizations over several speakers, and from the perspective of an NLP application, such generalizations might even be dangerous. In order to evaluate the impact of speaker strategies on the resolution of pronouns, we need more data -150 to 200 pronouns from 4-5 speakers each. Collecting this amount of data in a dedicated corpus is inefficient. Therefore, further acoustic investigations do not make much sense at this point; rather, the data should be examined carefully for tendencies which can form the basis for dedicated production and perception experiments which are explicitly designed for uncovering inter-speaker variation.
Are Acoustic Features Useful for Pronoun
Resolution? The answer is: probably not. At least for this corpus, we were not able to determine any numerical heuristics that could be utilized to aid pronoun resolution. The logistic regression experiments show that on a speaker-independent basis, logarithmic duration might well be a reliable cue to certain aspects of a pronoun's antecedent. In order to incorporate prosodic cues into an actual algorithm, we will need more training material and a principled evaluation procedure. We will also need to take into account other influences, such as dialog acts and dialog structure.
Table 3 :
en Víctor del Río & Alberto Santamaría (Coord.),Imagen, lenguaje e ideología. Aproximaciones desde la historia y la teoría del arte, Madrid: Akal, 2023
Scrineum Rivista, 2006
Journal of World Languages, 2024
Remote Sensing, 2014
Ars Interpretandi, 2020
DergiPark (Istanbul University), 2022
European spine journal : official publication of the European Spine Society, the European Spinal Deformity Society, and the European Section of the Cervical Spine Research Society, 2016
International Journal of Social Science & Management Studies (I.J.S.S.M.S.), 2024
Basic & Clinical Pharmacology & Toxicology, 2018
Isij International, 2020
Autoimmunity reviews, 2017
Journal of Applied Statistics, 2021