Internal Consistency of Performance Evaluations as a Function of Music Expertise and Excerpt Familiarity


Discount School Supply Coupon Codes

Steps to use coupon codes at discountschoolsupply.com

Teaching students to evaluate music performance is a fundamental goal of music education. Indeed, "evaluating music and music performances" is one of the nine U.S. national standards for music education (MENC-The National Association for Music Education, 1994, p. 62). Inherent in this skill is the ability to make consistent, reliable judgments as to the quality of a music performance. It would follow, then, that for one to demonstrate competence in the area of evaluation, reliability of evaluation must exist. If an individual is not able to be consistent (i.e., reliable) in evaluative tasks, it is difficult to place any validity in that individual's assertions about the quality of a music performance.

Because the evaluation of music performance is so intertwined with the processes of learning and teaching music, it is no wonder that this topic is of primary importance to music educators and researchers. As a consequence, the issue of reliability in the evaluation of music performance has received much attention in me research literature. Although there is much overlap, extant research relating to performance evaluation can be classified into three broad categories: (a) the development and validation of performance adjudication/rating scales (e.g., Abeles, 1973; Bergee, 1987, 1988, 1993, 1997, 2003; Fiske, 1975, 1977a; Jones, 1986; Nichols, 1991; Saunders & Holahan, 1997; Zdzinski & Barnes, 2002); (b) individual intrarater reliability (i.e., internal consistency) and interrater reliability in relation to performance evaluation, either between adjudicators of similar expertise or between lesser experienced raters/musicians and experts (e.g., Fiske, 1977b; Hewitt, 2002, 2005; Wapnick, Flowers, Alegant, & Jasinskas, 1993); and (c) factors contributing to improving reliability of performance evaluation (e.g., Davis, 1981; Fiske, 1977a, 1978; Hewitt, 2005; Morrison, Montemayer, & Wiltshire, 2004; Sparks, 1990; Towers, 1980).

With regard to the latter two areas of research on reliability, the results of several studies that have investigated the effects of training in evaluative tasks and experience in music performance on the reliability and accuracy of performance evaluation were mixed. Although some research has demonstrated that greater reliability cannot be attributed to training in adjudicative tasks (Fiske, 1978), expertise in the area of performance (Fiske, 1975; Heath, 1976; Roberts, 1975; Wapnick et al., 1993), performance proficiency (Fiske, 1977b), or theoreticaiVhistorical knowledge of music (Fiske, 1977a), others have found higher reliability coefficients for evaluators with greater music experience (Davis, 1981; Fiske, 1977a; Hewitt, 2005; Morrison et al., 2004; Sparks, 1990; Towers, 1980) and for members of older age groups (Towers, 1980).

Although disparate findings in this area of research may speak to the elusive nature of measuring evaluation accuracy, the conflicting results in these previous studies may be explained somewhat when considering the methods that researchers have employed to measure this construct. In some cases, reliability was calculated through correlation coefficients comparing a participant's evaluations of music material with herself or with others in a linear predictive manner (e.g., Bergee, 1993, 1997; Fiske, 1977a; Hewitt, 2002, 2005), whereas other studies compared mean differences between participants' evaluations with those of experts or each other (e.g., Hewitt, 2002, 2005; Wapnick et al., 1993). These two measurement techniques can produce seemingly contradictory results. For example, Hewitt (2005) found high school musicians to be more accurate in their self-evaluations than middle school musicians (i.e., smaller mean differences occurring between their ratings and those of experts); however, middle school musicians evidenced significant moderate correlations with expert evaluations, whereas a lower, nonsignificant correlation between high school musicians and experts was found. Although this significant correlation may have been attributable to greater variance being associated with the responses of both the middle school musicians and experts as compared with the high school musicians' evaluations, the lack of a significant correlation between high school musicians and experts is surprising.

Similar discrepant findings appear to permeate much of the extant literature. Although results of some studies suggest that educational level plays a role in the consistency of performance evaluation (Bergee, 1993, 1997; Byo & Brooks, 1994; Wapnick et al., 2005), authors of other studies have concluded otherwise (Hewitt, 2002; Kostka, 1997; Morrison et al., 2004). Bergee (1993, 1997), for example, found moderately high correlations between faculty and collegiate musicians' evaluations of performances, whereas Kostka (1997) found less agreement when investigating a similar population. Byo and Brooks (1994) and Hewitt (2002) found little to moderate agreement between younger musicians and experts, whereas Morrison et al. (2004) found that middle/junior high and high school students evaluated performances similarly to experts.

Although comparisons among groups of differing levels of expertise provide important information, many studies of this nature often assume that assessments of music performances would be internally consistent and/or stable over time for each individual group. Fiske (1977a), however, found that this was not necessarily the case. In fact, music education graduates' internal consistency in trumpet performance adjudication was found to be highly variable. He also found that internal consistency was greater for brass players than nonbrass players, suggesting an effect of expertise on internal consistency. Similarly, Hewitt (2002) found that adjudication reliability actually diminished over time for junior high musicians, although these findings reflected the degree to which participants' evaluations corresponded to that of experts.

No research since Fiske has directly examined the effect of training on the internal consistency of performance adjudication, although ancillary findings of more recent studies by Wapnick et al. (2005) and Kinney (2004) have evidenced similar trends. Kinney found that students who had participated for a minimum of 2 years in a high school performing ensemble demonstrated greater internal consistency in performance evaluation than nonparticipating students, although both groups' internal consistency was poor compared with that of music faculty. Furthermore, these results suggested that participants' internal consistency diminished when evaluating unfamiliar pieces of music - a variable yet to be explored directly in this area of research.

Given the overall lack of research in the extant literature relating specifically to evaluator internal consistency, a direct investigation of the effects of formal music training on the internal consistency of performance evaluations seemed warranted. Likewise, the effects of prior knowledge of music material on the internal consistency of performance evaluations should also be examined. Considering that Duke and Simmons (2006) have suggested that expert music teachers possess an auditory image of the music material they are teaching or evaluating, it would seem that being familiar with the music material to be evaluated would have a direct effect on evaluation consistency, perhaps manifested across different levels of expertise, as well. Although Wapnick et al. (1993) provided adjudicators with scores during performance evaluation, which may have functioned to make the pieces familiar, no direct investigation comparing the adjudication of familiar and unfamiliar pieces has been undertaken.

This study was designed to expand on the extant research by examining the effects of music training on the internal consistency of performance evaluations. Because the elective nature of music training in public secondary schools usually is centered on performing ensemble participation, participants for this study included college students who had participated in high school performing ensembles and college students who had received no formal training in music outside of elementary and middle school general music experiences. To increase the cross section of music expertise being examined, undergraduate music majors, graduate music majors, and music faculty were included as comparison groups. Furthermore, this study included stimuli that were both familiar and unfamiliar to participants to examine the effects of this factor on internal consistency.

Method

Participants

Participants in this study were undergraduate nonmusic majors (n = 63), undergraduate music majors (n = 42), graduate music majors (n = 17), and music faculty (n - 9). To achieve more precision and clarify findings relating to nonmusic majors, these participants were divided into two groups based on past music experiences. Nonmusic majors who had no previous formal training in music beyond typical elementary/middle general music curricula were classified as nonparticipants (n = 28) and included 5 freshmen, 12 sophomores, 7 juniors, and 4 seniors. Twelve were male; 16 were female. Those who were not music majors but identified themselves as having at least 2 years of formal study in a high school performing ensemble (i.e., band, orchestra, or choir) as their only means of music instruction beyond elementary /middle general music curricula were categorized as ensemble participants (n = 35). Participants in this group included 7 freshmen, 15 sophomores, 8 juniors, and 5 seniors. Sixteen were male; 19 were female. Any participant in the ensemble participant group whose music training occurred longer than 5 years prior to the study was removed from the sample to ensure that music instruction would be relatively recent. Moreover, none of the nonmusic major participants had received any private instruction in music or had participated in any formal music-making experience outside of school performing ensembles. Both groups of nonmusic majors were recruited from undergraduate music appreciation and world music courses from two large state universities.

Undergraduate music majors (n = 42) attended the same two state universities and were recruited from Introduction to Music Education, Music Theory, Music History, Teaching Methods, and Student Teaching Seminar courses. Fourteen were majoring in performance, whereas 26 were majoring in music education. Two students were majoring in composition/theory. Of the undergraduate music majors, 6 were freshmen, 17 were sophomores, 1 1 were juniors, and 8 were seniors. Twenty were male; 22 were female.

Graduate music majors were masters (? = 13) and doctoral students (? = 4) in music at the same two state universities. These students were majoring in music education (n = 10), performance (n = 3), or conducting (n = 4) and had an average of 5.6 years of professional experience as either teachers or performers. Ten were male; seven were female. Music faculty (n - 9) were collegiate faculty members in music at several National Assoication of Schools of Music-accredited music programs throughout the United States and had an average of 17.4 years of teaching experience in music. Four were male; five were female. Teaching assignments for faculty were in the areas of music education, performance, conducting, history, and/or tìieory. Because sample sizes were small for graduate students and faculty, these groups were combined and labeled as experts for analysis.

Stimuli

Stimuli for the study included keyboard performances of Amazing Grace, America the Beautiful, and Edward MacDowell's To a Wild Rose (see Figure 1), which were created by nonmusicians using a MIDI software performance device with a sampled piano timbre (Friedman, Kent, & Dudek, 1992); these performances resulted from a previous study (cf. Kinney, 2004). The software program allowed the performer to have control over several major aspects of musical expression (i.e., dynamics, tempo, note lengths, and articulations'), including the potential for subtle variations in each (e.g., phrase structure could be demonstrated through dynamic changes and tempo alterations such as rubato). However, pitch accuracy was controlled by the software in these harmonized, prepared renditions. Results of the previous experiment revealed that the stimuli sounded similar to acoustic piano performances and demonstrated sufficient variability in participants' evaluations in terms of both accuracy and expression. Furthermore, all participants in the prior investigation identified Amazing Grace and America the Beautiful as familiar and To a Wild Rose as unfamiliar.

From the excerpt pool of 145 performances created in the previous experiment, 10 performances of each song (i.e., 30 total excerpts) were chosen randomly for use in this investigation. From these, 5 performances of each song were chosen randomly to repeat at some time during stimuli presentation. Thus, the entire presentation of stimuli comprised 45 excerpts, 15 of which were exact repetitions of a previous excerpt so that internal consistency could be calculated for each participant. To reduce the possibility of order effects, three different orders of presentation were used, with each presentation made up of a different, randomized order of stimuli. Care was taken to ensure that no repeated excerpt immediately followed the excerpt's original presentation. Excerpt length averaged 34.3 seconds and ranged from approximately 29 to 41 seconds each.

Procedures

Student participants in the study were tested in classroom settings using one of the three stimulus CDs, counterbalanced across groups. Prior to the presentation of stimuli, students were read the following instructions:

This is a project regarding the adjudication of musical performance. You will be asked to evaluate 45 short musical excerpts performed on a synthesizer using a piano sound. These excerpts consist of three songs, two of which you will probably know, Amazing Grace and America the Beautiful, and one you may or may not know. After hearing each performance, please indicate the degree to which you thought the performance was accurate and musically expressive. For accuracy you will see that the scale spans from one, for a performance you consider not accurate at all, to seven, for a performance you consider very accurate. Likewise, for expression you will see that the scale spans from one, for a performance you consider to be without musical expression, to seven, for a performance you consider very musically expressive. The examples occur fairly quickly so you will not have much time to think about your answers - I am simply interested in your first impressions of the excerpt, not whether your judgments are right or wrong. After the last excerpt I ask that you complete the brief demographic questionnaire attached to the last page of your response sheet. Are there any questions?

Before beginning, I will provide a computer-generated performance of each song so that you can be familiar with the excerpts you will hear. Being rendered by a computer, these performances are rhythmically/temporally accurate (i.e., aligned by computer software) but void of dynamic variation (i.e., changes in volume were equalized across all note onsets via the computer software). They serve only to familiarize you with the excerpts. It is important to remember throughout your listening that individual performers of these songs were allowed to interpret them as they wished. In other words, it is okay to judge a song as accurate without it being identical to the computer-generated example you are about to hear in terms of its speed, speed fluctuations, and degree of loudness or softness.

Participants listened to a computer-generated performance of each song immediately following the instructions. For the familiar song exemplars, each was identified by name prior to its computer-generated performance. The unfamiliar exemplar was performed without identifying it by name. Instead, it was labeled as "Song Number 3." Participants were asked again if there were any questions after listening to the three examples, and then the 45 stimuli were presented in their entirety. Upon completion of the stimuli, participants completed a brief questionnaire in which they indicated demographic information, past music experiences, and whether they had heard the familiar and unfamiliar stimuli previously.

Participants indicated responses on prepared evaluation forms that included two 7-point Likert-type scale items for each stimulus. The first scale was for accuracy, where a rating of 1 corresponded to not accurate and a rating of 7 corresponded to very accurate. The second scale was labeled "musical expression" and ranged from 1 (no expression/without musical expression) to 7 (very musically expressive). The entire experiment, including preexperiment instructions and examples and the postexperiment questionnaire, lasted approximately 35 to 40 minutes.

Music faculty participants received written instructions identical to those above and were asked to complete the adjudication of excerpts, on their own, in one sitting. Each faculty participant received two CDs; one contained the examples of the songs to be evaluated and was labeled "Examples - Play First," whereas the other contained the 45 stimuli to be evaluated. A short demographic questionnaire also was completed. All faculty participants reported that the task took between 35 and 45 minutes to complete.

Results

I calculated internal consistency for each individual through Pearson productmoment procedures (r), correlating each individual participant's evaluations on the 15 repeated stimuli. Accuracy and expression were calculated separately. I then calculated descriptive statistics for each group from these individual correlation coefficients (see Table 1 ) and used two separate repeated measures analyses of variance to compare internal consistency means on accuracy and expression, independently. Both analyses included two between-subjects factors (expertise and the three orders of presentation) and one within-subjects factor (familiar/unfamiliar stimulus conditions2). Because 23 of the 26 experts indicated on the postexperiment questionnaire that they had heard the unfamiliar stimulus previously, only three levels of expertise were included in these analyses (nonparticipants, ensemble participants, and music majors). Four music majors also were excluded from analyses for the same reason. Finally, I examined the interrelationship between ratings of accuracy and expression for each category (i.e., familiar and unfamiliar) by each subgroup of adjudicators via separate Pearson productmoment procedures. As in previous analyses, I did not analyze results pertaining to the unfamiliar stimulus for experts because of their prior knowledge of this excerpt.

Accuracy

Significant main effects were found for the variables of excerpt familiarity, F(1, 92) = 55.54, p < .001, partial η^sup 2^= .38; and expertise, F(2, 92) = 399.28, p < .0001, partial η^sup 2^ = .89. Internal consistency means were significantly higher for familiar excerpts overall, although the difference between these means was not large (M = .38 to M = .33, respectively). All three means for expertise were significantly different from each other, p < .001, using post hoc Scheffé procedures for multiple comparisons. Here, music majors' internal consistency was strongest (M= .62), followed by ensemble participants' (M = .35) and then nonparticipants' (M = .10). There was no significant main effect for order of stimuli presentation.

Two significant first-order interactions were also found. Perhaps most important, there was a significant two-way interaction between expertise and familiarity, F(1, 92) = 8.32, p < .001, partial η^sup 2^= .15. Figure 2 illustrates this interaction (means are shown in Table 1 ). Means for internal consistency increased across levels of expertise for both familiar and unfamiliar excerpts. Also, although all groups' internal consistency decreased when evaluating unfamiliar excerpts, differences between familiar and unfamiliar internal consistency means were smaller for music majors (mean difference = .02) than for ensemble participants and nonparticipants (mean difference = .07, mean difference = .08, respectively; see Table 1). Furthermore, there was a significant two-way interaction between expertise and order, F(A, 92) = 4.90, p < .01, partial η^sup 2^= .10. Whereas means for ensemble participants and music majors remained stable over the three presentation orders, means for nonparticipants varied across presentations. Specifically, the nonparticipants' internal consistency mean for presentation order two (M = .05) was lower than the nonparticipants' means for presentation orders one (M = .13) and three (M = . 1 1 ). No other significant first-order or second-order interaction effects were found.

Expression

Analysis of internal consistency means for expression revealed significant main effects for excerpt familiarity, F(I, 92) = 7.08, p < .01, partial η^sup 2^= .07; and expertise, F(2, 92) = 288.98, p < .001, partial η^sup 2^= .86. As with results for accuracy, internal consistency means were higher for familiar excerpts, although the difference between these means was not large (M = .42 to M = .40). Again, all three means for expertise were significantly different, p < .001, using Scheffé post hoc comparisons. Music majors' internal consistency was strongest (M = .64), followed by ensemble participants' (M = .41) and then nonparticipants' (M = .18). There was no significant main effect for order of stimuli presentation.

One significant first-order interaction was found between expertise and excerpt familiarity, F(I, 92) = 4.23, p < .01, partial η^sup 2^= .09. Figure 3 illustrates this interaction (means are shown in Table 1). As with accuracy, means for internal consistency increased across levels of expertise for both familiar and unfamiliar excerpts. Differences between familiar and unfamiliar internal consistency means were more similar than those evidenced for accuracy. However, nonparticipants had slightly higher means for unfamiliar excerpts, whereas ensemble participants and music majors had higher means for familiar excerpts. There were no other significant firstorder or second-order interactions.

Correlational Analyses

Correlations between accuracy and expression for familiar stimuli showed linear trends across groups, with correlation coefficients decreasing as music expertise increased. Nonparticipants' ratings demonstrated the strongest correlation, r = .85, p < .001, followed by ensemble participants', r=.65,p< .001, and musicians', r= .57, p < .001. Experts evidenced the weakest correlation between these factors, and it was the only group not to show a significant relationship, r - .32, p = .21. Correlations between accuracy and expression for unfamiliar stimuli also followed a similar linear trend but were more modest, comparatively. Here, nonparticipants' ratings resulted in the highest correlation, r = .57, p < .01, followed by ensemble participants', r = .48, p < .01, and musicians', r = .30, ? = .08.

Discussion

Results of this investigation suggest that internal consistency of performance evaluation is related to music experience and training. Although previous findings in this area have been mixed, data from this study support prior studies with similar results (Bergee, 1993, 1997; Byo & Brooks, 1994; Fiske, 1977a; Hewitt, 2005; Kinney, 2004; Wapnick et al., 2005). In this investigation, internal consistency means reflected linear trends from nonparticipants to experts, with more experienced groups demonstrating greater internal consistency across both accuracy and expression evaluations. It is notable that dramatically stronger relationships were associated with level of expertise, which indicates that greater expertise was associated with higher internal consistency. Such differences are responsible, in part, for the large effect size associated with expertise (partial η^sup 2^ = .89, accuracy; partial η^sup 2^= .86, expression) and indicate that this factor was a salient influence on the consistency of performance evaluation.

Excerpt familiarity proved to influence internal consistency means, as well. Familiar excerpts were associated with higher consistency means for accuracy and expression ratings, although mean differences between familiar and unfamiliar stimuli were modest. Moreover, effect sizes associated with this variable were small in comparison with those found for expertise, indicating less influence of this factor on internal consistency. It is interesting, however, that the effect size for excerpt familiarity found for accuracy (partial η^sup 2^ = .38) was larger than that obtained for expression (partial η^sup 2^ = .07). Considering this, it appears that the degree of familiarity that a listener has with material to be evaluated is of greater importance when consistent listening for accuracy only is required. On the other hand, it appears that consistent evaluations of musical expression are less influenced by familiarity. Perhaps, with regard to consistent evaluations of expression in music, listeners hold a preconceived idea as to the nature and degree of appropriate expressive qualities found in a music performance that embodies general music acculturation and/or personal taste. Thus, evaluations of expression may reflect individual preferences for expressive parameters that remain consistent across familiar and unfamiliar music material, whereas accuracy evaluations may require prior knowledge of music material to develop a basis for consistent judgments.

The significant interactions between expertise and excerpt familiarity for both accuracy and expression also parallel previous findings (Duke & Simmons, 2006; Kinney, 2004) and lend further support to the interpretation of results discussed above. In the case of evaluations of accuracy, internal consistency mean differences all were lower when for familiar as compared with unfamiliar excerpts, however, this difference was smaller for music majors than for ensemble participants and nonparticipants. It seems apparent that differences in evaluation consistency for familiar and unfamiliar music material are contingent on degree of music expertise. If this is the case, it follows that the effects of excerpt familiarity on the consistency of performance evaluation may be related inversely to training and expertise in music. Trained musicians may be able to transfer previously acquired music knowledge to novel situations and render consistent judgments concerning the accuracy of performances.

The interaction of expertise and excerpt familiarity on the consistency of the expression evaluations proved somewhat different. Here, internal consistency mean differences were more similar for familiar and unfamiliar excerpts across differing levels of expertise. However, unlike other groups, nonparticipants were more consistent in evaluating the musical expression of the unfamiliar excerpts, although the magnitude of this difference was not large. The similarity of respective groups' consistency means across familiar and unfamiliar stimuli supports the idea that evaluations of expressive qualities in music are less contingent on the listener's prior knowledge of music material than evaluations of accuracy, although degree of expertise may be an overriding factor as to the overall acuity of evaluation consistency.

Results pertaining to familiarity deserve replication and further investigation, because little research has examined the effects of this variable on performance adjudication specifically. Although findings of this study are congruent with those of Kinney (2004) and Duke and Simmons (2006), some studies (Wapnick et al., 1993; Wapnick et al., 2005) have produced ancillary findings suggesting the opposite - more familiar repertoire might produce more divergent biases concerning performance criteria and, as a consequence, less consistent evaluations. Considering that this study was limited to compositions associated with the Instant Pleasure software device, future studies may wish to investigate familiarity using originally composed material. Because a majority of those in the expert group recognized the unfamiliar stimulus in this study, originally composed material would allow for experts to be investigated with this type of adjudicative task. Because varying levels of familiarity with repertoire often exist at adjudicated events, this line of research would prove informative for music educators.

Although there was an attempt to control for possible order effects through the use of three different presentation orders, a significant interaction effect between order and expertise was manifested, nonetheless. Finding an order effect is consistent with previous research in this and related areas (Bergee, 2006; Bergee & McWhirter, 2005; Bergee & Westfall, 2005; Duerksen, 1972; Geringer, Madsen, MacLeod, & Droe, 2006; Wapnick et al., 1993). Unique to this investigation, however, was that order interacted with expertise on consistent evaluations of music accuracy. Although internal consistency was stable across all stimuli presentation orders in the more musicexperienced groups, nonparticipants evidenced greater variability across stimuli presentation orders. For this group, internal consistency was influenced by presentation order, suggesting that context (i.e., recency of similar music excerpts) had a pronounced effect on consistent judgments of music accuracy. Thus, when confronted with this type of music task, those with less music experience may be more apt to evaluate a performance in comparison with a similar one in close proximity rather than relying on what Duke and Simmons (2006) refer to as a preconceived, "vivid auditory image . . . which detectfs] even the smallest deviations from the ones they have in their mind" (p. 14). Further research examining the interaction of these variables more directly is necessary to test this hypothesis.

Finally, correlations between ratings of accuracy and expression followed linear trends, with less of a relationship between the accuracy and expression ratings of the more experienced musicians for both familiar and unfamiliar stimuli. Results of this nature imply that those with greater experience are able to evaluate accuracy and expression independently, whereas those with less training tend to give more global ratings. Thus, experts seem better able to evaluate elements of a music performance as discrete components when faced with an adjudicative task. Conversely, those with less training may judge music performances in a holistic manner, even when asked to evaluate discrete performance parameters. Further research could examine the nature of these findings to determine how judgments concerning performance are made by less experienced musicians.

These results, which should be replicated with acoustic instruments, suggest that music expertise has an influence on the internal consistency of performance evaluation and that the familiarity of the music material to be evaluated also plays a role, albeit a smaller one. Furthermore, these two factors interact. Those with more music experience were more consistent in their evaluations of accuracy for both familiar and unfamiliar music material, whereas those with less music training were more consistent when evaluating the accuracy of familiar, as opposed to unfamiliar, music material. This is perhaps explained by previous research that has suggested that experts often tend to have an auditory image when evaluating familiar music that guides their evaluation of performance and that they transfer existing music knowledge to unfamiliar music material (Duke & Simmons, 2006). On the other hand, findings of this nature might simply be an artifact of task familiarity. Being familiar with this type of task and having had opportunities to develop it, those with extensive music training might take the task more seriously, be better able to sustain their concentration over time, or simply employ other strategies to remain internally consistent. Further research into the strategies these individuals use to attain consistency of performance evaluation would prove useful to a profession where this skill is paramount.

No comments:

Post a Comment