March 2002
Volume 15, Number 3
In Defense of Self Reports
It has been more than 20 years since the validity of self-report data was first seriously called into question. Back then, Nisbett and Wilson (1977) offered a critical examination of the many weaknesses of verbal reports of mental processes. Their review of self-report data encompassed many types of self-report methods, and invited the question of how self-reports should be best used given their limitations.
More recently, self-reports have again been called into question, this time in a New York Times article featuring psychophysicist Linda Bartoshuk.1 But in contrast to Nisbett and Wilson, Bartoshuk's comments focus almost exclusively on adjective rating scales, and she suggests that perhaps they ought not to be used at all. We believe that Bartoshuk's views, as reported in the article, are extreme, and we argue instead that adjective rating scale data do have a place within psychology.
The proper question is not whether such rating scales are appropriate, but when and how they are to be best used. There is a traceable history in psychology that points to the value of this method of data collection. For example, psycholinguists have long used semantic differential tasks in their examinations. In a broad, cross-cultural examination of the affective meanings of colors, Adams and Osgood (1973) successfully explored several trends in the attribution of affect, through the use of adjective rating scales. Bartoshuk neglects such benefits of adjective rating scales in her criticisms.
Bartoshuk has said that the use of adjective rating scales can be a "dreadful mistake" (Bartoshuk, 2000b). However, "mistakes" can be prevented with a few simple precautions. One of Bartoshuk's main criticisms of this form of self-report data is that internal experiences are subjective. For example, one person's "very hungry" (e.g., a famine victim) might mean something very different from another person's "very hungry" (e.g., a CEO late for dinner).
This problem does not annul the value of subjective reports. In fact, it is the exact reason that psychology so values random assignment of participants to conditions. With proper randomization, there should be the same number of famine victims, tardy CEO's, or any other type of person Bartoshuk can imagine in any given experimental condition. In addition, there are times when we want or need to know people's subjective theories about their situations; self-report data is uniquely suited to tasks such as obtaining individuals' personal theories about their experiences and their feelings. Through the proper use of randomization, adjective rating scales can be valid and useful methods by which to learn about these beliefs.
In addition to randomization, repeated measures designs provide a further protection against the concern with adjective rating scales. In a repeated measures design, each participant's answer is compared to his or her later answer, not to a different person's responses. Kernis (1995), for example, uses a repeated measures design to evaluate participants' self-esteem stability. In this case and others, the specific numbers on a scale are far less important than whether subsequent ratings increased, decreased, or did not change (Spector, 1994).
Researchers can, and should, protect against the drawbacks of all types of self-report scales by examining how well these measures converge with other methods of data collection. The self-esteem literature benefited from the finding that self-report data correlated well with data collected by other means (Baumeister, 1996; Coopersmith, 1965; Kernis, 1995; Rosenberg, 1979), allowing for the expansion and clarification of the self-esteem concept. Many important lessons can also be learned when different methods that are thought to be measuring the same phenomenon do not empirically converge.
Banaji and colleagues' work with the Implicit Associations Test (IAT) provides a good example of a line of research benefiting from both consistent and contradictory results between self-reports and other measures. Through recent work examining multiple measures of attitudes, a clearer understanding of the differences between implicit and explicit attitudes has emerged. Cunningham, Preacher, and Banaji (2001) found that several implicit measures tapped into a general implicit prejudice construct, distinct from explicit attitudes. Through both convergence and divergence, researchers have been better able to determine what the IAT is, as well as what it is not. More generally, it is clear that psychological constructs can be more finely delineated through converging and diverging lines of evidence.
Finally, it seems arbitrary to cite adjective rating scales as imperfect tools and advocate an end to their use on that basis, since realistically, all methods are flawed in some way. Although Bartoshuk advocates the use of other self-report measurements over adjective rating scales (e.g., Labeled Magnitude Scales, in Carpenter, 2000), Bartoshuk herself acknowledges that "it is unlikely that we will ever find a standard that is genuinely perceived identically by all subjects" (Bartoshuk, 2000a, p. 450). Thus, a dismissal of self-report data collected by adjective rating scales seems illogical in the face of other, similarly flawed methods. It does not make sense to say that rating scales are more flawed than other methods, such as neuroimaging, without reference to the specific goals of the researcher. Avoiding the use of this technique would limit what we can learn by eschewing a unique method of data collection.
We would do well to take a lesson from Nisbett and Cohen (1996) who, in their "culture of honor" research, used a multi-method approach consisting of physiological measures and behavioral measures, as well as self-reports, to make a powerful claim for their theory. Each (imperfect) method contributed to the creation of a clear picture of the phenomenon they were studying, resulting in an impressively tight and cogent profile of the culture of honor.
Throughout the history of our field, potentially useful methods have been ignored or left by the wayside due to a rise in popularity of a single, alternate measure (e.g., depression measures and the Beck Depression Inventory: See Beck, Rial & Rickets, 1974). At times, this focus on a single measure has been so limited that the measure itself has become equated with the construct, instead of simply serving as a tool with which to gain data about the topic. This seems to us to be one of the more shortsighted, and least productive, ways out of an intellectual dilemma; such thinking promotes a mindset that only takes into account those measures and findings that conform to our preconceptions. Learning or growth are not possible with such an orientation, and some of the greatest of findings occur when we are least expecting them. Indeed, an extensive program of research investigating the powerful influence of expectancy effects on behavior was launched by Rosenthal's discovery of experimenter bias in his dissertation (Rosenthal, 1993).
There is never an excuse for using a measure that we, as scientists, know invites biases, produces misleading results, or otherwise threatens the integrity of our field. However, there is also no excuse for dismissing a potentially important source of insight into human experience just because it is inconvenient or it requires care to put into practice. Instead, if we as researchers remain open to all methods, even those that are flawed, then we will continue to improve both our methods and our science.
1"Researcher challenges a host of psychological studies," by Erica Goode, The New York Times, January 2, 2001, p. F1.
REFERENCES
Adams, F M & Osgood, C E (1973). A cross-cultural study of the affective
meanings of color. Journal of Cross-Cultural Psychology, 4, 135-156.
Bartoshuk, L M (2000a). Comparing sensory experiences across individuals: Recent
psychophysical advances illuminate genetic variation in taste perception.
Chemical Senses, 25, 447-460.
Bartoshuk, L M (2000b, August). From sweets to hot peppers: Things don't taste
the same to everyone. Talk presented at the meeting of the American
Psychological Association, Washington, D.C.
Baumeister, R F (1996). Relation of threatened egotism to violence and aggression:
The dark side of high self-esteem. Psychological Review, 103, 5-33.
Beck, A T, Rial, W Y & Rickets, K (1974). Short form of Depression Inventory:
Cross-validation. Psychological-Reports, 34 (3), 1184-1186.
Carpenter, S (2000). A taste expert sniffs out a long-standing measurement
oversight. Monitor on Psychology, 31, 20-21.
Coopersmith, S. (1967). The antecedents of self-esteem. San Francisco: Freeman.
Cunningham, W A, Preacher, K J, & Banaji, M R (2001). Implicit attitude measures:
Consistency, stability, and convergent validity. Psychological Science, 121,
163-170.
Kernis, M H (Ed.) (1995). Efficacy, agency, and self-esteem. New York: Plenum.
Nisbett, R E & Cohen, D (1996). Culture of honor: The psychology of violence in
the South. Boulder, CO: Westview Press.
Nisbett, R E & Wilson, T D (1977). Telling more than we can know: Verbal reports
on mental processes. Psychological Review, 84, 231-259.
Rosenberg, M (1979).Conceiving the Self. New York: Basic Books.
Rosenthal, R (1993). Interpersonal expectations: Some antecedents and some
consequences. In PD Blanck (Ed.), Interpersonal expectations: Theory, research,
and applications. Paris: Cambridge University Press.
Spector, P E (1994). Using self-report questionnaires in OB research: A comment on
the use of a controversial method. Journal of Organizational Behavior, 15,
385-392.





