In Defense of Self Reports
BY REBECCA NORWICK, Y. SUSAN CHOI, & TAL BEN-SHACHAR
Special to the Observer
It has been more than 20 years since the validity of self-report data was first seriously called into question. Back then, Nisbett and Wilson (1977) offered a critical examination of the many weaknesses of verbal reports of mental processes. Their review of self-report data encompassed many types of self-report methods, and invited the question of how self-reports should be best used given their limitations.
More recently, self-reports have again been called into question, this time in a New York Times article featuring psychophysicist Linda Bartoshuk.1 But in contrast to Nisbett and Wilson, Bartoshuk’s comments focus almost exclusively on adjective rating scales, and she suggests that perhaps they ought not to be used at all. We believe that Bartoshuk’s views, as reported in the article, are extreme, and we argue instead that adjective rating scale data do have a place within psychology.
The proper question is not whether such rating scales are appropriate, but when and how they are to be best used. There is a traceable history in psychology that points to the value of this method of data collection. For example, psycholinguists have long used semantic differential tasks in their examinations. In a broad, cross-cultural examination of the affective meanings of colors, Adams and Osgood (1973) successfully explored several trends in the attribution of affect, through the use of adjective rating scales. Bartoshuk neglects such benefits of adjective rating scales in her criticisms.
Bartoshuk has said that the use of adjective rating scales can be a “dreadful mistake” (Bartoshuk, 2000b). However, “mistakes” can be prevented with a few simple precautions. One of Bartoshuk’s main criticisms of this form of self-report data is that internal experiences are subjective. For example, one person’s “very hungry” (e.g., a famine victim) might mean something very different from another person’s “very hungry” (e.g., a CEO late for dinner).
This problem does not annul the value of subjective reports. In fact, it is the exact reason that psychology so values random assignment of participants to conditions. With proper randomization, there should be the same number of famine victims, tardy CEO’s, or any other type of person Bartoshuk can imagine in any given experimental condition. In addition, there are times when we want or need to know people’s subjective theories about their situations; self-report data is uniquely suited to tasks such as obtaining individuals’ personal theories about their experiences and their feelings. Through the proper use of randomization, adjective rating scales can be valid and useful methods by which to learn about these beliefs.
In addition to randomization, repeated measures designs provide a further protection against the concern with adjective rating scales. In a repeated measures design, each participant’s answer is compared to his or her later answer, not to a different person’s responses. Kernis (1995), for example, uses a repeated measures design to evaluate participants’ self-esteem stability. In this case and others, the specific numbers on a scale are far less important than whether subsequent ratings increased, decreased, or did not change (Spector, 1994).
Researchers can, and should, protect against the drawbacks of all types of self-report scales by examining how well these measures converge with other methods of data collection. The self-esteem literature benefited from the finding that self-report data correlated well with data collected by other means (Baumeister, 1996; Coopersmith, 1965; Kernis, 1995; Rosenberg, 1979), allowing for the expansion and clarification of the self-esteem concept. Many important lessons can also be learned when different methods that are thought to be measuring the same phenomenon do not empirically converge.
Banaji and colleagues’ work with the Implicit Associations Test (IAT) provides a good example of a line of research benefiting from both consistent and contradictory results between self-reports and other measures. Through recent work examining multiple measures of attitudes, a clearer understanding of the differences between implicit and explicit attitudes has emerged. Cunningham, Preacher, and Banaji (2001) found that several implicit measures tapped into a general implicit prejudice construct, distinct from explicit attitudes. Through both convergence and divergence, researchers have been better able to determine what the IAT is, as well as what it is not. More generally, it is clear that psychological constructs can be more finely delineated through converging and diverging lines of evidence.
Finally, it seems arbitrary to cite adjective rating scales as imperfect tools and advocate an end to their use on that basis, since realistically, all methods are flawed in some way. Although Bartoshuk advocates the use of other self-report measurements over adjective rating scales (e.g., Labeled Magnitude Scales, in Carpenter, 2000), Bartoshuk herself acknowledges that “it is unlikely that we will ever find a standard that is genuinely perceived identically by all subjects” (Bartoshuk, 2000a, p. 450). Thus, a dismissal of self-report data collected by adjective rating scales seems illogical in the face of other, similarly flawed methods. It does not make sense to say that rating scales are more flawed than other methods, such as neuroimaging, without reference to the specific goals of the researcher. Avoiding the use of this technique would limit what we can learn by eschewing a unique method of data collection.
We would do well to take a lesson from Nisbett and Cohen (1996) who, in their “culture of honor” research, used a multi-method approach consisting of physiological measures and behavioral measures, as well as self-reports, to make a powerful claim for their theory. Each (imperfect) method contributed to the creation of a clear picture of the phenomenon they were studying, resulting in an impressively tight and cogent profile of the culture of honor.
Throughout the history of our field, potentially useful methods have been ignored or left by the wayside due to a rise in popularity of a single, alternate measure (e.g., depression measures and the Beck Depression Inventory: See Beck, Rial & Rickets, 1974). At times, this focus on a single measure has been so limited that the measure itself has become equated with the construct, instead of simply serving as a tool with which to gain data about the topic. This seems to us to be one of the more shortsighted, and least productive, ways out of an intellectual dilemma; such thinking promotes a mindset that only takes into account those measures and findings that conform to our preconceptions. Learning or growth are not possible with such an orientation, and some of the greatest of findings occur when we are least expecting them. Indeed, an extensive program of research investigating the powerful influence of expectancy effects on behavior was launched by Rosenthal’s discovery of experimenter bias in his dissertation (Rosenthal, 1993).
There is never an excuse for using a measure that we, as scientists, know invites biases, produces misleading results, or otherwise threatens the integrity of our field. However, there is also no excuse for dismissing a potentially important source of insight into human experience just because it is inconvenient or it requires care to put into practice. Instead, if we as researchers remain open to all methods, even those that are flawed, then we will continue to improve both our methods and our science.
1“Researcher challenges a host of psychological studies,” by Erica Goode, The New York Times, January 2, 2001, p. F1.
Adams, F M & Osgood, C E (1973). A cross-cultural study of the affective meanings of color. Journal of Cross-Cultural Psychology, 4, 135-156.
Bartoshuk, L M (2000a). Comparing sensory experiences across individuals: Recent psychophysical advances illuminate genetic variation in taste perception.
Chemical Senses, 25, 447-460.
Bartoshuk, L M (2000b, August). From sweets to hot peppers: Things don’t taste the same to everyone. Talk presented at the meeting of the American Psychological Association, Washington, D.C.
Baumeister, R F (1996). Relation of threatened egotism to violence and aggression: The dark side of high self-esteem. Psychological Review, 103, 5-33.
Beck, A T, Rial, W Y & Rickets, K (1974). Short form of Depression Inventory: Cross-validation. Psychological-Reports, 34 (3), 1184-1186.
Carpenter, S (2000). A taste expert sniffs out a long-standing measurement oversight. Monitor on Psychology, 31, 20-21.
Coopersmith, S. (1967). The antecedents of self-esteem. San Francisco: Freeman.
Cunningham, W A, Preacher, K J, & Banaji, M R (2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 121, 163-170.
Kernis, M H (Ed.) (1995). Efficacy, agency, and self-esteem. New York: Plenum.
Nisbett, R E & Cohen, D (1996). Culture of honor: The psychology of violence in the South. Boulder, CO: Westview Press.
Nisbett, R E & Wilson, T D (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84, 231-259.
Rosenberg, M (1979).Conceiving the Self. New York: Basic Books.
Rosenthal, R (1993). Interpersonal expectations: Some antecedents and some consequences. In PD Blanck (Ed.), Interpersonal expectations: Theory, research, and applications. Paris: Cambridge University Press.
Spector, P E (1994). Using self-report questionnaires in OB research: A comment on the use of a controversial method. Journal of Organizational Behavior, 15, 385-392.
Self Reports and Across-Group Comparisons
A Way Out of the Box
BY LINDA BARTOSHUK
Special to the Observer
Norwick, Choi and Ben-Shachar’s passionate support of self-reports is music to my ears. Self-reports are the foundation of my own area, psychophysics; it is crucial that the inferences we draw from them be valid. The problem I have been addressing concerns a misuse of self-reports that can lead to erroneous conclusions.1 Let me briefly summarize the problem as I see it along with the solutions that are under examination.
We cannot share experiences, so we cannot make direct comparisons of sensory or hedonic perceived intensities across individuals. However, we can make such comparisons indirectly if we can identify a standard known to be equally intense to all. We need only ask subjects to rate the experience of interest relative to the standard. But we cannot ever prove we have such a standard because we cannot share experiences directly.
Is there a way out of this box? There is. We can focus on comparisons across groups. We do not need to assume that the standard is equal to all; we need only assume that the standard and the stimuli of interest are not systematically related. This means that, on average, the standard will be the same to each group. We can then express the perceived intensities of the stimuli of interest relative to the standard. This gives us valid comparisons across the groups.
Magnitude matching. We’ve been able to trace the use of a sensory standard back to the 1960s and it may well be even older. This was formalized into the method of “magnitude matching” by Marks and Stevens (Marks & Stevens, 1980; Stevens & Marks, 1980; Marks et al, 1988). We use this method currently to study genetic variation in taste. Since the 1930s we have known that some individuals cannot taste PROP (6-n-propylthiouracil) while others perceive it to be bitter. More recently, we discovered that the tasters can be subdivided into medium and supertasters based on how bitter PROP tastes to them. This is related to tongue anatomy. Supertasters have the most fungiform papillae (structures that house taste buds); nontasters have the fewest.
We study this genetic variation by asking subjects to rate the intensities of tastes and tones on a common scale. By assuming that the ability to hear is not related to the ability to taste we can express the taste intensities relative to the tones; this allows us to compare average taste intensities across nontasters, medium tasters and supertasters of PROP.
Supertasters perceive the most intense tastes. They also perceive the most intense oral burn (chilis) and oral touch (fats in foods) because fungiform papillae are innervated by pain and touch as well as taste fibers (see Bartoshuk, 2000 and Prutkin et al., 2000 for reviews of our work on nontasters, medium tasters and supertasters). Needless to say, these three groups live in very different taste worlds.
Intensity labels. Another approach to making comparisons across subjects/groups has been used much more extensively than the approach above. This method uses intensity labels (“That tastes very strong to me; is it very strong to you?”). There are many different kinds of labeled scales (e.g., Likert, category, visual analogue) with a variety of properties. The oldest category scale, to the best of my knowledge, dates back to the Greek astronomer Hipparchus (190-120 B.C.) who classified stars into six categories by their brightness. One of the most common modern types is the visual analogue scale (VAS). Although its roots go back farther, it came into common use in the 1960s (e.g., Aitken, Ferres, & Gedye, 1963) and was soon applied widely to measure feelings (Aitken, 1969), pain (Huskisson, 1974), appetitive sensations like hunger (Silverstone & Stunkard, 1968), etc. It is typically described as a line labeled at its ends with the “minimum and the maximum rating” for a particular experience (Hetherington & Rolls, 1987). One form of the VAS familiar to patients is a line that extends from “no pain” to the “worst pain you have ever experienced” in common use in hospitals.
The various scales labeled with intensity descriptors were usually devised to make within subject comparisons or comparisons across groups when the members of the groups have been randomly assigned. Thus we can use the pain scale to observe a change in pain following a medication in a given patient or we can compare the efficacy of two medications if we randomly assign patients to the medication groups. This is just the type of proper usage of scales that Norwick, Choi and Ben-Shachar noted in their letter. But what happens when investigators decide to ask questions that require comparisons across groups? Suppose we want to compare perceived pain intensities in females and males. And suppose that our female subjects have all borne children and our male subjects have fortunately escaped life’s nastier conditions (kidney stones, etc.). Do we really believe that the “worst pain you have ever experienced” is the same, on average, to both groups? Of course not. And this is the crux of the problem.
Intensity adjectives (weak, moderate, strong, etc.) and the adverbs that modify them (very, extremely, etc.) do not denote absolute intensity until they are applied to a domain (i.e., provided with the noun they modify). S.S. Stevens made this point with a wonderful remark more than forty years ago (Stevens, 1958): “Mice may be called large or small, and so may elephants, and it is quite understandable when someone says it was a large mouse that ran up the trunk of the small elephant.” We now know quite a lot about the properties of adjective/adverb intensity descriptors but there is still a lot to learn. Let me suggest where I think this is going. Think of a scale labeled with intensity descriptors (e.g., weak, moderate, strong, very strong) printed on elastic. Imagine that the scale can be stretched or compressed to fit any given domain. The relative spacing of the descriptors would be constant but the absolute size of the domain would vary. For example, think about the odor of roses and the pain of migraine headaches. We can smell a weak or strong rose odor. We can also experience a weak or strong migraine. But clearly, “weak” and “strong” denote very different absolute intensities depending on whether we are speaking of roses or migraines.
Now apply the same reasoning to different people experiencing a given domain in very different ways: e.g., nontasters, medium tasters and supertasters. The absolute intensities of the descriptors on scales will vary depending on the experience of the individual (as suggested in the pain example above). On average, a “very strong” taste to a supertaster will be a more intense sensation than a “very strong” taste to a nontaster. This was tested by asking subjects to rate tastes, taste intensity descriptors and tones on a common intensity scale. Not surprisingly, the supertasters matched a “very strong” taste to a louder sound than did the nontasters.
Figure 1 illustrates the consequences of using labeled intensity scales to make across-group comparisons when the descriptors do not denote the same absolute intensity, on average, to all groups.
The functions labeled A through E are drawn to reflect differences in the size of PROP effects. For example, function A reflects a very large effect like that produced by quinine (bitterness), functions B though D reflect smaller effects like those for sucrose or NaCl (Ko et al., 2000). Function E reflects a sensation that is not correlated with PROP (e.g., oral burn from a non-tongue location; see (Karrer et al., 1992). The right side of Figure 1 shows what happens under the erroneous assumption that “very strong” indicates the same absolute intensity to all. For the largest effects (A and B), assuming that “very strong” is the same to all reduces the size of the PROP effect. However, for C, the PROP effect is completely lost. Note that D and E now erroneously appear to suggest that nontasters actually perceive more intense sensations than do medium and supertasters: a reversal artifact.
We have identified some instances in which effects are lost or the reversal artifact appears to have occurred in PROP studies (Bartoshuk et al., 2002, in press). How common is the invalid use of labeled scales to make comparisons across groups? This is hard to evaluate without a careful study of several journals. So far in my screening of several volumes of one journal (over 800 studies), 25 percent of the papers including human psychophysical scaling made across-group comparisons where the group assignments were not random. Some of these comparisons concerned only minor points in the study; however, in a few cases the main point of the paper was involved.
Other psychophysicists have worried about this problem. Narens and Luce (1983) disputed some economists’ belief in the “intercomparability of utility” the idea that the magnitude of value can be compared across individuals. Biernat and Manis (1994) noted that “very tall” indicates a different height when applied to a woman than when applied to a man. Birnbaum (1999) addressed the absurd results that can occur when comparisons are made across groups using different contexts to make judgments. Birnbaum asked different groups to judge the size of the numbers 9 and 221 on a 10-point scale ranging from “very very small” to “very very large.” Because of the different contexts brought with the two numbers, the average scale values for 9 and 221 were 5.13 and 3.10, respectively, leading to the apparent conclusion that 9 > 221.
Frequency and probability descriptors. Frequency and probability descriptors can also reflect different magnitudes across subjects/groups (e.g., (Hakel, 1968; Kong, Barnett, Mosteller, & Youtz, 1986; Mapes, 1979; Simpson, 1944; Wallsten & Budescu, 1995; Wallsten, Budescu, Rapoport, Zwick, & Forsyth, 1986). The movie “Annie Hall” (Allen & Brickman, 1977) immortalized this when Alvy (Woody Allen) and Annie (Diane Keaton) in split screen visits with their respective psychiatrists describe the frequency with which they have sex: Alvy says “Hardly ever,” and Annie says “Constantly” and then both add, “three times a week.”
What can we do about this? The key to making valid across-group comparisons is to express the sensations of interest relative to a standard (sensory or adjective/adverb intensity descriptor) in an unrelated domain. Note that this is just what scales like the VAS fail to do. When the scale is labeled in terms of the sensation to be measured (e.g., “none” to “worst pain experienced”), it should not be used to make across-group comparisons when the groups vary in some systematic way (e.g., sex, age, clinical status, taste anatomy, etc.).
We are exploring a variety of ways to correct this problem with labeled scales. For example, we have tried a labeled scale anchored at the top with “strongest imaginable sensation of any kind.” This scale works for genetic variation in taste; that is, the differences across nontasters, medium tasters and supertasters are similar whether we use this scale or magnitude matching with a sound standard (Bartoshuk, Green et al., 2000). We have most recently begun to explore using an experience common to most as an anchor for taste studies: brightness of the sun (Fast et al., 2001).
As Norwick, Choi and Ben-Shachar noted, one of the ways we can check the validity of scaling is to check conclusions with other techniques. The connection between tongue anatomy and perceived taste intensity provides one way to do this in taste. We can ask subjects to rate the bitterness of PROP with a variety of scales and correlate the perceived bitterness with density of fungiform papillae (the structures that house taste buds). If a scale is providing valid comparisons across nontasters, medium tastes and supertasters, it will show this association (Prutkin et al., 2000). There is much yet to do. We need to detect and correct any errors that have gotten into the scientific literature. We also need to search for more ways to improve the validity of comparisons across groups. We welcome Norwick, Choi and Ben-Shachar to this enterprise.
1 Much of the recent work my students, colleagues and I have been doing in this area is available in abstracts (Bartoshuk, Fast et al., 2000; Bartoshuk, Duffy, Fast, Green, & Prutkin, 2001; Bartoshuk, Duffy, Fast, Green, & Snyder, 2001; Fast, Green, Snyder, & Bartoshuk, 2001; Snyder, Duffy, Fast, Hoffman et al., 2001; Snyder, Duffy, Fast, Weiffenbach, & Bartoshuk, 2001) but the full length papers are still in press (Bartoshuk, Duffy, Fast, Green, & Snyder, 2002, in press; Fast, Duffy, & Bartoshuk, 2002, in press). However, one of the papers in press was presented at a conference and the lecture is available on the web (www.calliscope.com/ danone/frameset.html?1). Incidentally, presentations at meetings (Bartoshuk, 1999, 2000) and the resulting press coverage (Carpenter, 2000; Goode, 2001) helped us find others who have encountered the same problem in their fields. I briefly note their contributions as wel. I suspect that there are many more whose work I do not yet know. I hope they will get in touch.
Aitken, R C B (1969). Measurement of feelings using visual analogue scales. Proceedings of the Royal Society of Medicine, 62, 989-993.
Aitken, R C B, Ferres, H M, & Gedye, J L (1963). Distraction from flashing lights. Aerospace Medicine, 34, 302-306.
Allen, W, & Brickman, M. (1977). Annie Hall (W. Allen, Director).
Bartoshuk, L M (2000). Psychophysical advances aid the study of genetic variation in taste. Appetite, 34, 105.
Bartoshuk, L M (1999). Presidential Symposium (Part Biology, Part Social: How Eating Habits Are Learned). “Listening to patients: What experiments of nature can tell us about taste.” Paper presented at the American Psychological Society
12th Annual Convention, Miami, FL.
Bartoshuk, L M (2000). Neal Miller Distinguished Lecture. “Do You Hear What I Hear? Or Taste?” Paper presented at the 108th Annual Convention of the American Psychological Association, Washington, D.C.
Bartoshuk, L M, Duffy, V B, Fast, K, Green, B G, & Prutkin, J M (2001). Invalid sensory comparisons across groups: Examples from PROP research. Chemical Senses, 26, 761-762.
Bartoshuk, L M, Duffy, V B, Fast, K, Green, B G, & Snyder, D J (2001). The General Labeled Magnitude Scale provides valid measures of genetic variation in taste and may be a universal psychophysical ruler. Appetite, 37, 126.
Bartoshuk, L M, Duffy, V B, Fast, K, Green, B G, & Snyder, D J (2002, in press). Hormones, age, genes and pathology: How do we assess variation in sensation and preference? Food Selection, from Genes to Culture. Paris, France: Danone
Bartoshuk, L M, Fast, K, Duffy, V B, Prutkin, J M, Snyder, D J, & Green, B G (2000). Magnitude matching and a modified LMS produce valid sensory comparisons for PROP studies. Appetite, 35, 277.
Bartoshuk, L M, Green, B G, Snyder, D J, Lucchina, L A, Hoffman, H J, Weiffenbach, J M, & Ko, C W (2000). Valid across-group comparisons: Supertasters perceive the most intense taste sensations by magnitude matching
or the LMS scale. Chemical Senses, 25, 639.
Biernat, M, & Manis, M (1994). Shifting standards and stereotype-based judgements. Journal of Personality and Social Psychology, 66, 5-20.
Birnbaum, M H (1999). How to show that 9 > 221: Collect judgements in a between-subjects design. Psychological Methods, 4, 243-249.
Carpenter, S (2000). A taste expert sniffs out a long-standing measurement oversight. Monitor on Psychology, 31, 20-21.
Fast, K, Duffy, V B, & Bartoshuk, L M (2002, in press). New psychophysical insights in evaluating genetic variation in taste. In C Rouby & B Schaal & D Dubois & R Gervais & A Holley (Eds.), Olfaction, Taste and Cognition.
Fast, K, Green, B G, Snyder, D J, & Bartoshuk, L M (2001). Remembered intensities of taste and oral burn correlate with PROP bitterness. Chemical Senses, 26, 1069.
Goode, E (2001, January 2, 2001). Researcher challenges a host of psychological studies. New York Times, pp. F1, F7.
Hakel, M D (1968). How often is often? American Psychologist, 23, 533-534.
Hetherington, M M, & Rolls, B J (1987). Methods of investigating human eating behavior. In F M Toates & N. E. Rowland (Eds.), Feeding and Drinking (pp. 77-109). New York: Elsevier Science Publishers (Biomedical Division).
Huskisson, E C (1974). Measurement of pain. Lancet, 2, 1127-1131.
Karrer, T, Bartoshuk, L M, Conner, E, Fehrenbaker, S, Grubin, D, & Snow, D (1992).
PROP status and its relationship to the perceived burn intensity of capsaicin at different tongue loci. Chemical Senses, 17, 649.
Ko, C W, Hoffman, H J, Lucchina, L A, Snyder, D J, Weiffenbach, J M, & Bartoshuk, L M (2000). Differential perceptions of intensity for the four basic taste qualities in PROP supertasters versus nontasters. Chemical Senses, 25,
Kong, A, Barnett, G O, Mosteller, F, & Youtz, C (1986). How medical professionals evaluate expressions of probability. The New England Journal of Medicine, 315, 740-744.
Mapes, R E A (1979). Verbal and numerical estimates of probability in therapeutic contexts. Social Science & Medicine, 13A, 277-282.
Marks, L E, & Stevens, J C (1980). Measuring sensation in the aged. In L W Poon (Ed.), Aging in the 1980’s: Psychological issues (pp. 592-598). Washington: American Psychological Association.
Marks, L E, Stevens, J C, Bartoshuk, L M, Gent, J G, Rifkin, B, & Stone, V K (1988).
Magnitude matching: The measurement of taste and smell. Chemical Senses, 13, 63-87.
Narens, L, & Luce, R D (1983). How we may have been misled into believing in the interpersonal comparability of utility. Theory and Decision, 15, 247-260.
Prutkin, J, Duffy, V B, Etter, L, Fast, K, Gardner, E, Lucchina, L A, Snyder, D J, Tie, K, Weiffenbach, J, & Bartoshuk, L M (2000). Genetic variation and inferences about perceived taste intensity in mice and men. Physiology and Behavior, 69,
Silverstone, T, … Stunkard, A J (1968). The anorectic effect of dexamphetamine sulphate. British J. Pharmacol. Chemother., 33, 513-522.
Simpson, R H (1944). The specific meanings of certain terms indicating differing degrees of frequency. The Quarterly Journal of Speech, 30, 328-330.
Snyder, D J, Duffy, V B, Fast, K, Hoffman, H J, Ko, C W, Weiffenbach, J M, & Bartoshuk, L M (2001). Food preferences vary with age and sex: A new analysis using the general Labeled Magnitude Scale. Chemical Senses, 26, 1050.
Snyder, D J, Duffy, V B, Fast, K, Weiffenbach, J M, & Bartoshuk, L M (2001). PROP
genetics interact with age and sex to influence food preferences. Appetite, 37, 164.
Stevens, J C, & Marks, L E (1980). Cross-modality matching functions generated by magnitude estimation. Perception and Psychophysics, 27, 379-389.
Stevens, S S (1958). Adaptation-level vs the relativity of judgment. The American Journal of Psychology, 4, 633-646.
Wallsten, T S, & Budescu, D. V. (1995). A review of human linguistic probability processing: General principles and empirical evidence. The knowledge Engineering Review, 10, 43-62.
Wallsten, T S, Budescu, D V, Rapoport, A, Zwick, R, & Forsyth, B (1986). Measuring the vague meanings of probability words. Journal of Experimental Psychology: General, 115, 348-365.