From Beck’s Depression Inventory to the Positive and Negative Affect Schedule (PANAS), psychological scientists regularly use scales, schedules, and inventories in published empirical papers. But how can we be certain that these questionnaires actually measure the same construct across all respondents?
Take shame and guilt, two indicators of negative affect on the PANAS. They are generally considered negative emotions in individualistic cultures. But in collectivistic cultures, shame and guilt are seen somewhat positively; they represent self-reflection and self-improvement rather than sheer wrongfulness (Eid & Diener, 2001; Mesquita & Leu, 2007). Such equivalence issues eventually prompted the development of an international version of the PANAS that excludes items carrying different meanings across cultures (Thompson, 2007). Still, the original PANAS, which doesn’t account for those variations, is still commonly used (Chan, 2007; Spencer-Rodgers, Peng, & Wang, 2010).
While many well-established measures have already withstood rigorous tests of measurement invariance and are normed across age (Bowden, Weiss, Holdnack, & Lloyd, 2006), gender (Byrne, Baron, & Campbell, 1993) and culture (Runyan, Ge, Dong, & Swinney, 2012), they are merely a few of the ever-growing number of scales that are being developed and used in psychological research. It’s important for scientists to understand the basic tenets of measurement invariance testing to produce more comprehensive, broadly applicable results in research and practice.
Measurement Invariance Testing: Multigroup Confirmatory Factor Analysis
To test measurement invariance across participants from various groups, researchers use a statistical technique called “multigroup confirmatory factory analysis” (CFA; Milfont & Fischer, 2015). Essentially, multigroup CFA is an extension of the typical CFA; however, instead of fitting a single model to your data set, you divide the data set into groups (e.g., young adult, middle-aged adult, and older adult), determine model fit for each group separately, and then make multi-group comparisons. This procedure allows researchers to examine whether respondents from different groups interpret the same measure in a conceptually similar way (Bialosiewicz, Murphy, & Berry, 2013).
The three typical phases of measurement invariance testing are as follows.
Using age as an example, a configural invariance test allows you to examine whether the overall factor structure stipulated by your measure fits well for all age groups in your sample. As with a typical CFA, you start by specifying the relationships between each item in the measure you’re using and the latent factor(s) that the items are stipulated to measure. Take, for example, the five-item Satisfaction with Life Scale (Diener, Emmons, Larsen & Griffin, 1985). The latent construct of “life satisfaction” is indicated by each of the five scale items (e.g., “in most ways, my life is close to ideal”). The strength of each scale item-latent factor relationship is termed “factor loading” and each item’s origin value is termed “item intercept” (similar to the concepts of beta-coefficient and y-intercept, respectively, in linear regression analysis). To test configural invariance, you fit the model you have specified onto each of the age groups, leaving all factor loadings and item intercepts free to vary for each group. You then compare model fit across all age groups — a good multi-group model fit suggests that the overall factor structure holds up similarly for all ages.
The next step is to test for metric invariance to examine whether the factor loadings are equivalent across the groups. This time, you constrain the factor loadings to be equivalent across groups, while still allowing the item intercepts to vary freely as before. A good multi-group model fit indicates metric invariance — if constraining the factor loadings in this way results in a poorer fit, it suggests that the factor loadings are not similar across age groups.
Ascertaining metric invariance allows you to substantiate multi-group comparisons of factor variances and covariances, since metric invariance indicates that each item of the scale loads onto the specified latent factor in a similar manner and with similar magnitude across groups. As such, you can assume that differences in factor variances and covariances are not attributable to age-based differences in the properties of the scales themselves.
The final step is to test for scalar invariance to examine whether the item intercepts are equivalent across groups. In this case, you constrain the item intercepts to be equivalent, just as you did with the factor loadings in the previous step. If this results in a poorer multi-group model fit, you can conclude that the item intercepts are not similar for people of different ages.
Ascertaining scalar invariance allows you to substantiate multi-group comparisons of factor means (e.g., t-tests or ANOVA), and you can be confident that any statistically significant differences in group means are not due to differences in scale properties at different ages.
These steps are necessarily sequential, and scientists typically stop testing when any of these steps produces evidence of noninvariance. Scientists would then examine the factor loadings and item intercepts on an item-by-item basis to determine which items are the main contributors toward measurement noninvariance. Although additional steps can offer an even stricter test of measurement invariance, researchers generally agree that assessing configural, metric, and scalar invariance is sufficient for establishing measurement invariance (Bialosiewicz et al., 2013; Milfont & Fischer, 2015).
Testing for measurement invariance plays an integral role in psychological research, ensuring that comparisons across various groups of participants are both meaningful and valid. Chan (2011) states that “we cannot assume the same construct is being assessed across groups by the same measure” without tests of measurement invariance (p. 108). Measurement invariance testing is, therefore, a critical addition to our arsenal of statistical procedures that help to increase the robustness and validity of our research, regardless of field or discipline.
Bialosiewicz, S., Murphy, K., & Berry, T. (2013). An introduction to measurement invariance testing: Resource packet for participants. Retrieved from http://comm.eval.org/HigherLogic/System/DownloadDocumentFile.ashx?DocumentFileKey=63758fed-a490-43f2-8862-2de0217a08b8
Bowden, S. C., Weiss, L. G., Holdnack, J. A., & Lloyd, D. (2006). Age-related invariance of abilities measured with the Wechsler Adult Intelligence Scale-III. Psychological Assessment, 18, 334–339. doi:10.1037/1040-3518.104.22.1684
Byrne, B. M., Baron, P., & Campbell, T. L. (1993). Measuring adolescent depression: Factorial validity and invariance of the beck depression inventory across gender. Journal of Research on Adolescence, 3, 127–143. doi:10.1207/s15327795jra0302_2
Chan, D. W. (2007). Positive and negative perfectionism among Chinese gifted students in Hong Kong: Their relationships to general self-efficacy and subjective well-being. Journal for the Education of the Gifted, 31, 77–102. doi:10.4219/jeg-2007-512
Chan, D. (2011). Advances in analytical strategies. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology (Vol. 1, pp. 85–113). Washington, DC: American Psychological Association. doi:10.1037/12169-004
Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The satisfaction with life scale. Journal of Personality Assessment, 49, 1–5.
Eid, M., & Diener, E. (2001). Norms for experiencing emotions in different cultures: Inter- and intranational differences. Journal of Personality and Social Psychology, 81, 869–885. doi:10.1037/0022-3522.214.171.1249
Lim, F. M. H. (2007). An exploratory study of students’ positivity in Singapore (Thesis). Retrieved from https://repository.nie.edu.sg//handle/10497/809
Mesquita, B., & Leu, J. (2007). The cultural psychology of emotion. In S. Kitayama & D. Cohen (Eds.), Handbook of Cultural Psychology (pp. 734-759). New York, NY: Guilford Press.
Milfont, T. L., & Fischer, R. (2015). Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Research, 3, 111–130. doi:10.21500/20112084.857
Runyan, R. C., Ge, B., Dong, B., & Swinney, J. L. (2012). Entrepreneurial orientation in cross-cultural research: Assessing measurement invariance in the construct. Entrepreneurship Theory and Practice, 36, 819–836. doi:10.1111/j.1540-6520.2010.00436.x
Spencer-Rodgers, J., Peng, K., & Wang, L. (2010). Dialecticism and the co-occurrence of positive and negative emotions across cultures. Journal of Cross-Cultural Psychology, 41(1), 109–115. https://doi.org/10.1177/0022022109349508
Thompson, E. R. (2007). Development and validation of an internationally reliable short-form of the positive and negative affect schedule (PANAS). Journal of Cross-Cultural Psychology, 38, 227–242. doi:10.1177/0022022106297301