Testing for Measurement Invariance: Does your measure mean the same thing for different participants?

Sean T. H. Lee

Student Notebook

Testing for Measurement Invariance: Does your measure mean the same thing for different participants?

Sean T. H. Lee

September 28, 2018

Tags:

Log in to Save for Later

From Beck’s Depression Inventory to the Positive and Negative Affect Schedule (PANAS), psychological scientists regularly use scales, schedules, and inventories in published empirical papers. But how can we be certain that these questionnaires actually measure the same construct across all respondents?

Take shame and guilt, two indicators of negative affect on the PANAS. They are generally considered negative emotions in individualistic cultures. But in collectivistic cultures, shame and guilt are seen somewhat positively; they represent self-reflection and self-improvement rather than sheer wrongfulness (Eid & Diener, 2001; Mesquita & Leu, 2007). Such equivalence issues eventually prompted the development of an international version of the PANAS that excludes items carrying different meanings across cultures (Thompson, 2007). Still, the original PANAS, which doesn’t account for those variations, is still commonly used (Chan, 2007; Spencer-Rodgers, Peng, & Wang, 2010).

While many well-established measures have already withstood rigorous tests of measurement invariance and are normed across age (Bowden, Weiss, Holdnack, & Lloyd, 2006), gender (Byrne, Baron, & Campbell, 1993) and culture (Runyan, Ge, Dong, & Swinney, 2012), they are merely a few of the ever-growing number of scales that are being developed and used in psychological research. It’s important for scientists to understand the basic tenets of measurement invariance testing to produce more comprehensive, broadly applicable results in research and practice.

Measurement Invariance Testing: Multigroup Confirmatory Factor Analysis

To test measurement invariance across participants from various groups, researchers use a statistical technique called “multigroup confirmatory factory analysis” (CFA; Milfont & Fischer, 2015). Essentially, multigroup CFA is an extension of the typical CFA; however, instead of fitting a single model to your data set, you divide the data set into groups (e.g., young adult, middle-aged adult, and older adult), determine model fit for each group separately, and then make multi-group comparisons. This procedure allows researchers to examine whether respondents from different groups interpret the same measure in a conceptually similar way (Bialosiewicz, Murphy, & Berry, 2013).

The three typical phases of measurement invariance testing are as follows.

Configural Invariance

Using age as an example, a configural invariance test allows you to examine whether the overall factor structure stipulated by your measure fits well for all age groups in your sample. As with a typical CFA, you start by specifying the relationships between each item in the measure you’re using and the latent factor(s) that the items are stipulated to measure. Take, for example, the five-item Satisfaction with Life Scale (Diener, Emmons, Larsen & Griffin, 1985). The latent construct of “life satisfaction” is indicated by each of the five scale items (e.g., “in most ways, my life is close to ideal”). The strength of each scale item-latent factor relationship is termed “factor loading” and each item’s origin value is termed “item intercept” (similar to the concepts of beta-coefficient and y-intercept, respectively, in linear regression analysis). To test configural invariance, you fit the model you have specified onto each of the age groups, leaving all factor loadings and item intercepts free to vary for each group. You then compare model fit across all age groups — a good multi-group model fit suggests that the overall factor structure holds up similarly for all ages.

Metric Invariance

The next step is to test for metric invariance to examine whether the factor loadings are equivalent across the groups. This time, you constrain the factor loadings to be equivalent across groups, while still allowing the item intercepts to vary freely as before. A good multi-group model fit indicates metric invariance — if constraining the factor loadings in this way results in a poorer fit, it suggests that the factor loadings are not similar across age groups.

Ascertaining metric invariance allows you to substantiate multi-group comparisons of factor variances and covariances, since metric invariance indicates that each item of the scale loads onto the specified latent factor in a similar manner and with similar magnitude across groups. As such, you can assume that differences in factor variances and covariances are not attributable to age-based differences in the properties of the scales themselves.

Scalar Invariance

The final step is to test for scalar invariance to examine whether the item intercepts are equivalent across groups. In this case, you constrain the item intercepts to be equivalent, just as you did with the factor loadings in the previous step. If this results in a poorer multi-group model fit, you can conclude that the item intercepts are not similar for people of different ages.

Ascertaining scalar invariance allows you to substantiate multi-group comparisons of factor means (e.g., t-tests or ANOVA), and you can be confident that any statistically significant differences in group means are not due to differences in scale properties at different ages.

These steps are necessarily sequential, and scientists typically stop testing when any of these steps produces evidence of noninvariance. Scientists would then examine the factor loadings and item intercepts on an item-by-item basis to determine which items are the main contributors toward measurement noninvariance. Although additional steps can offer an even stricter test of measurement invariance, researchers generally agree that assessing configural, metric, and scalar invariance is sufficient for establishing measurement invariance (Bialosiewicz et al., 2013; Milfont & Fischer, 2015).

Testing for measurement invariance plays an integral role in psychological research, ensuring that comparisons across various groups of participants are both meaningful and valid. Chan (2011) states that “we cannot assume the same construct is being assessed across groups by the same measure” without tests of measurement invariance (p. 108). Measurement invariance testing is, therefore, a critical addition to our arsenal of statistical procedures that help to increase the robustness and validity of our research, regardless of field or discipline.

References

Bialosiewicz, S., Murphy, K., & Berry, T. (2013). An introduction to measurement invariance testing: Resource packet for participants. Retrieved from http://comm.eval.org/HigherLogic/System/DownloadDocumentFile.ashx?DocumentFileKey=63758fed-a490-43f2-8862-2de0217a08b8

Bowden, S. C., Weiss, L. G., Holdnack, J. A., & Lloyd, D. (2006). Age-related invariance of abilities measured with the Wechsler Adult Intelligence Scale-III. Psychological Assessment, 18, 334–339. doi:10.1037/1040-3590.18.3.334

Byrne, B. M., Baron, P., & Campbell, T. L. (1993). Measuring adolescent depression: Factorial validity and invariance of the beck depression inventory across gender. Journal of Research on Adolescence, 3, 127–143. doi:10.1207/s15327795jra0302_2

Chan, D. W. (2007). Positive and negative perfectionism among Chinese gifted students in Hong Kong: Their relationships to general self-efficacy and subjective well-being. Journal for the Education of the Gifted, 31, 77–102. doi:10.4219/jeg-2007-512

Chan, D. (2011). Advances in analytical strategies. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology (Vol. 1, pp. 85–113). Washington, DC: American Psychological Association. doi:10.1037/12169-004

Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The satisfaction with life scale. Journal of Personality Assessment, 49, 1–5.

Eid, M., & Diener, E. (2001). Norms for experiencing emotions in different cultures: Inter- and intranational differences. Journal of Personality and Social Psychology, 81, 869–885. doi:10.1037/0022-3514.81.5.869

Lim, F. M. H. (2007). An exploratory study of students’ positivity in Singapore (Thesis). Retrieved from https://repository.nie.edu.sg//handle/10497/809

Mesquita, B., & Leu, J. (2007). The cultural psychology of emotion. In S. Kitayama & D. Cohen (Eds.), Handbook of Cultural Psychology (pp. 734-759). New York, NY: Guilford Press.

Milfont, T. L., & Fischer, R. (2015). Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Research, 3, 111–130. doi:10.21500/20112084.857

Runyan, R. C., Ge, B., Dong, B., & Swinney, J. L. (2012). Entrepreneurial orientation in cross-cultural research: Assessing measurement invariance in the construct. Entrepreneurship Theory and Practice, 36, 819–836. doi:10.1111/j.1540-6520.2010.00436.x

Spencer-Rodgers, J., Peng, K., & Wang, L. (2010). Dialecticism and the co-occurrence of positive and negative emotions across cultures. Journal of Cross-Cultural Psychology, 41(1), 109–115. https://doi.org/10.1177/0022022109349508

Thompson, E. R. (2007). Development and validation of an internationally reliable short-form of the positive and negative affect schedule (PANAS). Journal of Cross-Cultural Psychology, 38, 227–242. doi:10.1177/0022022106297301

Observer > 2018 > October > Testing for Measurement Invariance: Does your measure mean the same thing for different participants?

Cookie	Duration	Description
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
uvc	1 year 27 days	Set by addthis.com to determine the usage of addthis.com service.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_3507334_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Cookie	Duration	Description
loc	1 year 27 days	AddThis sets this geolocation cookie to help understand the location of users who share the information.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Student Notebook

Testing for Measurement Invariance: Does your measure mean the same thing for different participants?

Comments

About the Author

Related

Careers Up Close: Joel Anderson on Gender and Sexual Prejudices, the Freedoms of Academic Research, and the Importance of Collaboration

Experimental Methods Are Not Neutral Tools

APS Fellows Elected to SEP

Comments

About the Author

Related

Careers Up Close: Joel Anderson on Gender and Sexual Prejudices, the Freedoms of Academic Research, and the Importance of Collaboration

Experimental Methods Are Not Neutral Tools

APS Fellows Elected to SEP

Cookies