“Lesser of Two Evils”: Applying Artificial Intelligence to Move Beyond Self-Reports

Emojis ranging from negative to positive emotions.

Self-reportsMachine learningConclusion

Quick Take

When selecting measures for use in new research, psychologists have overwhelmingly turned to self-reports. Though questionnaires and interview-style measures have advantages, studies exploring relationships between multiple self-reports can produce inflated effects.  

Progress in artificial intelligence (AI) and machine learning has the potential to transform measurement in psychology by offering new ways to analyze data created naturally through everyday human activities, including social media posts, videos, and photographs. In this essay, coauthored by a behavioral scientist with a doctoral degree in psychology (Catharine Fairbairn) and a machine-learning researcher with a doctorate in computer science (Nigel Bosch), we argue for increased uptake of automated measures in psychological science. Specifically, we advocate for these new AI-based measures not because they offer measurement free from error, but rather because they avoid specific problematic forms of error linked to overreliance on self-reports.  

Self-reports 

Psychology has long been a science of self-reports. When applied to subjective constructs, self-reports can provide a view of human experience that is tremendously valuable and difficult to capture via other means (Garcia & Gustavson, 1997). Self-reports are also cost-effective and scalable, and when they include multiple choice and Likert-style response options, self-reports can easily be analyzed with conventional statistical approaches (Paulhus & Vazire, 2007).

Headshot of Catharine Fairbairn.
Catharine Fairbairn

Although self-reports have advantages, their use in psychological science might be said to have gotten out of hand (Baumeister et al., 2007). The application of self-reports in behavioral research now extends well beyond the measurement of constructs that are inherently subjective to the measurement of behavior, events, skills/abilities, and even physiology (Fairbairn & Bosch, in press).  

Psychological processes often operate at levels below conscious awareness, such that we as humans may find ourselves unable to accurately report on our internal thoughts and feelings and external behaviors (Nisbett & Wilson, 1977). Limitations in memory and self-perception interfere with the accuracy of self-reports, with recall for distant life events, aggregation of information across time, and self-evaluation of our own performance emerging as particularly biased (Baldwin et al., 2019; Dunning et al., 2004; Schwarz, 1999).  

Even when we are capable of accurately reporting our experiences, we may not always be willing to do so, with misreporting on sensitive topics (e.g., drug use) reaching levels as high as 50% (Tourangeau & Yan, 2007). Fixed-choice items are no help in this regard. Participant responses to multiple-choice and Likert-type items vary substantially depending on the specific response options provided by researchers (Schwarz, 1999).  

In sum, we know that self-reports contain error. But the same could be said of every assessment technique deployed in the history of science. When it comes to measurement, we never deal in the realm of absolute truth, but rather one fumbling attempt at approximation after another.

“In theoretical terms, shared measurement error could either inflate or diminish the size of effects. However, when applied to self-reports, research suggests that shared measurement effects are overwhelmingly inflationary.”

Potential harm from measurement error can vary substantially depending on not only its quantity but also, and importantly, its specific characteristics or quality. Psychology features a high proportion of studies in which both predictor and outcome are assessed via self-report, with the prevalence of such designs surpassing 50% in some subdisciplines (Fairbairn & Bosch, in press). 

Measurement error shared across a predictor and outcome can artificially inflate the size of observed effects. For example, a significant relationship between self-reports of alcohol problems and self-reports of marital distress might emerge because of a true underlying relationship between these factors or, alternatively, because of systematic forms of measurement error shared across the two self-reports (Campbell & Fiske, 1959; Podsakoff et al., 2003, 2012). These might include individual differences in self-presentational concern (some participants may be unwilling to disclose alcohol use and marital distress), memory/attention (some may struggle to remember instances of either), mood (everything seems bad, including marriage and behaviors), or personal lay theories (“I believe my spouse is driving me to drink.”). Such systematic error has the potential to move beyond the realm of noise into that of true confound. 

Headshot of Nigel Bosch.
Nigel Bosch

In theoretical terms, shared measurement error could either inflate or diminish the size of effects. However, when applied to self-reports, research suggests that shared measurement effects are overwhelmingly inflationary (Podsakoff et al., 2024). As such, overreliance on self-reports has the potential to lead to false positive effects. In a discipline recently rocked by a replicability crisis (Open Science Collaboration, 2015), this possibility is one worthy of grave concern. 

A veritable “Who’s Who” of psychological science has raised concerns about overreliance on self-reports, from Floyd Henry Allport to Allen Edwards to Richard Nisbett and APS President James Pennebaker. Although limitations of self-reports have been widely acknowledged, researchers across psychological subdisciplines have continued to use them. Thus, within behavioral research, the measure that is most universally lambasted also represents the one most universally deployed.  

Such a seeming contradiction might be explained in part by necessity. Although biological psychologists have recourse to brain scans and biological assays, options available to psychosocial researchers are comparatively sparse. Historically, alternatives to self-reports in psychosocial research required large teams of human coders and costly experimental equipment, constraining investigations to laboratory contexts and requiring coding efforts that span months or even years. As a result, depending on the study and the domain of assessment, psychological researchers have often been faced with a choice between self-report data or no data at all.  

Machine learning  

Recent developments in AI have the potential to transform the measurement landscape of psychological science, offering a smorgasbord of measurement options to behavioral researchers beyond surveys. Advanced machine-learning subtypes such as deep learning can accurately model relationships characterized by extraordinary levels of complexity, including nonlinear associations and millions of interacting predictors (LeCun et al., 2015). These data-driven model types can help us move beyond the constraints of “designed” data types, such as closed-ended self-reports, into the analysis of rich, organic data sources created naturally through human activities (e.g., social media posts; Adjerid & Kelley, 2018; Tay et al., 2022).  

Despite these possibilities, discourse on machine learning in behavioral research has often focused on its potential as a tool for analysis of preexisting “designed” data types, and widely cited applications involve the use of older machine-learning subtypes (e.g., using a random forest algorithm to predict suicide risk from survey responses; see Jacobucci et al., 2021). This focus fails to leverage key advantages of newer machine-learning models, which require massive training datasets rich in both observations and reliably measured predictors—attributes unlikely to characterize even the largest of survey studies (Fairbairn & Bosch, in press; Jacobucci & Grimm, 2020).

“Importantly, measures based in machine learning do not and will not ever offer a true “objective” view of human experience and behavior. These models are only as robust as the datasets they are trained on, which were inevitably created and curated by humans.”

In tandem with progress in deep learning, the increased use of smartphones, wearables, and the internet has exponentially expanded access to organic data sources. Therefore, in behavioral research, machine-learning applications are likely to be most impactful not in transforming the manner in which we analyze premeasured constructs, but rather in transforming how we measure these constructs to begin with.  

Researchers can apply deep and generative learning methods within complex datasets for a wide range of measurement tasks, including recognition of patterns within predictors spanning time, space, and ordered sequences (LeCun et al., 2015). For example, machine-learning models have surpassed human accuracy in identifying objects and individuals within images (Norvig & Russell, 2021), offering unprecedented possibilities for analyzing environmental and social factors within photographs (e.g., Ariss et al., 2025). Machine learning has also been fruitfully applied to video data, enabling automated analysis of action sequences, body posture, physical proximity, and facial movements over time (e.g., Gurrieri et al., 2021).  

Deep learning has been particularly impactful for speech and language analysis, with models for speech recognition now capable of performing advanced transcription tasks, such as parsing individual speakers during social exchanges and accurately recognizing language in noisy recording environments. Furthermore, natural language processing models can now detect broad emotional tone as well as the content and structure of language (e.g., Rathje et al., 2024).  

Finally, researchers are using machine-learning models to identify patterns within sequences of events as well as to extract behaviorally relevant constructs from data produced by wearable technology (Fairbairn et al., in press). 

Importantly, measures based in machine learning do not and will not ever offer a true “objective” view of human experience and behavior. These models are only as robust as the datasets they are trained on, which were inevitably created and curated by humans. Although comparatively resistant to shared-methods bias, they are often trained on human reports as ground truth and therefore can share other limitations linked to these. Automated measures can also be misapplied. And, where individual errors arise, the complexity of machine-learning models means the mechanism of error may at times be difficult to trace. 

As such, machine-learning models have access to no truthier sort of truth than do self-reports—always looking through a lens and never directly. Yet in a field where tests of theory all too often rely on reports from the same individuals measured in the same context via similarly structured closed-ended questionnaires, automated measures offer us something infinitely valuable—a source of error variance more likely to be random than systematic.  

Conclusion 

New technological developments can trigger competing drives. On the one hand, there is the draw, the allure of the unknown, the longed-for fix. Here, technology-driven measures are an escape from a marriage that has long turned stale—a fresh start with the shiny and new. On the other hand is the resistance, the attachment borne of familiarity, the fear of the unknown. Here, we compare new solutions not to the true array of flawed alternatives, but to the fallacy of a limitation-less tool—a measure that will never exist. Guided in part by these competing drives, attitudes toward new measurement technology follow an oscillating cycle (Borup et al., 2006; Maclure, 2020), swinging from undiscriminating uptake to wholesale dismissal, with little space for a wider view.  

In this essay, we aim to offer an alternative. Specifically, we present an argument for the integration of AI-based measurement into psychological science grounded not in an inflated or rose-colored view of the future, but rather a reasoned vision of measures now. Measures based in AI offer great promise, and yet they are associated with limitations, both known and unknown. At the same time, these measures are less likely to be vulnerable to systematic forms of error and false-positive findings when compared with self-reports. As such, these new tools may offer behavioral scientists a relatively unglamorous, but nonetheless precious, “lesser of two evils.” 

Back to top

Feedback on this article? Email [email protected] or login to comment.

References

APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.

Please login with your APS account to comment.