New Content From Advances in Methods and Practices in Psychological Science

Journal header for Advances in Methods and Practices in Psychological Science.

Does Truth Pay? Investigating the Effectiveness of the Bayesian Truth Serum With an Interim Payment: A Registered Report 
Claire M. Neville, Matt N. Williams 

Self-report data are vital in psychological research, but biases such as careless responding and socially desirable responding can compromise their validity. Although various methods are employed to mitigate these biases, they have limitations. The Bayesian truth serum (BTS) offers a survey scoring method to incentivize truthfulness by leveraging correlations between personal and collective opinions and rewarding “surprisingly common” responses. In this study, we evaluated the effectiveness of the BTS in mitigating socially desirable responding to sensitive questions and tested whether an interim payment could enhance its efficacy by increasing trust. In a between-subjects experimental survey, 877 participants were randomly assigned to one of three conditions: BTS, BTS with interim payment, and regular incentive (RI). Contrary to the hypotheses, participants in the BTS conditions displayed lower agreement with socially undesirable statements compared with the RI condition. The interim payment did not significantly enhance the BTS’s effectiveness. Instead, response patterns diverged from the mechanism’s intended effects, raising concerns about its robustness. As the second registered report to challenge its efficacy, this study’s results cast serious doubt on the BTS as a reliable tool for mitigating socially desirable responding and improving the validity of self-report data in psychological research. 

Bestiary of Questionable Research Practices in Psychology 
Tamás Nagy, Jane Hergert, Mahmoud M. Elsherif, et al. 

Questionable research practices (QRPs) pose a significant threat to the quality of scientific research. However, historically, they remain ill-defined, and a comprehensive list of QRPs is lacking. In this article, we address this concern by defining, collecting, and categorizing QRPs using a community-consensus method. Collaborators of the study agreed on the following definition: QRPs are ways of producing, maintaining, sharing, analyzing, or interpreting data that are likely to produce misleading conclusions, typically in the interest of the researcher. QRPs are not normally considered to include research practices that are prohibited or proscribed in the researcher’s field (e.g., fraud, research misconduct). Neither do they include random researcher error (e.g., accidental data loss). Drawing from both iterative discussions and existing literature, we collected, defined, and categorized 40 QRPs for quantitative research. We also considered attributes such as potential harms, detectability, clues, and preventive measures for each QRP. The results suggest that QRPs are pervasive and versatile and have the potential to undermine all stages of the scientific enterprise. This work contributes to the maintenance of research integrity, transparency, and reliability by raising awareness for and improving the understanding of QRPs in quantitative psychological research. 

From Embeddings to Explainability: A Tutorial on Large-Language-Model-Based Text Analysis for Behavioral Scientists 
Rudolf Debelak, Timo K. Koch, Matthias Aßenmacher, Clemens Stachl 

Large language models (LLMs) are transforming research in psychology and the behavioral sciences by enabling advanced text analysis at scale. Their applications range from the analysis of social media posts to infer psychological traits to the automated scoring of open-ended survey responses. However, despite their potential, many behavioral scientists struggle to integrate LLMs into their research because of the complexity of text modeling. In this tutorial, we aim to provide an accessible introduction to LLM-based text analysis, focusing on the Transformer architecture. We guide researchers through the process of preparing text data, using pretrained Transformer models to generate text embeddings, fine-tuning models for specific tasks such as text classification, and applying interpretability methods, such as Shapley additive explanations and local interpretable model-agnostic explanations, to explain model predictions. By making these powerful techniques more approachable, we hope to empower behavioral scientists to leverage LLMs in their research, unlocking new opportunities for analyzing and interpreting textual data. 

Side Effects of Experience-Sampling Protocols: A Systematic Analysis of How They Affect Data Quality, Data Quantity, and Bias in Study Results 
Thomas Reiter, Sophia Sakel, Julian Scharbert, et al. 

In studies using the increasingly popular experience-sampling method (ESM), design decisions are often guided by theoretical or practical considerations. Yet limited empirical evidence exists on how these choices affect data quantity (e.g., response probabilities), data quality (e.g., response latency), and potential biases in study outcomes (e.g., characteristics of study variables). In a preregistered, 4-week study (N = 395), we experimentally manipulated two key ESM protocol characteristics for sending ESM surveys: timing (fixed vs. varying times) and contingency (directly vs. indirectly after unlocking the smartphone). We evaluated the ESM protocols resulting from the combination of these two characteristics regarding different criteria: As hypothesized for contingency, indirect protocols resulted in higher response probabilities (increased data quantity). But they also led to higher response latencies (reduced data quality). Contrary to our expectations, the combined effect of contingency and timing did not significantly influence response probability. We also did not observe other effects of timing or contingency on data quality. In exploratory follow-up analyses, we discovered that timing significantly affected response probability and smartphone-usage behaviors, as measured by screen logs; however, these effects were likely attributable to time-of-day effects. Self-reported states showed no differences based on the chosen ESM protocol, and similar trends were found when correlating primary outcomes with external criteria, such as trait affect and well-being. Based on the study’s findings, we discuss the trade-offs that researchers should consider when choosing their ESM protocols to optimize data quantity, data quality, and biases in study outcomes. 

The Response-Process-Evaluation Method: A New Approach to Survey-Item Validation 
Melissa G. Wolf, Elliott Ihm, Andrew Maul, Ann Taves 

Pretesting survey items for interpretability and relevance is a commonly recommended practice in the social sciences. The goal is to construct items that are understood as intended by the population of interest and test if participants use the expected cognitive processes when responding to a survey item. Such evidence forms the basis for a critical source of validity evidence known as the “response process,” which is often neglected in favor of quantitative methods. This may be because existing methods of investigating item comprehension, such as cognitive interviewing and web probing, lack clear guidelines for retesting revised items and documenting improvements and can be difficult to implement in large samples. To remedy this, we introduce the response-process-evaluation (RPE) method, a standardized framework for pretesting multiple versions of a survey. This iterative, evidence-based approach to item development relies on feedback from the population of interest to quantify and qualify improvements in item interpretability across a large sample. The result is a set of item-validation reports that detail the intended interpretation and use of each item, the population it was validated on, the percentage of participants that interpreted the item as intended, examples of participant interpretations, and any common misinterpretations to be cautious of. We also include an empirical study that compares the RPE method with cognitive interviewing in terms of the quality of data gathered and the resources expended. Researchers may find that they have more confidence in the inferences drawn from survey data after engaging in rigorous item pretesting. 

Six Fallacies in Substituting Large Language Models for Human Participants 
Zhicheng Lin 

Can artificial-intelligence (AI) systems, such as large language models (LLMs), replace human participants in behavioral and psychological research? Here, I critically evaluate the replacement perspective and identify six interpretive fallacies that undermine its validity. These fallacies are (a) equating token prediction with human intelligence, (b) treating LLMs as the average human, (c) interpreting alignment as explanation, (d) anthropomorphizing AI systems, (e) essentializing identities, and (f) substituting model data for human evidence. Each fallacy represents a potential misunderstanding about what LLMs are and what they can tell researchers about human cognition. In the analysis, I distinguish levels of similarity between LLMs and humans, particularly functional equivalence (outputs) versus mechanistic equivalence (processes), while highlighting both technical limitations (addressable through engineering) and conceptual limitations (arising from fundamental differences between statistical and biological intelligence). For each fallacy, specific safeguards are provided to guide responsible research practices. Ultimately, the analysis supports conceptualizing LLMs as pragmatic simulation tools—useful for role-play, rapid hypothesis testing, and computational modeling (provided their outputs are validated against human data)—rather than as replacements for human participants. This framework enables researchers to leverage language models productively while respecting the fundamental differences between machine intelligence and human thought. 

A Cross-Sectional Study of the Completeness of Preregistrations by Psychological Authors From German-Speaking Institutions 
Lena Hahn, Andreas Glöckner, Mario Gollwitzer, et al. 

Preregistering confirmatory research aims at reducing researchers’ degrees of freedom and increasing transparency to ultimately increase replicability. Yet the extent to which preregistrations actually achieve these goals depends on the completeness of a preregistration. To scrutinize the completeness of current preregistrations, we coded all preregistrations mentioned in journal articles published by psychologists from institutions in German-speaking countries in 2020 as to whether they contain six procedural specifications: (a) the hypothesized pattern of results, (b) the measures, (c) planned sample size, (d) exclusion criteria, (e) planned analyses to test the hypotheses, and (f) a time stamp. In addition, we consider transparency-related elements. Our results show that the completeness of preregistration was associated with neither the journal’s impact factor nor its transparency and openness promotion factor. Approximately half of the preregistrations contained all six procedural specifications. Hence, in line with previous research, our findings indicate that when considering publications from diverse subdisciplines of psychology, there was room for improvement regarding the completeness of preregistrations in psychology. We discuss steps to improve preregistration completeness.

The DECIDE Framework: Describing Ethical Choices in Digital-Behavioral-Data Explorations 
Heather Shaw, Olivia Brown, Joanne Hinds, Sophie J. Nightingale, John Towse, David A. Ellis 

Behavioral sciences now routinely rely on digital data, supported by digital technologies and platforms. This has resulted in an abundance of new ethical challenges for researchers and ethical-review boards. Several contemporary high-profile cases emphasize that ethical issues often surface after the research is published, once harm has already occurred. Consequently, implementing safeguards in digital-behavioral research is often reactionary and fails to adequately prevent harm. In response, we propose the DECIDE (Describing Ethical Choices in Digital-Behavioural Data Explorations) framework, which encourages ethical reflections and discussions throughout all stages of the research process. The framework presents several questions designed to help researchers view their work from new perspectives and uncover ethical issues they might not have anticipated. We provide several resources to support researchers with their ethical reflections and discussions, including (a) the DECIDE framework spreadsheet, (b) the DECIDE desktop app, (c) information documents, and (d) flowcharts. In this article, we provide suggestions on how to use each resource to encourage proactive discussions of how ethical issues may apply to specific research contexts. By promoting continuous ethical considerations, safeguards can be put in place throughout the research project, even after research commencement. The DECIDE framework shifts ethical reflection away from being reactive toward a more proactive endeavor, reducing the risk of harm and the misuse of digital-behavioral data. 

Citing Decisions in Psychology: A Roadblock to Cumulative and Inclusive Science 
Katherine M. Lawson, Brett A. Murphy, Jovani Azpeitia, Ella J. Lombard, Terrènce J. Pope 

Citations are the main avenue through which scholarly contributions are recognized. However, decisions about what to cite (or not cite) are often made without much systematic thought. Suboptimal citing practices undermine psychological science. Yet psychological science as a field has yet to comprehensively discuss ways to improve authors’ citing decisions. We outline the importance of citing for promoting the cumulativeness of the scientific endeavor, which encompasses promoting diversity, equity, and inclusion in the field. We describe how psychologists make citing decisions and some negative consequences when citation decisions are negligent (or even fraudulent). Moreover, we describe how citations driven by insular professional networks can reinforce historical exclusion and result in reference sections that reflect a failure to meaningfully search and engage with existing literature. Then, we review some potential causes of problematic citing behaviors, which include factors that manifest at the level of the individual, such as a desire to elevate one’s own professional profile, and systemic factors, such as the exponential growth in published literature. Finally, we offer strategies for the field, journals, labs, and individuals to improve citations. In framing our arguments and recommendations, we refer to empirical data collected on citing decisions from editorial-board members (N = 213) at 23 psychology journals. 

Bridging Null Hypothesis Testing and Estimation: A Practical Guide to Statistical Conclusion Drawing From Research in Psychology 
Henk A. L. Kiers, Jorge N. Tendeiro 

A well-known problem of null hypothesis significance testing is that it cannot be used to find support for the null hypothesis. A common solution for this is to replace the exact 0 value by an interval associated with values that are close to 0. This approach is denoted as equivalence testing and is a special case of procedures that test intervals of values against each other. Smiley et al. recently published a unified framework of statistical inference and suggested a straightforward method of testing all sorts of interval-based hypotheses in a unified way. In the present article, we discuss three alternative general approaches, based on Bayesian analysis, that have the advantage that the ensuing probabilities can be interpreted as probabilities of the population parameters rather than probabilities of the data (as is the case with frequentist methods). These methods (in some form) have been previously suggested, but here, we bring them together and show how they can be used for Smiley et al.’s full unified framework of statistical inference, now complementing it with three Bayesian counterparts. In particular, we show how each of the methods works in the analysis of a leading example data set involving a test on proportions. Subsequently, their relative pros and cons are discussed, and it is explained how the methods can be used for many statistical-analysis questions in practice using R and/or JASP. This is illustrated on an empirical data set for comparing means of two groups

Feedback on this article? Email [email protected] or login to comment.


APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.

Please login with your APS account to comment.