A More Mature Approach to Credibility Is Needed to Build Trust in Science

Qualitative information content analysis illustration.

APS’s Advocacy Task Force is committed to bringing psychological science into the forefront of both public policy and public understanding. This regular column features insights from psychological scientists who have become vocal advocates for science-driven policy.

Explore the APS Advocacy & Policy hub, your gateway to up-to-date policy statements, expert analysis, action tools, and resources to engage meaningfully with legislation, funding decisions, and public discourse.


For our latest column, managing editor of the Observer, Hannah O. Brown, spoke with Brian Nosek. Nosek is a psychology professor at the University of Virginia and executive director of the Center for Open Science (COS).

From 2019 to 2023, the Center for Open Science led a project called Systemizing Confidence in Open Research and Evidence (SCORE) that worked with 865 researchers to predict whether scientific studies were credible. The program was funded by a grant from the U.S. Defense Advanced Research Projects Agency (DARPA). It was designed to assess the ability of both human and machine reviewers to determine whether claims from 3,900 papers were replicable. The findings of this project have been published in Nature, as well as other journals, in a series of articles.

Brian Nosek headshot.
Brian Nosek

 In May 2025, the COS released a statement in response to the executive order released by the Trump administration titled “Restoring Gold Standard Science.”

“Ordinarily, such prominent promotion of these practices would be cause for COS to celebrate advancement of its mission,” the May 2025 statement says. “Unfortunately, their application in this Executive Order is counterproductive for open science’s purpose to accelerate discovery, advance treatments, and create knowledge.”

In this interview, Nosek touches on big takeaways from the SCORE program, how evidence from the program contradicts the 2025 executive order, and his vision for a more trustworthy scientific landscape.

Brown and Nosek spoke on April 9, 2026. Edited excerpts of their conversation follow.

What problem was SCORE designed to solve, and why did it take a DARPA grant to support it?

I don’t speak for DARPA, but the goal of the program was to see if we could create automated indicators of research credibility. I believe that they made this big investment because understanding the social behavioral sciences is very important for their goals. Providing for national security, they have to make a lot of behavioral and social decisions about how to help manage and advance national interests.

The challenge for DARPA, for the Department of Defense, for the world more broadly, is that understanding how trustworthy a finding is, is a hard problem. Replicability is only one part of the credibility or trustworthiness of a finding more generally. For practical purposes, taking credibility down to just, “Can we predict replicability?” was like, OK, we can start somewhere. This was justified based on work that had been done primarily in the psychological literature, suggesting that establishing replicability is harder than people might realize.

We published the Reproducibility Project in Psychology in 2015. That was a collaboration of 280-something psychologists trying to replicate findings in our field and found that it was challenging. Findings like that prompted this investment. But conducting replications is expensive and takes a lot of time, so if we had automated indicators that were reasonably valid to be able to predict findings like replicability, then that could at least help us direct attention and resource allocation.

When you say the need to understand the credibility of different studies, do you mean for researchers, the public, or both?

It is ultimately for everyone, right? Researchers, policymakers, practitioners, and people in general have interests in research findings. Everyone should be asking: Should I trust them? Should I use them? Should I apply them?

Establishing credibility is a hard problem, and it takes a long time, and our current indicators are simplistic, such as: “Is it published? Was it peer reviewed?” Just being published doesn’t mean it’s true. We all know that, yet we don’t have other indicators of credibility. There are a lot of dimensions to credibility. So, an implication of SCORE is to stimulate innovation in developing credibility indicators that go beyond the simplistic “published or not” to help researchers and the public understand this better.

This seems to bring up the concept of trust with the public. Communicating with them and trying to earn their trust is a whole other can of worms. Was that part of the equation for SCORE or is that a next step after creating this way of having a stable measure of credibility?

That’s obviously a challenge for the scientific community in the current social climate, because now there’s a lot more truth claims from people that are using more or less evidence. Because of the blossoming of decentralization of information, the internet has made it possible for any of us to make truth claims and broadcast them publicly. Some of us are really creative in how we create videos about our claims, or create websites about our claims, create personas about our claims. Some of those have no basis in evidence or science, and others do. That creates a broad social challenge. I think psychological science is very well positioned to show its work in a way that competing sources of truth claims about psychology cannot. Even so, delivery of new credibility indicators to the public is probably multiple steps away. There’s a lot of work to do to establish the validity and reliability of indicators of credibility.

In a recent New York Times article, you say AI “is not there yet” when it comes to making predictions on replicability. Human experts predicted replication outcomes with about 76%–78% accuracy, while machines didn’t perform consistently. Why are machines so bad at this?

The easy answer is we don’t know, but the more elaborate answer is there’s a lot that goes into predicting whether something is replicable or not. There’s important context for SCORE, which began in 2019, and the world of AI at that time did not include what we now understand AI to be. There were no large language models (LLMS) being used in public settings or research, so the types of AI solutions that are part of SCORE directly were whatever was the leading edge in 2019: knowledge graphs, a bot trading market, and semantic parsing of the text of the papers themselves.

So much has changed in the AI space. Now, obviously, the question is how can LLMs do with these data? We took the opportunity that the SCORE data had not been made public to run an open competition. We gave the same challenge to current AI researchers that we gave within the SCORE program. Anybody with an AI solution that wants to try to predict these outcomes could try to do so.

What we saw in round one of the competition was that the machines did terribly. Even the LLMs. Then, in round two, they got a lot better. They’re getting close to doing how well the human assessments did.

Will they get even better? Maybe. Who knows? My bottom line is that humans have a lot of capacity already. We know things. We can evaluate evidence.

AI systems are pulling in whatever information is already available to achieve similar capabilities. It isn’t surprising that they start worse on every dimension, and the rate of change is quite remarkable. It is entirely possible that in round three, we will see the machines outperforming humans. It’s also entirely possible that there is a ceiling that the machines will never be able to exceed for reasons we don’t understand yet. This is a very active area of research.

SCORE found there is no single indicator of research credibility—credibility is multidimensional and context-dependent. Yet the 2025 executive order essentially demands a checklist of criteria every study must meet. How directly does SCORE’s evidence contradict that framing?

There are several things in the executive order. The things that we like are the articulation of important principles of how to get good evidence. The order is positive in that it can help to increase transparency, credibility, and assessment of credibility of research.

Where the order goes off target is treating those principles as a checklist, as if all of them need to be met in order to treat a research finding as credible and to use that finding in policymaking. If we adopt that standard where it’s all or nothing, then almost no research will ever be used for decision making.

There is no perfect study. Responsible decision making uses the best available evidence and surfaces the uncertainties of that evidence. Decision makers should have the best available evidence and make decisions with humility and understanding of the evidence gaps that exist.

If we are not willing to use scientific evidence unless it meets this incredibly high bar, then what are we basing our decisions on? Ideology. Intuition. My primary worry is that the executive order creates a double standard. It imposes very high standards for scientific evidence and no standards at all for any other basis for making decisions.

The Center for Open Science’s statement says that a pattern of “sound” or “transparent” science language has been historically used to suppress inconvenient findings. How does this executive order fit that pattern?

I don’t mind setting a high bar, if we apply it universally. We would say, “Wow, almost nothing we have meets this standard that we really need to invest more in science, because we want to meet this very high bar.” I don’t mind being aspirational. The problem is when high standards are turned into political weapons that are applied to specific evidence or claims that I don’t like. That’s where it becomes a weapon rather than a tool.

Ten years from now, what does a more trustworthy scientific landscape look like?

My primary goal is to promote transparency across the entire research lifecycle, from plans to the data, the materials, the code that gets generated, to the outcomes that are produced, and to the evaluation process as the community assesses that evidence. If we can make the lifecycle more universally available, it will be easier to assess the evidence and improve over time. Because we can see each other’s work, we can debate it more productively, we can see the insights that are gained and converge toward truth more readily. And, with lifecycle open science, we can engage in the public sphere by showing work and illustrating what credible work looks like.

The current opaque system enables bad actors. It enables predatory publishers that will publish articles for a fee without any review at all because the authors get the reward they need—it’s published. This is corrosive. Paper mills exploit the opaque system by creating fake papers and then selling authorship to researchers. These destructive services succeed because publication is the currency of advancement. If all you need is it to be published, then why not just pay for it rather than doing all that hard work of conducting research?

Showing our work makes clear how hard we’re working to get to trustworthy knowledge. And, if showing your work is standard practice, it will be much harder for bad actors to meet the same standard. The dysfunctional markets are effective because they exploit the fact that papers can be generated and published without doing the hard work. So, in my view, the people who will benefit the most from a fully transparent research system are those who are working hard to conduct good research.

Do you think that in order for that trust of science to happen on a broader scale, the average person needs to be willing to do more work?

I think it’s too much to ask because most of what we trust is based on testimony. Very little of what I know is because I figured it out myself. Even in my own area of research, I depend on others to accurately report what they did, and others to review the data and code. If I am dependent on testimony, even for stuff that’s relatively close to my area of work, then what are we going to expect of anybody else?

We have to mature the way that testimony becomes reliable and surface where it isn’t reliable. This is a big challenge. It is a challenge that psychology as a field is well positioned to take on and contribute for trust in science, writ large.

So more of a systems change than an expectation that individuals need to change their behavior?

Yeah. Exactly right. I need to be able to trust the testimony of others. I don’t have time to read and reanalyze all vaccine data or climate science. Putting the onus on people to study more and become more informed about literally everything is not going to happen, right?

Ready to take action? Visit APS’s Act Now page to learn more about how to speak with your U.S. Representatives and Senators. 

Feedback on this article? Email [email protected] or login to comment.


APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.

Please login with your APS account to comment.