Major advances in computing technology, combined with the vast digital networks and the immense popularity of social media platforms, have given rise to unimaginably large troves of information about people. It’s estimated that the amount of digital data in existence today is in the thousands of exabytes — or 10 to the 18th power of bytes.
This era of Big Data has enormous potential to change the way psychological scientists observe human behavior. But just as it creates new opportunities, access to huge chests of information also creates new challenges for research, said Michael N. Jones of Indiana University Bloomington, introducing a theme program on Big Data at the 2014 APS Annual Convention in San Francisco.
“Each little piece of data is a trace of human behavior and offers us a potential clue to understanding basic psychological principles,” said Jones. “But we have to be able to put all those pieces together correctly to better understand the basic psychological principles that generated them.”
The study of language development, one of Jones’s own research interests, is a great example of a line of research poised to benefit from Big Data. Collecting large data samples from infants in naturalistic settings is extremely time-consuming and typically results in small samples. Testing theories about the way children learn language takes a long time as a result.
Big Data can help expedite the process. As a proof of concept, Jones showed how more than 100,000 words from natural language could be fed into a computer model based on the theory of associative learning — the idea that children group words together based on how often they’re used near other words. Jones showed that, as its analysis progressed, the model indeed recognized that “computer” and “data” were more closely related as word categories than, say, “computer” and “aardvark.”
Ultimately, said Jones, a similar analysis can be done to study associative learning theories in direct samples of child conversation. “These models are quite good at learning things from noise, as long as they have enough data to go on,” he said.
Big Data might help researchers get to a point where they can collect behavioral information without sampling human participants at all, Cornell University information scientist Tanzeem Choudhury said. Technology such as smartphones and wearable sensors can gather information on physical activity, social interactions, geographic location, and so on.
The upshot of this type of data collection is that it’s effectively invisible to users; it doesn’t require their time or energy, and it drastically reduces self-report errors.
“We can continuously get measurements of behavior without bugging people to fill out surveys,” said Choudhury. “We can potentially get continuous measurement without actually having to engage users all the time and rely on their self-input.”
Choudhury has been involved in a number of such projects already. StressSense tracks where people experience stress most frequently throughout the day to help them avoid anxious situations. MyBehavior uses physical activity patterns to suggest ways to stay in shape — walking to work more often along a route users seem to enjoy, for instance. MoodRhythm lets patients with bipolar disorder monitor sleep and social interactions to maintain balanced mood and energy levels, a major improvement over pen-and-paper tracking of daily behavior. (The programs remain in development as smartphone apps.)
The goal is to make it easier than ever for people to improve their lives, said Choudhury: “Just like sensing [technology] has become invisible, can we actually make behavioral change invisible?”
Big Data also enables researchers to reconsider past problems in fresh ways, said APS Fellow Brian M. D’Onofrio of Indiana University Bloomington. In particular, he said, researchers should consider repurposing data that might have been collected for other reasons. Repurposing large data samples can help researchers produce insights that traditional samples can’t as well as achieve the statistical power many lab studies lack — a big challenge as psychology makes a push to improve its methodology and replication process.
“With Big Data, it gives you the opportunity to use several different types of quasi-experimental designs, to help rule out alternative explanations,” D’Onofrio said.
D’Onofrio and collaborators recently repurposed millions of personal records compiled in Sweden to challenge the conventional notion that smoking during pregnancy directly causes bad behavior outcomes, such as criminality, later in life. In one study, the researchers analyzed 50,000 siblings whose mothers smoked during one pregnancy but not the other. They determined that family background factors — as opposed to exposure to smoking during pregnancy — accounted for the association with criminal convictions. Such realizations can greatly improve interventions: In this case, getting women to quit smoking should be only part of the focus of a broader suite of social services.
Big Data is already producing positive change in the world of Web search. The billions of Internet searches that occur each day leave behind behavioral logs that analysts use to improve search engines over time, said Susan T. Dumais of Microsoft Research. Without that vast record, sites like Google and Bing would never be able take the 2.4 words in an average Internet search and convert them into something useful.
“Behavioral logs allow us to characterize, with a richness and fidelity that we’ve never had before, what it is people are trying to do with the tools and systems they’re interacting with,” said Dumais.
By mining behavioral logs, analysts can create personalized algorithms that improve the search experience for users. If Dumais searches for “sigir,” for instance, she probably wants the homepage of the Special Interest Group on Information Retrieval (abbreviated SIGIR). If Stuart Bowen Jr. performs the same search, he probably wants the website for his position: Special Inspector General for Iraq Reconstruction (also abbreviated SIGIR).
In other words, systems can learn that words and acronyms in isolation aren’t always the best way to predict what a user wants from a search. Modeling searches in a way that takes into account the context in which the query is issued is important in improving Web search. Previous search activity matters, as does the location and time when the query occurs. A search for “US Open” performed in late spring likely refers to golf, for instance, while the same search in late summer likely refers to tennis.
“Before you were able to collect Big Data, the person who spoke loudest, or the highest-paid person’s opinion, would dominate,” said Dumais. “Now the data, especially when derived from carefully controlled Web-scale experiments, dominates.”
Big Data can even help psychological scientists study studies, said Tal Yarkoni of the University of Texas at Austin. Yarkoni and others recently developed Neurosynth, an online program that analyzes huge amounts of fMRI data to guide users toward a subject of interest. To date, said Yarkoni, Neurosynth has synthesized research from over 9,000 neuroimaging studies and about 300,000 brain activations.
One major goal of Neurosynth is to distinguish between brain activity that is consistently associated with a particular psychological process, but is nonspecific, and brain activity that implies a high probability that a specific psychological process is present. For example, painful physical stimulation might consistently produce a certain pattern of brain activity, and yet that pattern of activity need not imply the presence of pain; other mental states potentially produce a similar pattern. Inferring mental processes from observed brain activity — a process known as “reverse inference” — is very difficult to do in any individual neuroimaging study.
Neurosynth makes reverse inference possible by amassing loads of images and study data in one place. For example, the database helps researchers identify brain regions that are specifically related to pain instead of working memory or emotion, even if some of the active brain regions overlap in all three cases. Tests show that in many cases, Neurosynth performs as well as analyses done manually by sifting through the research literature — but with a time-savings of hundreds of hours of research versus simply pushing a button, said Yarkoni.
“That’s the long-term goal,” Yarkoni said. “To do this in a quantitative, automated way instead of a manual, qualitative way.”
Long-term goals were the theme of the program, since Big Data is still emerging as a scientific presence. Not everyone believes it will create a paradigm shift. (Behavioral scientist Dan Ariely of Duke University has compared Big Data to teenage sex, in that “everyone talks about it, nobody really knows how to do it.”) Even if Big Data does change statistical analysis, it can’t replace strong behavioral theories or experimentation, said Jones. But insofar as Big Data can refine those theories or sharpen those experiments, researchers can’t afford to ignore it.