Beyond Words: Why TalkBank is Crucial for Spoken Language Research

May 1, 2025

Tags:

Most linguistic datasets focus on written text, but what about the way we actually speak? TalkBank, the world’s largest open-access repository of spoken language, is helping researchers understand everything from child development to dementia, bilingualism, and even classroom learning.

In this episode of Under the Cortex, host Ozge Gurcanli Fischer Baum speaks with Brian MacWhinney from the Carnegie Mellon University, who recently published an article in APS’s journal Current Directions in Psychological Science. MacWhinney, the creator of TalkBank, highlights how spoken language research is transforming our understanding of psychology and communication. The conversation evolves into the discussion of the importance of open-access language databases, the role of artificial intelligence (AI) in analyzing speech, and the future of spoken language research across cultures and disciplines.

Send us your thoughts and questions at [email protected].

Unedited Transcript

[00:00:07.160] – APS’s Özge Gürcanlı Fischer Baum

How can we systematically study spoken language on a large scale? One powerful tool researchers uses TalkBank, a vast open access database of spoken language samples from diverse contexts, speakers, and languages. By analyzing these recordings, scientists can uncover patterns in communication, language development, and even Cognitive Processes. This is Under the Cortex. I’m Özge Gürcanlı Fischer Baum with the Association for Psychological Science. Joining me is Brian MacWhinney from Carnegie Mellon University, who has a recent article on this topic in APS’s journal, Current Directions in Psychological Science. Together, we will explore how TalkBank is transforming linguistic research and what it can tell us about human communication. Brian, thank you for joining me today. Welcome to Under the Cortex.

[00:01:00.010] – Brian MacWhinney

Well, thank you for inviting me, Ozge. By the way, I’m over here in Hong Kong right now, and it’s great to be able to talk with you.

[00:01:07.540] – APS’s Özge Gürcanlı Fischer Baum

Yeah. Let me start with our first question. Could you tell us a little bit about yourself? What type of psychologist are you?

[00:01:16.890] – Brian MacWhinney

Well, I began with child language, child language acquisition researcher. I even did my dissertation on how Hungarian children learn to speak. But I then became more and more interested in the idea that adults cannot learn like children. And I started to study second language acquisition, and I’m still working on that to some degree. But over time, I just keep on getting interested in more and more things. And this whole TalkBank thing has certainly changed my life because we’ve been brought into study of everything in the world, like aphasia and dementia and stuttering and second language. And oh, my gosh. So I think over time, I’ve really broadened out into all the areas from which we have data. That keeps me from getting retired.

[00:02:06.090] – APS’s Özge Gürcanlı Fischer Baum

Yeah, that’s great. Let’s talk about TalkBank. What inspired you to create TalkBank and how has it evolved since its inception?

[00:02:14.170] – Brian MacWhinney

We really began with childs for many years. We were just doing child language data. The predecessor was Roger Brown, who sent around data from his study of three children on mimeographed sheets.

[00:02:28.350] – APS’s Özge Gürcanlı Fischer Baum

For those of our listeners who don’t know, Brown Corpus is a very famous one, right?

[00:02:34.240] – Brian MacWhinney

Roger Brown was a professor at Harvard, and he studied three children, Adam, Eve, and Sarah. This was really the first really detailed study of children with recordings and transcripts, because a lot of it before that had been very anecdotal. And so that was a beginning of really quantitative study of child language. And of course, it was 1970. This was a long time ago. This was way back before there was even floppy disks. There was Remiograph, and he sent me a copy of all the data, but it was all extremely paper and pencil. And so as the IBM PC came in, we really said, Well, look, we could really put this all on computer and people could start working on the data and share the data. And that’s what really inspired us. So that began really in 1984, when Katherine Snow and I were at a MacArthur meeting, and both of us had the same idea at the same time, which is to start to share data. Also, Dan Slobin had some idea along this line, and we had a meeting in Nijmegen, and I think all of us realized that this was a great thing to do.

[00:03:43.890] – APS’s Özge Gürcanlı Fischer Baum

Yeah, we will come to the open science principle of TalkBank, but let me start with a more basic question. How does TalkBank differ from other large-scale linguistic data sets?

[00:03:58.930] – Brian MacWhinney

Great question. This is really very fundamental because most of the other large databases are of written language. Now, there is an exception, of course, language… The LDC, the Linguistic Data Consortium, has a lot of spoken language. But one of The biggest difference is that all of the data in TalkBank all are in the same format. So you just have to learn the format, which takes a while. But once you learned it, then you can look at everything across languages, across different types of speakers, ages, and and so on. Whereas other databases have one corpus that was contributed by one person in one format, and then you have no way of comparing across. So these are two huge differences, the emphasis on spoken language and Then also the use of a consistent format is a really big difference from all other databases, I would say.

[00:04:54.970] – APS’s Özge Gürcanlı Fischer Baum

Yeah, definitely. The consistent format brings in predictability for the researchers, so they don’t need to reorganize the data themselves. That is a really strong side of this database. Let’s talk about spoken language a little bit. Why has spoken language been historically underrepresented in large linguistic databases in your opinion?

[00:05:18.250] – Brian MacWhinney

Well, this, of course, gets to the core of many of the issues here, which is, well, obviously, the first thing people think about is privacy, and they say, Well, people don’t want to share data. But it turns out that that’s not really Very true. We include people with aphasia, with children in the home, children at school, second language learners. All of those people have given informed consent to share their data. So it’s not really true that the participants are worried because actually, in many cases, they’re proud to have themselves included. And they want other people to, in the case of a phasic, they want other people to understand what it’s like to be a phasic. So that is not the problem. The problem is often that researchers don’t ask for informed consent. And at that point, then they don’t pass their IRB review. So there’s actually the real barrier is often that the researcher is not really thinking about data sharing. And that is changing, but very slowly. So that is a problem, and it’s a shame. It’s not that we would want everything. That would be too much data, wouldn’t it? But we would literally like to have more data, particularly on underrepresented languages and underrepresented groups.

[00:06:29.340] – Brian MacWhinney

So this is One of our current real goals is to really broaden out representation, actually.

[00:06:35.740] – APS’s Özge Gürcanlı Fischer Baum

Yeah, definitely. Talkbank is a great start to address spoken language and conduct studies on that. I would like to talk about the content of TalkBank a little bit. It covers everything from child language development to aphasia and bilingualism. What are some of the most surprising discoveries that have come from its data?

[00:06:58.830] – Brian MacWhinney

Well, this was the toughest question that you gave me because really, we know a lot about humans already. Often in studies of humans, often we confirm what we’ve suspected. But having said that, I do think that the application of large language models to our data is really quite remarkable and surprising, and particularly right now in the area of dementia, where we have something like 500 computer science labs around the world, not to mention all the graduate students who are developing machine learning and large language models to understand the onset of dementia using the only corpus that is available for their work is the one in TalkBank. There is no other publicly available data on dementia. So This is really such a change. I never really foretold that suddenly TalkBank would become at the focus of this whole AI revolution, quite honestly, backed into this. But it’s quite exciting, and it’s great to see how well they’re doing. You can really use language as a really non-invasive way of studying the onset or non-onset of dementia. So as I say, this is being picked up by pharmaceutical companies, by computer science labs, by all the big tech companies, and they’re all based on our data.

[00:08:23.600] – Brian MacWhinney

That’s surprising, I think.

[00:08:26.190] – APS’s Özge Gürcanlı Fischer Baum

Yeah, and it is impressive, like you said. You mentioned Even in passing, you said you talked a little bit about aphasia. How has TalkBank contributed to understanding speech disorders like aphasia and stuttering?

[00:08:42.160] – Brian MacWhinney

Well, I mean, there is the classical theory of aphasia, which is, Broca’s, Wernicke’s, Anomia, Conduction, and so on. And we do have data that pretty much conforms to that. But even when you go into any of those, you find enormous individual differences. And first of all, the videos are available on the web for researchers. They can actually study in detail the problems or the possibilities that these people have in terms of communicating often with their gestures, what kinds of retail patients they have. And this is an enormous boom for teaching clinicians about how to deal with aphasia, because otherwise, they have really no contact. And we have some real tools for really teaching clinicians about what aphasia like and really sharpening their understanding. So I think that’s one real important part. I may be jumping ahead a bit here, but I think one of the biggest gaps in our understanding of things like aphasia, stuttering, even child language, second language learning, is the lack of longitudinal data. So we really want to understand how an aphasia develops over time, and that’s particularly in dementia, we want to know that, but also child language. Now, That’s actually better in child language.

[00:10:01.730] – Brian MacWhinney

We do know have some longitudinal data, but for second language, we don’t have enough. We want to see how people are progressing. And so that is really one of our goals. And we’re starting in that direction. I think that would be a big thing for the future is to have more longitudinal data. Hopefully, these new methods, these AI and all these recordings and everything and our interacting Zoom and all that will make that easier. I think that’s going to be great.

[00:10:30.110] – APS’s Özge Gürcanlı Fischer Baum

Yeah, you are right. Having longitudinal studies in dementia and dementia-related language disorders is definitely a gap in the literature. That is great news for the future. I would like to add that studying is impossible to study with written language, right? Without TalkBank, there is no other way to have this open database for it.

[00:10:56.430] – Brian MacWhinney

Yeah, you could study little clips. You could do some very tight, acoustic analysis on a few sentences and stuff. But if you really want… And I should say there’s another great field called conversation analysis, and those people have a lot of data. Unfortunately, they have not decided to share that data, not in a comprehensive way. We have a lot of CA data, but given the amount and the wonderful work they do in that field, it’s a shame. Another great area that I love is socio-linguistics, and also that area has also not We’ve gotten off into data sharing. So we move forward step by step. One of our new areas is Psychosis Bank, and those people are agreeing now to study data on people with psychosis, which is the whole idea of language in Psychosis is a very interesting one because it does reflect patterns of thought. And that is an international effort that’s being run by a fellow Lena Palana-Yapan and Miguel and many of his colleagues. That’s quite exciting, too.

[00:12:03.010] – APS’s Özge Gürcanlı Fischer Baum

Yeah. Like you said, with the new technologies, language could be one of the predictors of these episodes. I know that there is already data on that. There are already studies showing us the predictive power of language use. Yeah, that’s great. I mean, there are so many studies that are done using this type of data. I believe TalkBank has been used in over 12,000 published studies.

[00:12:32.820] – Brian MacWhinney

Yeah, well, that number is always growing, so it’s probably old.

[00:12:36.610] – APS’s Özge Gürcanlı Fischer Baum

Right. Here is another potential difficult question for you, Brian. Can you share an example of a study that particularly stood out to you?

[00:12:50.140] – Brian MacWhinney

Well, of course, there’s so many. I read, I probably read one eighth of all the published studies in detail. But of course, in child language, I really love that. One of the studies I really love is from Elaina Leven and colleagues, where they looked at the use of a and the by children. It turns out that you would say, well, a and the are going to be used for the same way across all different words. But it turns out that a is used a very different way from the with different words. I have an appointment, but I don’t say the appointment, or where is the door, but I don’t say a door so much. So words are going to go… This is what they call usage-based linguistics, and I think this is such a great example of the fact that words have their own attachments and little ecosystems that each will The word has a little ecosystem that it lives in. So it’s, I think, a great reflection of, yes, there are categories, I wouldn’t deny that, but there’s also this very low level of basics. And this is what large language models are doing, too.

[00:13:59.390] – Brian MacWhinney

They’re taking these the data in this way and they’re making abstractions, but they also keep this lower level association. I think that’s a great example of how usage-based linguistics is really informing us.

[00:14:14.530] – APS’s Özge Gürcanlı Fischer Baum

I am so glad you mentioned this study about ɪ and ɪ, because as a second language speaker, I confused them still a lot.

[00:14:23.150] – Brian MacWhinney

Turkish. Yes, Turkish has a problem with that. I know.

[00:14:25.670] – APS’s Özge Gürcanlı Fischer Baum

Yeah, because in Turkish, it is given in the language or we have a suffix.

[00:14:31.280] – Brian MacWhinney

Chinese is even worse. Yeah. And Russian also. Yeah.

[00:14:37.590] – APS’s Özge Gürcanlı Fischer Baum

I love it when children also get confused. I’m like, Okay, we are on the same boat. I would like to talk about this open science principles a little bit. Talkbank definitely follows those. It’s a problem when we have all this great data, you mentioned some examples, and when people choose not to share them. For our novice listeners who are not from the field, I would like to ask you, why is it important to make language data freely available to researchers?

[00:15:11.350] – Brian MacWhinney

Well, the core of this is that it’s so that you can do science. Science is based on the idea that different people can run the same experiment and make sure that it’s really working. They may not always want to replicate the same experiment, but they want to do something very similar. And so they need to make sure that the data that they’re looking at is really constant. I mean, think about geology. They really know where the rocks came from, and they have core samples and everything. So there’s no real worry about what the basis is. I know that paleontologists sometimes don’t share their skeletons for a few years, which is a shame, and there’s some discussion that. But in the end, you get to see all the skeletons. And without being able to actually look at the data, a person may make an ID about gender relations and speech, and you say, Well, if I saw the data, I might not agree with that. And so that’s the whole thing in science, that you want to make sure that other people are on the same plane and that they agree with your findings by criticizing.

[00:16:16.170] – Brian MacWhinney

I mean, we do, as scientists, have to criticize and make sure that all the findings are really right. This is just core to science, I think.

[00:16:24.390] – APS’s Özge Gürcanlı Fischer Baum

Yeah. At the APS, we definitely support open science as well. Why not share the love? We are all- Yeah, that’s great. We are all in search of the truth or scientific analysis. In the beginning, Brian, you talked a little bit about the IRB process. Let me ask a question to you and see if you have more things to add. What other ethical challenges come with collecting and sharing spoken language data?

[00:17:00.510] – Brian MacWhinney

Right, this is a huge topic, probably an hour of podcast on this one. But first thing is that, as I mentioned earlier, you really must ask for informed consent. Now, that’s not always true. There are some types of data, there are survey data, because they’re so de-identified that you don’t care. Actually, most of the data in TalkBank, the ones that are just auditory data, not video, are essentially de-identified. There are people who claim that voice can be identified identified, but that’s not actually true. I mean, it might be true in a very small community. If you knew exactly that there were a group of 30 Mennonites in Western Pennsylvania, and this is one of those people, then you can say which of those there are. But if you’re talking about a child in Chicago, you can’t identify a single voiceprint unless we had a national database of voiceprints. Even that would be very difficult. I mean, it’s not like fingerprint. So there’s a lot of misunderstanding about that, I think. So de-identification is indeed possible. And in that case, you’re not even required to have informed consent. But having informed consent really is, I think, the right way to go.

[00:18:12.600] – Brian MacWhinney

You don’t want to violate people’s trust. But then beyond that, there has also been this thing in Europe called the GDPR, which is a really fairly reasonable data privacy act. And all of the people in Europe don’t understand what it says. It says that scientific data can be shared, particularly when they’re de-identified. But even if not de-identified, they can be shared if there’s informed consent. And then it says, well, and then also they can be shared if they’re for scientific purposes. So it makes it very clear that you can. But unfortunately, the universities in Europe don’t read their own regulations very carefully, and they block data sharing because it’s easier to block than it is to allow. Everybody has a set of lawyers that are telling them, oh, be careful, be careful. And so at the end of the day, the data sharing loses out. The only good news on this is that the federal agencies and the funding agencies, including Gates Foundation and so on, are all really pushing for data sharing. They’ve made this really a requirement. Still, people try to get out of it. It’s funny. It really is funny. But we move forward step by step.

[00:19:28.620] – APS’s Özge Gürcanlı Fischer Baum

Yeah, I’m hopeful because Because I think as people see the benefits of it more and more, I think they will find solutions for it. We talked about the type of research that stood out to you and the number of published studies that came out of TalkBank. But of course, it’s a big field, right? And there are still areas that are under studied. In your opinion, what areas of spoken language research are still underdeveloped, and how How can TalkBank help address them?

[00:20:03.440] – Brian MacWhinney

Yeah, I mean, that’s a really good question. There are really many areas that are really begging for further data. Well, of course, there is this one type of data that has been collected called Home Bank, which is a child language database, but it includes 16 hours a day recordings that are taken in the home. And so there is a real privacy issue there. So we have to really vet the people who use it. And also we go over the transcripts of these or as much as we can to see if there’s anything embarrassing. Now, we have informed consent, of course, but still we want to make sure there’s nothing too embarrassing. But having more of that would be really fantastic. And it is out there. So that’s an area where complete coverage. And even a fellow named Uri Hassan has this 1,000K, a thousand days of video recording, complete video recording in the home Yeah, but this is an area, this huge data. This idea of huge data collection is one. Another, there are all sorts of minor areas. We actually have some data on swallowing, of all things. I’m not really into the intervoicement swallowing, But the other is generally apraxia, people with problems with articulation, and we need much more data.

[00:21:22.750] – Brian MacWhinney

We haven’t got enough data sharing in that area, but the data is out there. That would be one area where we really have to Of course, the biggest is this longitudinal data, as I mentioned, and that is so important. Then there’s all the data from these indigenous languages. We have now, hopefully, a wonderful collection of Mixtek and Tautanak data from Mexico, from Jonathan Amis. We hope that that work gets funded. So there are a lot of anthropologists, nativists who have been collecting these data, and I think many of them are willing to add data, but there are some technical problems with those languages. And also you do need buy-in from the communities and those people, too. To extend our coverage to the 7,000 languages of the world. We’ve developed computational tools for analyzing the foreign languages that are in TalkBank. But to extend to all these what we call under-resourced languages will be a big challenge, but a really interesting one. A lot of computer scientists are working on That’s a big area. There are so many areas, I got to tell you. There are so many things we could do.

[00:22:37.900] – APS’s Özge Gürcanlı Fischer Baum

You partially answered my next question, but let me, for our listeners, reemphasize the cross-linguistic nature of TalkBank. There are many different languages there. What languages or communities would you most likely to add?

[00:22:57.890] – Brian MacWhinney

Well, I guess certainly Oh, gosh. I’d like to add everything. But the problem really we’ve run into is that the computational tools for some of them are weak. So one of the things we’ve been doing most recently is that we get data from researchers that haven’t yet been transcribed, and we can use automatic speech recognition to pull them in to transcripts right away. And then we can also use natural language processing tools to do a complete grammatical analysis. I mean, everything is becoming so automatic. It’s crazy. But not for these these under-resourced languages. But at the same time, people at Stanford are doing great work, and CMU, too, Carnegie Mellon, on developing tools for these under-resourced languages. Sometimes if you know one language in a group, like the Bantoo languages, you can then extend your AI. This is all AI, right? To these new languages. So the computational tools are fantastic. So I think, I mean, I’d like to add anything Everything we can. What we need is that sometimes we have the data, but then it takes more work to bring it up to the standard of the other languages, all the speech recognition and the language processing.

[00:24:14.330] – Brian MacWhinney

At some point, this is a job for the whole community, and we’re getting a lot of computer scientists helping us, which is great.

[00:24:21.460] – APS’s Özge Gürcanlı Fischer Baum

Yeah, that’s great. We are at my last question for the podcast, even though I have so many other questions, but our time is limited, unfortunately. I would like to ask you about your predictions about the feature. If you could make one big leap in spoken language research within the next decade, what would it be?

[00:24:48.220] – Brian MacWhinney

Well, my own preference is to have longitudinal data. There’s a problem with funding there because the funding often is for two or three years, but that’s still… At the other hand, that’s really my perspective. I think if you step back further, it’s really AI is going to be the biggest contributor. As we move forward, there will be also autonomous agents who are going to be able to speak with people over phone calls or over the web, and people are already working on this. And I think this will gather data of a very interesting type because right now, those agents can’t really individualize their work. But as they develop individualized AI, I, individualized essentially with a user model, this gets very technical. But people are up to this, and I think that will be a big leap. When those tools are ready, we’ll have such fantastic data. I mean, in their native language, communicating with them and then recording all their interactions with a computer system. So we’re going… It’s a little scary. I know the idea that we’re getting human language based on on questions from a computer system, but believe me, it’s going to be happening and it will work.

[00:26:06.810] – APS’s Özge Gürcanlı Fischer Baum

Yeah. All right, Brian, thank you very much. This was a pleasure. Thank you for answering all my questions.

[00:26:14.130] – Brian MacWhinney

It’s been a pleasure talking, and I hope people have an idea. You can register at TalkBank. Some of the data sets are more protected, and we have to make sure that you’re an established researcher. But the general large amount of data is really Fairly open access. So invite people in. Thank you for letting me explain all this.

[00:26:38.110] – APS’s Özge Gürcanlı Fischer Baum

Yeah. Thank you very much again. This is Özge Gürcanlı Fischer Baum with APS, and I I’ve been speaking to Brian MacWhinney from Carnegie Mellon University. If you want to know more about this research, visit psychologicalscience.org. Would you like to reach us? Send us your thoughts and questions at [email protected].

News > Podcasts > Beyond Words: Why TalkBank is Crucial for Spoken Language Research

APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.

Please login with your APS account to comment.

Cookie	Duration	Description
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
uvc	1 year 27 days	Set by addthis.com to determine the usage of addthis.com service.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_3507334_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Cookie	Duration	Description
loc	1 year 27 days	AddThis sets this geolocation cookie to help understand the location of users who share the information.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Beyond Words: Why TalkBank is Crucial for Spoken Language Research

Unedited Transcript

Related

Investigating Human-Like Processing in Large Language Models: A Glimpse into Findings from Early-Career Researchers

Human Insights for Machine Smarts

Language and Memory Are in Focus for Latest Cattell Sabbatical Awards

Unedited Transcript

Related

Investigating Human-Like Processing in Large Language Models: A Glimpse into Findings from Early-Career Researchers

Human Insights for Machine Smarts

Language and Memory Are in Focus for Latest Cattell Sabbatical Awards

Cookies