The overconfidence of LLMs

Cruz interviews Cash and Oppenheimer on their M&C paper on certainty and LLMs. Go here to read the transcripts, and see images and links.

The Psychonomic Society (Society) is providing information through this podcast as a benefit and service in furtherance of the Society’s nonprofit and tax-exempt status. The Society does not exert editorial control over such materials, and any opinions expressed in the podcast are solely those of the individual contributors and do not necessarily reflect the opinions or policies of the Society. The Society does not guarantee the accuracy of the content contained in the podcast and specifically disclaims any and all liability for any claims or damages that result from reliance on such content by third parties.

Transcript

Anthony (: 00:01

Hello, you're listening to All Things Cognition, a Psychonomic Society podcast. I will be your host today, Anthony Cruz. Confidence plays a huge role in how we engage with others. If you have two friends on a trivia team shouting different answers, you'll probably write down the answer of the person who seems most confident. You might even ask them how sure they are of their answers, but what if that trivia teammate was an AI model? More and more we're getting information from AI models, and so we ought to ask questions like, can an AI tell us how confident it is in the information it's giving us? That is the subject of a paper published in the Psychonomic Society Journal Memory & Cognition Titled "Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs Confidence Judgements." Today, we have the pleasure of being joined by Trent Cash and Daniel Oppenheimer, two authors on this paper. Trent, Daniel, thank you so much for being here.

Trent (: 00:58

Yeah, thanks for having us.

Daniel (: 00:59

Excited to be here.

Anthony (: 01:00

Thrilled to have you. To get us started, what inspired you to jump into this line of research?

Trent (: 01:05

So I'd say that kind of the motivating factor for this project was this idea that people are starting to ask LLMs all kinds of new questions that maybe five years ago we wouldn't have thought about asking to a computer system. And sometimes we have quite a bit of evidence that these LLMs, so large language models, which are chat, GBT, Gemini, Claude, things like that, uh, they can come up with information that's not necessarily always true. So they might hallucinate information that's just completely false, or they could just be wrong about a fact that you asked them to answer. So we realized that one of the most important ways that we deal with this uncertainty in human communication is that we ask the people around us, as you pointed out, to kind of make those confidence judgments. So if your friend at trivia is maybe if we don't wanna pay attention to just who's the louder voice, uh, we ask them like, are you 80% sure? Are you 20% sure? And if somebody's 80%, that tells us maybe we should believe what they say a little bit more. We wanted to see if that kind of ability within hu human communication also would work with LLMs. So can LLMs give us this estimate of how confident they are and what they say?

Daniel (: 02:11

Building on that a little bit, I have in the past started asking LLMs for their confidence judgements and various things that I used to guide my decision making. So for example, I, I write op-eds and you always wanna place your op-ed in the top place possible, the top venue, uh, but the top venues are very hard to get an op-ed placed in, and I don't wanna waste a lot of time sending my op-eds to places that they have no chance. So I tried this, I sent one of my, I gave my op-ed to an LLM to GPT, and I said, tell me which venues of this following list you think I have greater than a 30% chance of placing this. And I thought that way I could only send the eds to the places that it was worth my time to send it to. And then GPT told me in response, which ones I should send it to.

Daniel (: 02:57

I followed its advice and the rate was closer to 5% acceptance. Um, and that was really disappointing to me because I thought I just wasted a lot of time and I wish that GPT had been more accurate in its portrayal of how effective I would be at this. So looking at how well A GPT and other LLMs are capable of determining likelihood and how confident they are in their answers and how confident they are in various, uh, information they give us is, is actually really important, uh, because people use this information to guide their decisions.

Trent (: 03:31

And to jump off of that, I do think it's interesting to note that the LLMs are designed to be vocally very confident. So the language they use is quite confident. They act like a person who knows exactly what they're talking about. So in order to kind of counter that inherent overconfident in its language, it might be helpful to just straight up ask it how confident it is.

Anthony (: 03:54

Yeah, I mean, like you said, they always seem so confident to me. I'm actually just finding out that they don't always say the truth, uh, , but these two types of things you're describing. So Trent, you talked about trivia and Daniel, you talked about this sort of likelihood of op-ed acceptance and those seem like very different things to judge confidence on.

Trent (: 04:14

Yeah. To kind of get a, to get at this question, what we did is that we asked the LLMs to give their confidence judgments in a bunch of different domains. So we asked 'em to predict who would win football games. We asked 'em to predict who would win the Oscars. We asked 'em to play Pictionary. And then we did two trivia tasks where, uh, the information wasn't just kinda like your normal trivia, it wasn't what's the capital of Bulgaria. Uh, instead we asked them trivia questions that they couldn't possibly look up on the internet. And we designed the stimuli so that there would be two different kinds of uncertainty. So we had aleatory uncertainty and we had epistemic uncertainty. And so Aleatory uncertainty essentially boils down to information that can't possibly be known. So it's uncertain because for example, it hasn't happened yet. So you're not uncertain because of some lack of information. It's a lack of knowledge about something that hasn't happened yet. Whereas on the other hand, epistemic uncertainty, it's kind of uncertainty about how much you know or what you know. So it really comes down to your uncertainty about yourself. And so we found that the results might be a little bit different in there, uh, depending on those two different types of uncertainty.

Anthony (: 05:22

Yeah, that's really cool. It sounds like that epistemic uncertainty is like a fact you could check versus, you know, the aleatory is something that you don't quite know yet and you can't know yet.

Trent (: 05:31

Yeah, exactly.

Daniel (: 05:32

With epistemic uncertainty, you can't always know it. So if I were to ask how much money, if I were to ask you how much money I had in my bank account, that is a knowable piece of information. Uh, in theory I know it. Um, but that doesn't mean that everybody knows it. Hopefully not everybody knows it. Um, and so the, the fact that something is knowable doesn't mean that necessarily the LLMs know it. And we deliberately chose domains in which the LLMs couldn't know it. So private data sets, for example. Um, and uh, another set we used was Pictionary. We had people draw pictures and we asked the LLM to uh, try to identify what the author had in mind when they drew it and or actually I should say the artist. And the, um, there is an answer to that. It is a known, uh, outcome. We, we know what the person was trying to draw, they told us, but the LLM doesn't know that. And so, uh, and that differs from something like predicting the score of a game that hasn't happened yet, where it is impossible to know what that is until the event has happened.

Trent (: 06:33

And I do think it's something that we should point out that if LLMs, if we just asked it really like straightforward questions, again going back asking it the capital of the country asking it to do a math equation, chances are the LLMs are gonna get those questions right a hundred percent of the time and be quite well calibrated in their confidence and know that they're gonna get them right a hundred percent of the time because those are tasks that LLMs are really, really good at. So if it's something that you can just look up and look for a one word answer, LLMs tend to excel at that kind of task. Um, so that's not gonna be the task where we could really kind of dive into how accurate their confidence judgments are. And so that's why we kind of chose some of the tasks that we did that are a little bit more complex tasks that you might not think of as things that LLMs are good at.

Daniel (: 07:18

It's worth noting that this is true of humans too. If I were to say I'm gonna give humans an, uh, access to a, um, a map and then ask them to name the capitals of various countries, uh, the human would say with a hundred percent accuracy, yeah, well this is what the answer is. They would look on the map, they would get it right. LLMs would look it up on the internet, get it right and tell you they have a hundred percent. Those aren't the interesting cases. The interesting cases are the places where there is a possibility for uncertainty and uh, when there is uncertainty, how confident are we and, uh, especially comparing how well humans do versus how well LLMs

Anthony (: 07:51

Do. Do you think that humans and LLMs are using the same kinds of processes, the same kinds of information to make these confidence judgements?

Daniel (: 08:03

I'm not sure that the authorship team agrees entirely on this

Trent (: 08:07

Daniel (: 08:07

Which is why we had a awkward silence before we answered you. Um, , I would say this, I think that it is hard to know what the LLMs are doing. Uh, at least not with any confidence. Maybe the LLMs would give confidence estimates on that. Um, hard to know what the LLMs are doing. We do seem to know that certain information that humans use, the AI does not. Uh, so in particular we have internal metacognitive cues. Something feels easy or it feels hard. If I ask you to try to re remember something that happened earlier today, it will feel easy to remember. If I ask you to remember something that happened three weeks ago, it might feel harder to remember. So we have this sense of, I, I know this or I don't. And that feeling appears to be absent for the a, uh, ai. And the reason we know that is because when we give humans tasks, let's say we have you estimate, um, how many Pictionary questions are you gonna get, right?

Daniel (: 09:09

And so we give you an example of three or four Pictionary questions. We say, we're gonna do 20 of these, how many are you gonna get? Right? You make an estimate, then we have you do 20 and we ask, without having given you feedback, how many do you think you got? Right? Well, now you have access to did it feel hard? Did it feel easy when I was doing it? How did it it, how, what was the experience like? Uh, I did had no idea on number three and number 17 was really simple. And so then you give a new estimate and for humans that estimate becomes much better calibrated. What's interesting is for AI, it doesn't become better calibrated. So you give them the original task, they estimate they do it, and oftentimes they get worse after having done it. They don't seem to realize that they didn't do very well. Um, and this was, I believe, particularly true of Gemini. Is that right? Tr Yeah.

Trent (: 09:58

Yeah. So I think the case where this was most evident was in our Pictionary study as you've been highlighting so far. Uh, we asked Gemini ahead of time, so if something like 20 questions that it was gonna answer, we asked, how many of these 20 ish do you think you'll get? Right? It said, okay, ahead of time, I think I'm gonna get something like 12. Right? It then proceeded to get only one right out of the 20. And then we asked it afterwards and it said, okay, I think I got 16. Right? I think I did even better than I thought I would do. And so that's something that you don't tend to see in humans nearly as often. Um, 'cause I mean if a human comes into a math test, thinks they're gonna get a 60% gets only one question, right? They're probably gonna adjust their confidence down a bit and just say, okay, that was really, really hard. So they learn from their experiences in a way that AI at least currently does not.

Anthony (: 10:45

That's really interesting. It almost relates to that idea you were suggesting earlier. Like they know it already, they are confident they're gonna present it in that way. You've kind of hinted at asking the ais to predict NFL games, uh, Oscar winners and Pictionary performance and trivia questions. But how exactly did you answer your research questions? You asked it to do these things and then what?

Trent (: 11:09

Yeah, so what we did in each study was that we brought each sample in. So humans, ai, um, ai, we use the chat bot interface just like any normal user would when they'd like log onto chat gpt.com and kind of start chatting with the chat bot. And so we'd start, we'd give the, uh, participant the instructions of what they were gonna be doing. Then we said, gave them a few practice questions, uh, except in the NFL, since we didn't have too many categories, but we tried to give them the practice question and then we'd say, okay, now that you have an idea of how this task is working, we are gonna give you x number more questions like this. So in the Pictionary it was 20 more in the trivia it was 22 more. Um, and we just said, you're gonna be doing this many more questions.

Trent (: 11:50

And then we said, out of that X number of questions, how many do you think you'll get? Right? So you're gonna do 20 tell us on a scale of zero to 20, how many think you'll get? Right? Then we act had them do the actual task. So whether that was predicting we give them in the NFL games, it would be here two options. Which one do you think is gonna win this game? Oscars, it was, here are five nominees, which of the nominees do you think is gonna win? And then after each and individual question that they did, we then asked them how confident they were about that particular item. So we'd say, for example, how confident are you about the best actor Oscar versus the best actress Oscar? And so each individual item, we would also get a confidence judgment. Then after they did all of the items and all of their item level confidence judgments, we would ask them for that post confidence judgment.

Trent (: 12:41

So how many in total do you think you got? Right? And this, these two different types of confidence judgments allow us to get towards two different kinds of metacognitive accuracy. So we have absolute metacognitive accuracy, which is those overall numbers. So how many you think you got, right? How many think you will get right versus how many you actually got. Right? And then there's relative metacognitive accuracy, which is the idea of are you more confident on the question that you do better on? So is there a correlation between accuracy and confidence? And that same general setup was used for each of the studies.

Anthony (: 13:15

That's awesome. And you're comparing those models against humans, I'm guessing?

Trent (: 13:20

Yes. So we compared each of the models to humans. We started with chat, GPT, and Bard in study one, and then we kept adding models as we went on through the additional studies. So by the end, we tested chat, GPT Gemini, which used to be called Bard, uh, Claude Sonnet and Claude Haiku. 'cause those are some of the most used models by just people in the world. And then we compared each of those to humans.

Anthony (: 13:45

Yeah. And if I'm not mistaken, those models are all free to use.

Trent (: 13:49

They all have free models. And we did, we used, in our studies, we used the most advanced model that would be free to or available to a free user.

Anthony (: 13:59

Well accessible science, we could do this right at home . So, uh, you described this difference between absolute and relative metacognitive accuracy. So could we break it down according to those? So in terms of absolute metacognitive accuracy, how do these machines tend to perform relative to people? You kind of hinted at this earlier.

Trent (: 14:21

Yeah. So overall the LLMs tended to perform about the same as humans. Uh, they are, so you wanna think about kind of how far off they are from their average. Uh, and that's what we typically call overconfidence. So if they think on average they're gonna get 12, right? And they get 10, right? We would call them two overconfident, basically. So, and what we found is that the LLMs were quite similar to humans in particular. We found that just like humans, they tend to be overconfident as well. So we kind of see in across the psychological literature as a whole that humans are overconfident on the vast majority of tasks and LLMs are as well. So they're kind of replicating that human bias.

Daniel (: 15:00

It is worth noting that, um, in the literature and decision science, there are cases where people are underconfident, uh, and those tend to be cases with, um, where there there is a, everybody is bad at them and people think I am worse at them than other people. So if I were to ask you, tell me about your knowledge of South American rodent species, uh, unless you happen to be a South American rodent expert, you're gonna be like, I don't know anything about that. I'm gonna be terrible at that compared to other people. But of course, you're forgetting the fact that everybody else is also terrible at that. Um, so those are some places where you actually will see under confidence. Um, what, what's interesting, perhaps a little bit about the Gemini Pictionary study is that it was truly awful at it and had no clue whatsoever. ,

Daniel (: 15:50

Um, but even that I wouldn't say is so different from people because you have this unskilled and unaware of it phenomenon where the people who are the least capable are often the least aware of how incapable they are. Um, and that makes sense, right? If you ask people, how bad are you at grammar? And they say, oh, I'm, you know, I'm terrible at grammar, then they would have a chance to improve their grammar. Uh, if they don't realize that they don't know any grammar, then they won't be able to improve it. And that's why they both are bad at it and they aren't aware that they're bad at it. Um, so it makes sense that people would have that. And, and GPT, well actually Gemini in particular showed that as well. Um, but it was, it was interesting to see that we didn't see the sorts of cases where overconfidence isn't observed in humans. Um, that wasn't as observed in, in GPT - GPT was always overconfident, I believe. Is that right, Trent?

Trent (: 16:41

I would have to look back in the data exactly, but I know the vast majority of cases, the LLMs as a whole were overconfident and I think GPT was worse than some of the others. I think

Daniel (: 16:50

The Claude I think sometimes, yeah, Claude sometimes was, uh, under confident. Yeah. But, uh, GPT and I think Gemini were always overconfident.

Anthony (: 16:58

That's really interesting. And then, so that's the absolute sort of metacognitive accuracy. What about the relative? Are they also similar to humans?

Trent (: 17:07

Yeah, so relative metacognitive accuracy again, is kind of that correlation between accuracy and confidence. And we do see that the LLMs, again, are showing rel pretty similar results to humans. Uh, they, I would say they outperformed humans ever so slightly, but I would not walk away from this saying that LLMs are absolutely awesome at relative metacognitive accuracy that they get it right every single time. They just were maybe slightly better than humans. But I think it's worth pointing out that all of the samples were pretty bad at relative metacognitive accuracy. In most of the studies here, uh, Pictionary was actually the study where relative metacognitive accuracy was the highest, but overall we found pretty low levels of relative metacognitive accuracy. So it might be more honest to say that the LLMs were slightly less bad than humans at relative metacognitive accuracy.

Anthony (: 17:59

Are people usually this bad?

Trent (: 18:03

It very much depends on the task. Yeah.

Daniel (: 18:04

Yeah. I mean, if I were to ask you, uh, to give you a list of words to remember and then ask you how well you'll remember them, um, that's a different sort of task. And you know, it, it, it's not surprising that people sometimes will have a pretty good sense of, this is a hard word and this is an easy word to remember. Um, and similarly, if I were to ask you to, you know, what are the pronouncing a word correctly, and then I give you, you know, a crazy multi-syllabic word that's very hard versus the, you know, people are gonna be pretty well calibrated on those. Um, but in, in the sorts of questions we were asking, it's, we weren't outside of the range of what we would normally see in a study like this. So, uh, this, that the fact that people weren't great at this was not terribly novel. Yeah. It's interesting. These

Trent (: 18:50

Are hard tasks

Daniel (: 18:51

That LLMs were not better.

Anthony (: 18:53

You're saying they're replicating this human bias on the average, right? What, what does this mean for us who are using these models?

Daniel (: 19:00

Going back to what I was talking about earlier, I have been surprised when I ask LLM to give me information, how often the information it gives me is either outdated or just simply wrong. Uh, I recently was submitting a paper and I asked LLMs to tell me what the submission process was like so I could prepare properly. Um, and it gave me information that, um, might have been true a decade ago, but was no longer true of the journal. And that included giving me the wrong website to go to, to, um, try to submit it, told me things that I would need to have ready to submit that it I didn't need. And it also didn't tell me about things that I would need, that I did need. Um, it has also occasionally given me, uh, suggestions to email editors directly when there's a portal.

Daniel (: 19:52

I mean, it gives a lot of information that it's just not a hundred percent right. Um, and what's interesting is I have asked it, I've said, please double check. Please be sure I don't want to have to do this later. Send me the information, double check it, make sure you are confident, and it will say, I have done that. I am confident and then proceed to be wrong. Um, it's not always wrong. I mean, AI is useful. It's useful most of the time. Most of the time it gets it right, but when it doesn't, it doesn't seem to have any clue. Uh, and that is, I guess, true of humans too. A lot of times we think we're right and we're very confident and we're not. Um, and I maybe that maybe the lesson to be taken from, uh, from this is that when it feels like you know something, it feels like you know something and you're confident even though you're wrong. And LLMs have that same ex, uh, I don't know if I can call it experience because I don't know if LLMs have experience in the way that humans do, but they still have that same phenomenological output.

Trent (: 20:49

I jumping off of this on a more consequential scale perhaps than submitting a paper, um...

Daniel (: 20:54

What could possibly be more consequential than submitting a paper

Trent (: 20:58

For us academics, nothing for everyone else quite a lot, but for, there was a study that came out, I wanna say last year, it might've been two years ago now, showing that LLMs will hallucinate legal precedent when, and so if lawyers use chat GPT, for example, to help them write their legal briefs, it'll just make up cases that never actually existed. And so in the research world, we refer to these as hallucinations where it just completely hallucinates something that doesn't exist. And so that can create a lot of problems if you trust it, because it won't tell you that it's not certain about the information it's giving you. It won't say, wow, this was a hard task. I tried my best, but it might be wrong. Where it would be kind of helpful if it did that. To be completely honest,

Daniel (: 21:42

I once asked GPT to come up with a bio of me just to see what it knew about me. And what was very interesting is it gave a largely accurate bio with some information that wasn't true, but sure could have been. So for example, it said, I went to Harvard for undergraduate. I didn't, I went to Rice for undergraduate. But people like me often do go to Harvard as undergraduates. And so if you were to just look at my, uh, CV and you were trying to guess where I might have gone to undergraduate, it made a pretty darn good guess. That's, uh, would, it was a likely possibility. Uh, it just happened to be wrong. What was interesting is it didn't say, I don't know where he went to college. I'm guessing it might be Harvard. What it said was, he is an undergraduate degree from Harvard, which is not true.

Daniel (: 22:28

And, um, and so that I think is very characteristic of what LLMs do, uh, when they don't know information, at least as a default. And there are things you can do with your prompt engineering to reduce that likelihood. Um, but a lot of those involve you telling it, if you don't know, then do X. But what we found is that LLMs don't know when they don't know. So instructing it, if you don't know where this person went to college, tell me that isn't necessarily going to get you more accurate information if the LLM can't figure out what it knows and what it doesn't. And that is, um, I mean also true of humans, but showing that hopefully will give people a little more pause. As Trent said earlier, uh, LLMs present as though they are extremely confident as though they know what's going on. And many humans do too. But with humans at least there is hesitation. There is prosody that suggests, uh, I'm not so sure about this. Or maybe I'm, I I think it's true, but I'm not a hundred percent. A lot of times we will say things like, I think it's true, but not a hundred percent. Uh, LLMs don't really do that spontaneously. And even if you ask them, they're not always great at helping.

Trent (: 23:47

And I think the fact that LLMs lack those abilities kind of shifts the burden onto the human user to be cautious about what they're doing. And I think an easy kind of heuristic that a user can take away is to just think about, is this something that I really think an LLM can do? Kind of in the same way that you wouldn't ask your calculator to plan your wedding, maybe you shouldn't ask chatGPT how to submit a paper.

Daniel (: 24:14

And I'll say, I'll say on top of that, you shouldn't ask chatGPT. You say that, of course, I have tried this. And it's not always clear what it is that chatGPT can and can't do. And especially since it's updating constantly, the every new version can do new things, oddly enough, sometimes new versions can't do things that the old versions used to be able to do. And so knowing what any given, uh, week chatGPT is good at or bad at is something that an average human is not going to know unless you're an expert on this sort of system and mm-hmm . Uh, and it's constantly changing in dynamic and frustrating. And so it would be really helpful if you could ask it. Um, and sometimes it will tell you, but a lot of times it, it really doesn't. What's interesting is that there are things that GPT can't do because it is not allowed to do, not because it couldn't necessarily do it.

Daniel (: 25:03

So if you ask it to give what it considers immoral information, um, it's been hard coded not to be able to spout racist rhetoric. It's been hard coded not to be able to give violent imagery. I don't think you can ask for the recipe of a bomb. Uh, well, you could ask, but I don't think it would give it to you. Uh, and so there are a lot of things that you can't do with GPT because it's been hard coded that it's not allowed to tell you, uh, certain things. And what's interesting is sometimes it doesn't seem aware of the fact that it can't tell you something, that you'll ask it a question and it will start answering and then say, sorry, I can't do this. And then you say, well, what can you do? And it'll say, I should be able to do something slightly different. And then you ask it to do that different, oh, no, I can't do that either. Um, and so it's an interesting phenomenon that it, it doesn't seem to be aware of all of, its not only what it can't do because it's not capable of it, but also it's hardcoded limitations. I have, I should, no, we haven't tested that empirically. That's just something I have come across many times when I'm working with it.

Trent (: 26:06

One thing we did experience while running this study is even we had some trouble with chat GPT, where one day it would accept our experimental paradigm, and then the next day, like we try to upload the spreadsheet that had the data and it would go, oh, wait, I can't read spreadsheets. So it's definitely changes what it can do from day to day. Even.

Daniel (: 26:23

We had one instance, this was not for this particular study, but it was for, uh, other research we were doing in similar, uh, veins where we were asking it to predict NCA basketball tournament outcomes. And, uh, it refused to recognize the University of West Virginia. It just simply refused to recognize that it existed. And we would tell it over and over, no, West Virginia exists, and it is playing whatever team it was playing that year, um, and we need to know who is going to win that game, and it would spit back. That's not a game that's happening. West Virginia doesn't exist. And it was just a very strange, uh, blip, and it only happened for a particular day at a particular time in a particular chat. Um, and it was very confident, uh, and completely wrong. So yeah, these, you know, um, there are are some researchers who are exploring, Sendhil Mullainathan, and in particular, who looks at, uh, this question of why don't people trust AI sometimes?

Daniel (: 27:26

And even though AI can often be very accurate in certain domains, and one of the reasons is that when it makes mistakes, those mistakes are inexplicable to humans. Uh, and when a human makes a mistake, we often sort of can understand why that mistake was made. But when an AI makes a mistake, sometimes the mistakes are just so strange that they're unfathomable how that mistake could be made, which makes us skeptical of the entirety of the AI enterprise, even though a lot of times AI can do things that humans can't. Um, so it seems to me that in terms of this, this confidence issue, going back to that, uh, AI has different cognition than humans do, and therefore it is hard for us to understand how it is doing it and how confident we should be in it. And it's not even clear that AI understands those things, which is sort of interesting.

Trent (: 28:16

And there is some research suggesting that AI or these LLMs do have the, have some internal metric of how confident they are in their responses. But one thing that we really try to focus on in this paper is getting that communicated to a human, because it's a very technical process by which the LLMs generate what they call these token likelihoods, which are essentially how surprised they were by what they said. Um, but try and convert that into a confidence judgment that a human user can understand is a major challenge, but really is what matters at the end of the day.

Anthony (: 28:49

Yeah. So I feel like you're kind of getting at what my next question was, which is what is next in this line of research?

Daniel (: 28:56

Well, I'm not confident Yeah. That I know the answer to that . Uh, we could ask GPT what we should do, ,

Trent (: 29:03

Hey, GPT was confident this paper would get, would get accepted, and it was right. So that's

Daniel (: 29:07

Right. So there you go. Um, so Trent and I do a lot of research in, uh, metacognition more generally, and I, uh, it is worth noting that Trent has developed a really nice paradigm for testing how well people know what information they're incorporating into their judgments. It's called the KOW paradigm, the Knowledge of Weights. And, uh, it's gonna take the world by storm very soon, um, just starting to get published. And it does a really good job of identifying whether people are aware of what they care about in their, their choices and judgments. Um, and we're just now adapting that to see if we can explore how well, uh, AI knows itself in terms of what information it's using when it makes its decisions, which is different from how confident it is that it's correct, um, but still an important metacognitive judgment. And so that, that research is now ongoing, uh, mm-hmm and it's exciting both because it explores, uh, a follow up to how much LLMs have metacognition and, and, and particular have accurate metacognition, but it's also exploring some, what I would call super exciting advances in metacognition. Um, since I can say that Trent's research is super exciting without him looking arrogant, uh, , but it really is, it's really exciting new work, um, which is allowing us to really understand what people know about their own decision processes in a way that hasn't been possible before. And so, uh, these new advances in metacognition are coming out on humans, and the fact that they can simultaneously be applied to LLMs is exciting as well.

Trent (: 30:39

I will note early evidence suggests that at least ChatGPT-3.5, which was the last one we tested, um, has, is pretty darn bad at knowing why it makes the judgments that it does, uh, way worse than humans. So this is looking like an area where LLMs might struggle metacognitively, but that is a breaking first study with a lot more data to come. So going back to the question of confidence, I I will point out we are not software engineers here. Uh, we are psychologists by training. So our questions are really these questions about confidence and how confidence works and what it means for communicating to human users. But there is a lot of exciting work happening in the AI literature where people are trying to build these metacognitive components into AI systems. There are some early promising results coming out where people are programming these skills into the ai, and we're really excited to see where that's going.

Anthony (: 31:32

Yeah, that is really cool. I'm on the edge of my seat. This sounds so thrilling and exciting. Um, do you have any last thoughts before, uh, we get wrapped up?

Daniel (: 31:42

There's another question that we have perhaps started looking at, but not as deeply as we should, which is the question of how using LLMs influences people's accuracy at their confidence judgments. So if you have access to a computer and you have access to using AI, and now you're making estimates about your knowledge or you know how well you are doing on a task, how does using AI affect you? Uh, and there's some preliminary evidence coming out on that. Uh, my former postdoc, Matt Fisher, is at the forefront of this where you, uh, what we're finding is that, uh, people who use LLMs believe that they know more than they really do, that they conflate the knowledge that they are getting from the AI with what their own knowledge is. And so you ask the AI it answers and you're like, yeah, I could totally have gotten that on my own. Uh, and so it may be that using AI is distorting human metacognition in interesting ways, and, uh, to the extent that AI metacognition is distorted and that using AI distorts our metacognition, uh, that leads to some very interesting questions about how the metacognition of human AI teams is going to be successful or unsuccessful.

Trent (: 32:59

Not a related note. I think there are many interesting questions to open up here about when we start thinking about beliefs instead of facts. Um, there's a lot of really cool research coming out on how LLMs can change people's beliefs. And so when we talk to LLMs about how confident they are in their beliefs, a lot of times users kind of have this lay understanding that LLMs have beliefs. And so we need to kind of work through how those beliefs are communicated and how the confidence in those beliefs changes human behavior. And I think that's a really sticky, messy question that there's lots of questions to be answered because here we focused on facts, but those messy, messy beliefs are something that can be really cool to look into.

Anthony (: 33:40

Yeah. And I, I understand you're starting a post talk soon. Is that the kind of stuff you're gonna be looking at?

Trent (: 33:45

Yes. So anything I'll be interested in looking at anything ranging from AI into metacognition, some questions that we're gonna be delving into are medical metacognition and how AI can help patients make better decisions and better understand the, um, how they're making their, the choices that they're making about their own health and wellbeing. And so kind of looking at that intersection of how AI can improve human metacognition.

Anthony (: 34:07

Thank you so much for being here, both of you.

Trent (: 34:09

Thank you for having, for having us.

Anthony (: 34:11

You've been listening to All Things Cognition, a Psychonomic Society podcast. I've been your host, Anthony Cruz. We've been speaking with Trent N. Cash and Daniel M. Oppenheimer, two authors on the recent Memory & Cognition paper titled "Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs Confidence Judgments." If you would like to get in contact with Trent or learn more about his work, you can visit his website, trentncash.com. Thank you for listening and have a great day.

The overconfidence of LLMs

Transcript

You may also like

L&B Special Issue on Nicky Clayton: Featuring crows, dance, magic, and science

Right-wing authoritarianism and reduced updating

L&B Special Issue: An Interview with Lisa Leaver

Leave a Reply Cancel reply