Can ChatGPT Label Your Text Data?
Plus, how entropy can help us learn about the world from quantitative AND qualitative sources.
Hello friends—Happy Wednesday! Welcome to Pulse of the Polis 20.
Apologies for suddenly vanishing for a few months. Life decided to throw a few curves my way. Some were like roller coasters: Fun, exciting, joyous. Others were like that coaster everyone put in their Rollercoaster Tycoon parks: Nauseating, too fast, and oh-shit-is-that-a-gap-in-the-tracks-am-i-about-to-die-should-i-have-rated-this-park-higher-to-avoid-this-cruel-twist-of-fate-i-knew-something-was-weird-when-i-couldn’t-exit-the-queue1.
I’ve been writing a bit but in fits and bursts that aren’t really conducive to the work I like to post here. I’m glad to be back on the saddle for PotP. I’ve updated the site with a new description (one commensurate with my posting expectations) and a brand-spanking new images.
Let’s get into some social science!
I’ve got 2 projects for you today
Not Causal or Descriptive But Some Secret, Other Thing: Entropy as a Criterion for Causal Learning | Conference Paper
Before discussing this paper, I think we need to first discuss what “entropy” is and what on Earth it has to do with social science.
Entropy is the universal tendency for systems, when closed and left to their own devices, to tend towards disorder. You see it in everything from the diffusion of heat, to the passage of time, to the permanent state of my toddler’s toys. This inevitable creep into disorder can be thought of equivalently as the tendency for systems to plod towards being maximally uninformative; “disorder” here not necessarily meaning “messy” but lacking predictable structure. (Why is Hoppity Bunny in the fridge? Who the hell knows! There’s no rhyme or reason.) We can measure entropy through understanding the probability that the elements of a system are in a particular configuration. When things are 100% assured to be in a particular state (such as the TV, which the toddler hasn’t managed to hoist from its usual spot yet), it exhibits zero entropy. When something is equally likely to be in any particular space (Hoppity Bunny’s location), it exhibits maximal entropy. Most real systems fall somewhere in between these extremes. And we do not need to be limited to just understanding “physical systems” through this lens: interactions within the social world—between actors, concepts, instruments, and institutions—can all be thought of as demonstrating some form of probabilistic configuration. In this case, entropy provides a measure of how predictable or “surprising” particular outcomes are.
With this background in mind, we can better understand the arguments of this paper. In the social sciences, findings are typically considered either “descriptive” or “causal.” “Descriptive” usually implies that the focus of a study is less the generative processes behind a phenomenon and more it’s about it’s contemporaneous state/configuration. “Causal” implies work less interested in the contemporaneous state and, instead, more interested in articulating the interrelation of phenomena—specifically, interrelations where B is understood to transpire as some consequence of A2. Instead of thinking of these as rigid, diametric dichotomies, Robert Kubinec proposes that we can use the framework of entropy to instead translate social science contributions as existing on a descriptive-causal continuum. Explanations that reduce the entropy of our measure of how social phenomena relate to each other can be understood as contributing at least some information about these co-relations. Work that aims towards “causal identification” (e.g., experiments and contexts/models which try to make credible cause-effect claims) tends to reduce this uncertainty the most, but descriptive work can also provide entropy-reducing information about a system. Importantly, this information can be sourced from either quantitative or qualitative projects; it is less about the mode of inquiry and more about the information that it can fruitfully recover on the topic. “Descriptive” work, Kubinec argues, can thus “can provide causal learning.”
This article had me thinking of little else but it for days after I initially read it. I’m still sorting out all of my thoughts on it and probably will be for a long time to come. I think that the entropy framework has the potential to be quite powerful.
One of the things that appeals to me about this piece is in the parallels it has with one of my favorite philosophical paradoxes: The Raven paradox.
Let’s advance the hypothesis that “all ravens are black.” If someone were to find a black raven, most people generally accept that this weakly improves our faith in the hypothesis. Someone taking a huge, random sample of the global raven population is better, though not perfect. If someone scoured the earth for every raven alive and didn’t find any that weren’t black, we’d take this as very strong evidence. And, of course, if we found a white raven, we’d reject the hypothesis outright3.
The paradox arises, though, from the fact that the statement “all ravens are black” is logically equivalent to the statement “if something is not black, then it is not a raven.” And if we pulled out a green apple or a white shoe, the fact that they are neither black nor ravens appears to support the statement “if something is not black, then it is not a raven.” And because the two statements are logically equivalent, if you take a white shoe to provide empirical support for “if something is not black, then it is not a raven”, then a white shoe must also provide empirical support for “all ravens are black.”
There are lots of solutions to the paradox, approaching the quandary from many angles4; several I think have very fair points to make. The solutions that I tend to gravitate towards the most though, argue that, yes, a green apple and white shoe do provide evidence about the “all ravens are black” hypothesis—just an incredibly miniscule amount5. The strength of evidence derives from the rigor and scope of the testing procedure but you can gain information nonetheless through non-ideal tests.
So Kubinec’s framework really strikes a chord with me because it provides a means of demonstrating that we can learn about world even if we aren’t using the fanciest quantitative methods. Shoot, even if we aren’t using strictly “quantitative” methods at all! Entropy provides a means of demonstrating that we are gaining information about the realization of a phenomenon by diminishing prior uncertainties. Sometimes the investigation provides a large information gain (like a large random sample of ravens), sometimes it provides a far smaller one (like a white shoe or a green apple). But, at the end of the day, knowledge about this little slice of the world has been gained.
I need to think more deeply about the nature of this learning and how it relates to the typical aims of “causal” and “descriptive” research, but it is clear to me now that the framework offers a common “language” for various investigatory styles to be integrated. And that’s something I’m always going to appreciate.
There are three ways I’d love to see this work extended. First, I think that there’s a natural marriage that can arise with this work and Bayesian inference (I’m specifically thinking about using it to compare prior and posterior distributions to quantify our reduction in uncertainty). Second, relatedly, it seems like the degree of information gained is pretty dependent upon our level of prior knowledge—so extensions that toy with more informative priors would be useful. Finally, I think it would be amazing if someone puzzled out a way to fruitfully extend the framework to the research design process. That way, people can think more deeply and rigorously about how to structure their study (given their skills, interests, and resource constraints) to maximize the information gain of their work.
I hope that this won’t be the last that I write about this topic. Truly great and interesting stuff!
Chatbots Are Not Reliable Text Annotators | Preprint
One of the many, many, MANY potential capabilities for large language models (LLMs; such as ChatGPT) is the ability to have the tool annotate text data: Essentially, throw your chaotic text into the LLM, provide it a list of topics and definitions, and watch it return the topics that each of your text contains6. However, since OpenAI will “openly” use your prompts and results as training data, and since we are but in the early days of this particular tech dystopia, there’s a need for papers investigating whether things like ChatGPT can actually do the job and if it’s better than open-sourced models and other alternatives.
This preprint, written by Ross Kristensen-McLachlan, Miceal Canavan, Marton Kardos, Mia Jacobsen, and Lene Aarøe, compares a variety of LLMs to a set of customized supervised machine learning models on samples of US Tweets. These Tweets originated from the accounts of various US news agencies and had been labeled by human coders to identify whether each Tweet discussed “politics” and whether they contain an “exemplar” (a named person or entity to use as foreground for a more general story). It also looked at whether LLMs’ effectiveness varied based upon how much information was provided in the prompts. Some prompts simply asked if the Tweet discussed politics/contained an exemplar, other prompts provided examples of Tweets paired with the decisions of the human coders as examples to orient the LLM.
In all cases, the custom supervised learning models outperform the LLMs. However, the relative difference depends on the annotation task at hand. ChatGPT, and many other LLMs, did almost as well as the trained models when determining if the Tweet contained political content, but its performance was substantially worse when flagging whether the Tweet contained an “exemplar.” There were large, not-always-intuitive differences in the effect of the prompt details. This suggests that there is a lot variability (and, given the black-box nature of many LLMs, perhaps unqualifiable variability) in the outputs. The authors conclude that ChatGPT may not be the most viable solution available for text annotation tasks and that supervised learning methods remain a viable (if not top-performing) option with regards to predictive performance—while also dodging concerns about security, openness, and replicability.
I really like this paper because there is a whole lot of hype surrounding generative AI right now. It’s not that I like I have a particular axe to grind against Gen-AI7, but I think that it’s important for us to separate the hype from reality. And there’s no better way to do that than to do some science!
By and large, I found myself mostly vigorously nodding along to this paper like a bobble-head keeping time with a metronome. (See, also, this X post from Mike Burnham that shows on “political stance classification from text” that “GPT-4 is impressive but cannot scale.”) If there’s one thing that I disagree with though, it’s less it’s set-up or analysis and more how it’s conclusions are framed. Yes, ChatGPT and other LLMs did worse in all contexts, but the preprint is written in a way that makes it sound like the LLMs got absolutely routed by the custom ML models. And, well, they really didn’t—in the case of identifying text as “political,” at least. As the graph above showed, ChatGPT’s F1 scores (a measure of predictive performance; nothing to do with cars, unfortunately) appear more-or-less identical to the supervised-learning model’s scores with very little difference based upon the descriptiveness of the prompt. Other LLMs performed similarly. It was when the models were tasked with identifying an exemplar that the supervised learning models showed substantially better, more reliable performance.
So I’m actually coming out of this paper the same as I do with most LLM papers I read these days: impressed with what it can do but increasingly aware that this impressive performance is limited to contexts featuring relatively mundane arrangements of text. “Politics” is very well-represented in ChatGPT’s training data; erudite concepts like “exemplars” with little exposure outside of media studies is less well-represented. Plus, supervised-learning models can be really expensive to create, so LLMs may actually be an ok solution when you’ve got a single project, you’re not super worried about reproducibility, you’re not a statistician/data scientist/ML engineer (and you can’t afford to hire a consultant), and you’re aiming to label text based upon simple concepts that are probably well represented and expounded upon in the training set. That sounds like a lot of “ands” but that probably describes the majority of people who, broadly, research humans for a living! The trick, then, is recognizing whether your application is going to be well-represented or not. (Easier said than done in practice though. The curse of knowledge tends to make experts overestimate how many people are aware of the things they’re experts in.)
One concern that I didn’t see emphasized in this paper (at least in proportion to the concerns I hear/read from ML/DS practitioners) was idempotency. Idempotency is the idea that, when you pop input X into a software function, you’ll get output Y every time8. If I have a dataset of tweets/comments/whatever and I have an algorithm that classifies observations into a handful of categories, I want those classifications to be the same whether I ran it the first or the fiftieth time. That kind of reliability is imperative when you’re making production-level systems and when you’re trying to make replicable science. Right now, you can ask ChatGPT the same question and you’ll get different text back. Based off what? Who the hell actually knows! (Do androids react to electric vibes?)
Future work building off of this could explore this issue explicitly by making multiple calls to the API to classify the same records multiple times, comparing the results. It’ll add cost to the experiment for sure, but the variation in F1 scores across runs could be a good signal for how reliable these models would be when put to repeat use.
But, all in all, I’m really glad that the authors decided to actually test whether LLMs can do one of the things that many a LinkedIn tech influencer swears can be done oh-so-easily9. I’m glad that this paper exists. I think that the evidence the authors gathered provides a splash of cool water on a really hot topic right now. But not enough to make me think that LLMs are necessarily substantively less viable than tailored supervised learning algorithms in all circumstances.
Though, to be fair, the authors weren’t talking about all circumstances! They were mostly talking about the context of scientific research and/or production-grade systems dealing with text data. And in such contexts, where we have professional obligations to make secure, maximally reproducible work, I think it’s good to demonstrate that we don’t have to chase after the shiny new tech thing. We have ChatGPT at home. Except, unlike “we have McDonalds at home,” the alternatives (whether open-sourced LLMs or supervised ML models) are actually as good if not better on the primary goal and better serves important auxiliary goals as well.
Nuggets:
I wanted to link this interesting thread on X that CNN’s Ariel Edwards-Levy wrote about a recent poll. The network asked respondents to identify the ways that Democrats and Republicans differ. It not only found that the vast majority of Americans are intuiting clear differences between the two parties (this was not always the case historically), but that there wasn’t a single, clear “most” important difference. This suggests that there are many salient cleavages observed among the lay public; a single issue doesn’t define the schism.
This interesting analysis from Ryan Cummings and Neale Mahoney show that a lot of the difference between people’s economic perceptions versus reality over time can be chalked up to partisanship. While this fact has been pretty well-established and occurs among both Republicans and Democrats, what’s interesting is that they find “that Republicans cheer louder when their party is in control and boo louder when their party is out of control.”
Ever wanted to see how different names were “birthed” from popular movies? Check out this post from Walter Hickey. (For example, Luna shot up massively after Harry Potter released as did Chandler (RIP) after Friends dropped).
This post has been edited because I forgot to delete a sentence about posting nuggets throughout the email. I tried it. Looked and flowed bad. Nuggets stay at the end for now.
I’m sure that there’s a German word for this.
This statement is fun because it will elicit literally no reaction out of the vast majority of people but make the very, very few with epistemology as their personal bugbears “big mad.”
This of course assumes that we don’t define ravens in such a way that they must be black such as we define “bachelor” and “bachelorette” as individuals who must be unmarried.
Social scientists will be familiar with the critique of whether induction actually can provide evidence in favor of something, as many of us are trained to approach our work as trying to find evidence that disproves a competing contention rather than provides evidence for the contention of interest itself. Whether or not that’s what many of us actually do is a whole ‘nother story. But fruitful critiques have also focused on whether the two statements are actually equivalent in a real-world setting rather than when we just consider them as tokens to be shuffled about in a logical equation.
The way I like to think of it is that the world has a stupidly large total number of discrete “things” in it (and what we consider to be a discrete thing is, of course, incredibly prickly in-and-of itself), but that number is not infinite. So while it may be more effective and efficient to only focus on ravens and the hue of their plumage, you’d still get the “right” answer (I’m ignoring albino ravens at the moment) if you enumerated all of the non-black things in the world and never came across a single raven. Just because you take the polar route, it doesn’t mean you won’t eventually get to your destination.
Or, even more simply, “here’s my text. Tell me if it concerns [TOPIC], yes or no.”
I, for one, welcome the imminent arrival of our robot overlords.
There were a couple of tangential references to “reproducibility” but I think the issue merits explication. Hence why I’m doing it here!
Especially if you purchase their course on the topic.