How Not to Mess Up Your Survey Experiment

Plus how different translations of Mass Effect may have led to different qualitative experiences of the game.

Jan 26, 2024

Hello Friends—Happy Friday! Welcome to another edition of Pulse of the Polis.

Today we’re focusing on a great article by John Kane that looks at different ways your survey experiment could come back with insignificant results apart from your hypothesis or theory being wrong.

I loved this paper reading it; it’s really jam-packed with a lot of practical information. The best way to prevent a false-negative (that is, saying something didn’t have an effect when it actually does) is through strong, thorough experimental design. There are loads of things to consider long before you’ve collected a single respondent, let alone opened R/Stata/SPSS/Python or whatever. And this article touches on a bunch of them.

There are some nuggets at the end with some polls, research, and resources that caught my eye this week! Lots of great stuff. I hope you enjoy!

Think that people you know would find today’s topic interesting? Click here to share it with them! And don’t forget to subscribe if you haven’t already!

Survey experiments are a popular means of testing causal theories and hypotheses in the social sciences and in industry. (In the latter, they are frequently called “A/B” tests and are used in everything from product design, strategy, messaging, marketing, and a lot more). The promise being that a significant result tells us (more) definitively that a change in X causes a change in Y. But what about an insignificant result? Ideally, a well-designed and executed survey experiment should allow you to conclude that the evidence suggests that the intervention is not causally-linked to changes in Y1. But doing a “well-designed and executed” survey experiment is easier said than done! Non-significance can also be indicative of failures in how the experiment was set-up rather than the invalidity of the tested hypotheses.

This article by John Kane identifies 7 alternative reasons for why your experiment may return with non-significant results that aren’t due to your underlying theory/hypothesis being “wrong.” The paper comes with suggestions on how to check/prevent them which I augment with some of my own thoughts throughout2.

1. Respondent inattentiveness:

Most treatments in survey experiments require at least a bit of active participation on the respondents’ part. The interventions have to be watched, read, listened—you know, experienced. You trust that people aren’t just browsing Reddit and clicking through your survey at random, not paying a lick of attention to your meticulously crafted intervention. Respondents not paying attention (either because the intervention is dull or because respondent quality is poor) means that you haven’t always engaged in a fair test of the underlying theory.

Some fixes: Ensure that treatments are not too dull/long (or that respondents are commensurately compensated for the engagement necessary to test your hypothesis), stratify your analyses by measures of attentativeness, and make sure your sample is generally high-quality prior to exposure to the intervention.

2. Intervention failure:

Most experimental manipulations—either explicitly or implicitly3— rely on the intervention affecting some mechanism of change for your dependent variable. For example, if I’m testing if a video will drive purchasing behavior, I might be expecting the video to do so by changing my brand image. But if the manipulation does not shift this mediating force, you will not see your dependent variable change. This could be because your treatment is not strong enough to actually make this change or it could be that it’s not operating on your anticipated causal path.

Some fixes: Ensure that your treatment is actually strong enough to reasonably make a difference in the mechanism of change4 and test that it is actually causing a shift in what you expect; conduct a manipulation check. Generally, only field instruments that you have good reason to believe will make the change you expect, either through pre-testing, cognitive interviews, previous research, or deep engagement with the subject matter.

3. Pre-treated respondents:

Experimental manipulations generally assume that you’re working on an audience that has not already been treated by what you intend to do. In my experience (and is mentioned in the paper), this happened super frequently with the Covid vaccine: Everyone and their uncle was (understandably) trying to see if various vaccine endorsements would encourage people to get jabbed. What if Trump endorsed it? What if Biden? What if the Rock? The issue, though, is that social scientists and practitioners weren’t the only ones showing respondents endorsements (and detractors) of the vaccine! Respondents were seeing such messaging all the time IRL! They were, in a sense, “treated” long before they ever opened the survey. So the effect that you observe in the experiment may not work because any effect suggested by your theory has already been wrought in the world outside.

Some fixes: Ensure that your respondents have not already received something analogous/contrary to what you are about to show them. Or, at the very least, that they are not so saturated that change is basically impossible. This may, unfortunately, mean not being able to research the hot-button issue of the day until the fervor dies down5.

4. Insufficient statistical power:

Statistical power refers to the ability of your experimental set-up to correctly differentiate an effective treatment from a “null treatment.” Power varies as a function of three things: your “significance threshold”, the potency/strength of your treatment (how much you reasonably expect to shift your outcome variable), and your sample size. Social science research (and adjacent fields such as marketing and tourism) generally suffer from underpowered studies. This leads to a lot of false negatives but, also, a lot of overoptimistic effects.

Some fixes: In practice, you have two routes to ensuring adequate power: Increasing the potency of your treatment condition such that it’s reasonably expected to have a larger effect on your outcome OR having a large enough sample size to reliably reject even small effect sizes at your desired significance level. However, most interventions in the social world are inherently small! So, often, increasing your sample size is the most pragmatic solution.

5. Poor measurement of the outcome variable:

There may be times where your outcome is pretty grounded: you may be interested in the kinds of things that easy to capture without needing to climb far up the level of abstraction (such as click-through-rate, dollars spent, time on the page, etc). Others are much more nebulous like “support for democracy.” What does that actually mean? The process? The promises? The vibes6? When you go up the ladder, you have to work harder to ensure that you’re actually tracking with the thing you’re intending to measure. Your intervention may be working fine in theory: it’s your measurement that’s off.

Some fixes: Stand on the shoulders of giants. Look towards scales that have previously been published. If striking it out on your own (and measuring a tough topic), try to use multi-item scales as these are more stable on complex constructs. Pre-test your measure by ensuring that it actually correlates with what you’d expect it to. For example, a support for democracy measure should probably correlate well with measures of political tolerance and negatively correlate with support for authoritarian figures/policies.

6. Ceiling and Floor Effects:

“Ceiling” and “floor” effects refer to cases where your dependent variable is fundamentally bound—and where many of those that you intend to treat lie near (or at) the ends of the measurement scale. For example, if you want to see if a new promotion will significantly improve brand perceptions among loyalty members, you may return with an insignificant result because their perception (again, as loyalty measures) is already near the max; your intervention can’t possibly move it much higher. In these cases, your treatments may not be insignificant because your theory is wrong or because of your treatment not working well. Rather, it’s because there’s not much more room to push things overall.

Some Fixes: Don’t artificially limit the bounds of your scales if you suspect that there may be people who tend to cluster at the extremes. Multiple scale items will also work well here if you expect that top/bottom-coding will be prevalent on some articulations of your concept but not necessarily every articulation. Additionally, knowing the generally distribution of your outcome variable will be important for estimating your statistical power. If small effects are all that’s possible, you should plan for small effects!

7. Countervailing Treatment Effects

Time for my favorite thing in all of experimental research: Subgroup heterogeneity!

There may be cases where an overall null effect masks numerous discernible effects among subgroups in the sample. For example, you may find a null effect on an abortion-related message on support for a democratic candidate. This could be because there’s no true null OR because subgroups (in this case, Republicans and Democrats) react in opposing ways that could cancel out when viewed from the whole. There are significant effects for every subgroup of interest, but they balance each other out when pooled together such that it appears null7.

Some Fixes: Identify (either theoretically or empirically8) subgroups within your analysis that may be affected by the treatment in different ways. A lot of this comes down to understanding the potential mechanism of change and questioning whether the assumptions you have about this mechanism will necessarily hold for everyone you’ve surveyed. You can, of course, learn more in the course of an experiment and thus gain a hunch as to who may be affected differently. But that will only be useful on the next study; it’s best to try and puzzle through that ahead of time.

And there you have it! Though this paper focuses on avoiding an outcome that can only be discerned near the end of an analysis (i.e., statistically insignificant result for reasons apart from your underlying theory), it’s imperative to recognize that the best (and sometimes only) way to prevent these from happening is to think through them before your data hits the field. It’s just another one of the billion cases in this life of an ounce of prevention being worth more than a pound of cure.

Note: The main post has been edited to reflect that the article is not “conditionally accepted” at the Journal of Experimental Political science but fully accepted! Which has got to be my favorite reason to make an edit ever.

Nuggets for the Week

This Knights of Columbus - Marist Poll of about 1,400 American adults fielded in early January 2024 finds that Americans, by and large, continue to support some limitations on Abortion (66% of national adults support restrictions on when an abortion is allowed versus ) but even those categorized as “Pro-life” overwhelmingly support exceptions for “rape, incest, or to save the life of the mother at any time during pregnancy” (70% versus “no exceptions at 27%). Respondents were also overwhelmingly supportive of pregnancy resource centers that “do not perform abortions but instead offer support to people during their pregnancy and after the baby is born” (83% support). Those surveyed are opposed (60% strongly apposed, 26% opposed; total of 86%) to terminating a pregnancy due to the gender of the child, though many are more accepting of abortions performed if the child would be born with Down Syndrome (30% strongly oppose, 28% oppose, total 58% oppose versus 40% support).
Friends Don’t Let Friends is a GitHub repository by Chenxin Li that offers an opinionated guide for various data visualization practices. You know, the kinds that “friends don’t let friends” commit. It may be opinionated but I agree with the Lion’s share of the opinions. Some of them include “friends odon’t let friends make bar charts for mean separations”, which shows why it’s important to show (as best as you can) the full distribution rather than just the mean. The only one I’m more meh on is the exhortation against pie charts. Don’t get me wrong, I’m not the biggest fan but I actually do think they have a place depending upon your audience. Apart from that personal quibble, I find this to be a wonderful resource aimed at making our data visualizations more clear and understandable.
Floridians are facing a massive property insurance crisis (for reference, my home insurance literally doubled YOY from 2022 to 2023) and 58% of registered Florida voters, according to this poll of nearly 1,000 fielded early January, sponsored by The Associated Industries of Florida Center for Political Strategy, do not think that state legislators are doing enough to lower insurance costs. Home insurance costs and inflation are top-of-mind issues in the nation’s 3rd largest state, which is facing an increasing home affordability crises due to a combination of inflation and increased housing costs driven by lagging home construction and population influxes.
This paper published in December of 2023 looks at the linguistic differences in the second-person pronouns (e.g., “you” in English) in different translations of the Mass Effect video games. Many languages have different versions of “you” to denote deference and respect: so-called “formal” and “informal” second person pronouns. In French “vous” is formal, “tu” is informal; in Spanish “vosotros” is formal and “tu” is informal9. The version of “you” that you use can make a pretty big difference in the connotation of your interaction. Interestingly, the French localization of Mass Effect had much more instances of formal usage than the Spanish localization did—a pattern that is over-and-beyond any tendency for French/Spanish media to tend towards these forms generally. Meaning that some of the interactions French and Spanish players had with the same characters, and the same underlying original English, might’ve carried distinct meanings.

At least not among the full population studied; the possibility of heterogenous treatment effects are always a possibility!

When in doubt though, assume that the advice is John’s and that I am adding detail or otherwise building off of him.

And we ought best to make it explicit when we can!

Do we actually think that a single paragraph of text is enough to substantially change positions on deeply-seeded moral issues which have strong identitarian (and even often genetic) drivers?

That said, it may actually be useful to do so even if you know that the audience has been “pre-treated” if you’re in the business of persuading folks. This is the case for interest groups especially: You don’t always get the benefit of a sterile discourse environment. And knowing if your message manages to work even when the audience is pre-treated is very useful. But, in that case, you’re going to want to condition upon level of prior exposure to get an accurate sense of movement—and your aims are decidedly more pragmatic than testing scientific theories.

In practice, the answer is, unfortunately “yes.”

This would qualify as a case of Simpson’s paradox.

Though if you’re doing so empirically, it’s best to have multiple independent samples to confirm your suspicions on!

English’s “you” was actually the former “formal” version with “thou” being informal—but since you only see “thou” frequently in older writing and/or religious texts, “thou” tends to connote more gravitas today.

Pulse of the Polis