Use recursive partitioning and decision trees to guess someone's political opinions (using R with survey data)

Let’s say you meet someone at a dinner party, and you really want to know their opinion about some juicy political topic. If you’re me, you might just ask them straight up. But often, there are reasons why you can’t be so forward.

Thankfully, political opinions are not randomly distributed. So if you can glean a few facts about someone, you can often predict their other, more hidden opinions. The challenge is knowing which pieces of small-talk are the most predictive of the jucier questions you can’t quite ask them.

Maybe you only have one minute; which questions should you ask? In what order should you ask them? If there’s no time limit, how many questions should you ask before you’re wasting your time?

The ideal solution would be a short list of questions, in a particular order, like a tree: If they say yes to this, then ask them this, and so on. Ideally, this list would be as short and simple as necessary to make the best guess possible.

It turns out there is a statistical technique that matches this challenge perfectly. Recursive partitioning is a technique that produces decision trees of binary or “yes/no” questions. The resulting decision tree represents the statistically optimal path of questioning.

How to conduct recursive partitioning in R

Let’s say you want to predict whether or not someone supports the suppression of racist speakers. A bit akward to ask straight up at a dinner party. We can use a large public dataset such as the General Social Survey, run some recursive partitioning, and obtain the “tree” of mundane questions we can respectably ask someone at a dinner party.

First, identify the variable of interest. In the GSS, there is a variable that asks whether racist speakers should be allowed to speak or not. The name of the variable is spkrac. (In my processing, I renamed this variable spkrac.fac.) Then identify the variables representing mundane facts you can observe yourself or reasonably ask about within the norms of small talk. I identified the following mundane variables (for more information, search for the GSS codebook):

  • sex/gender = variable named sex
  • race = variable named race
  • left/right identification = variable named pol
  • family income = variable named realinc
  • college attendance = variable named college
  • word knowledge or verbal skill (proxy for IQ) = variable named wordsum

How to conduct recursive partitioning for a binary dependent variable in R

In this case, our variable of interest is a binary variable: one either allows or disallows racist speakers. The modeling routines are different for different variables of interest, but so long as your variable of interest is binary, you can more or less follow the following routine. The code is commented.

First we need to import the raw GSS data and process the variables we’ll be using.

# Import the necessary packages

gssFullPanel <- read_dta("/Users/justin/Dropbox/Public/political-data/GSS7218_R1.dta")

df <- subset(gssFullPanel,
                 select = c(sex, educ, realinc, 
             polviews, spkrac, race, ethnic, wordsum))

# Race
# Optional: Assign NA to other because it's ambiguous and wildly unstable over time
# factor() removes empty factor levels, which is useful here because IAP = 0
levels(df$race) <- c("White","Black","Other")
# levels(df$race)[levels(df$race)=='Other'] <- NA

# Sex
# as_factor() from haven package brings in the labels for factor levels
levels(df$sex) <- c("Male","Female")

# College
df$college <- ifelse(as.numeric(df$educ)<=12,
                     "No College",
                     "At Least Some College")
df$college <- factor(df$college,
                     levels=c("No College", 
                              "At Least Some College"))

df$spkrac.fac <- factor(df$spkrac -1) # Reduce to 0 and 1 and convert to factor
levels(df$spkrac.fac)<-c("Allowed", "Not Allowed") # Rename levels

df$pol <- as_factor(df$polviews)
df$pol <- dplyr::recode(df$pol,
                            "EXTREMELY LIBERAL" = "Extremely Liberal",
                            "liberal" = "Liberal",
                            "SLIGHTLY LIBERAL" = "Slightly Liberal",
                            "moderate" = "Moderate",
                            "SLGHTLY CONSERVATIVE" = "Slightly Conservative",
                            "conservative" = "Conservative",
                            "EXTRMLY CONSERVATIVE" = "Extremely Conservative",
                            .default = NA_character_)

Now we can conduct the partitioning. The code is pretty simple.

# Start by subsetting your main dataset ("df") to just the variables
# of interest. We'll also omit any rows with missing values.
sub<-df %>%
  select(c("spkrac.fac", "sex", "race", "pol",
           "realinc", "college", "wordsum")) %>%

# Next conduct the recursive partitioning using the rpart() function
# Variable of interest first, then a "~", as follows:
fit <- rpart(spkrac.fac ~ sex + race + pol + college + realinc + wordsum,
             # With small data, set 'minsplit'and 'cp' to low values
             control=rpart.control(minsplit=2, cp=0.001),
             method="class", # if you're predicting a class, rather than a number

# Now "prune the tree" with cross-validation, aka keep only the branches
# with the most predictive value
pfit<- prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])

# Plot the decision tree
yesno=2, # add yes/no labels
box.palette="GnBu", # add a swankier color palette
tweak=1.1, # increase the font size a bit
main="Decision Tree Predicting If Someone Wants to Suppress Racist Speakers")


The interpretation is very intuitive. The top of the tree is the first question you should ask, the single variable that gives you the most leverage to predict your victim’s secret opinion. The top node represents your knowledge before asking them anything: If you meet a random person, there is a 38% chance they would vote to suppress racist speakers. This applies to the whole population of America (100%.)

According to the analysis, the first question with the most predictive power is: “Did you ever go to college?” If they say yes, the probability of them wanting to suppress drops to 29% and that’s all you need to ask them! Nothing else will improve your guess (at least from the variables we selected).

If they never went to college, the next question you have to ask yourself is whether they’re smart. You probably don’t want to give them a vocabulary test, but conversation is pretty naturally revealing. So if they are on the dumb side, consider them a “no” in this chart. If they are smart, then you infer with a 40% probablility that they would disallow the speaker (you infer that they would allow). If they are dumb, it’s now a coin flip (and 26% of the population is here).

Next, what is their race? This you can probably guess yourself. From here:

  • If they are white, then there’s a 48% chance they would disallow the speaker. If then they are male, there’s a 45% chance they would disallow and that’s your final guess (that they would allow). If they are female, you can see if their family is rich or not. If rich, they are slightly less likely than a coin flip to disallow the speaker (46%); if poor, they’re slightly more likely than a coin flip to disallow the speaker (54%).

  • If they are dumb and non-white,then there’s a 57% chance they would not allow the speaker. And that’s your best guess.


For a relatively small dataset, the results strike me as reasonably useful (and intuitively plausible). To gauge this, consider that a “yes” to our first question (whether they attended any college) gives us a pretty good bet that our interlocutor would allow racist speakers (71%), and that represents about 50% of people! With only a few more questions, two of which we can observe without asking, we can get to as much as a 57% guess that they would disallow.

Probably the major limitation of this short study is that we don’t get very impressive traction on those who would disallow racist speakers. The highest we get is a 57% probability, which is non-trivially better than if we know nothing about someone (in which case we’d guess a 38% probability). But it’s still not great. All this means is that we need better predictor variables, which we could identify through additional research. But as a first cut with a small handful of variables I gathered based only on my own intuitions, these results are pretty good, I think.

Perhaps the most interesting implication here is that ideological identification totally drops out! It appears that it’s not predictive. This is fascinating, given that many people today tend to think of speech suppression is a fashion on the educated Left! And it is, but that’s only a highly visible minority. Political scientists would not be surprised by this result: We’ve long known that leftists and educated people are always more supportive of free expression (you just don’t hear about those people in the media right now).

The next time you’re at a party, please try this. Let me know how it works!


comments powered by Disqus