Teaching and learning

A quick-start guide to the Statistical analysis of political attitudes and behaviors

Code and instructions to start a statistical research project using R.

Overview

The purpose of this document is to get you up and running with (potentially) professional-level, statistical analysis of political data, quickly. It will automatically import the Eurobarometer Trend File (1970-2002), a large survey dataset asking individuals across European countries a wide variety of questions in many different years. This document is intended to rapidly “bootstrap” individuals with zero programming knowledge into being able to conduct fairly sophisticated anlayses on a wide variety of political questions. In particular, this is intended especially for undergraduate political science students with no background in computer programming, interested in questions about political attitudes/behaviors, who wish to challenge themselves and build their skills/experience with quantitative data analysis. You can use this as a springboard to learning how to use R independently, but this document is not intended to teach you that. It’s a purely practical guide to jumpstart your capacity to produce analyses; everything else is up to you.

If you find this fun/interesting/useful/exciting, then you might want to actually learn the language a bit more. For excellent, easy, free interactive tutorials try: https://www.datacamp.com/courses/free-introduction-to-r and/or http://tryr.codeschool.com/

Setup

  1. Download and install R (it’s free.)
  2. Download and install RStudio (it’s free).
  3. Find and download the Eurobarometer codebook for your reference
  4. Open RStudio

R is the programming language and RStudio is the application we will actually open, see, and do our work in.

  1. Load this starter script I have written for you.

After opening RStudio, go to File > New File > R script and copy-paste that starter script into the new script file. Or download that that starter script as a .R file and then go to File > Open File and select that .R file. You can just select all of the text in that script and click “Run” (or command-Enter on a Mac), to rapidly have your computer do some basic data wrangling tasks and estimate, as well as visualize, the relationship between media exposure and the probability that individuals in the UK would vote in the 1994 European Parliament elections. To do your own original analysis of something totally different, you can peruse the codebook of the Eurobarometer data and copy/tinker with the code in my script to apply it to your own variables of interest. You can also choose different countries in different years, but in your first analyses, focus on one country in one year because things get statistically complicated when you have multiple countries and multiple years.

Run the script and voilà

### Starter script for analyzing the Eurobarometer Trend data (1970-2002).
### Outline:
### 1. Import the dataset straight from the web
### 2. Isolate variables of interest and clean them up
### 3. Estimate a regression model of what makes people turnout for EP elections

### Download packages we're going to use
### Remove the "#" in the following three lines if you have not already installed these
# install.packages("ggplot2")
# install.packages("foreign")
require(ggplot2)
require(foreign)
require(Zelig)

### Define the URL where the dataset is located
url<-"https://dl.dropboxusercontent.com/u/20498362/eurobarometer_trends/eurobarometer_trends.dta?raw=1"

### Read the dataset straight from the web, name the dataframe "data"
data <- read.dta(url)

### Variable "mediause" asks how much the respondent relies on the media.
summary(data$mediause) # Get a simple summary
## VERY HIGH      high       low  VERY LOW        dk        na      inap 
##    166824    128310     71049     19534      2514       445      5612 
##      NA's 
##    740096
data$mediause[data$mediause=="dk"]<-NA # Let's consider all the "dk" answerers as missing (NA)
data$mediause[data$mediause=="na"]<-NA # Let's consider all the "dk" as missing (NA)
data$mediause[data$mediause=="inap"]<-NA # Let's consider all the "inap" as missing (NA)
data$mediause<-factor(data$mediause) # this removes unused levels
levels(data$mediause)<-c("Very high", "High", "Low", "Very low") # Clean up the messy names of the levels
data$medianum<-as.numeric(data$mediause) # convert the categories to a numerical scale (cheating a bit!)
data$medianum<-(4 - data$medianum) # Subtract from 4 to make it more intuitive (higher number = more likely to vote)

### Variable "particip" asks respondent how likely they are to vote in the EP elections.
summary(data$particip)
## CERTAINLY YES  PROBABLY YES  PROBABLY NOT CERTAINLY NOT       depends 
##         86318         37950         13719         16375          6192 
##         DK,NA          inap          NA's 
##         10030          7319        956481
data$particip[data$particip=="DK,NA"]<-NA
data$particip[data$particip=="depends"]<-NA
data$particip[data$particip=="inap"]<-NA
data$particip<-factor(data$particip)
levels(data$particip)<-c("Certainly yes", "Probably yes", "Probably not", "Certainly not")
data$participnum<-as.numeric(data$particip) # convert the categories to a numerical scale (cheating a bit!)
data$participnum<-(4 - data$participnum) # Subtract from 4 to make it more intuitive (higher number = more likely to vote)
data$particip <- ifelse(data$participnum>=2, 1, 0) # Collapse into a no/yes, 0/1, binary version

### Variable "income" asks the respondent's income level.
summary(data$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    5.00    8.00   29.28   96.00   99.00   52308
### Variable "polint" asks the respondent's political interest generally.
summary(data$polint)
##   A GREAT DEAL TO SOME EXTENT       NOT MUCH     NOT AT ALL          DK,NA 
##          12311          36657          35622          24101            968 
##           inap           NA's 
##           2583        1022142
data$polint[data$polint=="DK,NA"]<-NA
data$polint[data$polint=="inap"]<-NA
data$polint<-factor(data$polint)
levels(data$polint)<-c("A great deal", "To some extent", "Not much", "Not at all")
data$polintnum<-as.numeric(data$polint) # convert the categories to a numerical scale (cheating a bit!)
data$polintnum<-(4 - data$polintnum) # Subtract from 4 to make it more intuitive (higher number = more likely to vote)
summary(data$polintnum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     1.0     1.0     1.3     2.0     3.0 1025693
### Variable "ecint3" asks the respondent's interest in EU politics.
summary(data$ecint3)
## VERY INTERESTED        A LITTLE      NOT AT ALL          DK, NA 
##           17438           39654           20800            2727 
##            inap            NA's 
##               0         1053765
data$ecint3[data$ecint3=="DK, NA"]<-NA
data$ecint3[data$ecint3=="inap"]<-NA
data$ecint3<-factor(data$ecint3)
levels(data$ecint3)<-c("Very interested", "A little", "Not at all")
data$ecint3num<-as.numeric(data$ecint3) # convert the categories to a numerical scale (cheating a bit!)
data$ecint3num<-(3 - data$ecint3num) # Subtract from 4 to make it more intuitive (higher number = more likely to vote)
summary(data$ecint3num)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       1       1       1       2 1056492
### Variable "ecint4" asks the respondent's interest in EU politics.
summary(data$ecint4)
##   A GREAT DEAL TO SOME EXTENT       NOT MUCH     NOT AT ALL         DK, NA 
##          14316          48580          49690          28588           2288 
##           inap           NA's 
##           4611         986311
data$ecint4[data$ecint4=="DK, NA"]<-NA
data$ecint4[data$ecint4=="inap"]<-NA
data$ecint4<-factor(data$ecint4)
levels(data$ecint4)<-c("A great deal", "To some extent", "Not much", "Not at all")
data$ecint4num<-as.numeric(data$ecint4) # convert the categories to a numerical scale (cheating a bit!)
data$ecint4num<-(4 - data$ecint4num) # Subtract from 4 to make it more intuitive (higher number = more likely to vote)
summary(data$ecint4num)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     1.0     1.0     1.3     2.0     3.0  993210
### Variable "nation1" captures the respondent's nation
summary(data$nation1)
##           france          belgium      netherlands     GERMANY-WEST 
##            91414            89906            89269            91498 
##            italy       luxembourg          denmark          ireland 
##            92163            38459            84566            84395 
##    GREAT BRITAIN NORTHERN IRELAND           greece            spain 
##            88792            25303            70266            60123 
##         portugal     GERMANY-EAST           norway          finland 
##            59947            49641            11989            37869 
##           sweden          austria      switzerland 
##            33671            35113                0
data$gb<-ifelse(data$nation1=="GREAT BRITAIN", 1, 0)

### Variable "year" captures the year of the survey
summary(data$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1970    1987    1994    1992    1999    2002
##########################################################################
### Zero in on the subset of the sample relevant to your research question
##########################################################################

# Subset the data to one country in one year because having multiple countries/years gets complicated statistically
dataUK1994<-subset(data, nation1=="GREAT BRITAIN" & data$year==1994, select=c("particip", "medianum", "polintnum", "income", "nation1"))

#########################################################################
### Always do some basic visual inspection / descriptive analysis of your key variables

qplot(dataUK1994$particip) +
  labs(x="Likely to vote in the EU Parliament election?")

qplot(dataUK1994$medianum) +
  labs(x="Degree of media usage")

#######################################################
####### Regression analysis using the Zelig package
#######################################################

# DV is likely to vote or not likely to vote, i.e. a binary variable.
# So we need a "logit" model

model<-zelig(particip ~
               income + medianum,
             data=dataUK1994,
             model="logit")
## How to cite this model in Zelig:
##   R Core Team. 2007.
##   logit: Logistic Regression for Dichotomous Dependent Variables
##   in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
##   "Zelig: Everyone's Statistical Software," http://zeligproject.org/
summary(model)
## Model: 
## 
## Call:
## z5$zelig(formula = particip ~ income + medianum, data = dataUK1994)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7077  -1.4148   0.7351   0.8428   1.1639  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.234961   0.211419   1.111 0.266418
## income      -0.002073   0.001648  -1.257 0.208654
## medianum     0.320133   0.087705   3.650 0.000262
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1220.7  on 1000  degrees of freedom
## Residual deviance: 1205.5  on  998  degrees of freedom
##   (2189 observations deleted due to missingness)
## AIC: 1211.5
## 
## Number of Fisher Scoring iterations: 4
## 
## Next step: Use 'setx' method
x.out <- setx(model, medianum=seq(0,3,1)) # Set media to a range from its min to its max
s.out <- sim(model, x = x.out)

ci.plot(s.out,
        main="Media Usage and Voting in the European Election (UK in 1994)")


Share this post:

Citation for this post: RIS Citation BibTeX Entry

Murphy, Justin. 2016. "A quick-start guide to the Statistical analysis of political attitudes and behaviors," http://jmrphy.net/blog/2016/12/06/a-quick-start-guide-to-the-statistical-analysis-of-political-attitudes-and-behaviors/ (August 13, 2017).