Analysis of #GazaUnderAttack tweets

During the Israeli-Palestinian attacks of last month, I scraped from the Twitter API about 270,000 tweets containing the hashtag #gazaunderattack. Given the unprecedented degree to which this war was fought online, tweets of this sort could and should become really important data for political scientists. I’m too busy right now to say very much here, but I want to share some basic descriptive analyses for anyone who might be interested. The dataset can be downloaded here and the R script that produced these analyses is available at the very bottom.

Almost everything in this post I learned how to do directly and solely from the amazingly smart, open and generous #Rstats community. The scraping and time-series plot follow a script by Michael Bommarito, and the rest follows scripts by Ben Marwick.

Analysis

There are 269,158 tweets. These tweets are authored by 79,923 unique users. Of all the tweets, .62 are retweets.

This first graph plots the frequency of #gazaunderattack tweets in 30-minute intervals between November 17th and November 21st. I believe this is all of the tweets containing that hashtag within this period. I know the Twitter Search API is subject to weird filters and restrictions, but I believe the technique I used here pages through each and every tweet available within the available time period.

timeseries

Most frequent #gazaunderattack tweeters.

tweet_counts

The most retweeted tweeters. Interestingly, Anonymous seems to have had more reach, at least during this period, than the twitter account of Hamas (@AlqassamBrigade).

_retweet_counts_The most retweeted tweeters as a ratio of total quantity of tweets sent.   Anonymous still seems to have had the most reach on the #gazaunderattack hashtag.

retweet_ratios

Most frequently tweeted links.links

This is the image in the most popular link, capturing an explosion from an Israeli airstrike in Gaza.

Again, the Python code I used to obtain the tweets and the R code I used to analyze them were lifted directly from scripts by the authors linked above.

``` r Analyze Tweets scraped with Python

x<-read.csv(“tweets_#gazaunderattack.csv”, header=FALSE, stringsAsFactors=FALSE) x$username<-x$V2 x$text<-x$V5

#

#### Nice Time-Series Plot #### ######################################### library(ggplot2) x$date <- strptime(x$V4, “%a, %d %b %Y %H:%M:%S %z”, tz = “EST”) x$date <- as.POSIXct(x$date, tz = “EST”) timeseries<-ggplot(data=x, aes(x=date)) + geom_bar(aes(fill=..count..), binwidth=60*30) + theme_bw() + ylab(“# of Tweets”) + xlab(“Time”) timeseries ggsave(file=”timeseries.png”)

#

#### Nice Plot of Frequent Tweeters #### ######################################### library(ggplot2) x$username[x$username==””]<-NA length(unique(x$username)) # see how many unique tweeter accounts in the sample counts=table(x$username) counts.sort<-sort(counts) counts.sort.subset=subset(counts.sort, counts.sort>350) # create a subset of those who tweeted at least 350 times or more counts.sort.subset.df<-data.frame(people=unlist(dimnames(counts.sort.subset)),count=unlist(counts.sort.subset)) # makes a funny sort of data frame… counts.sort.subset.df<-data.frame(people=as.factor(counts.sort.subset.df$people),counts=as.numeric(counts.sort.subset.df$count)) # makes a better data frame for ggplot to work with ggplot(counts.sort.subset.df, aes(reorder(people,counts),counts)) + xlab(“Author”) + ylab(“Number of messages”)+ geom_bar() + coord_flip() + theme_bw() + opts(axis.title.x = theme_text(vjust = -0.5, size = 14)) + opts(axis.title.y=theme_text(size = 14, angle=90)) # plot nicely ordered counts of tweets by person for people > 5 tggsave(file = “tweet_counts.pdf”) # export the plot to a PDF file ggsave(file = “tweet_counts.png”)

#

#### Nice Plot of Frequent Re-Tweeters #### ########################################### library(stringr) x$text=sapply(x$text,function(row) iconv(row,to=’UTF-8’)) #remove odd characters trim <- function (x) sub(‘@’,’‘,x) # remove @ symbol from user names x$rt=sapply(x$text,function(tweet) trim(str_match(tweet,”^RT (@[[:alnum:]_]*)”)2)) #extract who has been RT’d sum(!is.na(x$rt)) # see how many tweets are retweets sum(!is.na(x$rt))/length(x$rt) # the ratio of retweets to tweets countRT<-table(x$rt) countRT<-sort(countRT) countRT.subset=subset(countRT,countRT>1000) # subset those RT’d more than 1000 times countRT.subset.df<-data.frame(people=as.factor(unlist(dimnames(countRT.subset))),RT_count=as.numeric(unlist(countRT.subset))) ggplot(countRT.subset.df, aes(reorder(people,RT_count),RT_count)) + xlab(“Author”) + ylab(“Number of messages retweeted by others”) + geom_bar() + coord_flip() + theme_bw() + opts(axis.title.x = theme_text(vjust = -0.5, size = 14)) + opts(axis.title.y=theme_text(size = 14, angle=90)) # plot nicely ordered counts of tweets by person for people > 1000 retweets ggsave(file = “retweet_counts.png”)

#

#### Nice Plot of RT-Tweet Ratios ######### ########################################### t<-as.data.frame(table(x$username)) # make table with counts of tweets per person rt<-as.data.frame(table(x$rt)) # make table with counts of retweets per person t.rt<-merge(t,rt,by=”Var1”) # combine tweet count and retweet count per person t.rt[“ratio”]<-t.rt$Freq.y / t.rt$Freq.x # creates new col and adds ratio tweet/retweet sort.t.rt<-t.rt[order(t.rt$ratio),] # sort it to put names in order by ratio sort.t.rt.subset<-subset(sort.t.rt,sort.t.rt$Freq.y>1000) # exclude those with 1000 tweets or less sort.t.rt.subset.drop<-droplevels(sort.t.rt.subset) # drop unused levels that got in there somehow… note that this is already a data frame ggplot(sort.t.rt.subset, aes(reorder(Var1,ratio),ratio)) + xlab(“Author”) + ylab(“Retweets as a ratio of total tweets”) + geom_bar() + coord_flip() + theme_bw() + opts(axis.title.x = theme_text(vjust = -0.5, size = 14)) + opts(axis.title.y=theme_text(size = 14, angle=90)) ggsave(file = “retweet_ratios.png”)

#

#### Nice Plot of Most Popular Links ###### ########################################### x$link=sapply(x$text,function(tweet) str_extract(tweet,(“http[[:print:]]+”))) # creates new field and extracts the links contained in the tweet x$link=sapply(x$text,function(tweet) str_extract(tweet,”http[[:print:]]{16}”)) # limits to just 16 characters after http so I just get the shortened link. They are all shortened, so this is fine, but there might be a better way using regex. countlink<-table(x$link) # get frequencies of each link countlink<-sort(countlink) # sort them barplot(countlink) # plot freqs # or to use ggplot2, read on… countlink<-data.frame(table(na.omit((x$link)))) countlink<-subset(countlink,countlink$Freq>300) # exclude those with 300 tweets or less ggplot(countlink, aes(reorder(Var1, Freq), Freq)) + geom_bar() + coord_flip() + theme_bw() + xlab(“Link”) + ylab(“Frequency”) + opts(axis.title.x = theme_text(vjust = -0.5, size = 14)) + opts(axis.title.y=theme_text(size = 14, angle=90)) ggsave(file = “links.png”)

```

Cite this post: RIS Citation BibTeX Entry

Murphy, Justin. 2012. "Analysis of #GazaUnderAttack tweets," http://jmrphy.net/blog/2012/12/29/analysis-of-gazaunderattack-tweets/ (June 20, 2017).