Sentiment Analysis of Vaccination Related Tweets and Relationship to Social Bots

Social bots are increasingly used to manipulate public opinion from politics to healthcare. These bots are used to spread miss information and fake news; create confusion regarding what is right and wrong; spread fears and distrust about authorities.
In this project, I attempted to find out whether the bots engage differently based on the sentiment of vaccination related tweets. This will help deriving guidelines on how not to attract bots while using Twitter.

Methodology

1. Fetching Tweets

Twitter API was used to collect the most recent vaccination related tweets, from 8th to 15th August 2021, using relevant hashtags. CoVaxxy web dashboard developed by Verna, 2021 was used to loosely identify pro and anti vaccination tweets. #getvaccinated was used to fetch tweets leaning pro vaccination. #vaccineinjury, #vaccinesideeffects, #vaccineskill, #nomandatoryvaccine, #covidvaccinescam, #leaveourkidsalone, #notovaccinepassports, #mybodymychoice - hashtags were used to fetch tweets leaning anti vaccination. Around 2500 tweets were collected from either of spectrums.

2. Cleaning and Preparing Tweets for Clustering

Original tweet had urls, user mentions, hashtags, emojis, special characters, mixed letters. All these were removed to reduce complexity and tweet was converted into tidy text. To further reduce complexity, short words and stop words were also removed. At the end text was converted into group of words, called bag of words. This text is further standardized by lemmatization and stemming.

3. Clustering for Sentiment Analysis

Two types of clustering algorithms were used to divide tweets into different sentiments, NMF - Non negative Matrix Factorization and LDA Latent Dirichlet Allocation with the help of sklearn and gensim packages for python. Tweet tokens were vectorized in order to apply clustering. Term Frequency - TF was to feed to LDA algorithm v/s term frequency - inverse document frequency vectorizations - TF-IDF was feed to NMF algorithm.

NMF Clustering

Topics were divided into pro, anti and nuetral sentiments based on the words in given tweet

Words Sentiment
vaccin covid covidvaccin getvaccin pfizer fulli vaccineswork get effect peopl Neutral
mybodymychoic freedom choic bodi right want mandat say abort peopl Anti Vaccination
getvaccin wearamask mask covid peopl children kid amp wear maskup Pro Vaccination

LDA Clustering

Topics were divided into pro, anti and neutral sentiments based on the words in given tweet

Words Sentiment
vaccin getvaccin covid amp peopl day year today like wearamask Neutral
vaccin mybodymychoic covid peopl freedom right choic novaccin getvaccin mandat Anti Vaccination
getvaccin covid wearamask vaccin mask school kid children peopl maskup Pro Vaccination

Bot Assessment

I used BotOrNot package developed by Mkearney to to get a bot probability score for every user retweeting given tweet. This exercise was done on top 500 most popular tweets. Up to 100 user retweeting particular tweet were considered to calculate the bot score. At the end, this gave ratio of bots vs humans who liked the given tweet.

Validation

I randomly selected 100 tweets were classified into 3 categories - pro, anti and neutral sentiment. The manual scoring were compared with prediction from NMF and LDA. This is shown here. The neutral for NMF was more evenly split than that of LDA. NMF had higher accuracy and lower false positive rate than LDA. NMF was considered as choice of classification to determine the bot behavior.

Bot Behavior

The null hypothesis could not be rejected with p-value being >0.05. The null hypothesis was both anti and pro sentiment tweets attract same level of bot engagement. This analysis could not find any evidence of bot targeting and spreading anti vaccination tweets over pro vaccination tweets.