The goal of this project was to use natural language processing techniques to craft a topic model, delineating subtopics of discussion surrounding universal healthcare on twitter. I worked with twitter data gathered through python's tweet gathering package, twint to identify a list of 7 subtopics that are the main categories of conversation surrounding universal healthcare. After refining an interpretted topic model, I performed sentiment analysis on each of the topics from the topic model to analyse what subtopics might be more contentious and should be approached with more caution/analysis/thoughtfulness.
This project was designed according to a generic NLP workflow. One data was gathered, preprocessed, and further explored, it was vectorized and topic modeling was initiated, followed by sentiment analysis. The topic model have potential as a research outline for universal healthcare campaigns. Campaigns aiming to morph conversations surrounding universal healthcare, can use this topic model to initiate growing an understanding of hestitations against universal healthcare so they may better craft rebuttals against the hesitations.
The dataset contains 26,457 tweets. All the tweets were pulled using twint, a twitter data gathering python package. The search queries were "universal healthcare","affordable healthcare", and "healthcare", all of which were removed as stopwords before topic modeling.
- This Notebook contains code for pulling data using twint.
- This Notebook contains code for data cleaning.
- This Notebook contains code for Topic Modeling.
- This Notebook contains code for whole data sentiment analysis.
- This Notebook contains code for sentiment analysis by topic.
Data Preprocessing
- Cleaning out all URLs and emojis from tweets//formatting (lowercasing, etc.)
- Using spaCy to remove stopwords while tokenizing
- Lemmatizing tweets
Vectorization & Topic Modeling
- Document term matrix with tfidf vectorizer
- Fitting NMF model with 5 components
- Interpretting 5 topics The topic modeling portion involved iteratively testing various numbers of components and deciding on 5 according to highest coherence in most frequent words in each topic, and some subjective interpretability comparisons.
Final Topics Yielded from Topic Model
- Political
- Human Rights
- Quality of Life
- Infrastrucutre
- Comparisons to other countries
- Public Services
- Accessibility
Sentiment Analysis
- Sentiment analysis on whole dataset
- Sentiment analysis by topic
Every topic had a higher percentage of tweets labelled as "positive". Though sentiment analysis is still a fairly flawed process, ranking the percentages within each topic might help to guide further exploration.
Most to Least Positive sentimented:
- Quality of Life
- Public Service/Accessibility
- Infrastructure
- Political/Human Rights
- Comparisons to other countries
Most to Least Negative sentimented:
- Political/Human Rights
- Comparisons to other countries
- Infrastructure
- Public Service/Accessibility
- Quality of Life
- pandas
- twint for data gathering
- spacy for text preprocessing
- sklearn for topic modeling
- sklearn sentiment vader
There are slides accompanying the notebooks of code that outline the process/workflow of the entire project.