Universal Healthcare Discussions NLP

Abstract

The goal of this project was to use natural language processing techniques to craft a topic model, delineating subtopics of discussion surrounding universal healthcare on twitter. I worked with twitter data gathered through python's tweet gathering package, twint to identify a list of 7 subtopics that are the main categories of conversation surrounding universal healthcare. After refining an interpretted topic model, I performed sentiment analysis on each of the topics from the topic model to analyse what subtopics might be more contentious and should be approached with more caution/analysis/thoughtfulness.

Design

This project was designed according to a generic NLP workflow. One data was gathered, preprocessed, and further explored, it was vectorized and topic modeling was initiated, followed by sentiment analysis. The topic model have potential as a research outline for universal healthcare campaigns. Campaigns aiming to morph conversations surrounding universal healthcare, can use this topic model to initiate growing an understanding of hestitations against universal healthcare so they may better craft rebuttals against the hesitations.

Data

The dataset contains 26,457 tweets. All the tweets were pulled using twint, a twitter data gathering python package. The search queries were "universal healthcare","affordable healthcare", and "healthcare", all of which were removed as stopwords before topic modeling.

Breakdown of Code Notebooks:

This Notebook contains code for pulling data using twint.
This Notebook contains code for data cleaning.
This Notebook contains code for Topic Modeling.
This Notebook contains code for whole data sentiment analysis.
This Notebook contains code for sentiment analysis by topic.

Algorithms

Data Preprocessing

Cleaning out all URLs and emojis from tweets//formatting (lowercasing, etc.)
Using spaCy to remove stopwords while tokenizing
Lemmatizing tweets

Vectorization & Topic Modeling

Document term matrix with tfidf vectorizer
Fitting NMF model with 5 components
Interpretting 5 topics The topic modeling portion involved iteratively testing various numbers of components and deciding on 5 according to highest coherence in most frequent words in each topic, and some subjective interpretability comparisons.

Final Topics Yielded from Topic Model

Political
Human Rights
Quality of Life
Infrastrucutre
Comparisons to other countries
Public Services
Accessibility

Sentiment Analysis

Sentiment analysis on whole dataset
Sentiment analysis by topic

Every topic had a higher percentage of tweets labelled as "positive". Though sentiment analysis is still a fairly flawed process, ranking the percentages within each topic might help to guide further exploration.

Most to Least Positive sentimented:

Quality of Life
Public Service/Accessibility
Infrastructure
Political/Human Rights
Comparisons to other countries

Most to Least Negative sentimented:

Political/Human Rights
Comparisons to other countries
Infrastructure
Public Service/Accessibility
Quality of Life

Tools

pandas
twint for data gathering
spacy for text preprocessing
sklearn for topic modeling
sklearn sentiment vader

Communication

There are slides accompanying the notebooks of code that outline the process/workflow of the entire project.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Code		Code
Deliverables		Deliverables
Images		Images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Healthcare Discussions NLP

Abstract

Design

Data

Breakdown of Code Notebooks:

Algorithms

Tools

Communication

About

Releases

Packages

Languages

mehiks11/Universal_Healthcare_NLP

Folders and files

Latest commit

History

Repository files navigation

Universal Healthcare Discussions NLP

Abstract

Design

Data

Breakdown of Code Notebooks:

Algorithms

Tools

Communication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages