little-mallet-wrapper

This is a little Python wrapper around the topic modeling functions of MALLET.

Installation

pip install little_mallet_wrapper

Requirements

Python 3.7
MALLET
pandas
numpy
seaborn (for plotting functions)

Hint: Make sure you have MALLET installed correctly (it relies on Java, so you'll also need a JDK). Installation instructions are available here.

Usage

See demo.ipynb for a demonstration of how to use the functions in little-mallet-wrapper.

To get started quickly, use the quick_train_topic_model() function with your MALLET path, an output directory (where you want the model to save everything), the number of topics, and a list of strings (your training data).

Documentation

`print_dataset_stats(training_data)`

Displays basic statistics about the training dataset.

Name	Type	Description
`training_data`	list of strings	Documents that will be used to train the topic model.

`process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)`

A simple string processor that prepares raw text for topic modeling. CAUTION: Depending on your data, you might need to write your own processing function. Do not rely on this function for non-English languages; both the stopword list and the punctuation removal assume English as input.

Name	Type	Description
`text`	string	Individual document to process.
`lowercase`	boolean	Whether or not to lowercase the text.
`remove_short_words`	boolean	Whether or not to remove words with fewer than 2 characters.
`remove_stop_words`	boolean	Whether or not to remove stopwords.
`remove_punctuation`	boolean	Whether or not to remove punctuation (not A-Za-z0-9)
`remove_numbers`	string	'replace' replaces all numbers with the normalized token NUM; 'remove' removes all numbers.
`stop_words`	list of strings	Custom list of words to remove.
RETURNS	string	Processed version of the input text.

`quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)`

Imports training data, trains an LDA topic model using MALLET, and returns the topic keys and document distributions.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`output_directory_path`	string	Path to where the output files should be stored.
`num_topics`	integer	The number of topics to use for training.
`training_data`	list of strings	Processed documents for training the topic model.
RETURNS	list of lists of strings	The 20 most probable words for each topic.
RETURNS	list of lists of integers	Topic distribution (list of probabilities) for each document.

`import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)`

Imports the training data into MALLET formatted data that can be used for training.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_training_data`	string	Path to where the training data should be stored.
`path_to_formatted_training_data`	string	Path to where the MALLET formatted training data should be stored.
`training_data`	list of strings	Processed documents for training the topic model.
`training_ids`	list of strings	Unique identifiers for the training data.
`use_pipe_from`	string	If you want to import the documents using the same model as a previous set of documents, include the path to the previous MALLET formatted training data.

`train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, path_to_diagnostics, num_topics)`

Trains an LDA topic model using MALLET.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_formatted_training_data`	string	Path to where the MALLET formatted training data is stored.
`path_to_model`	string	Path to where the model should be stored.
`path_to_topic_key`	string	Path to where the topic keys should be stored.
`path_to_topic_distributions`	string	Path to where the topic distributions should be stored.
`path_to_word_weights`	string	Path to where the word weights should be stored.
`path_to_diagnostics`	string	Path to where the XML diagnostics file should be stored.
`num_topics`	integer	The number of topics to use for training.

`load_topic_keys(topic_keys_path)`

Loads the most sets of most probable words for each topic after training a topic model.

Name	Type	Description
`topic_keys_path`	string	Path to where the topic keys are stored.
RETURNS	list of lists of strings	The 20 most probable words for each topic.

`load_topic_distributions(topic_distributions_path)`

Loads the topic distribution for each document after training a topic model.

Name	Type	Description
`topic_distributions_path`	string	Path to where the topic distributions are stored.
RETURNS	list of lists of integers	Topic distribution (list of probabilities) for each document.

`load_training_ids(topic_distributions_path)`

Loads the training IDs. This is either a list of sequential integers or the user-specified training IDs passed to import_data().

Name	Type	Description
`topic_distributions_path`	string	Path to where the topic distributions are stored.
RETURNS	list of lists of strings	List of training IDs in the same order as the topic distributions.

`load_topic_word_distributions(word_weight_path)`

Loads the topic word distributions. These are the probabilities for each word for each topic.

Name	Type	Description
`word_weight_path`	string	Path to where the word weights are stored.
RETURNS	defaultdict of defaultdict of float	Map of topics to words to probabilities.

`get_top_docs(training_data, topic_distributions, topic_index, n=5)`

Gets the documents with the highest probability for the target topic.

Name	Type	Description
`training_data`	list of strings	Processed documents that was used to train the topic model.
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_index`	integer	The index of the target topic.
`n`	integer	The number of documents to return.
RETURNS	list of tuples (float, string)	The topic probability and document text for the n documents with the highest probability for the target topic.

`plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

If the dataset includes some time of categorical labels, creates a heatmap of the labels x topics.

Name	Type	Description
`labels`	list of strings	Document labels (e.g., authors of the documents, genres of the documents).
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`output_path`	string	Path to where the resulting figure should be saved.
`target_labels`	list of strings	A subset of `labels` to use for plotting.
`dim`	tuple of integers	(x, y) dimensions for the resulting figure.

`plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

If the dataset includes some time of categorical labels, creates a set of boxplots, one plot for each topic.

Name	Type	Description
`labels`	list of strings	Document labels (e.g., authors of the documents, genres of the documents).
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`output_path`	string	Path to where the resulting figure should be saved.
`target_labels`	list of strings	A subset of `labels` to use for plotting.
`dim`	tuple of integers	(x, y) dimensions for the resulting figure.

`divide_training_data(documents, num_chunks=10)`

Given a dataset, divides each document into a set of equally sized chunks.

Name	Type	Description
`documents`	list of strings	Documents to split.
`num_chunks`	integer	How many times to split each document.
RETURNS	tuple (list of strings, list of integers, list of floats)	The divided documents, the indices of the input documents, and the positions within the documents (0-1.0).

`infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)`

Get topic distributions for a set of new documents using a model that has been trained on another set of documents.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_original_model`	string	Path to where the topic model was stored.
`path_to_new_formatted_training_data`	string	Path to where the MALLET formatted training data is stored.
`path_to_new_topic_distributions`	string	Path to where the topic distributions should be stored.

`plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)`

Creates lineplots, one for each topic, showing the mean topic probability over document segments.

Name	Type	Description
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`times`	list of floats	The division indices within the document.
`topic_index`	integer	The index of the target topic.
`output_path`	string	Path to where the resulting figure should be saved.

`get_js_divergence_documents(document_index_1, document_index_2, topic_distributions)`

Calculates the Jensen-Shannon divergence between the two target topic distributions.

Name	Type	Description
`document_index_1`	integer	Index of the first target document distribution.
`document_index_2`	integer	Index of the second target document distribution.
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
RETURNS	float	Jensen-Shannon divergence of the requested topic distributions.

`get_js_divergence_topics(topic_index_1, topic_index_2, topic_word_probability_dict)`

Calculates the Jensen-Shannon divergence between the two target topic distributions.

Name	Type	Description
`topic_index_1`	integer	Index of the first target topic distribution.
`topic_index_2`	integer	Index of the second target topic distribution.
`topic_word_probability_dict`	defaultdict of defaultdict of float	Map of topics to words to probabilities.
RETURNS	float	Jensen-Shannon divergence of the requested topic distributions.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
little_mallet_wrapper		little_mallet_wrapper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

little-mallet-wrapper

Installation

Requirements

Usage

Documentation

`print_dataset_stats(training_data)`

`process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)`

`quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)`

`import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)`

`train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, path_to_diagnostics, num_topics)`

`load_topic_keys(topic_keys_path)`

`load_topic_distributions(topic_distributions_path)`

`load_training_ids(topic_distributions_path)`

`load_topic_word_distributions(word_weight_path)`

`get_top_docs(training_data, topic_distributions, topic_index, n=5)`

`plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

`plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

`divide_training_data(documents, num_chunks=10)`

`infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)`

`plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)`

`get_js_divergence_documents(document_index_1, document_index_2, topic_distributions)`

`get_js_divergence_topics(topic_index_1, topic_index_2, topic_word_probability_dict)`

About

Releases

Packages

Languages

License

rbbby/little-mallet-wrapper

Folders and files

Latest commit

History

Repository files navigation

little-mallet-wrapper

Installation

Requirements

Usage

Documentation

print_dataset_stats(training_data)

process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)

quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)

import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)

train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, path_to_diagnostics, num_topics)

load_topic_keys(topic_keys_path)

load_topic_distributions(topic_distributions_path)

load_training_ids(topic_distributions_path)

load_topic_word_distributions(word_weight_path)

get_top_docs(training_data, topic_distributions, topic_index, n=5)

plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)

plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)

divide_training_data(documents, num_chunks=10)

infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)

plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)

get_js_divergence_documents(document_index_1, document_index_2, topic_distributions)

get_js_divergence_topics(topic_index_1, topic_index_2, topic_word_probability_dict)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`print_dataset_stats(training_data)`

`process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)`

`quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)`

`import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, training_ids=None, use_pipe_from=None)`

`train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, path_to_word_weights, path_to_diagnostics, num_topics)`

`load_topic_keys(topic_keys_path)`

`load_topic_distributions(topic_distributions_path)`

`load_training_ids(topic_distributions_path)`

`load_topic_word_distributions(word_weight_path)`

`get_top_docs(training_data, topic_distributions, topic_index, n=5)`

`plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

`plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

`divide_training_data(documents, num_chunks=10)`

`infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)`

`plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)`

`get_js_divergence_documents(document_index_1, document_index_2, topic_distributions)`

`get_js_divergence_topics(topic_index_1, topic_index_2, topic_word_probability_dict)`

Packages