-
Notifications
You must be signed in to change notification settings - Fork 356
Build process
We strongly recommend following the Docker process to make the entire ConceptNet build reproducible, with all its dependencies built in.
If you want to go it alone without Docker, here's what you need:
- A Unix system with command-line tools like
sort
andgrep
- Python 3.4 or later, with development headers (
python3-dev
) - Familiarity with managing a Python environment (for example, using virtualenv)
- PostgreSQL 9.5 and the ability to create databases
- Set up the permissions so that you can run "createdb conceptnet5" as your current user, without sudo.
- Git
- 240 GB of free disk space
- At least 14 GB of available RAM
- The time and bandwidth to download 12 GB of raw data
- The
numpy
andscipy
libraries - The
libhdf5-dev
library for reading and writing HDF5 tables - The
mecab-dev
library for tokenizing Japanese, and its dictionary,mecab-ipadic-utf8
Check out the source code of ConceptNet from Git:
git clone [email protected]:commonsense/conceptnet5
cd conceptnet5
Make sure that the development libraries that ConceptNet needs are available. For example, on Ubuntu:
sudo apt install build-essential python3-pip python3-dev libhdf5-dev mecab-dev mecab-ipadic-utf8
Install PostgreSQL 9.5 or later, and make sure your user account has the ability to create databases. (This is outside the scope of this tutorial. See How to install and use PostgreSQL on Ubuntu.)
You will need to be able to access PostgreSQL via a Python library, not just via the command line, so this probably involves setting up password authentication for your user.
Your PostgreSQL user account has to be able to access the database by connecting to a local address, not just using the "Unix domain socket" that the psql
command uses. You'll either need to set a password on your PostgreSQL account and store that in the CONCEPTNET_DB_PASSWORD
environment variable, or follow a guide such as this one to not require a password when connecting locally:
https://gist.github.com/p1nox/4953113
Create a PostgreSQL database named conceptnet5
that you have the ability to write to:
createdb conceptnet5
Create a data
directory within conceptnet5
that will contain ConceptNet's data. If necessary, make it a symbolic link to a hard drive with more space on it.
mkdir data
Install ConceptNet as a python package in your environment, including the optional "vectors" dependencies:
pip install -e '.[vectors]'
Finally, run the Snakemake build:
./scripts/build.sh
Here are some useful outputs of the build process:
- The
conceptnet5
PostgreSQL database, containing an index of all the edges -
assertions/assertions.csv
: A CSV file of all the assertions in ConceptNet -
assertions/assertions.msgpack
: The same data in the more efficient (and less readable) msgpack format -
edges/
: The edges from individual sources that these assertions were built from. -
stats/
: Some text files that count the distribution of different languages, relations, and datasets in the built data. -
assoc/reduced.csv
: A tabular text file of just the concept-to-concept associations (plus additional 'negated concept' nodes that represent negative relations), filtered for concepts that are referred to frequently enough -
vectors/mini.h5
: A vector space of high-quality word embeddings built from an ensemble of ConceptNet, word2vec, and GloVe, stored as a Pandas data frame in HDF5 format -
vectors/plain/conceptnet-numberbatch_uris_main.txt.gz
: the complete word embedding data in a text format similar to word2vec's
Starting points
Reproducibility
Details