Skip to content

Build process

Rob Speer edited this page Mar 23, 2017 · 25 revisions

We strongly recommend following the Docker process to make the entire ConceptNet build reproducible, with all its dependencies built in.

If you want to go it alone without Docker, here's what you need:

  • A Unix system with command-line tools like sort and grep
  • Python 3.4 or later, with development headers (python3-dev)
  • Familiarity with managing a Python environment (for example, using virtualenv)
  • PostgreSQL 9.5 and the ability to create databases
    • Set up the permissions so that you can run "createdb conceptnet5" as your current user, without sudo.
  • Git
  • 240 GB of free disk space
  • At least 14 GB of available RAM
  • The time and bandwidth to download 12 GB of raw data
  • The numpy and scipy libraries
  • The libhdf5-dev library for reading and writing HDF5 tables
  • The mecab-dev library for tokenizing Japanese, and its dictionary, mecab-ipadic-utf8

Step-by-step setup

Check out the source code of ConceptNet from Git:

git clone [email protected]:commonsense/conceptnet5
cd conceptnet5

Make sure that the development libraries that ConceptNet needs are available. For example, on Ubuntu:

sudo apt install build-essential python3-pip python3-dev libhdf5-dev mecab-dev mecab-ipadic-utf8

Install PostgreSQL 9.5 or later, and make sure your user account has the ability to create databases. (This is outside the scope of this tutorial. See How to install and use PostgreSQL on Ubuntu.)

You will need to be able to access PostgreSQL via a Python library, not just via the command line, so this probably involves setting up password authentication for your user.

Your PostgreSQL user account has to be able to access the database by connecting to a local address, not just using the "Unix domain socket" that the psql command uses. You'll either need to set a password on your PostgreSQL account and store that in the CONCEPTNET_DB_PASSWORD environment variable, or follow a guide such as this one to not require a password when connecting locally:

https://gist.github.com/p1nox/4953113

Create a PostgreSQL database named conceptnet5 that you have the ability to write to:

createdb conceptnet5

Create a data directory within conceptnet5 that will contain ConceptNet's data. If necessary, make it a symbolic link to a hard drive with more space on it.

mkdir data

Install ConceptNet as a python package in your environment, including the optional "vectors" dependencies:

pip install -e '.[vectors]'

Finally, run the Snakemake build:

./scripts/build.sh

What you get

Here are some useful outputs of the build process:

  • The conceptnet5 PostgreSQL database, containing an index of all the edges
  • assertions/assertions.csv: A CSV file of all the assertions in ConceptNet
  • assertions/assertions.msgpack: The same data in the more efficient (and less readable) msgpack format
  • edges/: The edges from individual sources that these assertions were built from.
  • stats/: Some text files that count the distribution of different languages, relations, and datasets in the built data.
  • assoc/reduced.csv: A tabular text file of just the concept-to-concept associations (plus additional 'negated concept' nodes that represent negative relations), filtered for concepts that are referred to frequently enough
  • vectors/mini.h5: A vector space of high-quality word embeddings built from an ensemble of ConceptNet, word2vec, and GloVe, stored as a Pandas data frame in HDF5 format
  • vectors/plain/conceptnet-numberbatch_uris_main.txt.gz: the complete word embedding data in a text format similar to word2vec's
Clone this wiki locally