Skip to content

Build process

Rob Speer edited this page Aug 21, 2015 · 25 revisions

This page is up-to-date for ConceptNet 5.4.

Requirements

To do a complete build of ConceptNet, you will need:

  • Python 3.3 or later
    • While the ConceptNet code will run on Python 2.7, we don't recommend using it to build ConceptNet. ConceptNet makes very heavy use of Unicode, which is both more optimized and more correct in Python 3. On Python 2.7, you'll get slightly different data, and you'll get it slower.
  • Standard UNIX command-line tools, including curl, rsync, sed, sort, uniq, and cut
  • The Ninja build system. On Ubuntu, you can get it with apt-get install ninja-build.
  • 50 GB of free disk space

If you are going to run the complete build including the build_assoc step, you also need:

  • The numpy and scipy libraries
  • The Python package named assoc-space
  • At least 10 GB of available RAM

We build ConceptNet on the GNU versions of the UNIX command-line tools. Most of them have a separate BSD version for non-GNU systems, such as Mac OS. We don't frequently build on the BSD versions, but we're not trying to do anything non-portable, so let us know about any bugs or discrepancies.

We don't recommend trying to build ConceptNet on Windows, but if you accomplish it, let us know what you did.

Setting up your environment

Installing Ninja

ConceptNet uses the Ninja build system. Get it from https://martine.github.io/ninja/, or run this on Ubuntu:

sudo apt-get install ninja-build

Making a Python 3 environment

We recommend building ConceptNet in a Python virtualenv, so that its dependencies are easy to keep track of. Fortunately, this is easy in Python 3.4.

Make a virtualenv if you don't have one already:

pyvenv-3.4 conceptnet-env

On Ubuntu, pyvenv is broken by default, so let's fall back on installing virtualenv instead:

# alternate version that works on Ubuntu
sudo apt-get install python3-virtualenv
virtualenv conceptnet-env

Then activate the environment, which will make python refer to this Python 3.4 environment:

source conceptnet-env/bin/activate

Installing the ConceptNet code

  • Get the ConceptNet Python code from PyPI or GitHub.
    • Remember to do this on a disk where you'll have 50 GB available.
  • Install it in "development mode" in your virtual environment. Assuming you've extracted it into a directory called conceptnet5:
cd conceptnet5
python setup.py develop

The build process

Setting up

Run python ninja.py to create a Ninja file that describes the process of building ConceptNet.

You usually won't ever need to re-run ninja.py, but there's one case where it may be convenient: if you have already downloaded the raw data into your data/raw directory, you can run ninja.py again to get a version of the file that assumes your downloaded data is correct and doesn't try to re-download it.

One step of the build process depends on a package that uses NumPy and SciPy, some powerful Python libraries which are often hard to install if you don't have them already. Here's what you'd need to run on Ubuntu:

sudo apt-get build-dep python-scipy
pip install assoc-space

See "Building the vector space" below for why you might or might not want this step. If you don't want to deal with this part, you can take it out of the build. Edit your build.ninja file and remove the lines that look like these:

# build vector space
build data/assoc/assoc-space-5.4/u.npy data/assoc/assoc-space-5.4/sigma.npy data/assoc/assoc-space-5.4/assoc.npy data/assoc/assoc-space-5.4/labels.txt: build_assoc data/assoc/reduced.csv

# compress data/dist/DATE/conceptnet5_vector_space_5.4.tar.bz2
build data/dist/DATE/conceptnet5_vector_space_5.4.tar.bz2: compress_tar data/assoc/assoc-space-5.4/u.npy data/assoc/assoc-space-5.4/sigma.npy data/assoc/assoc-space-5.4/assoc.npy data/assoc/assoc-space-5.4/labels.txt

Running the build

To determine what needs to be built and run all the steps, type:

ninja -v

This will download the data if necessary, and run the build steps in parallel, using an appropriate number of cores for your machine. On my 4-core i5-2500K, ConceptNet takes a little under a day to build.

What you get

Here are some useful outputs of the build process:

  • assertions/*.jsons: JSON streams containing the data for all the assertions in ConceptNet.
  • assertions/*.msgpack: The same data as the JSON stream files, but in the more efficient (and less readable) msgpack format.
  • assertions/*.csv: The same assertions in tabular text format. (The extension '.csv' originally stands for Comma-Separated Values, but frequently it is used to refer to tab-separated values as well, which is what these files contain.)
  • edges/: The individual edges that these assertions were built from.
  • stats/: Some text files that count the distribution of different languages, relations, and datasets in the built data.
  • sw_map/: Files in N-Triples format that connect ConceptNet URIs to equivalent Semantic Web resources.
  • assoc/reduced.csv: A tabular text file of just the concept-to-concept associations (plus additional 'negated concept' nodes that represent negative relations), filtered for concepts that are referred to frequently enough.

Building the vector space

This step has the most software dependencies, and it's the one step that you might want to remove from the build.

One cool thing you can do with ConceptNet is discover generalized associations between sets of concepts, by representing concepts in a high-dimensional vector space. This is used in the API, for example, and on that page you'll find some examples.

The vector space is represented by a matrix. Its rows are labeled with a filtered subset of the concepts in ConceptNet, and its columns are 150 principal components of knowledge in ConceptNet. Therefore, each concept gets associated with a 150-dimensional vector. Concepts that are more strongly associated with one another will have vectors with a higher dot product.

There's a separate package called LuminosoInsight/assoc-space that builds this matrix from ConceptNet data.

This package relies on Numpy and Scipy; your Python environment has to already have them installed, or have the tools necessary to compile them (such as development libraries for Python, BLAS linear algebra libraries, and C and Fortran compilers). As long as Numpy and Scipy are working, you can run this:

The result will be a directory called assoc-space-5.4 that can be loaded by the assoc_space package or by the Web API.

Clone this wiki locally