Skip to content

Build process

rspeer edited this page Sep 23, 2014 · 25 revisions

Note: This page is written for ConceptNet 5.2. In ConceptNet 5.3, the process is a bit different, and is outlined in Running your own copy.

The main changes that haven't been worked into these directions are:

  • The Makefile is at the top level, not in the data/ directory
  • NLTK 3 is no longer in alpha
  • You need to build APSW before you build ConceptNet

Requirements

To do a complete build of ConceptNet, you will need:

  • Python 3.3 or later
    • While the ConceptNet code will run on Python 2.7, we don't recommend using it to build ConceptNet. ConceptNet makes very heavy use of Unicode, which is both more optimized and more correct in Python 3. On Python 2.7, you'll get slightly different data, and you'll get it slower.
  • NLTK 3.0a or later (see "Setting up your environment" below)
  • Standard UNIX command-line tools, including make, curl, rsync, sed, sort, uniq, and cut
  • 40 GB of free disk space
  • At least 10 GB of available RAM, if you are going to run the build_assoc step

We build ConceptNet on the GNU versions of the command-line tools. Most of them have a separate BSD version for non-GNU systems, such as Mac OS. We don't frequently build on the BSD versions, but we're not trying to do anything non-portable, so let us know about any bugs or discrepancies.

We don't recommend trying to build ConceptNet on Windows, but if you accomplish it, let us know what you did.

Setting up your environment

Making a Python 3 environment

We recommend building ConceptNet in a Python virtualenv, so that its dependencies are easy to keep track of. Fortunately, this is easy in Python 3.4.

  • Make a virtualenv if you don't have one already:
pyvenv-3.4 conceptnet-env
  • Activate the environment, which will make python refer to this Python 3.4 environment:
source conceptnet-env/bin/activate

Installing NLTK 3

NLTK 3 is still in alpha and hasn't been released on the Python Package Index, but it works well enough for building ConceptNet.

tar xvf nltk-3.0*.tar.gz
  • With your virtual environment activated, enter that directory and install NLTK:
cd nltk-3.0*
python setup.py install
cd ..

Installing the ConceptNet code

  • Get the ConceptNet Python code from PyPI or GitHub.
  • Install it in "development mode" in your virtual environment. Assuming you've extracted it into a directory called conceptnet5:
cd conceptnet5
python setup.py develop

The build process

Setting up

Everything from here on is going to happen inside the data/ directory.

cd data

If you want to be running this on some other disk, now is the time to set that up. For example, I do most of my development on an SSD, but building ConceptNet on an SSD would use up some expensive disk space and also put wear on the drive. So the first thing I do is move the entire data/ directory onto an external, traditional hard disk.

We're going to build ConceptNet by running commands from the Makefile found here.

Although Makefiles are traditionally used for C code, they don't require any particular programming language. This Makefile manages the steps of building ConceptNet, via Python code and shell scripts.

The advantage of using Make is that it keeps track of dependencies, so that you can skip rebuilding data files that don't need to be rebuilt, and that it can run build steps in parallel.

The first thing you will need is the raw data from the resources that ConceptNet is built from. This will be over 9 gigabytes of data in total.

make download

The main build step

Now you're ready to run the main build step. Simply run:

make -j8

The -j8 option tells it to run 8 processes in parallel. A higher or lower number might be better for your machine.

This parallelism is really valuable. It's like "map-reduce" but it's been possible for decades, and it doesn't require networked servers. It cuts down the build time from about a day to a few hours.

What you get

Here are some useful outputs of the build process:

  • assertions/*.jsons: JSON streams containing the data for all the assertions in ConceptNet.
  • assertions/*.csv: The same assertions in tabular text format. (The extension '.csv' originally stands for Comma-Separated Values, but frequently it is used to refer to tab-separated values as well, which is what these files contain.)
  • edges/: The individual edges that these assertions were built from.
  • stats/: Some text files that count the distribution of different languages, relations, and datasets in the built data.
  • sw_map/: Files in N-Triples format that connect ConceptNet URIs to equivalent Semantic Web resources.
  • assoc/all.csv: A tabular text file of just the concept-to-concept associations (plus additional 'negated concept' nodes that represent negative relations).
  • solr/*.json: All the assertions as documents that can be loaded into the Solr search engine, for quick lookup.

Optional: Building the assoc-space

This step is complex and computationally expensive, so it's optional.

One cool thing you can do with ConceptNet is discover generalized associations between sets of concepts, by representing concepts in a high-dimensional vector space. This is used in the API, for example, and on that page you'll find some examples.

The vector space is represented by a matrix. Its rows are labeled with a filtered subset of the concepts in ConceptNet, and its columns are 150 principal components of knowledge in ConceptNet. Therefore, each concept gets associated with a 150-dimensional vector. Concepts that are more strongly associated with one another will have vectors with a higher dot product.

There's a separate package called LuminosoInsight/assoc-space that builds this matrix from ConceptNet data.

This package relies on Numpy and Scipy; your Python environment has to already have them installed, or have the tools necessary to compile them (such as development libraries for Python, BLAS linear algebra libraries, and C and Fortran compilers). As long as Numpy and Scipy are working, you can run this:

pip install assoc-space
make build_assoc

The result will be a directory called assoc-space-5.2 that can be loaded by the assoc_space package or by the Web API.

Clone this wiki locally