-
Notifications
You must be signed in to change notification settings - Fork 356
Build process
Note: This page is written for ConceptNet 5.2. In ConceptNet 5.3, the process is a bit different, and is outlined in Running your own copy.
The main changes that haven't been worked into these directions are:
- The Makefile is at the top level, not in the
data/
directory - NLTK 3 is no longer in alpha
- You need to build APSW before you build ConceptNet
To do a complete build of ConceptNet, you will need:
-
Python 3.3 or later
- While the ConceptNet code will run on Python 2.7, we don't recommend using it to build ConceptNet. ConceptNet makes very heavy use of Unicode, which is both more optimized and more correct in Python 3. On Python 2.7, you'll get slightly different data, and you'll get it slower.
- NLTK 3.0a or later (see "Setting up your environment" below)
-
Standard UNIX command-line tools, including
make
,curl
,rsync
,sed
,sort
,uniq
, andcut
- 40 GB of free disk space
- At least 10 GB of available RAM, if you are going to run the
build_assoc
step
We build ConceptNet on the GNU versions of the command-line tools. Most of them have a separate BSD version for non-GNU systems, such as Mac OS. We don't frequently build on the BSD versions, but we're not trying to do anything non-portable, so let us know about any bugs or discrepancies.
We don't recommend trying to build ConceptNet on Windows, but if you accomplish it, let us know what you did.
We recommend building ConceptNet in a Python virtualenv, so that its dependencies are easy to keep track of. Fortunately, this is easy in Python 3.4.
- Make a virtualenv if you don't have one already:
pyvenv-3.4 conceptnet-env
- Activate the environment, which will make
python
refer to this Python 3.4 environment:
source conceptnet-env/bin/activate
NLTK 3 is still in alpha and hasn't been released on the Python Package Index, but it works well enough for building ConceptNet.
- Download the latest .tar.gz from http://www.nltk.org/nltk3-alpha/.
- Extract it:
tar xvf nltk-3.0*.tar.gz
- With your virtual environment activated, enter that directory and install NLTK:
cd nltk-3.0*
python setup.py install
cd ..
- Get the ConceptNet Python code from PyPI or GitHub.
- Install it in "development mode" in your virtual environment. Assuming you've extracted it into a directory called
conceptnet5
:
cd conceptnet5
python setup.py develop
Everything from here on is going to happen inside the data/
directory.
cd data
If you want to be running this on some other disk, now is the time to set that up. For example, I do most of my development on an SSD, but building ConceptNet on an SSD would use up some expensive disk space and also put wear on the drive. So the first thing I do is move the entire data/
directory onto an external, traditional hard disk.
We're going to build ConceptNet by running commands from the Makefile
found here.
Although Makefiles are traditionally used for C code, they don't require any particular programming language. This Makefile manages the steps of building ConceptNet, via Python code and shell scripts.
The advantage of using Make is that it keeps track of dependencies, so that you can skip rebuilding data files that don't need to be rebuilt, and that it can run build steps in parallel.
The first thing you will need is the raw data from the resources that ConceptNet is built from. This will be over 9 gigabytes of data in total.
make download
Now you're ready to run the main build step. Simply run:
make -j8
The -j8
option tells it to run 8 processes in parallel. A higher or lower number might be better for your machine.
This parallelism is really valuable. It's like "map-reduce" but it's been possible for decades, and it doesn't require networked servers. It cuts down the build time from about a day to a few hours.
Here are some useful outputs of the build process:
-
assertions/*.jsons
: JSON streams containing the data for all the assertions in ConceptNet. -
assertions/*.csv
: The same assertions in tabular text format. (The extension '.csv' originally stands for Comma-Separated Values, but frequently it is used to refer to tab-separated values as well, which is what these files contain.) -
edges/
: The individual edges that these assertions were built from. -
stats/
: Some text files that count the distribution of different languages, relations, and datasets in the built data. -
sw_map/
: Files in N-Triples format that connect ConceptNet URIs to equivalent Semantic Web resources. -
assoc/all.csv
: A tabular text file of just the concept-to-concept associations (plus additional 'negated concept' nodes that represent negative relations). -
solr/*.json
: All the assertions as documents that can be loaded into the Solr search engine, for quick lookup.
This step is complex and computationally expensive, so it's optional.
One cool thing you can do with ConceptNet is discover generalized associations between sets of concepts, by representing concepts in a high-dimensional vector space. This is used in the API, for example, and on that page you'll find some examples.
The vector space is represented by a matrix. Its rows are labeled with a filtered subset of the concepts in ConceptNet, and its columns are 150 principal components of knowledge in ConceptNet. Therefore, each concept gets associated with a 150-dimensional vector. Concepts that are more strongly associated with one another will have vectors with a higher dot product.
There's a separate package called LuminosoInsight/assoc-space that builds this matrix from ConceptNet data.
This package relies on Numpy and Scipy; your Python environment has to already have them installed, or have the tools necessary to compile them (such as development libraries for Python, BLAS linear algebra libraries, and C and Fortran compilers). As long as Numpy and Scipy are working, you can run this:
pip install assoc-space
make build_assoc
The result will be a directory called assoc-space-5.2
that can be loaded by the assoc_space
package or by the Web API.
Starting points
Reproducibility
Details