-
Notifications
You must be signed in to change notification settings - Fork 356
Changelog
ConceptNet 5.4 includes small updates to the source data, a significant simplification to how texts are represented as URIs, and a new build process.
- We've updated the data from nadya.jp, a game that collects relational knowledge in Japanese.
- Updated DBPedia to its 2014 release.
- Dropped some not-very-useful relations that snuck in from old experiments with ConceptNet, such as
/r/InheritsFrom
. - We've simplified the way that natural language texts are represented as ConceptNet URIs. Instead of an English-specific, machine-learned tokenizer from NLTK, we use a simpler regex based on Unicode character classes to split the text into words. The most noticeable change is that hyphens are now token boundaries just like spaces, and get replaced by underscores:
/c/en/mother-in-law
is now/c/en/mother_in_law
. - Assertions now store the original texts of the terms that they relate in the
surfaceStart
andsurfaceEnd
fields. The assertion whosesurfaceText
is[[Fire]] is [[hot]]
has properties including{"surfaceStart": "Fire", "surfaceEnd": "hot"}
. - The Makefile for ConceptNet was becoming unwieldy, so we've replaced it with a Ninja file. Ninja is a build system that's similar in spirit to Make, but deals better with parallel builds and with build steps that produce many outputs.
- Uses the
langcodes
module to parse language names and codes more robustly, especially those from Wiktionary.
ConceptNet 5.3 introduces changes in many areas:
- The search index is now implemented in pure Python, using SQLite. Solr is no longer a dependency.
- The API now uses this new search index. One effect of this is that it matches only complete path components, not any prefix of a URI. Searching for "/c/en/cat" will get "/c/en/cat" and "/c/en/cat/n/animal", but not "/c/en/catamaran".
- Exact matches are also possible. Searching for "/c/en/cat/." -- with a dot as the final component -- will find only "/c/en/cat".
- Because the search index no longer uses Solr, the "score" attribute no longer appears on edges. This attribute was an artifact of Solr that represented the product of the edge's "weight" with Solr's built-in search weight. If you were using "score", you should use "weight" instead.
- ConceptNet now imports data from Umbel, a Semantic Web-style view of OpenCyc.
- Indonesian (
id
) and Malay (ms
) concepts have been unified into the Malay macrolanguage (also designatedms
), similarly to the way we already unify Chinese, because of their highly overlapping vocabularies. In a later version, we may be able to make distinctions between languages within a macrolanguage when necessary. - We've implemented a better Wiktionary reader using Grako, a framework for writing recursive parsers in Python. This parser is able to understand the structure of a Wiktionary entry, giving more results and fewer errors than what we did before.
- Wiktionary parsing now covers entries written in German as well as English. (As before, the entries are about words in hundreds of languages.)
The intermediate format for lists of ConceptNet edges is now msgpack instead of JSON. This format is compatible with JSON but saves disk space and parsing time.
The "assoc-space", a dimensionality-reduced vector space of which words are like other words, uses an updated version of the assoc_space
package. It can now be built in shards that are combined to form the complete space, instead of having to be built all at once, making it possible to run using a reasonable amount of RAM.
Some of ConceptNet's data is available under the Creative Commons Attribution (CC-By) license, even though the dataset as a whole requires the Creative Commons Attribution-ShareAlike license (CC-By-SA). This information is marked on each edge, but in ConceptNet 5.2, there was no easy way to get the CC-By subset.
By now, there are enough CC-By-SA data sources that it doesn't make sense to attempt a complete build of ConceptNet without them. However, ConceptNet 5.3's downloads include a file containing only the CC-By edges, as individual edges that aren't grouped into assertions.
ConceptNet 5.3's support code still runs on Python 2, but we would like to drop support for Python 2 in an upcoming version. As has been the case since version 5.2, the data cannot be built correctly on Python 2.
The data files described by MANIFEST.in
are now also installed as package_data
in setup.py
, making them available when installed as a package. (This accomplishes what the last bullet point in 5.2.3 was supposed to be about.)
- Fix a typo in the Makefile that prevented it from downloading the initial raw data.
- Enforce the rate limit in the API.
- Merge in NLP code from
metanl
, instead of having it as an external dependency. The dependency is now on the simpler packageftfy
. - Add a
MANIFEST.in
so that the necessary data can still be found after apip install
orsetup.py install
.
- Fix the accidental omission of nadya.jp data.
5.2.1 is a significant revision to the code that builds ConceptNet, but it retains mostly the same representation and almost all of the same knowledge as 5.2.0. The cases where they differ are largely due to bugs that were discovered in the refactor.
- Reorganized much of the code for working with nodes and edges in ConceptNet.
- The code is now designed for Python 3 as its primary environment. A small amount of compatibility code makes sure that it will still run on Python 2.7 as well, but it will not necessarily get the same results from all Unicode operations.
- Removed a fair amount of dead code.
- Added test cases that cover most of the code; removed tests for 5.0 that clearly wouldn't work anymore.
- Combined assertions (such as what the 5.2 API returns) keep track of their full list of sources and their first-seen dataset, so they can be searched like edges in 5.1.
A change will be noticeable in the Web API, because for a while it was serving the union of ConceptNet 5.1 and 5.2 data structures, with both separate edges and combined assertions. Now it is only serving the combined assertions. The results should be similar, but with less duplication.
- The set of knowledge sources has changed. JMdict is in. ReVerb is out, because we couldn't filter it well enough.
- Some bugs in building from existing sources were fixed.
- ConceptNet can now be built from its raw data using a Makefile. (See Build process)
- The code comes with everything you need to build and query "assoc spaces" -- vector spaces representing semantic connections between concepts -- thanks to the open-source release of assoc_space by Luminoso.
- The API now returns one result per assertion, even if that assertion comes from multiple sources.
- Because of that, the representation of knowledge sources has changed. The sources used to be lists of reasons that an assertion got added, and each one implicitly represented a conjunction. The "sources" field in the API now always contains one element for each assertion, and that element contains the full AND-OR tree of sources.
Version 5.1 has a new, simpler representation of nodes and edges than ConceptNet 4.0 or 5.0, making it suitable to represent ConceptNet 5 with downloadable flat files and efficient search indexes.
- Made base URIs shorter. For example,
/concept/en/dog
becomes/c/en/dog
. - Changed the representation of assertions. Assertions are a bundle of edges (hyperedges, really) that connect two arguments and a relation. These edges are labeled with all the appropriate metadata.
- Created JSON and CSV flat-files.
- Created a Solr index and an accompanying API. The MongoDB is deprecated.
ConceptNet 5.1.1 was an incremental update that maintains full API compatibility with 5.1.
- First API for ConceptNet 5.
- All assertions were reified as nodes, with edges for arguments. This turned out to be an ineffective representation.
Starting points
Reproducibility
Details