Skip to content

URI hierarchy

Rob Speer edited this page Mar 27, 2014 · 26 revisions

Every object in ConceptNet has a URI that is structured like a path, giving you a standard place to look it up. For example, the concept "common sense" in English has the URI /c/en/common_sense.

Technicality: The identifiers in ConceptNet are actually "IRIs", not "URIs", because they may contain Unicode characters. However, I have never encountered the term "IRI" outside of DBPedia's documentation, so let's keep using the more familiar term "URI".

Most URIs are intended to be meaningful: if you look at a URI, you can tell what object it is, and if you look at an object you can tell what its URI is. The exception is edges, where the URI is a hash of all the information in the edge, whose only purpose is to ensure uniqueness.

The different kinds of objects are distinguished by the first element of the path.

  • /a/: assertions
  • /c/: concepts (words, disambiguated words, and phrases, in a particular language)
  • /d/: datasets (large sources of knowledge that can be downloaded as a unit)
  • /e/: unique, arbitrary IDs for edges. Edges that assert the same thing combine to form assertions.
  • /l/: license terms for redistributing the information in an edge. The two licenses in ConceptNet are /l/CC/By for Creative Commons Attribution, and /l/CC/By-SA for the more restrictive Attribution-ShareAlike license. See Copying and sharing ConceptNet.
  • /r/: language-independent relations, such as /r/IsA
  • /s/: knowledge sources, which can be human contributors, Web sites, or automated processes
  • /and and /or: conjunctions and disjunctions of sources

Concept URIs

Concept URIs contain the text of the concept, reduced to a normal form using the language-specific lemmatizers in metanl and with spaces replaced by underscores. All non-ASCII text is in UTF-8.

Each concept has at least three components: the initial /c to make it a concept, a part that indicates its language (using the shortest ISO language code for that language), and a part with the concept text.

An optional fourth component gives the part of speech (as a single letter, following the convention of WordNet), and an option fifth component is a phrase distinguishing a particular word sense from others.

  • /c/en/play_game is the English concept "play a game".
  • /c/en/read/v is the English word "read", in all its senses that are verbs.
  • /c/en/read/v/interpret_something_that_is_written_or_printed is a particular verb sense of "read".
  • /c/ja/紙 is the Japanese concept meaning "paper".

Assertion URIs

Assertion URIs indicate the relation, start, and end of an edge (or bundle of edges; having more edges with different sources makes the assertion stronger).

The relation, start, and end are all represented in a bracketed list in the URI. The brackets allow assertion URIs to be nested within each other, in the case where you have assertions about assertions. These lists are surrounded by the components /[/ and /]/ and delimited by /,/. For example, the assertion "A dog is an animal" has the URI /a/[/c/en/dog/,/c/en/animal/].

The relation is either a language-independent /r/ relation or a language-specific /c/ concept. The start and end can be concepts or assertions. (They can even conceivably be relations, if we add an upper ontology describing how relations relate to each other.)

Source URIs

A single source of knowledge has a URI that begins with /s. Sources are broken down into more types:

  • /s/contributor: a human contributor to a crowd-sourced knowledge base.
  • /s/activity: a knowledge-collection task that was being presented by a computer to collect crowd-sourced knowledge.
  • /s/rule: an automatic rule for extracting knowledge from a different form.
  • /s/site: a knowledge base with the authority of some Web site behind it, such as Wiktionary or DBPedia.

Awkward detail: In ConceptNet 5.2, the source that should be /s/site/dbpedia/3.7 is actually just /s/dbpedia/3.7, making DBPedia just a completely separate kind of source. This is probably not desirable, but changing it would change all the edge IDs. We will probably fix this in ConceptNet 5.3.

The sources for an assertion are often conjunctions (/and) or disjunctions (/or) of these individual sources. For example, any edge with a contributor probably has an activity as well, and those would be combined into an /and source.

If the same assertion comes from multiple edges from unrelated sources, those are combined into an /or source.

The sources then appear in bracketed tree-structured URIs, such as:

/or/[/and/[/s/contributor/omcs/havasi/,/s/activity/omcs1/]/,/s/site/dbpedia/3.7/]

/and sources appear within /or sources, but never the other way around.

Edge URIs

Edges can differ in many ways, so unlike assertions, there's no compact description of the edge that is sufficiently smaller than the data structure for the edge itself.

Instead, we hash all the data that makes an edge unique, creating a URI such as /e/d4983ab61dd4e4050b29377716fe37fa0704771b. These URIs are a more compact way of deduplicating edges, and you can also sort by them to get a pseudorandom selection of edges.

To make an edge URI, create a list containing the assertion URI, the context, and the list of conjoined sources that justify the edge in alphabetical order, and separate all their URIs with spaces, giving a string like /a/[/r/UsedFor/,/c/en/book/,/c/en/learn/] /ctx/all /s/activity/omcs/omcs1,_possibly_free_text /s/contributor/omcs/annedog. Then take the SHA-1 hexadecimal digest of this string.

See the make_edge function in edges.py for an implementation.

Source code

Code for working with URIs in general: https://github.com/commonsense/conceptnet5/blob/master/conceptnet5/uri.py

Code for working with concept names in particular languages: https://github.com/commonsense/conceptnet5/blob/master/conceptnet5/nodes.py

Clone this wiki locally