Skip to content

URI hierarchy

rspeer edited this page Apr 30, 2012 · 26 revisions

Introduction

Every object in ConceptNet has a URI that is structured like a path, giving you a standard place to look it up. For example, the concept "common sense" in English has the URI /c/en/common_sense.

Most URIs are intended to be meaningful: if you look at a URI, you can tell what object it is, and if you look at an object you can tell what its URI is. The exception is edges, where the URI is a hash of all the information in the edge whose only purpose is to ensure uniqueness.

The different kinds of objects are distinguished by the first element of the path. In version 5.1, these elements have mostly been reduced to a single character to avoid wasting disk space.

  • /a/: assertions
  • /c/: concepts (words, disambiguated words, and phrases, in a particular language)
  • /ctx/: contexts in which assertions can be true, if they're language independent. Most assertions are in /ctx/all. A language-specific concept can also be used as a context.
  • /d/: datasets
  • /e/: unique, arbitrary IDs for edges. Edges that assert the same thing combine to form assertions.
  • /l/: license terms for redistributing the information in an edge. The two licenses in ConceptNet are /l/CC/By for Creative Commons Attribution, and /l/CC/By-SA for the more restrictive Attribution-ShareAlike license. See Copying and sharing ConceptNet.
  • /r/: language-independent relations, such as /r/IsA
  • /s/: knowledge sources, which can be human contributors, Web sites, or automated processes

Concept URIs

Concept URIs contain the text of the concept, reduced to a normal form using the language-specific lemmatizers in metanl and with spaces replaced by underscores. All non-ASCII text is in UTF-8.

Each concept has at least three components: the initial /c to make it a concept, a part that indicates its language (using the shortest ISO language code for that language), and a part with the concept text.

An optional fourth component gives the part of speech (as a single letter, following the convention of WordNet), and an option fifth component is a phrase distinguishing a particular word sense from others.

  • /c/en/play_game is the English concept "play a game".
  • /c/en/read/v is the English word "read", in all its senses that are verbs.
  • /c/en/read/v/interpret_something_that_is_written_or_printed is a particular verb sense of "read".
  • /c/ja/紙 is the Japanese concept meaning "paper".

Assertion URIs

Assertion URIs indicate the relation, start, and end of an edge (or bundle of edges; having more edges with different sources makes the assertion stronger).

The relation, start, and end are all represented in a bracketed list in the URI. The brackets allow assertion URIs to be nested within each other, in the case where you have assertions about assertions. These lists are surrounded by the components /[/ and /]/ and delimited by /,/. For example, the assertion "A dog is an animal" has the URI /a/[/c/en/dog/,/c/en/animal/].

The relation is either a language-independent /r/ relation or a language-specific /c/ concept. The start and end can be concepts or assertions. (They can even conceivably be relations, if we add an upper ontology describing how relations relate to each other.)

Edge URIs

Edges can differ in many ways, so unlike assertions, there's no compact decsription of the edge that is sufficiently smaller than the data structure for the edge itself.

Instead, we hash all the data that makes an edge unique. Create a list containing the assertion URI, the context, and the list of conjoined sources that justify the edge, and separate all their URIs with |; then take the SHA-1 hexadecimal digest of this string. See the make_edge function in edges.py for an implementation.

Clone this wiki locally