-
Notifications
You must be signed in to change notification settings - Fork 356
URI hierarchy
== Introduction ==
Every object in ConceptNet has a URI that is structured like a path, giving you a standard place to look it up. For example, the concept "common sense" in English has the URI /c/en/common_sense
.
Most URIs are intended to be meaningful: if you look at a URI, you can tell what object it is, and if you look at an object you can tell what its URI is. The exception is edges, where the URI is a hash of all the information in the edge whose only purpose is to ensure uniqueness.
The different kinds of objects are distinguished by the first element of the path. In version 5.1, these elements have mostly been reduced to a single character to avoid wasting disk space.
-
/a/
: assertions -
/c/
: concepts (words, disambiguated words, and phrases, in a particular language) -
/ctx/
: contexts in which assertions can be true, if they're language independent. Most assertions are in/ctx/all
. A language-specific concept can also be used as a context. -
/d/
: datasets -
/e/
: unique, arbitrary IDs for edges. Edges that assert the same thing combine to form assertions. -
/r/
: language-independent relations, such as/r/IsA
-
/s/
: knowledge sources, which can be human contributors, Web sites, or automated processes
== Concept URIs == Concept URIs contain the text of the concept, reduced to a normal form using the language-specific lemmatizers in metanl and with spaces replaced by underscores. All non-ASCII text is in UTF-8.
Each concept has at least three components: the initial /c
to make it a concept, a part that indicates its language (using the shortest ISO language code for that language), and a part with the concept text.
An optional fourth component gives the part of speech (as a single letter, following the convention of WordNet), and an option fifth component is a phrase distinguishing a particular word sense from others.
-
/c/en/play_game
is the English concept "play a game". -
/c/en/read/v
is the English word "read", in all its senses that are verbs. -
/c/en/read/v/interpret_something_that_is_written_or_printed
is a particular verb sense of "read". -
/c/ja/紙
is the Japanese concept meaning "paper".
== Assertion URIs == Assertion URIs indicate the relation, start, and end of an edge (or bundle of edges; having more edges with different sources makes the assertion stronger).
The relation, start, and end are all represented in a bracketed list in the URI. The brackets allow assertion URIs to be nested within each other, in the case where you have assertions about assertions. These lists are surrounded by the components /[/
and /]/
and delimited by /,/
. For example, the assertion "A dog is an animal" has the URI /a/[/c/en/dog/,/c/en/animal/]
.
The relation is either a language-independent /r/
relation or a language-specific /c/
concept. The start and end can be concepts or assertions. (They can even conceivably be relations, if we add an upper ontology describing how relations relate to each other.)
== Edge URIs == Edges can differ in many ways, so unlike assertions, there's no compact decsription of the edge that is sufficiently smaller than the data structure for the edge itself.
Instead, we hash all the data that makes an edge unique. Create a list containing the assertion URI, the context, and the list of conjoined sources that justify the edge, and separate all their URIs with |
; then take the SHA-1 hexadecimal digest of this string. See the make_edge
function in edges.py for an implementation.
Starting points
Reproducibility
Details