-
Notifications
You must be signed in to change notification settings - Fork 356
URI hierarchy
Every object in ConceptNet has a URI that is structured like a path, giving you a standard place to look it up. For example, the concept "common sense" in English has the URI /c/en/common_sense
.
Most URIs are intended to be meaningful: if you look at a URI, you can tell what object it is, and if you look at an object you can tell what its URI is. The exception is edges, where the URI is a hash of all the information in the edge, whose only purpose is to ensure uniqueness.
The different kinds of objects are distinguished by the first element of the path.
- /a/: assertions
- /c/: concepts (words, disambiguated words, and phrases, in a particular language)
- /d/: datasets (large sources of knowledge that can be downloaded as a unit)
- /e/: unique, arbitrary IDs for edges. Edges that assert the same thing combine to form assertions.
- /l/: license terms for redistributing the information in an edge. The two licenses in ConceptNet are /l/CC/By for Creative Commons Attribution, and /l/CC/By-SA for the more restrictive Attribution-ShareAlike license. See Copying and sharing ConceptNet.
- /r/: language-independent relations, such as /r/IsA
- /s/: knowledge sources, which can be human contributors, Web sites, or automated processes
- /and and /or: conjunctions and disjunctions of sources
Concept URIs contain the text of the concept, reduced to a normal form using the language-specific lemmatizers in metanl and with spaces replaced by underscores. All non-ASCII text is in UTF-8.
Each concept has at least three components: the initial /c to make it a concept, a part that indicates its language (using the shortest ISO language code for that language), and a part with the concept text.
An optional fourth component gives the part of speech (as a single letter, following the convention of WordNet), and an option fifth component is a phrase distinguishing a particular word sense from others.
- /c/en/play_game is the English concept "play a game".
- /c/en/read/v is the English word "read", in all its senses that are verbs.
- /c/en/read/v/interpret_something_that_is_written_or_printed is a particular verb sense of "read".
- /c/ja/紙 is the Japanese concept meaning "paper".
Assertion URIs indicate the relation, start, and end of an edge (or bundle of edges; having more edges with different sources makes the assertion stronger).
The relation, start, and end are all represented in a bracketed list in the URI. The brackets allow assertion URIs to be nested within each other, in the case where you have assertions about assertions. These lists are surrounded by the components /[/ and /]/ and delimited by /,/. For example, the assertion "A dog is an animal" has the URI /a/[/c/en/dog/,/c/en/animal/].
The relation is either a language-independent /r/ relation or a language-specific /c/ concept. The start and end can be concepts or assertions. (They can even conceivably be relations, if we add an upper ontology describing how relations relate to each other.)
A single source of knowledge has a URI that begins with /s
. Sources are broken down into more types:
-
/s/contributor
: a human contributor to a crowd-sourced knowledge base. -
/s/activity
: a knowledge-collection task that was being presented by a computer to collect crowd-sourced knowledge. -
/s/rule
: an automatic rule for extracting knowledge from a different form. -
/s/site
: a knowledge base with the authority of some Web site behind it, such as Wiktionary or DBPedia.
Awkward detail: In ConceptNet 5.2, the source that should be
/s/site/dbpedia/3.7
is actually just/s/dbpedia/3.7
, making DBPedia just a completely separate kind of source. This is probably not desirable, but changing it would change all the edge IDs. We will probably fix this in ConceptNet 5.3.
The sources for an assertion are often conjunctions (/and
) or disjunctions (/or
) of these individual sources. For example, any edge with a contributor probably has an activity as well, and those would be combined into an /and source.
If the same assertion comes from multiple edges from unrelated sources, those are combined into an /or source.
The sources then appear in bracketed tree-structured URIs, such as:
/or/[/and/[/s/contributor/omcs/havasi/,/s/activity/omcs1/]/,/s/site/dbpedia/3.7/]
/and sources appear within /or sources, but never the other way around.
Edges can differ in many ways, so unlike assertions, there's no compact description of the edge that is sufficiently smaller than the data structure for the edge itself.
Instead, we hash all the data that makes an edge unique, creating a URI such as /e/d4983ab61dd4e4050b29377716fe37fa0704771b. These URIs are a more compact way of deduplicating edges, and you can also sort by them to get a pseudorandom selection of edges.
To make an edge URI, create a list containing the assertion URI, the context, and the list of conjoined sources that justify the edge in alphabetical order, and separate all their URIs with spaces, giving a string like /a/[/r/UsedFor/,/c/en/book/,/c/en/learn/] /ctx/all /s/activity/omcs/omcs1,_possibly_free_text /s/contributor/omcs/annedog
. Then take the SHA-1 hexadecimal digest of this string.
See the make_edge
function in edges.py for an implementation.
Starting points
Reproducibility
Details