-
Notifications
You must be signed in to change notification settings - Fork 5
Some insights on database schema
Or how to prepare a neo4j db to be compatible with histograph api.
The nodes labelled resource
represent any kind of document: pictures, news articles, book chapter etc...
Here below we list the main properties:
name | value | description |
---|---|---|
name | string | a generic name for the oblject |
slug | string UNIQUE | the slugified version of the name field |
type | string | according to your resource type described in your settings.js file |
languages | array | available languages for the fields title , caption and url (if used, it's optional) |
title_<lang> | string | one title for each of the languages specified in languages field, e.g. title_en and/or title_fr
|
caption_<lang> | string | one caption for each of the languages specified in languages field, see title above |
full_search | string/text | useful for lucene |
creation_date | ISO date | |
creation_time | UNIX time in milliseconds | |
start_time | UNIX TIME | the date used in the corpus timeline, in ms from EPOCH |
end_time | UNIX TIME | the date used in the corpus timeline, in ms from EPOCH |
Please note the coexistence of two UNIQUE properties: the uuid
value, representing the identifier and the slug
that can be used to require the resource as human readable index.
The UNIQUE fields are enforced by UNIQUE index.
Optional fields are:
name | value | description |
---|---|---|
url | url string | LOCAL url of the resource, cfr your settings.paths configuration |
url_<lang> | string | language specific representation of url, e.g. transcription of interviews for each of the languages specified in languages field |
ipr_<lang> | string | one copyright/property rights for each of the languages specified in languages field |
start_date | ISO DATE | the date used in the corpus timeline, isoformat |
end_date | ISO DATE | the date used in the corpus timeline, isoformat |
The cypher query should be something like this:
MERGE (res:resource {uuid:{uuid}})
SET res.uuid = {uuid},
res.name = {name},
res.slug = {slug},
res.type = {type},
res.languages = {languages},
res.title_en = {title_en},
res.caption_en = {caption_en},
res.creation_date = {creation_date}
res.creation_time = {creation_time}
res.full_search = {full_search}
Each resource can be linked via an appears_in
relationship to nodes labelled entity
and sublabelled wit more specific type, among that:person
, location
, theme
, institution
or social_group
:
(ent:entity:person)-[r:appears_in]->(res:resource)
The relationship set (ent:entity:person)-[r:appears_in]->(res:resource)
is used to calculate similarity index between two resources or two entities. Then every relationship must contain at least the property frequency
, an integer number stating the number of occurrences for that entity in the (res:resource) document context (i.e. at least 1).
The frequency
is then used to calculate the tfidf value. This is normally done by the tfidf script accessible via the command line:
$ cd histograph
$ node scripts/manage.js --task=entity.tfidf
The script calculates the tfidf
value and enriches each entity nodes with a) the corresponding df
value (document frequency), that is the number of docs where the entity appears and b) the specificity
value, normalizing the df value to the total number of document. The tf
and the tfidf
values are stored as properties for the appears_in
relationship.
Each resource has one ore more annotation
node (a subtype of version
) where a yaml field contain starting and ending position of entities. (to be continued)