Skip to content
This repository has been archived by the owner on Dec 20, 2019. It is now read-only.

Some insights on database schema

Daniele Guido edited this page Aug 8, 2016 · 2 revisions

Or how to prepare a neo4j db to be compatible with histograph api.

The :resource node

The nodes labelled resource represent any kind of document: pictures, news articles, book chapter etc... Here below we list the main properties:

name value description
name string a generic name for the oblject
slug string UNIQUE the slugified version of the name field
type string according to your resource type described in your settings.js file
languages array available languages for the fields title, caption and url(if used, it's optional)
title_<lang> string one title for each of the languages specified in languages field, e.g. title_en and/or title_fr
caption_<lang> string one caption for each of the languages specified in languages field, see title above
full_search string/text useful for lucene
creation_date ISO date
creation_time UNIX time in milliseconds
start_time UNIX TIME the date used in the corpus timeline, in ms from EPOCH
end_time UNIX TIME the date used in the corpus timeline, in ms from EPOCH

Please note the coexistence of two UNIQUE properties: the uuid value, representing the identifier and the slug that can be used to require the resource as human readable index. The UNIQUE fields are enforced by UNIQUE index.

Optional fields are:

name value description
url url string LOCAL url of the resource, cfr your settings.paths configuration
url_<lang> string language specific representation of url, e.g. transcription of interviews for each of the languages specified in languages field
ipr_<lang> string one copyright/property rights for each of the languages specified in languages field
start_date ISO DATE the date used in the corpus timeline, isoformat
end_date ISO DATE the date used in the corpus timeline, isoformat

The cypher query should be something like this:

MERGE (res:resource {uuid:{uuid}})
  SET res.uuid = {uuid},
      res.name = {name},
      res.slug = {slug},
      res.type = {type},
      res.languages = {languages},
      res.title_en = {title_en},
      res.caption_en = {caption_en},
      res.creation_date = {creation_date}
      res.creation_time = {creation_time}
      res.full_search = {full_search}

1. The :entity node and the [:appears_in]relationship

Each resource can be linked via an appears_in relationship to nodes labelled entity and sublabelled wit more specific type, among that:person, location, theme, institution or social_group:

(ent:entity:person)-[r:appears_in]->(res:resource)

1.1 Tfidf calculation

The relationship set (ent:entity:person)-[r:appears_in]->(res:resource) is used to calculate similarity index between two resources or two entities. Then every relationship must contain at least the property frequency, an integer number stating the number of occurrences for that entity in the (res:resource) document context (i.e. at least 1). The frequency is then used to calculate the tfidf value. This is normally done by the tfidf script accessible via the command line:

$ cd histograph
$ node scripts/manage.js --task=entity.tfidf

The script calculates the tfidf value and enriches each entity nodes with a) the corresponding df value (document frequency), that is the number of docs where the entity appears and b) the specificity value, normalizing the df value to the total number of document. The tf and the tfidf values are stored as properties for the appears_in relationship.

Each resource has one ore more annotation node (a subtype of version) where a yaml field contain starting and ending position of entities. (to be continued)