Skip to content
This repository has been archived by the owner on Dec 20, 2019. It is now read-only.

importing text documents and configure the annotation script

Daniele Guido edited this page Sep 3, 2015 · 8 revisions

Importing documents involves 5 steps

  1. prepare a csv file, tab separated that contains the list of documents metadata
  2. set the folder path which stores the documents in the settings.js file
  3. configure the settings.js by adjusting annotation parameters
  4. execute the import script
  5. execute the discover script

1. Create the csv file

An example of csv file is available on the repo. The slug column is the unique identifier for your document and it is required. Two document with the same slug would be merged during the import process. The special column headers title_*, caption_* and url_* allow to localize some document metadata: available languages are described in the field languages. For instance, if you have a bilingual document in English and French, make sure that:

  1. the related languages field contains the available languages for that document, separated by a comma: en,fr
  2. the fields title_en, title_fr,caption_en, caption_fr, url_en, url_fr are present

Histograph uses the column headers to verify the coherence of the csv file with the internal data model: please make sure that all the column headers specified in the example are present and that title_*, caption_* and url_* columns are set for every language.

2. Adjusting settings.js: set the folder path.

Each url_* field must contain a relative path: Histograph looks for that file under the path specified in settings.js

  paths: {
    media: '/path/to/media/items',
    txt:   '/path/to/txt/items',
    ...
  },

So, if the url_en value for our first_document.txt is test/first_document.txt the complete path would be:

/path/to
├── media/  
└── txt/
    └──test/first_document.txt

3. Adjusting settings.js: configure annotation

Setup yagoaida endpoint or textrazor :

  yagoaida: {
    endpoint: 'https://gate.d5.mpi-inf.mpg.de/aida/service/disambiguate' 
  },

Configure the geonames and / or geocoding to refine the information about the locations extracted:

  geonames : {
    endpoint: 'http://api.geonames.org/searchJSON',
    username: 'your username'
  },

  geocoding: { // google geocoding api
    endpoint: 'https://maps.googleapis.com/maps/api/geocode/json',
    key: 'XXXXXXXXXXXXXX'
  },

Then complete the disambiguation section of the settings.js script accordingly. Configure the fields that will be used for the analysis, without their language suffix:

  disambiguation: {
    fields: [
      "title",
      "caption",
      "url"
    ],
  ...
  }

Then configure the services and the geoservices used by specifying the language support

    services: {
      "yagoaida": ['en'],
      // "textrazor": ['en', 'fr', 'de']
    },
    geoservices: {
      'geonames': ['en', 'fr', 'de'],
      'geocoding': ['en', 'fr', 'de']
    }
  },

4.execute the import script

  > cd /path/to/histograph
  > node .\scripts\manage.js --task=import-resources --source=/path/to/my-resources.tsv

Wait until all of the documents have been processed.

5.execute the discover script

Executing the discover script will first save the document metadata in a (resource) node, then will extract the named entities for each text in your disambiguation.fields. The annotated version of the resource will be stored in neo4j - the (version:annotation) node and linked to the resource by a (version:annotation)-[:describes]->(resource) relationship. The entities extracted will be linked as generic appears_in relationship; e.g. (entity:person)-[:appears_in]->(resource). Discover can be performed on the whole corpus (you can specify --limit and --offset if you have a lot of documents):

  > cd /path/to/histograph
  > node .\scripts\manage.js --task=discover-resources --limit=10

or on a single document, by specifying the (resource) node id:

  > cd /path/to/histograph
  > node .\scripts\manage.js --task=discover-resource --id=567890

that's all!