Skip to content

Latest commit

 

History

History
233 lines (210 loc) · 11.6 KB

README.md

File metadata and controls

233 lines (210 loc) · 11.6 KB

timeextractor

Time Extractor NLP project - locate dates and times in text documents

Introduction

The project was developed by Digamma.ai. The goal of the project is to develop a library to find and extract time/date information from textual documents.

The main goal is to indentify texts fragments that are related to time/date/period (exact date, time of day, day of the week, months, seasons, time intervals, etc.) and make structural forms from them. We tried to detect a variety of textual representations and handle things like recurring times (e.g. "every Wednesday").

Installation

Clone the repository and create .jar file, with

git clone https://github.com/digamma-ai/timeextractor.git timeextractor
cd timeextractor
maven clean install

You will find in target/ folder a jar named like timeextractor.jar.

Dependencies

This library is built on:

  • joda-time Library for the Java date and time classes
  • opencsv Parser Library
  • JUnit Testing Framework
  • Log4j Logging Service
  • Gson Json Serialization/Deserialization library

Quickstart

Class DateTimeExtractor is the main class for using Timeextractor. DateTimeExtractor is used by first constructing a DateTimeExtractor instance and then invoking extract() method on it. extract() is convenience method to extract date/time fragments from input text.

Method Attributes Description
extract() String text Extracts date/time fragments with default settings
Overloading extract() String text, Settings settings Extracts date/time fragments with custom settings
extractFromCsv() String csvPath, String outputPath, String separator, Settings settings Extracts date/time fragments from .csv file
extractJson() String text, Settings settings Extracts date/time fragments and saves output to JSON format
extractFromCsvToJson() String csvPath, String outputPath, String separator, Settings settings Extracts date/time fragments from .csv file to JSON format

TemporalExtraction class representing an element of extracted date/time fragments.

Here is an example of how DateTimeExtractor and TemporalExtraction are used:

// input string
String inputText = "Reduced entrance fee after 16:30 except for Thursdays. Closed on Mondays.";
        
// extract date/times fragments
TreeSet<TemporalExtraction> result = DateTimeExtractor.extract(inputText);

// print extracted results
for (TemporalExtraction elem : result) {
     System.out.println(elem);
}

The output will be:

1 after 16:30, [Temporal[type=TIME_INTERVAL, group=TimeGroup, rule=timeIntervalRule, duration=null, durationInterval=null, set=null, startDate=TimeDate [time=Time [hours=16, minutes=30, seconds=0, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=null, weekOfMonth=null]], endDate=null]], 21, 32

2 Thursdays, [Temporal[type=DATE, group=DateGroup, rule=dayOfWeekRule, duration=null, durationInterval=null, set=null, startDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=TH, weekOfMonth=null]], endDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=TH, weekOfMonth=null]]]], 44, 54

3 Mondays, [Temporal[type=DATE, group=DateGroup, rule=dayOfWeekRule, duration=null, durationInterval=null, set=null, startDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=MO, weekOfMonth=null]], endDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=MO, weekOfMonth=null]]]], 65, 73

Output Description

The ouptut of the extraction process will be TreeSet of TemporalExtraction class. This class has next attributes:

Attributes Description
String temporalExpression founded date/time fragment
Temporal temporal represents date/time fragment's details

Temporal class attributes:

Attributes Description
String type type of founded date/time fragment (date, time, relative date, etc.)
String group used group of rules for extracting current date/time fragment
String rule used rule for extracting current date/time fragment
Duration duration duration of extracting date/time fragment
DurationInterval temporal duration interval of extracting date/time fragment
Set set set of frequency, interval and days of repetiotion properties
TimeDate startDate info about start date of extracting date/time fragment
TimeDate endDate info about end date of extracting date/time fragment

Advanced settings

You can modify default extraction settings for some specific scenarios, like:

  • find closest day of week according to current date for relative date;
  • find closest date according to current date for relative date;
  • change found time expression according to specified date and timezone;
  • filter extraction rules;
  • find only dates that are current date or after current date.

A Settings can be applied to specify some additional extraction options, like setting local user date/time, time-zone offset, filtering extraction rules and finding latest dates.

SettingsBuilder is used for constructing Settings instance when you need to set configuration options other than the default. SettingsBuilder is best used by creating it, and then invoking its various configuration methods, and finally calling build.

Method Attributes Description
addRulesGroup() String rulesGroup Adds extraction rules from rulesGroup group for extracting date/time fragments
excludeRules() String ruleToExclude Excludes extraction rule ruleToExclude from extracting rules
addUserDate() String userDate Changes found time expression according to specified user date
correct format: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
addTimeZoneOffset() String timeZoneOffset Changes found time expression according to specified user time-zone offset in minutes
includeOnlyLatestDates() boolean includeOnlyLatest Finds only dates that are current date or after current date
build() Creates a Settings instance based on the current configuration.

The following is an example shows how to use the SettingsBuilder to construct a Settings instance:

Settings settings = new SettingsBuilder()
         .addRulesGroup("DateGroup")
         .excludeRules("holidaysRule")
         .addUserDate("2017-10-23T18:40:40.931Z")
         .addTimeZoneOffset("100")
         .includeOnlyLatestDates(true)
         .build();

Extraction rules

All extraction rules are divided into rules groups.

Group Description Example
DateGroup Contains rules associated with the date
dayOfWeekRule Extracts days of week fragments Come along to celebrate on Saturday 16
relativeDateRule Extracts relative dates fragments It was 1 week ago.
Went there today.
holidaysRule Extracts holidays dates fragments We will meet on Christmas day.
monthDayRule Extracts month-day dates fragments The Snowy Day and the Art of Ezra Jack Keats (through January 29).
monthYearDayRule Extracts year/month/day dates fragments January 13-19, 2014 Show Times".
monthYearRule Extracts year/month dates fragments In March 2008, the Golden Gate Bridge District board approved a resolution to implement congestion pricing.
yearRule Extracts year dates fragments 2013 is also the 850th anniversary of Notre-Dame.
DateIntervalGroup Contains rules associated with the period between two dates
dateIntervalRule Extracts intervals between dates $3 off general admission with your uberX receipt from 10/16/13 - 10/18/13!
Best time to visit is from Tuesday to Thursday.
In main season (May - Sep ) the boat leaves daily exc.
DurationGroup Contains rules associated with the period of time.
intervalDurationRule Extracts duration intervals It's acceptable to include 10 - 15 years of experience.
durationRule Extracts periods of time Buy a combined ticket it lasts two days
Was told that the last 30min before closing is free.
RepeatedGroup Contains rules associated with repeated events.
repeatedRule Extracts repeated events Free organ show every Sunday at 4.
Try San Francisco City Guides, who offer free weekly tours
SeasonGroup Contains rules associated with seasons of the year.
seasonRule Extracts seasons of the year In summer months , the park is an anti-urban oasis along the riverfront.
Catch the post-impressionist exhibit in the fall!
TimeGroup Contains rules associated with the time.
timeRule Extracts the time Go before 4pm PST and get there in time for the Tower.
The 'Long Walk' on route to the races at about 1.30pm
timeIntervalRule Extracts time intervals Happy hour from 19 till 20 !!
Best between 2:00 pm and 4:00 pm to enjoy the sun
timeZoneRule Extracts time zones Closed by 21:00CET.
Last entry 04:15 UTC
WeekendGroup Contains rules associated with weekends
weekendRule Extracts seasons of the year Weekend happy hour 11am-7pm