Time Extractor NLP project - locate dates and times in text documents
The project was developed by Digamma.ai. The goal of the project is to develop a library to find and extract time/date information from textual documents.
The main goal is to indentify texts fragments that are related to time/date/period (exact date, time of day, day of the week, months, seasons, time intervals, etc.) and make structural forms from them. We tried to detect a variety of textual representations and handle things like recurring times (e.g. "every Wednesday").
Clone the repository and create .jar
file, with
git clone https://github.com/digamma-ai/timeextractor.git timeextractor
cd timeextractor
maven clean install
You will find in target/
folder a jar named like timeextractor.jar
.
This library is built on:
- joda-time Library for the Java date and time classes
- opencsv Parser Library
- JUnit Testing Framework
- Log4j Logging Service
- Gson Json Serialization/Deserialization library
Class DateTimeExtractor
is the main class for using Timeextractor. DateTimeExtractor
is used by first constructing a DateTimeExtractor
instance and then invoking extract()
method on it. extract()
is convenience method to extract date/time fragments from input text.
Method | Attributes | Description |
---|---|---|
extract() |
String text |
Extracts date/time fragments with default settings |
Overloading extract() |
String text, Settings settings |
Extracts date/time fragments with custom settings |
extractFromCsv() |
String csvPath, String outputPath, String separator, Settings settings |
Extracts date/time fragments from .csv file |
extractJson() |
String text, Settings settings |
Extracts date/time fragments and saves output to JSON format |
extractFromCsvToJson() |
String csvPath, String outputPath, String separator, Settings settings |
Extracts date/time fragments from .csv file to JSON format |
TemporalExtraction
class representing an element of extracted date/time fragments.
Here is an example of how DateTimeExtractor
and TemporalExtraction
are used:
// input string
String inputText = "Reduced entrance fee after 16:30 except for Thursdays. Closed on Mondays.";
// extract date/times fragments
TreeSet<TemporalExtraction> result = DateTimeExtractor.extract(inputText);
// print extracted results
for (TemporalExtraction elem : result) {
System.out.println(elem);
}
The output will be:
1 after 16:30, [Temporal[type=TIME_INTERVAL, group=TimeGroup, rule=timeIntervalRule, duration=null, durationInterval=null, set=null, startDate=TimeDate [time=Time [hours=16, minutes=30, seconds=0, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=null, weekOfMonth=null]], endDate=null]], 21, 32
2 Thursdays, [Temporal[type=DATE, group=DateGroup, rule=dayOfWeekRule, duration=null, durationInterval=null, set=null, startDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=TH, weekOfMonth=null]], endDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=TH, weekOfMonth=null]]]], 44, 54
3 Mondays, [Temporal[type=DATE, group=DateGroup, rule=dayOfWeekRule, duration=null, durationInterval=null, set=null, startDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=MO, weekOfMonth=null]], endDate=TimeDate [time=Time [hours=18, minutes=59, seconds=43, timezoneOffset=0], date=Date [year=2017, month=10, day=24, dayOfWeek=MO, weekOfMonth=null]]]], 65, 73
The ouptut of the extraction process will be TreeSet
of TemporalExtraction
class. This class has next attributes:
Attributes | Description |
---|---|
String temporalExpression |
founded date/time fragment |
Temporal temporal |
represents date/time fragment's details |
Temporal
class attributes:
Attributes | Description |
---|---|
String type |
type of founded date/time fragment (date, time, relative date, etc.) |
String group |
used group of rules for extracting current date/time fragment |
String rule |
used rule for extracting current date/time fragment |
Duration duration |
duration of extracting date/time fragment |
DurationInterval temporal |
duration interval of extracting date/time fragment |
Set set |
set of frequency, interval and days of repetiotion properties |
TimeDate startDate |
info about start date of extracting date/time fragment |
TimeDate endDate |
info about end date of extracting date/time fragment |
You can modify default extraction settings for some specific scenarios, like:
- find closest day of week according to current date for relative date;
- find closest date according to current date for relative date;
- change found time expression according to specified date and timezone;
- filter extraction rules;
- find only dates that are current date or after current date.
A Settings
can be applied to specify some additional extraction options, like setting local user date/time, time-zone offset, filtering extraction rules and finding latest dates.
SettingsBuilder
is used for constructing Settings
instance when you need to set configuration options other than the default. SettingsBuilder
is best used by creating it, and then invoking its various configuration methods, and finally calling build.
Method | Attributes | Description |
---|---|---|
addRulesGroup() |
String rulesGroup |
Adds extraction rules from rulesGroup group for extracting date/time fragments |
excludeRules() |
String ruleToExclude |
Excludes extraction rule ruleToExclude from extracting rules |
addUserDate() |
String userDate |
Changes found time expression according to specified user date correct format: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" |
addTimeZoneOffset() |
String timeZoneOffset |
Changes found time expression according to specified user time-zone offset in minutes |
includeOnlyLatestDates() |
boolean includeOnlyLatest |
Finds only dates that are current date or after current date |
build() |
Creates a Settings instance based on the current configuration. |
The following is an example shows how to use the SettingsBuilder
to construct a Settings
instance:
Settings settings = new SettingsBuilder()
.addRulesGroup("DateGroup")
.excludeRules("holidaysRule")
.addUserDate("2017-10-23T18:40:40.931Z")
.addTimeZoneOffset("100")
.includeOnlyLatestDates(true)
.build();
All extraction rules are divided into rules groups.
Group | Description | Example |
DateGroup | Contains rules associated with the date | |
dayOfWeekRule | Extracts days of week fragments | Come along to celebrate on Saturday 16 |
relativeDateRule | Extracts relative dates fragments | It was 1 week ago. Went there today. |
holidaysRule | Extracts holidays dates fragments | We will meet on Christmas day. |
monthDayRule | Extracts month-day dates fragments | The Snowy Day and the Art of Ezra Jack Keats (through January 29). |
monthYearDayRule | Extracts year/month/day dates fragments | January 13-19, 2014 Show Times". |
monthYearRule | Extracts year/month dates fragments | In March 2008, the Golden Gate Bridge District board approved a resolution to implement congestion pricing. |
yearRule | Extracts year dates fragments | 2013 is also the 850th anniversary of Notre-Dame. |
DateIntervalGroup | Contains rules associated with the period between two dates | |
dateIntervalRule | Extracts intervals between dates | $3 off general admission with your uberX receipt from 10/16/13 - 10/18/13! Best time to visit is from Tuesday to Thursday. In main season (May - Sep ) the boat leaves daily exc. |
DurationGroup | Contains rules associated with the period of time. | |
intervalDurationRule | Extracts duration intervals | It's acceptable to include 10 - 15 years of experience. |
durationRule | Extracts periods of time | Buy a combined ticket it lasts two days Was told that the last 30min before closing is free. |
RepeatedGroup | Contains rules associated with repeated events. | |
repeatedRule | Extracts repeated events | Free organ show every Sunday at 4. Try San Francisco City Guides, who offer free weekly tours |
SeasonGroup | Contains rules associated with seasons of the year. | |
seasonRule | Extracts seasons of the year | In summer months , the park is an anti-urban oasis along the riverfront. Catch the post-impressionist exhibit in the fall! |
TimeGroup | Contains rules associated with the time. | |
timeRule | Extracts the time | Go before 4pm PST and get there in time for the Tower. The 'Long Walk' on route to the races at about 1.30pm |
timeIntervalRule | Extracts time intervals | Happy hour from 19 till 20 !! Best between 2:00 pm and 4:00 pm to enjoy the sun |
timeZoneRule | Extracts time zones | Closed by 21:00CET. Last entry 04:15 UTC |
WeekendGroup | Contains rules associated with weekends | |
weekendRule | Extracts seasons of the year | Weekend happy hour 11am-7pm |