Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for stanford 3.4.1 #32

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 23 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
[![Build Status](https://secure.travis-ci.org/louismullie/stanford-core-nlp.png)](http://travis-ci.org/louismullie/stanford-core-nlp)

**About**

This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.

This gem is compatible with Ruby 1.9.2 and 1.9.3 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.
This gem is compatible with Ruby 1.9.2, 1.9.3, and 2.1.1 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.
It is only compatible with Stanford's 3.4.1 release and above. Serious repackaging occured between the 3.3 and 3.4 versions

**Installing**

NOTE: Please see instructions on "using the latest version" below. The packaging of the stanford version has changed.

First, install the gem: `gem install stanford-core-nlp`. Then, download the Stanford Core NLP JAR and model files. Two packages are available:

* A [minimal package](http://louismullie.com/treat/stanford-core-nlp-minimal.zip) with the default tagger and parser models for English, French and German.
Expand Down Expand Up @@ -71,7 +74,7 @@ text.get(:sentences).each do |sentence|
puts token.get(:named_entity_tag).to_s
# Coreference
puts token.get(:coref_cluster_id).to_s
# Also of interest: coref, coref_chain,
# Also of interest: coref, coref_chain,
# coref_cluster, coref_dest, coref_graph.
end
end
Expand All @@ -81,7 +84,7 @@ end

The Ruby symbol (e.g. `:named_entity_tag`) corresponding to a Java annotation class is the `snake_case` of the class name, with 'Annotation' at the end removed. For example, `NamedEntityTagAnnotation` translates to `:named_entity_tag`, `PartOfSpeechAnnotation` to `:part_of_speech`, etc.

A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the `config.rb` file inside the gem.
A good reference for names of annotations are the Stanford Javadocs for [CoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/CoreAnnotations.html), [CoreCorefAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/dcoref/CorefCoreAnnotations.html), and [TreeCoreAnnotations](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreeCoreAnnotations.html). For a full list of all possible annotations, see the `config.rb` file inside the gem.


**Loading specific classes**
Expand All @@ -90,12 +93,12 @@ You may want to load additional Java classes (including any class from the Stanf

```ruby
# Default base class is edu.stanford.nlp.pipeline.
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
# => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>

# Here, we specify another base class.
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
puts StanfordCoreNLP::MaxentTagger.inspect
# => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
```
Expand Down Expand Up @@ -147,24 +150,26 @@ To run the specs for each language (after copying the JARs into the `bin` folder

**Using the latest version of the Stanford CoreNLP**

Using the latest version of the Stanford CoreNLP (version 3.3.1 as of 6/1/2014) requires some additional manual steps:
Using the latest version of the Stanford CoreNLP (version 3.4.1 as of 8/27/2014) requires some additional manual steps:

* Download [Stanford CoreNLP version 3.3.1](http://nlp.stanford.edu/software/stanford-corenlp-full-2014-01-04.zip) from http://nlp.stanford.edu/.
* Download [Stanford CoreNLP version 3.4.1](http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip) from http://nlp.stanford.edu/.
* Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory location configured by setting StanfordCoreNLP.jar_path.
* Download [the full Stanford Tagger version 3.3.1](http://nlp.stanford.edu/software/stanford-postagger-full-2014-01-04.zip) from http://nlp.stanford.edu/.
* Extract the contents of the stanford-corenlp-3.4.1-models.jar file in the bin folder. The Jar and its exploded file structure are both accessed by the gem. Note that if you are locating the stanford exploded model files outside
* the gem's bin folder, your StanfordCoreNLP.model_path should be set to the root of that file structure.
* Download [the full Stanford Tagger version 3.4.1](http://nlp.stanford.edu/software/stanford-postagger-2014-08-27.zip) from http://nlp.stanford.edu/.
* Make a directory named 'taggers' inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory configured by setting StanfordCoreNLP.jar_path.
* Place the contents of the extracted archive inside taggers directory.
* Download [the bridge.jar file](https://github.com/louismullie/stanford-core-nlp/blob/master/bin/bridge.jar?raw=true) from https://github.com/louismullie/stanford-core-nlp.
* Place the downloaded bridger.jar file inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/taggers/) or inside the directory configured by setting StanfordCoreNLP.jar_path.
* Place the downloaded bridge.jar file inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/taggers/) or inside the directory configured by setting StanfordCoreNLP.jar_path.
* Configure your setup (for English) as follows:
```ruby
StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.3.1.jar',
'stanford-corenlp-3.3.1-models.jar',
'stanford-corenlp-3.4.1.jar',
'stanford-corenlp-3.4.1-models.jar',
'jollyday.jar',
'bridge.jar'
]
Expand All @@ -178,8 +183,8 @@ StanfordCoreNLP.set_model('pos.model', 'french.tagger')
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.3.1.jar',
'stanford-corenlp-3.3.1-models.jar',
'stanford-corenlp-3.4.1.jar',
'stanford-corenlp-3.4.1-models.jar',
'jollyday.jar',
'bridge.jar'
]
Expand All @@ -193,16 +198,17 @@ StanfordCoreNLP.set_model('pos.model', 'german-fast.tagger')
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.3.1.jar',
'stanford-corenlp-3.3.1-models.jar',
'stanford-corenlp-3.4.1.jar',
'stanford-corenlp-3.4.1-models.jar',
'jollyday.jar',
'bridge.jar'
]
end
```

**Contributing**

Simple.

1. Fork the project.
2. Send me a pull request!
2. Send me a pull request!
46 changes: 23 additions & 23 deletions lib/stanford-core-nlp.rb
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,16 @@ module StanfordCoreNLP
StanfordCoreNLP.log_file = nil

# Default JAR files to load.
# note must be version 3.4.1 and above
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-parser.jar',
'stanford-corenlp.jar',
'stanford-segmenter.jar',
'jollyday.jar',
'bridge.jar'
]


# Default classes to load.
StanfordCoreNLP.default_classes = [
['StanfordCoreNLP', 'edu.stanford.nlp.pipeline', 'CoreNLP'],
Expand All @@ -57,15 +57,15 @@ module StanfordCoreNLP

require 'stanford-core-nlp/bridge'
extend StanfordCoreNLP::Bridge

class << self
# The model file names for a given language.
attr_accessor :model_files
# The folder in which to look for models.
attr_accessor :model_path
# Store the language currently being used.
attr_accessor :language
#Custom properties
#Custom properties
attr_accessor :custom_properties
end

Expand All @@ -75,7 +75,7 @@ class << self
# with the individual models inside. By default, this
# is the same as the JAR path.
self.model_path = self.jar_path

# ########################### #
# Public configuration params #
# ########################### #
Expand Down Expand Up @@ -106,7 +106,7 @@ def self.use(language)

# Use english by default.
self.use :english

# Set a model file.
def self.set_model(name, file)
n = name.split('.')[0].intern
Expand All @@ -118,7 +118,7 @@ def self.set_model(name, file)
# ########################### #

def self.bind

# Take care of Windows users.
if self.running_on_windows?
self.jar_path.gsub!('/', '\\')
Expand All @@ -133,16 +133,16 @@ def self.bind
klass = const_get(info.first)
self.inject_get_method(klass)
end

end

# Load a StanfordCoreNLP pipeline with the
# specified JVM flags and StanfordCoreNLP
# properties.
def self.load(*annotators)

self.bind unless self.bound

# Prepend the JAR path to the model files.
properties = {}
self.model_files.each do |k,v|
Expand All @@ -160,7 +160,7 @@ def self.load(*annotators)
end
properties[k] = f
end

properties['annotators'] = annotators.map { |x| x.to_s }.join(', ')

unless self.language == :english
Expand All @@ -172,46 +172,46 @@ def self.load(*annotators)
# Otherswise throws java.lang.NullPointerException: null.
properties['parse.buildgraphs'] = 'false'
end

# Bug fix for NER system. Otherwise throws:
# Error initializing binder 1 at edu.stanford.
# nlp.time.Options.<init>(Options.java:88)
properties['sutime.binders'] = '0'

# Manually include SUTime models.
if annotators.include?(:ner)
properties['sutime.rules'] =
self.model_path + 'sutime/defs.sutime.txt, ' +
self.model_path + 'sutime/english.sutime.txt'
properties['sutime.rules'] =
self.model_path + './edu/stanford/nlp/models/sutime/defs.sutime.txt, ' +
self.model_path + './edu/stanford/nlp/models/sutime/english.sutime.txt'
end

props = get_properties(properties)

# Hack for Java7 compatibility.
bridge = const_get(:AnnotationBridge)
bridge.getPipelineWithProperties(props)

end

# Hack in order not to break backwards compatibility.
def self.const_missing(const)
if const == :Text
puts "WARNING: StanfordCoreNLP::Text has been deprecated." +
"Please use StanfordCoreNLP::Annotation instead."
Annotation
else
else
super(const)
end
end

private

# Create a java.util.Properties object from a hash.
def self.get_properties(properties)
properties = properties.merge(self.custom_properties)
props = Properties.new
properties.each do |property, value|
props.set_property(property.to_s, value.to_s)
props.set_property(property, value)
end
props
end
Expand Down
12 changes: 6 additions & 6 deletions lib/stanford-core-nlp/config.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ class Config

# Folders inside the JAR path for the models.
ModelFolders = {
:pos => 'taggers/',
:parse => 'grammar/',
:ner => 'classifiers/',
:dcoref => 'dcoref/'
:pos => 'edu/stanford/nlp/models/pos-tagger/english-left3words/',
:parse => '/edu/stanford/nlp/models/lexparser/',
:ner => '/edu/stanford/nlp/models/ner/',
:dcoref => '/edu/stanford/nlp/models/dcoref/'
}

# Tag sets used by Stanford for each language.
Expand All @@ -41,7 +41,7 @@ class Config
},

:ner => {
:english => 'all.3class.distsim.crf.ser.gz'
:english => 'english.all.3class.distsim.crf.ser.gz'
# :german => {} # Add this at some point.
},

Expand Down Expand Up @@ -351,7 +351,7 @@ class Config
'ConstraintAnnotation'
],

'nlp.trees.semgraph.SemanticGraphCoreAnnotations' => [
'nlp.semgraph.SemanticGraphCoreAnnotations' => [
'BasicDependenciesAnnotation',
'CollapsedCCProcessedDependenciesAnnotation',
'CollapsedDependenciesAnnotation'
Expand Down
6 changes: 3 additions & 3 deletions spec/english_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.3.1.jar',
'stanford-corenlp-3.3.1-models.jar',
'stanford-corenlp-3.4.1.jar',
'stanford-corenlp-3.4.1-models.jar',
'jollyday.jar',
'bridge.jar'
]
Expand Down Expand Up @@ -57,4 +57,4 @@
pipeline.annotate(annotation)
annotation.get(:sentences).size.should eql 2
end
end
end
6 changes: 3 additions & 3 deletions spec/french_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.3.1.jar',
'stanford-corenlp-3.3.1-models.jar',
'stanford-corenlp-3.4.1.jar',
'stanford-corenlp-3.4.1-models.jar',
'jollyday.jar',
'bridge.jar'
]
Expand All @@ -36,4 +36,4 @@
last_char.should eql [7, 8, 11, 16, 20, 23, 28, 35, 38, 46, 47, 50, 54, 56, 57, 58, 64, 67, 75, 76]
end
end
end
end
6 changes: 3 additions & 3 deletions spec/german_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.3.1.jar',
'stanford-corenlp-3.3.1-models.jar',
'stanford-corenlp-3.4.1.jar',
'stanford-corenlp-3.4.1-models.jar',
'jollyday.jar',
'bridge.jar'
]
Expand All @@ -33,4 +33,4 @@
last_char.should eql [2, 7, 14, 19, 25, 31, 36, 44, 45]
end
end
end
end
2 changes: 1 addition & 1 deletion spec/spec_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ def get_information(text, with_name_tag=false, with_coref=false)

[sentences, tokens, tags, lemmas, begin_char, last_char, name_tags, coref_ids]

end
end