Pattern is a Cascading framework and library for machine learning model scoring at scale.
Pattern can read PMML models as workflow specifications for generating Cascading flows which can run on Apache Hadoop.
Pattern is still under active development under the wip-1.0 branch. Thus all wip releases are made available
from the files.concurrentinc.com
domain. When Pattern hits 1.0 and beyond, final releases will be under
files.cascading.org
.
See the pattern-examples
subdirectory for sample apps.
For more information, visit: http://www.cascading.org/pattern/
Pattern currently supports the following PMML model types:
- General Regression
- Regression
- Clustering
- Tree
- Mining - ensembles of the above models like Random Forest
In progress are:
- Neural Network
- Support Vector Machine
Not all aspects of each of the above models are supported. To request support for a particular model or model parameter, report an issue.
These PMML model types translate or compose into:
- Random Forest in PMML 4.0+ exported from R/Rattle
- Linear Regression in PMML 1.1+
- K-Means Clustering in PMML 2.0+
- Logistic Regression in PMML 4.0.1+
- Multinomial Model in PMML 2.0+
Note: Hierarchical Clustering is also implemented. The unit test for that algorithm currently excludes two data points. In regression tests with the Iris data set, we've isolated edge cases where the classifiers in R and Pattern do not agree. Then again, Iris data gets used to illustrate model behaviors with such properties. This will take some digging into numerical operations inside R.
To use Pattern, there is no installation other than adding the necessary dependencies to Maven, Ivy, or Gradle.
To include the base core model libraries, use:
<dependency>
<groupId>cascading</groupId>
<artifactId>pattern-core</artifactId>
<version>x.y.z</version>
</dependency>
To include the PMML parsing libraries and the PMMLPlanner
, use:
<dependency>
<groupId>cascading</groupId>
<artifactId>pattern-pmml</artifactId>
<version>x.y.z</version>
</dependency>
Other sub-projects and artifacts are simply in place to faciliate testing on various platforms, the above dependencies have no dependencies on Cascading Hadoop or local modes, they are completely independent of the underying platforms.
The best way to report an issue is to add a new test to SimplePMMLPlatformTest
along with the expected result set
and submit a pull request on GitHub.
Failing that, feel free to open an issue on the Cascading/Pattern project site or mail the mailing list.
Running:
> gradle idea
from the root of the project will create all IntelliJ project and module files, and retrieve all dependencies.