-
Notifications
You must be signed in to change notification settings - Fork 11
Cascading.Multitool is a sed and grep command line tool for Apache Hadoop.
License
cwensel/cascading.multitool
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Welcome This is the Cascading.Multitool (Multitool) application. Multitool provides a simple command line interface for building data processing jobs. Think of Multitool as 'grep' and 'sed' for Hadoop that also supports joins between multiple data-sets. For example, with "$HADOOP_HOME/bin/" in your PATH, the following command, > hadoop jar multitool-<release-date>.jar source=input.txt select=Monday sink=outputDir will start a Hadoop job to read in the source file "input.txt", grep all lines with the word "Monday" and output the results into the directory "outputDir". Multitool will inherit the underlying Hadoop configuration, so if the default FileSystem is HDFS, all paths will be relative to the cluster filesystem, not local. Using fully qualified urls will override the defaults (file://some/path or s3n:/bucket/file). This application is built with Cascading. Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster. It can be found at the following location: http://www.cascading.org/ Installing This step is not necessary if you wish to run Multitool directly from the uncompressed distribution folder or Multitool was pre-installed with your Hadoop distribution. Type, > which multitool to see if it is already been added to your PATH. Installed for all users into "/usr/local/bin": > sudo ./bin/multitool install or for the current user only into "~/.multitool": > ./bin/multitool install For detailed instructions: > ./bin/multitool help install Choose the method that best suites your environment. If you are running Multitool on AWS Elastic MapReduce, you need to follow the Elastic MapReduce instructions on the AWS site, which typically expect the multitool-<release-date>.jar to be uploaded to AWS S3. Using The environment variable HADOOP_HOME should always be set to use Multitool. To run from the command line with the jar, Hadoop should be in the path: > hadoop jar multitool-<release-date>.jar <args> If no args are given, a comprehensive list of commands will be printed. Or if Multitool has been installed from above: > multitool source=data/artist.100.txt cut=0 sink=output This will cut the first fields out of the file 'artists.100.txt' and save the results to 'output'. For a more complex example: > ./bin/multitool source=data/topic.100.txt cut=0 \ "pgen=(\b[12][09][0-9]{2}\b)" group=0 count=0 group=1 \ sink=output sink.replace=true sink.parts=1 This will find all years in the input file, count them, and sort them by counts. Examples copying: args = source=input.txt sink=outputDir copying while removing the first header line, and overwriting output: args = source=input.txt source.skipheader=true sink=outputDir sink.replace=true filter out data: args = source=input.txt "reject=some words" sink=outputDir Building To build Multitool, you must download the source code from GitHub: https://github.com/concurrentinc/cascading.multitool/tarball/master or clone the repo: https://github.com/concurrentinc/cascading.multitool This release will pull all dependencies from the relevant maven repos, including conjars.org. To build a jar, > ant retrieve jar To test, > ant test License See LICENSE.txt
About
Cascading.Multitool is a sed and grep command line tool for Apache Hadoop.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published