-
Notifications
You must be signed in to change notification settings - Fork 277
dev>rmr>Design Philosophy
What rmr aims to be:
-
For map-reduce programmers:
- the easiest, most productive, most elegant way to write map reduce jobs. One-two orders of magnitude less code than Java. Readable, reusable, extensible.
- General, and not a crippled language trap. Any map reduce algorithm can be implemented with this package. With Alan Kay: "Simple things should be simple, complex things should be possible."
-
For R programmers:
- a way to use the Map Reduce programming paradigm.
- a way to work on big data sets in a way that is “natural” or “R-like”.
- a way to access massive, fault tolerant parallelism without mastering distributed programming.
- just a library to use, no run-time, no R patches, nothing
What rmr is not trying to be:
-
rmr
is not Hadoop Streaming. It uses streaming for its implementation but it doesn’t aim to support every single option that streaming has. Streaming is accessible from R with no additional packages because R can execute an external program and R scripts can read stdin and stdout. The point is to provide an abstraction over that.rmr
is not syntactic sugar for Hadoop Streaming. -
Map reduce programs written in
rmr
are not going to be the most efficient. While aiming to reduce the gap over time to extend its applicability, it is unlikely to ever be the most efficient way to implement massive production jobs. Nonetheless it is used in production by large and small companies. -
rmr
does not provide a mapreduce version of any of the thousands of packages available for R. It does not solve the problem of parallel programming. You still have to write parallel algorithms for any problem you need to solve, but you can focus only on the interesting aspects. Some problems are believed not to be amenable to a parallel solution and using the mapreduce paradigm orrmr
does not create an exception.