Two goals have always been kept in mind during the development of Incanter 1) to provide an R-like, interactive statistical and graphics environment for performing data analysis, and 2) to provide a collection of libraries that can be embedded in larger data analysis systems, taking advantage of both the power of the Clojure language and the rich set of libraries available on the JVM for accessing and processing data.
Incanter has provided much of what you would expect from R by combining the power of the Clojure language with Java libraries, like Parallel Colt, jFreeChart, and Processing. The Clojure language itself is well suited for data processing due to its data structures, the sequence abstraction, the powerful sequence processing functions, destructuring, data structure literals, and myriad other niceties.
Now, Incanter’s goal of being embed-able in larger data analysis systems has taken a giant step forward with the merger of Flightcaster‘s code. Flightcaster’s service for predicting airline arrival times is a Clojure- and Hadoop-based distributed statistical-learning system that now has Incanter at the core. Their setup demonstrates how Incanter can scale up to larger data sets by leveraging JVM-based projects like Hadoop, AKKA, or other Clojure-based distributed computing systems. And with the Flightcaster code, Incanter has new functionality in classification, information theory, more probability, more io capabilities, a large range of dependence and similarity measures, and a variety of data transformation functions.
And this marks just the beginning of Flightcaster’s participation in the development of Incanter. Together we are working on a number of algorithms for learning models, structure, ensembles, and more. Future work will also ensure that Incanter can work seemlessly with FlightCaster’s recently open-source Crane framework for managing distributed processes on Hadoop and Amazon’s EC2.
Very cool. Can you give us a quick summary of the new features in Incanter from the merge?
I’m hoping Bradford and I will each blog more in the future on the new libraries. In general, the new functionality is based on FlightCaster’s low-level statistical-learning toolkit, which is used to develop their custom analytics. We plan on developing higher-level functions based on this foundation in the future.
The following are links to the relevant API documentation. We plan to improve all the docs, and provide more example usage. The following existing Incanter libraries have additional functions:
incanter.stats: a bunch of new functions, including many distance and similarity metrics,
incanter.io: more io functions
and these are entirely new libraries:
incanter.transformations
incanter.probability
incanter.information_theory
incanter.incremental-stats
incanter.classification
incanter.chrono: a Joda time based library
This is just the beginning, we plan on a lot more development in the coming months.
David
Great news… Now all we R hackers need your book ready, so we can see what the big deal is. It’s gonna be great!
Haha, my work on the book has been a bit delayed by the recent work merging FlightCaster’s code, but I hope to get back to work on it. I do think the recent work may change the emphasis a bit, we’ll see.
David
Sounds like a fair bit of overlap with Weka.
Would it be worthwhile to integrate Weka and Clojure. I think there’s an R to Weka package. Obvously incanted can link directly to Weka since it’s 100% java, more easily than R. Is it worth building incanter wrappers to Weka a bit like Colt etc.
I have long planned to integrate Weka, and I would still like that to happen. I think the current FlightCaster-based approach has a different emphasis than Weka does. Right now we’re focusing on creating a lower-level toolkit for building custom statistical-learning algorithms that can be embedded in larger systems, including distributed systems. Weka’s approach is to provide a lot of great pre-built learning algorithms, but this approach provides less flexibility than systems like FlightCaster require. However, I think this different emphasis means there’s room for both approaches in Incanter, and I hope to include Weka integration in the future.
David
Forgot to say fantastic news though….thanks Flightcaster.
Awesome Work David …. Glad to acknowledge that Bradford has jumped on the incanter bandwagon too. I have put a few of your libraries to work in production application (in the financial domain) and can’t say how happy and satisfied I am .
:)
Thanks Chetan, that is great to hear!
David
This is some very good stuff. I was already impressed by Incanter and this addition is exactly in line with what we’ll be doing soon (also a startup using machine learning technologies).
Thanks, I look forward to hearing more about your startup in the future!
Pingback: More musings on MapReduce and bioinformatics