Data Sorcery with Clojure

Dark theme for Incanter charts

February 6, 2010 · 2 Comments

JFreeChart has been a fantastic library, I’ve been able to include useful charting functionality in Incanter very quickly because of it, but I’m not a big fan of its default visual theme. Eventually I’d like to create some new themes, or better yet include themes created by others, but in the meantime I have created the set-theme function, which accepts a chart and either a keyword indicating a built-in theme or a JFreeChart ChartTheme object, and applies the theme to the chart.

At the moment, the only built-in themes are :default and :dark, but hopefully that will change in the future.

Here’s an example of using set-theme. First I’ll create a chart with the default theme,

(use '(incanter core charts datasets))

(with-data (get-dataset :iris)
  (view (scatter-plot :Sepal.Length :Sepal.Width :group-by :Species)))

and here’s the same scatter-plot with the dark theme.

(with-data (get-dataset :iris)
  (doto (scatter-plot :Sepal.Length :Sepal.Width :group-by :Species)
    (set-theme :dark)
    view))

The set-theme function is available in the latest version Incanter @ Github.

I have also added the incanter-pdf module discussed in the previous blog post, but it isn’t installed by default. To install it in your local Maven repository, run ‘mvn install’ from the incanter/modules/incanter-pdf directory.

→ 2 CommentsCategories: Clojure · Incanter · jFreeChart

Saving Incanter charts as PDF documents

February 5, 2010 · 4 Comments

Incanter charts can be saved as PNG files using the save function, but I had a request earlier today to add the ability to save them as PDF documents.

So I’ve created a new function called save-pdf in a new package called incanter.pdf. This package is an optional module in the Incanter distribution (and is also available as a gist). To install it, run ‘mvn install’ in the incanter/modules/incanter-pdf directory; this will install the library in your local Maven repository, where it can be included as a project dependency by including the following line in a project.clj file:
[org.incanter/incanter-pdf "1.0-master-SNAPSHOT"]

The save-pdf function uses the iText library to convert the chart graphic object into a PDF document. If you don’t install the incanter-pdf module with Maven, but use the code in this gist instead, you’ll need to included the following dependency in you Leiningen project.clj file: [com.lowagie/itext "1.4"]

If you did use Maven to install the module, your project.clj file will look something like this,

(defproject pdf-chart "1.0.0-SNAPSHOT"
  :description "An example of creating a PDF chart with incanter.pdf."
  :dependencies [[org.incanter/incanter-app "1.0-master-SNAPSHOT"]
                 [org.incanter/incanter-pdf "1.0-master-SNAPSHOT"]])

Here’s a basic example.

(use '(incanter core charts pdf))
(save-pdf (function-plot sin -4 4) "./pdf-chart.pdf")

Which outputs the following PDF file.

→ 4 CommentsCategories: Clojure · Incanter · jFreeChart

Working with R from Clojure and Incanter

February 5, 2010 · Leave a Comment

Joel Boehland has introduced Rincanter, which lets you use R from Clojure and Incanter. This is fantastically cool, as it opens up the vast number of R libraries to Clojure/Incanter, translating between R and Clojure data types, including Incanter datasets.

Check out Joel’s latest blog post, All your datasets R belong to us (I love that name), where he introduces Rincanter and demonstrates its use.

→ Leave a CommentCategories: Clojure · Incanter · R · Rincanter · Statistics

New name for blog: Data Sorcery with Clojure

January 16, 2010 · 4 Comments

I Just renamed the blog; previously it was called “Data Analysis and Visualization with Clojure,” and now it is simply called “Data Sorcery with Clojure.” I’ve also changed the default URL from http://incanter-blog.org to http://data-sorcery.org (the former URL now redirects to the new one).

Why rename it? Because “data sorcery” fits the theme of the name Incanter, it captures the perception people have of both statistics and lisp languages as dark arts, and it broadens the scope from just data analysis and visualization to include all the other machinations necessary when making sense of data.

→ 4 CommentsCategories: Incanter

Incanter Data Sorcery t-shirt

January 16, 2010 · Leave a Comment

I’ve created an Incanter Data Sorcery t-shirt for Clojure-loving data geeks. I’ve made the design available on a basic black shirt and a higher quality one (more expensive) through Zazzle (where you can also find Clojure t-shirts). Proceeds offset the cost of the coffee used to fuel development of Incanter.

Here’s a close up of the design.

Other shirt color options are also available at the Incanter Zazzle store. If you have other design or tag-line suggestions, let me know.

→ Leave a CommentCategories: Incanter · t-shirt

Working with data sets in Clojure with Incanter and MongoDB

January 3, 2010 · 9 Comments

This post will cover some new dataset functionality I’ve recently added to Incanter, including the use of MongoDB as a back-end data store for datasets.

Basic dataset functionality

First, load the basic Incanter libraries

(use '(incanter core stats charts io))

Next load some CSV data using the incanter.io/read-dataset function, which takes a string representing either a filename or a URL to the data.

(def data
  (read-dataset
    "http://github.com/liebke/incanter/raw/master/data/cars.csv"
     :header true))

The default delimiter is \, but a different one can be specified with the :delim option (e.g. \tab). The cars.csv file is a small sample data set that is included in the Incanter distribution, and therefore could have been loaded using get-dataset,

(incanter.datasets/get-dataset :cars)

See the documentation for get-dataset for more information on the included sample data.

We can get some information on the dataset, like the number of rows and columns using either the dim function or the nrow and ncol functions, and we can view the columns names with the col-names function.

user> (dim data)
[50 2]
user> (col-names data)
["speed" "dist"]

We can see that there are just 50 rows and two columns and that the column names are “speed” and “dist”. The data are 50 observations, from the 1920s, of automobile breaking distances observed at different speeds.

I will use Incanter’s new with-data macro and $ column-selector function to access the dataset’s columns. Within the body of a with-data expression, columns of the bound dataset can be accessed by name or index, using the $ function, for instance ($ :colname) or ($ 0).

For example, the following code will create a scatter plot of the data (speed vs. dist), and then add a regression line using the fitted values returned from the incanter.stats/linear-model function.

(with-data data
  (def lm (linear-model ($ :dist) ($ :speed)))
  (doto (scatter-plot ($ :speed) ($ :dist))
    (add-lines ($ :speed) (:fitted lm))
    view))

Within the with-data expression, the dataset itself is bound to $data, which can be useful if you want to perform operations on it. For instance, the following code uses the conj-cols function to prepend an integer ID column to the dataset, and then displays it in a window.

(with-data (get-dataset :cars)
  (view (conj-cols (range (nrow $data)) $data)))

The conj-cols function returns a dataset by conjoining sequences together as the columns of the dataset, or by prepending/appending columns to an existing dataset, and the related conj-rows function conjoins rows.

We can create a new dataset that adds the fitted (or predicted values) to the original data using the conj-cols function.

(def results (conj-cols data (:fitted lm)))

You’ll notice that the column names are changed to generic ones (i.e. col-0, col-1, col-2), this is done to prevent naming conflicts when merging datasets. We can add more meaningful names with the col-names function.

(def results (col-names data [:speed :dist :predicted-dist]))

We could have used the -> (thread) macro to perform both steps, as well as add the residuals from the output of linear-model to the dataset

(def results (-> (conj-cols data (:fitted lm) (:residuals lm))
                 (col-names [:speed :dist :predicted :residuals])))

Querying data sets with the $where function

Another new function, $where, lets you query an Incanter dataset using a syntax based on MongoDB and Somnium’s Congomongo Clojure library.

To perform a query, pass a query-map to the $where function. For instance, to get the rows from the results data set where the value of speed is 10, use

($where {:speed 10} results)

For the rows where the speed is between 10 and 20, use

($where {:speed {:$gt 10 :$lt 20}} results)

For rows where the speed is in the set #{4 7 24 25}, use

($where {:speed {:$in #{4 7 24 25}}} results)

Or not in that set,

($where {:speed {:$nin #{4 7 24 25}}} results)

Like the $ function, $where can be used within with-data, where the dataset is passed implicitly. For example, to get the mean speed of the observations that have residuals between -10 and 10 from the results dataset,

(with-data results
  (mean ($ :speed ($where {:residuals {:$gt -10 :$lt 10}}))))

which returns 14.32.

Query-maps don’t support ‘or’ directly, but we can use conj-rows to construct a dataset where speed is either less than 10 or greater than 20 as follows:

(with-data results
  (conj-rows ($where {:speed {:$lt 10}})
             ($where {:speed {:$gt 20}})))

An alternative to conjoining query results is to pass $where a predicate function that accepts a map containing the key/value pairs of a row and returns a boolean indicating whether the row should be included. For example, to perform the above query we could have done this,

(with-data results
  ($where (fn [row] (or (< (:speed row) 10) (> (:speed row) 20)))))

Storing and Retrieving Incanter datasets in MongoDB

The new incanter.mongodb library can be used with Somnium’s Congomongo to store and retrieve datasets in a MongoDB database.

MongoDB is schema-less, document-oriented database that is well suited as a data store for Clojure data structures. Getting started with MongoDB is easy, just download and unpack it, and run the following commands (on Linux or Mac OS X),

$ mkdir -p /data/db
$ ./mongodb/bin/mongod &

For more information, see the MongoDB quick start guide.

Once the database server is running, load Incanter’s MongoDB library and Congomongo,

(use 'somnium.congomongo)
(use 'incanter.mongodb)

and use Congomongo’s mongo! function to connect to the “mydb” database on the server running on the localhost on the default port.

(mongo! :db "mydb")

If mydb doesn’t exist, it will be created. Now we can insert the results dataset into the database with the incanter.mongodb/insert-dataset function.

(insert-dataset :breaking-dists results)

The first argument, :breaking-dists, is the name the collection will have in the database. We can now retrieve the dataset with the incanter.mongodb/fetch-dataset function.

(def breaking-dists (fetch-dataset :breaking-dists))

Take a look at the column names of the retrieved dataset and you’ll notice that MongoDB added a couple, :_ns and :_id, in order to uniquely identify each row.

user> (col-names breaking-dists)
[:speed :_ns :_id :predicted :residuals :dist]

The fetch-dataset function (and the congomongo.fetch function that it’s based on) support queries with the :where option. The following example retrieves only the rows from the :breaking-dists collection in the database where the :speed is between 10 and 20 mph, and then calculates the average breaking distance of the resulting observations.

(with-data (fetch-dataset :breaking-dists
			  :where {:speed {:$gt 10 :$lt 20}})
  (mean ($ :dist)))

The syntax for Congomongo’s query-maps is nearly the same as that for the $where function, although :$in and :$nin take a Clojure vector instead of a Clojure set.

For more information on the available functionality in Somnium’s Congomongo, visit its Github repository or read the documentation for incanter.mongodb

(doc incanter.mongodb)

The complete code for this post can be found here.

→ 9 CommentsCategories: Clojure · Incanter · Statistics · congomongo · mongodb

Starting an Incanter Swank server with Leiningen

December 22, 2009 · 2 Comments

Kevin Nuckolls asked if the Swank server I set up using Maven in my previous post can be started using Leiningen instead. The answer is yes. It’s very simple, in fact.

You don’t need Maven, Git, or even to manually install Incanter. You just need to install Leiningen, as described in this post, and then create a project directory containing the following project.clj file:

(defproject incanter-swank "0.1.0"
  :description "A Swank Server for Incanter"
  :dependencies [[incanter "1.0-master-SNAPSHOT"]]
  :dev-dependencies [[leiningen/lein-swank "1.0.0-SNAPSHOT"]])

Next download Incanter and its dependencies with Leiningen,

$ lein deps

and start a Swank server.

$ lein swank

Now connect to it from Emacs using M-x slime-connect, as described in my previous post and that’s it.

[NOTE: The first time you build this project, you may see a error message like, java.util.zip.ZipException: duplicate entry. This is a problem I've recently been seeing with Leiningen builds, but the jar files that are produced are valid, and the message will not occur when starting the Swank server after the initial build. I am looking into this.]

→ 2 CommentsCategories: Incanter

Setting up Clojure, Incanter, Emacs, Slime, Swank, and Paredit

December 20, 2009 · 18 Comments

Emacs is the favored development environment for the majority of Clojure developers, and there are good reasons for that, but personally, I don’t think it should be the first choice of developers new to Clojure, unless they have used it previously; it’s just too much to learn at once.

I recommend people use an editor they’re comfortable with, combined with a command-line REPL. There is no reason to tackle the complexities of configuring and using Emacs, Slime, and Swank until you’ve got your head around the basics of Clojure and functional programming. Once you’ve got the basics down though, it’s worth venturing into the arcane world of Emacs. You may decide it’s not for you, and luckily there are alternatives, from your favorite editor combined with a REPL to plugins for popular IDEs like Netbeans (Enclojure), IntelliJ (La Clojure), and Eclipse (Counter-Clockwise).

But you’ll never know if it’s for you unless you give it a try. So, I’ll be demonstrating how to build and install Incanter (which includes Clojure and Clojure-contrib), and then set up a development environment with Emacs, Slime, Swank, and Paredit.

Setting up Clojure and Incanter

You’ll need Git and Maven to grab and build Incanter. First clone Incanter from its Github repository:

$ git clone git://github.com/liebke/incanter.git

This will create an incanter subdirectory

$ cd incanter

Use Maven to build, test, and install it:

$ mvn install

If this is the first time you’ve installed Incanter, Maven will download a lot of stuff. Next it will perform the build, and run tests. Once this process is complete, you can start a Clojure REPL with all of Incanter’s dependencies pre-configured on the CLASSPATH by either using the clj scripts included in the bin/ directory.

$ bin/clj

or on Windows,

$ bin/clj.bat

or you can use Maven to start a REPL from the modules/incanter-bundle directory,

$ mvn clojure:repl

or you can start it directly with the java command:

$ java -jar modules/incanter-bundle/target/incanter-exec.jar

This will present you with the user=> prompt. As a simple example of using Incanter from the REPL, we’ll generate a line plot of the sine function over the range -4 to 4, first load the necessary Incanter libraries:

user=> (use '(incanter core charts))

and then use the function-plot function:

user=> (view (function-plot sin -4 4))

Now that we know Incanter and Clojure are installed correctly, let’s set up an Emacs development environment.

Setting up and using Emacs, Swank, Slime, and Paredit

I’m a long time vi/vim user and I typically use MacVim, but I have recently gone back to Emacs (the editor I used when I first learned Lisp) in order to take advantage of Slime, Swank, and Paredit. Doing most of my development on a Macbook, I like Aquamacs, which blends standard OS X and Emacs behaviors. Another nice option on the Mac is Carbon Emacs.

The procedure I’m going to use to setup the Emacs development environment is based on the instructions provided by Phil Hagelberg (a.k.a Technomancy) in this blog post and in the README for his fork of swank-clojure.

The best way to install the necessary packages (clojure-mode, slime, slime-repl, swank-clojure) is by using the Emacs Lisp Package Archive, or ELPA.

To access ELPA, use the following command:

M-x package-list-packages

The meta-key on the Mac for most flavors of Emacs is the command key, but with Aquamacs it’s the alt/option key.

If the ‘package-list-packages’ command cannot be found, you’ll need to paste the following snippet of elisp in your *scratch* buffer and then evaluate it, (go here for more detailed instructions).

 (let ((buffer (url-retrieve-synchronously
	       "http://tromey.com/elpa/package-install.el")))
  (save-excursion
    (set-buffer buffer)
    (goto-char (point-min))
    (re-search-forward "^$" nil 'move)
    (eval-region (point) (point-max))
    (kill-buffer (current-buffer))))

In Aquamacs, you’ll evaluate it by placing your cursor right after the last parentheses and entering:

C-x C-e

On most other version of Emacs, including Carbon Emacs, you’ll enter

C-j

Once this has been done, you should be able access ELPA with:

M-x package-list-packages

You’ll see a list of packages, either scroll down to find or search for, using C-s, the following packages:

  • clojure-mode
  • slime
  • slime-repl
  • swank-clojure
  • paredit

When you’re cursor is on the appropriate package, hit the i key to select it. Once all the packages are selected, hit x to begin their installation. When it’s complete, you might see some warnings, but don’t worry about them.

Slime is an Emacs-mode for editing Lisp/Clojure code and Swank is a back-end service that Slime connects to, letting you evaluate Clojure code from within Emacs. Paredit provides some additional Clojure/Lisp editing functionality, although, like Emacs, it requires some getting used to (see mudphone’s introduction to Paredit presentation and the Paredit cheat sheet).

Now it’s time to start up a Swank server that will let us run Clojure code from Emacs. We can use Maven to start one up that is pre-configured with all of Incanter’s dependencies with the bin/swank script, or by running the following Maven command from the modules/incanter-bundle directory:

$ mvn clojure:swank

This will generate some messages, ending with

Connection opened on local port  4005
#<ServerSocket ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=4005]>

Now we need to connect to the server from Emacs with the following command:

M-x slime-connect

It will prompt you for the IP address and port of the server, just use the defaults it offers. It may then show the following prompt:

Versions differ: nil (slime) vs. 2009-09-14 (swank). Continue? (y or n)

Just say ‘yes’. You will then get a message confirming you’re connected, and a window will open with a Clojure REPL and a ‘user>’ prompt. A cool feature of slime-connect is that you can connect to a swank server on a remote system, just provide the system’s IP address or host name, instead of the default 127.0.0.1, when prompted.

Now open or create a Clojure file, using ‘C-x C-f’ (or using ‘command-o’ or ‘command-n’ in Aquamacs). If you’re creating a new file, give it a *.clj suffix and Emacs will start clojure-mode automatically.

Now start up Paredit,

M-x paredit-mode

You’re now ready to edit Clojure code. Start by loading a few Incanter libraries with the following line:

(use '(incanter core stats charts))

You’ll notice that closing parens are automatically created when you create an opening paren, this is due to Paredit. You can evaluate this block of code by placing your cursor right after the last paren, and entering ‘C-x C-e’. You should see the return value, nil, in the Emacs message pane.

Now let’s generate a plot of the PDF of the Normal distribution, over the range -3 to 3, by entering and evaluating the following line:

(view (function-plot pdf-normal -3 3))

That’s it, you’re all set up. Have fun!

See also:

→ 18 CommentsCategories: Clojure · Incanter · emacs · paredit · slime · swank

Funding Open Source Projects

December 16, 2009 · Leave a Comment

Rich Hickey, the creator of the Clojure language, made an interesting request yesterday, he needs help funding Clojure’s development. For the last few years, he has essentially been the sole financial backer for the project, and now there is a need for additional financial support, so he can continue developing Clojure full-time.

The request spurred an outpouring of support for Rich and Clojure, raising more than half of the target amount within a day, with more than 187 individuals and five companies providing support so far.

The Clojure language is one of the primary reason Incanter has been such a joy to develop and use, and I have been a past and present financial contributor to Clojure, as well as to the other projects that provide the foundation that I built Incanter on, like Parallel Colt and JFreeChart. I hope you’ll join me in helping fund great open-source projects like these.

To help fund Clojure, visit its funding page. To help fund Piotr Wendykier’s Parallel Colt project, visit his donation page, and to help fund JFreeChart, purchase their developer’s guide.

Many other projects lay at the core of Incanter, and although they don’t all require funding, you can help in other ways by contributing your talent. To learn more about contributing to Processing, visit its contribute page, and to learn more about helping out with Incanter itself, visit its Google group.

David

→ Leave a CommentCategories: Clojure · open source

FlightCaster merges its statistical-learning code into Incanter

December 3, 2009 · 12 Comments

Two goals have always been kept in mind during the development of Incanter 1) to provide an R-like, interactive statistical and graphics environment for performing data analysis, and 2) to provide a collection of libraries that can be embedded in larger data analysis systems, taking advantage of both the power of the Clojure language and the rich set of libraries available on the JVM for accessing and processing data.

Incanter has provided much of what you would expect from R by combining the power of the Clojure language with Java libraries, like Parallel Colt, jFreeChart, and Processing. The Clojure language itself is well suited for data processing due to its data structures, the sequence abstraction, the powerful sequence processing functions, destructuring, data structure literals, and myriad other niceties.

Now, Incanter’s goal of being embed-able in larger data analysis systems has taken a giant step forward with the merger of Flightcaster’s code. Flightcaster’s service for predicting airline arrival times is a Clojure- and Hadoop-based distributed statistical-learning system that now has Incanter at the core. Their setup demonstrates how Incanter can scale up to larger data sets by leveraging JVM-based projects like Hadoop, AKKA, or other Clojure-based distributed computing systems. And with the Flightcaster code, Incanter has new functionality in classification, information theory, more probability, more io capabilities, a large range of dependence and similarity measures, and a variety of data transformation functions.

And this marks just the beginning of Flightcaster’s participation in the development of Incanter. Together we are working on a number of algorithms for learning models, structure, ensembles, and more. Future work will also ensure that Incanter can work seemlessly with FlightCaster’s recently open-source Crane framework for managing distributed processes on Hadoop and Amazon’s EC2.

→ 12 CommentsCategories: Clojure · FlightCaster · Incanter