Category Archives: Incanter

Starting an Incanter Swank server with Leiningen

Kevin Nuckolls asked if the Swank server I set up using Maven in my previous post can be started using Leiningen instead. The answer is yes. It’s very simple, in fact.

You don’t need Maven, Git, or even to manually install Incanter. You just need to install Leiningen, as described in this post, and then create a project directory containing the following project.clj file:

(defproject incanter-swank "0.1.0"
  :description "A Swank Server for Incanter"
  :dependencies [[incanter "1.0-master-SNAPSHOT"]]
  :dev-dependencies [[leiningen/lein-swank "1.0.0-SNAPSHOT"]])

Next download Incanter and its dependencies with Leiningen,

$ lein deps

and start a Swank server.

$ lein swank

Now connect to it from Emacs using M-x slime-connect, as described in my previous post and that’s it.

[NOTE: The first time you build this project, you may see a error message like, java.util.zip.ZipException: duplicate entry. This is a problem I’ve recently been seeing with Leiningen builds, but the jar files that are produced are valid, and the message will not occur when starting the Swank server after the initial build. I am looking into this.]

Setting up Clojure, Incanter, Emacs, Slime, Swank, and Paredit

Emacs is the favored development environment for the majority of Clojure developers, and there are good reasons for that, but personally, I don’t think it should be the first choice of developers new to Clojure, unless they have used it previously; it’s just too much to learn at once.

I recommend people use an editor they’re comfortable with, combined with a command-line REPL. There is no reason to tackle the complexities of configuring and using Emacs, Slime, and Swank until you’ve got your head around the basics of Clojure and functional programming. Once you’ve got the basics down though, it’s worth venturing into the arcane world of Emacs. You may decide it’s not for you, and luckily there are alternatives, from your favorite editor combined with a REPL to plugins for popular IDEs like Netbeans (Enclojure), IntelliJ (La Clojure), and Eclipse (Counter-Clockwise).

But you’ll never know if it’s for you unless you give it a try. So, I’ll be demonstrating how to build and install Incanter (which includes Clojure and Clojure-contrib), and then set up a development environment with Emacs, Slime, Swank, and Paredit.

Setting up Clojure and Incanter

Incanter is available on Clojars, so you can include it in your projects by adding it as a dependency in your project.clj file (see below). Alternatively, you can clone the full distribution from Github, which includes REPL and Swank start up scripts. Note: The repl scripts are necessary, since Leiningen’s repl task does not work correctly with Clojure 1.2.

The following examples will assume you cloned Incanter’s github repository (see below for instructions) and have the repl and swank scripts included with the distribution.

First, you’ll need Git and Leiningen to grab and build Incanter. First clone Incanter from its Github repository:

$ git clone git://github.com/liebke/incanter.git

This will create an incanter subdirectory

$ cd incanter

Use lein deps to downloads the necessary dependencies:

$ lein deps

Once this process is complete, you can start a Clojure REPL with all of Incanter’s dependencies pre-configured on the CLASSPATH by either using the repl scripts included in the script/ directory.

$ script/repl

or on Windows,

$ script/repl.bat

or you can start it directly with the java command:

$ java -cp 'lib/*' clojure.main

This will present you with the user=> prompt. As a simple example of using Incanter from the REPL, we’ll generate a line plot of the sine function over the range -4 to 4, first load the necessary Incanter libraries:

user=> (use '(incanter core charts))

and then use the function-plot function:

user=> (view (function-plot sin -4 4))

Now that we know Incanter and Clojure are installed correctly, let’s set up an Emacs development environment.

Setting up and using Emacs, Swank, Slime, and Paredit

I’m a long time vi/vim user and I typically use MacVim, but I have recently gone back to Emacs (the editor I used when I first learned Lisp) in order to take advantage of Slime, Swank, and Paredit. Doing most of my development on a Macbook, I like Aquamacs, which blends standard OS X and Emacs behaviors. Another nice option on the Mac is Carbon Emacs.

The procedure I’m going to use to setup the Emacs development environment is based on the instructions provided by Phil Hagelberg (a.k.a Technomancy) in this blog post and in the README for his fork of swank-clojure.

The best way to install the necessary packages (clojure-mode, slime, slime-repl, swank-clojure) is by using the Emacs Lisp Package Archive, or ELPA.

To access ELPA, use the following command:

M-x package-list-packages

The meta-key on the Mac for most flavors of Emacs is the command key, but with Aquamacs it’s the alt/option key.

If the ‘package-list-packages’ command cannot be found, you’ll need to paste the following snippet of elisp in your *scratch* buffer and then evaluate it, (go here for more detailed instructions).

 (let ((buffer (url-retrieve-synchronously
	       "http://tromey.com/elpa/package-install.el")))
  (save-excursion
    (set-buffer buffer)
    (goto-char (point-min))
    (re-search-forward "^$" nil 'move)
    (eval-region (point) (point-max))
    (kill-buffer (current-buffer))))

In Aquamacs, you’ll evaluate it by placing your cursor right after the last parentheses and entering:

C-x C-e

On most other version of Emacs, including Carbon Emacs, you’ll enter

C-j

Once this has been done, you should be able access ELPA with:

M-x package-list-packages

You’ll see a list of packages, either scroll down to find or search for, using C-s, the following packages:

  • clojure-mode
  • slime
  • slime-repl
  • swank-clojure
  • paredit

When you’re cursor is on the appropriate package, hit the i key to select it. Once all the packages are selected, hit x to begin their installation. When it’s complete, you might see some warnings, but don’t worry about them.

Slime is an Emacs-mode for editing Lisp/Clojure code and Swank is a back-end service that Slime connects to, letting you evaluate Clojure code from within Emacs. Paredit provides some additional Clojure/Lisp editing functionality, although, like Emacs, it requires some getting used to (see mudphone’s introduction to Paredit presentation and the Paredit cheat sheet).

Now it’s time to start up a Swank server that will let us run Clojure code from Emacs. We can use Leiningen to start one up that is pre-configured with all of Incanter’s dependencies with the script/swank script, or by running the following Leiningen command Incanter directory:

$ lein swank

This will generate some messages, ending with

Connection opened on local port  4005
#<ServerSocket ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=4005]>

Now we need to connect to the server from Emacs with the following command:

M-x slime-connect

It will prompt you for the IP address and port of the server, just use the defaults it offers. It may then show the following prompt:

Versions differ: nil (slime) vs. 2009-09-14 (swank). Continue? (y or n)

Just say ‘yes’. You will then get a message confirming you’re connected, and a window will open with a Clojure REPL and a ‘user>’ prompt. A cool feature of slime-connect is that you can connect to a swank server on a remote system, just provide the system’s IP address or host name, instead of the default 127.0.0.1, when prompted.

Now open or create a Clojure file, using ‘C-x C-f’ (or using ‘command-o’ or ‘command-n’ in Aquamacs). If you’re creating a new file, give it a *.clj suffix and Emacs will start clojure-mode automatically.

Now start up Paredit,

M-x paredit-mode

You’re now ready to edit Clojure code. Start by loading a few Incanter libraries with the following line:

(use '(incanter core stats charts))

You’ll notice that closing parens are automatically created when you create an opening paren, this is due to Paredit. You can evaluate this block of code by placing your cursor right after the last paren, and entering ‘C-x C-e’. You should see the return value, nil, in the Emacs message pane.

Now let’s generate a plot of the PDF of the Normal distribution, over the range -3 to 3, by entering and evaluating the following line:

(view (function-plot pdf-normal -3 3))

That’s it, you’re all set up. Have fun!

See also:

FlightCaster merges its statistical-learning code into Incanter

Two goals have always been kept in mind during the development of Incanter 1) to provide an R-like, interactive statistical and graphics environment for performing data analysis, and 2) to provide a collection of libraries that can be embedded in larger data analysis systems, taking advantage of both the power of the Clojure language and the rich set of libraries available on the JVM for accessing and processing data.

Incanter has provided much of what you would expect from R by combining the power of the Clojure language with Java libraries, like Parallel Colt, jFreeChart, and Processing. The Clojure language itself is well suited for data processing due to its data structures, the sequence abstraction, the powerful sequence processing functions, destructuring, data structure literals, and myriad other niceties.

Now, Incanter’s goal of being embed-able in larger data analysis systems has taken a giant step forward with the merger of Flightcaster‘s code. Flightcaster’s service for predicting airline arrival times is a Clojure- and Hadoop-based distributed statistical-learning system that now has Incanter at the core. Their setup demonstrates how Incanter can scale up to larger data sets by leveraging JVM-based projects like Hadoop, AKKA, or other Clojure-based distributed computing systems. And with the Flightcaster code, Incanter has new functionality in classification, information theory, more probability, more io capabilities, a large range of dependence and similarity measures, and a variety of data transformation functions.

And this marks just the beginning of Flightcaster’s participation in the development of Incanter. Together we are working on a number of algorithms for learning models, structure, ensembles, and more. Future work will also ensure that Incanter can work seemlessly with FlightCaster’s recently open-source Crane framework for managing distributed processes on Hadoop and Amazon’s EC2.

Building a Clojure Web application with Incanter, Compojure, and Leiningen

This post will demonstrate how to build a simple Clojure Web application, with Compojure, that will use Incanter to generate a sample from a normal distribution with parameters (size, mean, and standard deviation) provided by the user, and then return a PNG image of the histogram of the data.

This example is simple, but it demonstrates how to dynamically generate an Incanter chart and pass it directly to the Web client as a PNG image.

For more information on the Compojure Web framework, visit its Github repository, its Google group, its WikiBooks site, and this quick tutorial.

I will use the Leiningen build tool to download dependencies, compile the application, and package it up as jar file. For more information on installing and using Leiningen, see my previous post.

Let’s start off by creating a project directory, I’ll call it incanter-webapp, and give it a src subdirectory. Then create the following Leiningen project.clj file in the project directory.

(defproject incanter-webapp "0.1.0"
  :description "A simple Incanter web-app."
  :dependencies [[incanter "1.2.1-SNAPSHOT"]
                 [compojure "0.3.2"]]
  :main simple_web_app)

This project file includes two dependencies, Incanter and Compojure, and indicates that the -main function for the project is in the file simple_web_app.clj in the src subdirectory of the project.

Next create the file called simple_web_app.clj in the src subdirectory.

(ns simple_web_app
  (:gen-class)
  (:use [compojure]
        [compojure.http response]
        [incanter core stats charts])
  (:import (java.io ByteArrayOutputStream
                    ByteArrayInputStream)))

Note that the namespace for this file is simple_web_app with underscores like in the file name and in the :main line of the project.clj file.

In Compojure, you need to define routes, which in this case associate the function sample-form with “/”, and the function gen-samp-hist-png with “/sample-normal”.

(defroutes webservice
  (GET "/"
    sample-form)
  (GET "/sample-normal"
    (gen-samp-hist-png request
                       (params :size)
                       (params :mean)
                       (params :sd))))

The function gen-samp-hist-png takes four arguments, a request object, the sample size, the population mean, and the population standard deviation, and then updates the Compojure response object to include an Incanter histogram of a sample from a normal distribution with the given parameters.

Now define the gen-samp-hist-png function; start by converting the string-arguments, passed in from the params object, into numbers.

(defn gen-samp-hist-png
  [request size-str mean-str sd-str]
    (let [size (if (nil? size-str)
                 1000
                 (Integer/parseInt size-str))
          m (if (nil? mean-str)
              0
              (Double/parseDouble mean-str))
          s (if (nil? sd-str)
              1
              (Double/parseDouble sd-str))

Each argument has a default value, 1000 for size, 0 for mean, and 1 for the standard deviation (sd).

Next generate the sample from the normal distribution using Incanter’s sample-normal function with the given size, mean, and standard deviation, and then create a histogram of the resulting data.

          samp (sample-normal size
                              :mean m
                              :sd s)
          chart (histogram
                  samp
                  :title "Normal Sample"
                  :x-label (str "sample-size = " size
                                ", mean = " m
                                ", sd = " s))

I’ve used the x-label option of the histogram function to customize the x-axis label so that it includes the given sample size, mean, and standard deviation.

Next, we need to convert the histogram into a byte array, and feed that into a ByteArrayInputStream that can be passed to Compojure’s update-response function, along with a header that sets the Content-Type to “image/png”.

        out-stream (ByteArrayOutputStream.)
        in-stream (do
                    (save chart out-stream)
                    (ByteArrayInputStream.
                      (.toByteArray out-stream)))
        header {:status 200
                :headers {"Content-Type" "image/png"}}]
    (update-response request
                     header
                     in-stream)))

The above code is the key to sending a dynamically generated chart directly to the client. In addition to a filename, an OutputStream can be passed to Incanter’s save function; in this case, we use a ByteArrayOutputStream, which is then converted to a byte array that is used to initialize the ByteArrayInputStream that is passed to update-response.

We can create a Web form for submitting requests to /sample-normal. The following function is a helper function that creates a simple html page skeleton.

(defn html-doc
  [title & body]
  (html
    (doctype :html4)
    [:html
      [:head
        [:title title]]
      [:body
       [:div
	[:h2
	 [:a {:href "/"}
          "Generate a normal sample"]]]
        body]]))

Now define the sample-form function, which will generate a Web form using html-doc. This function was associated with “/” earlier using the defroutes macro.

(def sample-form
  (html-doc "sample-normal histogram"
    (form-to [:get "/sample-normal"]
      "sample size: " (text-field {:size 4} :size)
      "mean: " (text-field {:size 4} :mean)
      "sd: " (text-field {:size 4} :sd)
      (submit-button "view"))))

Now, we need to create a -main function that will be called when the incanter-webapp.jar is executed. This function will call Compojure’s run-server function, starting the built-in Jetty server on port 8080.

(defn -main [& args]
  (run-server {:port 8080}
    "/*" (servlet webservice)))

Now we can use Leiningen to download and install all the necessary dependencies, including Incanter and Compojure, and then build and package the app.

$ lein deps
$ lein compile
$ lein uberjar

Finally, to start the server, run

$ java -jar incanter-webapp-standalone.jar 

and visit http://localhost:8080 to submit a request, or go directly to the sample-normal app by constructing a URL like the following:

http://localhost:8080/sample-normal?size=500&mean=100&sd=10

Which returns a PNG image of a histogram like this:

The complete code for this example can be found here, and the Leiningen project file can be found here.

Building Incanter applications with Leiningen

[NOTE: Make sure to use Leiningen version 1.1.0 or greater. The version can be determined with ‘lein version’]

This is a quick guide to using Leiningen to build applications that use Incanter, based on the very useful post by Zef Hemel, “Building Clojure Projects with Leiningen“.

(Note: the following tutorial uses Unix shell commands (i.e. Mac OS X or Linux), but Roland Sadowski has created a Leiningen PowerShell script for Windows that works with this example: lein.ps1)

First get a copy of Leiningen’s self-installing script:

$ cd ~/bin
$ wget http://github.com/technomancy/leiningen/raw/stable/bin/lein
$ chmod +x lein

and run the self-installer.

$ lein self-install

Now create a project directory called ‘incanter-helloword’ and give it a ‘src’ subdirectory.

$ mkdir incanter-helloworld
$ mkdir incanter-helloworld/src
$ cd incanter-helloworld

Next create the following project.clj file, which includes Incanter in the dependencies section:

(defproject incanter-helloworld "0.1.0"
  :description "The Incanter version of hello world."
  :dependencies [[incanter "1.2.3-SNAPSHOT"]]
  :main hello)

The project will have a single source file called hello.clj, which will contain a -main function that will be called when the jar file is executed. Create it in the incanter-helloworld/src directory:

(ns hello
  (:gen-class)
  (:use (incanter core stats charts)))

(defn -main [& args]
  (view (histogram (sample-normal 1000))))

The program will display a histogram of sample data from a standard normal distribution.

Now download and install all the project’s dependencies, including Incanter, using Leiningen’s ‘deps’ command:

$ lein deps

Compile the project:

$ lein compile

And build an ‘uberjar’ that includes all the Incanter jar files:

$ lein uberjar

And finally, run incanter-helloworld,

$ java -jar incanter-helloworld-standalone.jar

which displays the following histogram.

You can also start a Clojure REPL with a CLASSPATH that includes all your project’s dependencies, including Incanter and its dependencies, using the following command:

$ java -cp 'lib/*' clojure.main

The lein repl command won’t work correctly, since it uses Clojure 1.1 and Incanter requires Clojure 1.2. You can also use the repl and swank scripts available in the scripts directory of the Incanter distribution on Github.

Incanter Cheat Sheet

I’ve created an Incanter cheat sheet (pdf or LaTex) based on Steve Tayon’s Clojure cheat sheet (pdf or html). Click on the image below to see the complete pdf.

Updated: Trimmed some redundancies, got the sheet down to a single page.

Statistical Learning in Clojure Part 1: LDA & QDA Classifiers

This will hopefully be the first of a series of posts based on a book that has substantially influenced me over the last several years, The Elements of Statistical Learning (EoSL) by Hastie, Tibshirani, and Friedman (I went and got a degree in statistics essentially for the purposes of better understanding this book). Best of all, the pdf version of EoSL is now available free of charge at the book’s website, along with data, code, errata, and more.

This post will demonstrate the use of Linear Discriminant Analysis and Quadratric Discriminant Analysis for classification, as described in chapter 4, “Linear Methods for Classification”, of EoSL.

I will implement the classifiers in Clojure and Incanter, and use the same data set as EoSL to train and test them. The data has 11 different classes, each representing a vowel sound, and 10 predictors, each representing processed audio information captured from eight male and seven female speakers. Details of the data are available here. The data is partitioned into a training set and a test set.

The task is to classify each of the N observations as one of the (K=11) vowel sounds using the (p=10) predictors. I will do this using a Bayes classifier built from both linear- and quadratic-discriminant functions. The Bayes classifier works by estimating the probability that an observation, x, belongs to each class, k, and then selects the class with the highest probability. Appropriately, Bayes rule is used to calculate the conditional probability , or posterior probability, for each class.

(eq 4.7) P(G=k|X=x) = \frac{P(X=x|G=k)\pi_k}{\sum_{l=1}^{K} P(X=x|G=l)\pi_l}  

Where, \pi_k is the prior probability of class k. If we model each class density as multivariate-normal, then P(X=x|G=k) = f_k(x) , where

(eq 4.8) f_{k}(x) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)}  

and the posterior probability becomes

(eq 4.7) P(G=k|X=x) = \frac{f_{k}(x)\pi_k}{\sum_{l=1}^{K}f_l(x)\pi_l}  

Since the denominator is the same for each class, we can ignore it, and if we take the log of the numerator and ignore the constant, (2\pi)^{p/2} , in the density function, we end up with the quadratic discriminant function:

(eq 4.12) \delta_{k}(x)=-\frac{1}{2}log|\Sigma_{k}|-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)+log\pi_k 

If we further assume that the covariance matrices, \Sigma_k are equal for each cluster, then more terms cancel out and we end up with the linear discriminant function,

(eq 4.10) \delta_{k}(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+log\pi_k  

Classification is then performed with these discriminant functions using the following decision rule:

(eq 4.10a) G(x)=argmax_k\delta_k(x) 

In other words, select the class, k, with the highest score for each observation.

Implementing the LDA Classifier

Start by loading the necessary Incanter libraries and the data sets from the EoSL website:

(use '(incanter core stats charts io))
(def training (to-matrix 
                (read-dataset "http://bit.ly/464h4h" 
                              :header true)))
(def testing (to-matrix 
               (read-dataset "http://bit.ly/1btCei" 
                             :header true)))

Define the fixed parameters:

(def K 11)
(def p 10)
(def N (nrow training))
(def group-counts (map nrow (group-by training 1)))

Since we don’t know the parameters of the multivariate-normal distributions used to model each cluster, we need to estimate them with the training data. First, estimate the prior probabilities for each cluster

(eq 4.10b) \hat\pi_k=N_k/N 

where N_k is the number of observations in class k, and N is the total number of observations.

(def prior-probs (div group-counts N))

Next, estimate the centroids for each cluster

(eq 4.10c) \hat\mu_k=\sum_{g_i=k}x_i/N_k  
(def cluster-centroids 
  (matrix 
    (for [x_k (group-by training 1 :cols (range 2 12))]
      (map mean (trans x_k)))))

Finally, estimate the covariance matrix to be used for all clusters,

(eq 4.10d) \hat\Sigma=\sum_{k=1}^K\sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T/(N-K) 
(def cluster-cov-mat 
  (let [groups (group-by training 1 :cols (range 2 12))]
    (reduce plus
      (map (fn [group centroid n]
        (reduce plus 
                (map #(div 
                        (mmult (trans (minus % centroid))
                               (minus % centroid))
                        (- N K))
                     group)))
             groups cluster-centroids group-counts))))

and its inverse:

(def inv-cluster-cov-mat (solve cluster-cov-mat))

With the parameters estimated, now define the linear discriminant function based on equation 4.10 from EoSL:

(eq 4.10) \delta_{k}(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+log\pi_k  
(defn ldf [x Sigma-inv mu_k pi_k]
  (+ (mmult x Sigma-inv (trans mu_k))
     (- (mult 1/2 (mmult mu_k Sigma-inv (trans mu_k))))
     (log pi_k)))

Differences in the equation, as implemented in the code, and equation 4.10 are because most of the arguments to the ldf function above are row vectors, instead of the column vectors used in eq 4.10.

Next, define a function that returns a matrix of linear discriminate scores, where rows are observations and columns are classes,

(defn calculate-scores
  ([data inv-cov-mat centroids priors]
    (matrix 
      (pmap (fn [row]
             (pmap (partial ldf row inv-cov-mat) 
                   centroids 
                   priors))
           (sel data :cols (range 2 12))))))

and calculate the scores for the training data

(def training-lda-scores 
  (calculate-scores training
                    inv-cluster-cov-mat
                    cluster-centroids
                    prior-probs))

Now, create a function that returns the index (i.e. class) with the maximum score for a given row (i.e. observation) of the scoring matrix,

(defn max-index 
  ([x] 
    (let [max-x (reduce max x)
          n (length x)]
      (loop [i 0]
        (if (= (nth x i) max-x)
          i
          (recur (inc i)))))))

We can get a vector of the predicted class for every observation in the training set with the following line of code:

(map max-index training-lda-scores)

To determine the error rate of the classifier, create the following function,

(defn error-rate [data scores] 
  (/ (sum (map #(if (= %1 %2) 0 1) 
               (sel data :cols 1) 
               (plus 1 (map max-index scores))))
     (nrow data)))

and apply it to the score matrix for the training data,

(error-rate training training-lda-scores)

which returns an error rate of 0.316.

Now let’s see how the classifier performs on the test data. Start by calculating the linear discriminant score matrix,

(def testing-lda-scores 
  (calculate-scores testing
                    inv-cluster-cov-mat
                    cluster-centroids
                    prior-probs))

and then calculate the error-rate,

(error-rate testing testing-lda-scores)

which is 0.556. As expected the error rate is higher on the test data than the training data.

Implementing the QDA Classifier

In order to use QDA to classify the data, we need to estimate a unique covariance matrix for each of the K classes, and write the quadratic discriminant function.

First, estimate the covariance matrices,

(eq 4.10e) \hat\Sigma_k=\sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T/(N_k-1) 
(def cluster-cov-mats 
  (let [groups (group-by training 1 :cols (range 2 12))]
    (map (fn [group centroid n]
      (reduce plus 
              (map #(div 
                      (mmult (trans (minus % centroid))
                             (minus % centroid))
                      (dec n))
                   group)))
           groups cluster-centroids group-counts)))

and their inverses.

(def inv-cluster-cov-mats (map solve cluster-cov-mats))

Next, create a function that implements the quadratic discriminant function:

(eq 4.12) \delta_{k}(x)=-\frac{1}{2}log|\Sigma_{k}|-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)+log\pi_k 
(defn qdf [x Sigma_k Sigma-inv_k mu_k pi_k]
  (+ (- (mult 1/2 (log (det Sigma_k))))
     (- (mult 1/2 (mmult (minus x mu_k) 
                         Sigma-inv_k 
                         (trans (minus x mu_k)))))
     (log pi_k)))

Then calculate the quadratic discriminant score matrix and error rates for the training and test data, which are 0.0113 and 0.528 respectively. Notice that the error rate is substantially lower on the training data than it was with LDA (0.316), but the error rate on the test data is not much better than that achieved by LDA (0.556).

Improving Performance

Hastie, Tibshirani, and Friedman suggested performance of the calculations can be improved by first performing an Eigenvalue decomposition of the covariance matrices.

\Sigma_k=U_kD_kU_k^T 

where U_k is the matrix of eigenvectors, and D_k is a diagonal matrix of positive eigenvalues d_{kl} .

(def Sigma-decomp 
  (map decomp-eigenvalue cluster-cov-matrices ))
(def D (map #(diag (:values %)) Sigma-decomp))
(def U (map #(:vectors %) Sigma-decomp))

Making the following substitutions to \delta_k(x) ,

(a) (x-\hat\mu_k)^T\hat\Sigma_k^{-1}(x-\hat\mu_k)=[U_k^T(x-\hat\mu_k)]^TD_k^{-1}[U_k^T(x-\hat\mu_k)] 
(b) log|\hat\Sigma_k|=\sum_l log(d_{kl}) 

results in the following quadratic discriminant function:

(eq 4.12) \delta_{k}(x)=-\frac{1}{2}\sum_l log(d_{kl})-\frac{1}{2}[U_k^T(x-\hat\mu_k)]^TD_k^{-1}[U_k^T(x-\hat\mu_k)]+log\pi_k 
(defn qdf [x D_k U_k mu_k pi_k]
  (+ (- (mult 1/2 (sum (map log (diag D_k)))))
     (- (mult 1/2 
              (mmult (trans (mmult (trans U_k) 
                                   (trans (minus x mu_k))))
                     (solve D_k) 
                     (mmult (trans U_k) 
                            (trans (minus x mu_k))))))
     (log pi_k)))

This version of the quadratic discriminant function is a bit faster than the previous version, while producing the exact same results. The performance improvement should be greater with larger data sets.

The complete code for this post can be found here.

Incanter Development Roadmap

I want to discuss plans for future Incanter development, and I’m looking for volunteers interested in contributing to any of the following projects, as well as suggestions for other improvements.

My list of priorities, in no particular order:

1. Create new functions based on the Java libraries already included in Incanter. For instance, I would like to improve incanter.optimize by a) including the nonlinear optimization routines available in Parallel Colt b) writing new routines in Clojure, and c) improving the existing routines.

2. Integrate Parallel Colt’s sparse matrix support, as suggested by Mark Reid.

3. Expose more of the chart customizability of JFreeChart in the incanter.chart library, e.g. enabling annotations of categorical charts, allowing users to set the scale on axes, customizing colors, etc..

4. Create an incanter.viz library, consisting of Processing-based data visualizations.

5. Integrate the Weka machine learning library.

6. Provide additional statistical methods.

7. Optimize existing functions; in general, I have favored ease-of-use and expressibility over performance, but there is A LOT of room to optimize without compromising usability.

Any other suggestions or feedback is welcome, and should be directed to the Incanter Goolge Group, where I have started an “Incanter Roadmap” thread. You can subscribe to the mailing list here.

David

New Google group for Incanter

The number of questions and suggestions I have been receiving about Incanter has been on the rise, so I decided to create a Google group (http://groups.google.com/group/incanter) as a central location for future discussion. To subscribe to the mailing list visit http://groups.google.com/group/incanter/subscribe. I welcome questions, answers, and suggestions for future development.

I would also like to thank everybody who has previously sent encouragement, patches, and improvements, it has made this project even more fun to work on.

David

Creating Processing Visualizations with Clojure and Incanter

The Processing language, created by Ben Fry and Casey Reas,
“is an open source programming language and environment for people who want to program images, animation, and interactions.”

Incanter now includes the incanter.processing library, a fork of Roland Sadowski‘s clj-processing library, making it possible to create Processing visualizations with Clojure and Incanter. Incanter.processing provides much more flexibility when creating customized data visualizations than incanter.charts — of course, it is also more complex to use.

Several nice examples of the kinds of visualizations that can be created in Processing can be found on Ben Fry’s website and blog, including the cool zipcode, human vs. chimps, baseball salary vs. performance examples.

The processing website has a set of tutorials, including one on getting started, and there are also three great books on Processing worth checking out:

The API documentation for incanter.processing is still a bit underdeveloped, but is perhaps adequate when combined with the excellent API reference on the Processing website.

Incanter.processing was forked from of Roland Sadowski’s clj-processing library in order to provide cleaner integration with Incanter. I have added a few functions, eliminated some, and changed the names of others. There were a few instances where I merged multiple functions (e.g. *-float, *-int) into a single one to more closely mimic the original Processing API; I incorporated a couple functions into Incanter’s existing multi-methods (e.g. save and view); I eliminated a few functions that duplicated existing Incanter functions and caused naming conflicts (e.g. sin, cos, tan, asin, acos, atan, etc); and I changed the function signatures of pretty much every function in clj-processing to require the explicit passing of the ‘sketch’ (PApplet) object, whereas the original library passes it implicitly by requiring that it is bound to a variable called *applet* in each method of the PApplet proxy.

These changes make it easier to use Processing within Incanter, but if you just want to write Processing applications in Clojure without all the dependencies of Incanter, then the original clj-processing library is the best choice.

A Simple Example

The following is sort of a “hello world” example that demonstrates the basics of creating an interactive Processing visualization (a.k.a sketch), including defining the sketch’s setup, draw, and mouseMoved methods and representing state in the sketch using closures and refs. This example is based on this one, found at John Resig‘s Processing-js website.


Click on the image above to see the live Processing-js version of the sketch.

Start by loading the incanter.core and incanter.processing libraries,

(use '(incanter core processing))

Now define some refs that will represent the state of the sketch object,

(let [radius (ref 50.0)
      X (ref nil)
      Y (ref nil)
      nX (ref nil)
      nY (ref nil)
      delay 16

The variable radius will provide the value of the circle’s radius; X and Y will indicate the location of the circle; and nX and nY will indicate the location of the mouse. We use refs for these values because their values are mutable and need to be available across multiple functions in the sketch object.

Now define a sketch object, which is just a proxied processing.core.PApplet, and its required setup method,

sktch (sketch
        (setup []
          (doto this
            (size 200 200)
            (stroke-weight 10)
            (framerate 15)
            smooth)
          (dosync
            (ref-set X (/ (width this) 2))
            (ref-set Y (/ (width this) 2))
            (ref-set nX @X)
            (ref-set nY @Y)))

The first part of the setup method sets the size of the sketch, the stroke weight to be used when drawing, the framerate of the animation, and indicates that anti-aliasing should be used. The next part of the method uses a dosync block and ref-set to set initial values for the refs. Note the @ syntax to dereference (access the values of) the refs X and Y.

Processing sketches that use animation require the definition of a draw method, which in this case will be invoked 15 times per second as specified by the framerate.

  (draw []
   (dosync
     (ref-set radius (+ @radius 
                        (sin (/ (frame-count this) 4))))
     (ref-set X (+ @X (/ (- @nX @X) delay)))
     (ref-set Y (+ @Y (/ (- @nY @Y) delay))))
   (doto this
     (background 125) ;; gray
     (fill 0 121 184)
     (stroke 255)
     (ellipse @X @Y @radius @radius)))

The first part of the draw method uses dosync and ref-set to set new values for the radius, X, and Y refs for each frame of the animation. The sin function is used to grow and shrink the radius over time. The location of the circle, as indicated by X and Y, is determined by the mouse location (nX and nY) with a delay factor.

The next part of the draw method draws the background (i.e. gray background) and the circle with the ellipse function.

Finally, we need to define the mouseMoved method in order to track the mouse location, using the mouse-x and mouse-y functions, and set the values of the nX and nY refs. All event functions in incanter.processing, including mouseMoved, require an event argument; this is due to limitations of Clojure’s proxy macro, and isn’t required when using the Processing’s Java API directly.

(mouseMoved [mouse-event]
  (dosync
    (ref-set nX (mouse-x mouse-event)) 
    (ref-set nY (mouse-y mouse-event)))))]

Now that the sketch is fully defined, use the view function to display it,

(view sktch :size [200 200]))

The complete code for this example can be found here.

In future posts I will walk through other examples of Processing visualizations, some of which can be found in the Incanter distribution under incanter/examples/processing.