Incanter Data Sorcery t-shirt

I’ve created an Incanter Data Sorcery t-shirt for Clojure-loving data geeks. I’ve made the design available on a basic black shirt and a higher quality one (more expensive) through Zazzle (where you can also find Clojure t-shirts). Proceeds offset the cost of the coffee used to fuel development of Incanter.

Here’s a close up of the design.

Other shirt color options are also available at the Incanter Zazzle store. If you have other design or tag-line suggestions, let me know.

Working with data sets in Clojure with Incanter and MongoDB

This post will cover some new dataset functionality I’ve recently added to Incanter, including the use of MongoDB as a back-end data store for datasets.

Basic dataset functionality

First, load the basic Incanter libraries

(use '(incanter core stats charts io))

Next load some CSV data using the incanter.io/read-dataset function, which takes a string representing either a filename or a URL to the data.

(def data
  (read-dataset
    "http://github.com/liebke/incanter/raw/master/data/cars.csv"
     :header true))

The default delimiter is \, but a different one can be specified with the :delim option (e.g. \tab). The cars.csv file is a small sample data set that is included in the Incanter distribution, and therefore could have been loaded using get-dataset,

(incanter.datasets/get-dataset :cars)

See the documentation for get-dataset for more information on the included sample data.

We can get some information on the dataset, like the number of rows and columns using either the dim function or the nrow and ncol functions, and we can view the columns names with the col-names function.

user> (dim data)
[50 2]
user> (col-names data)
["speed" "dist"]

We can see that there are just 50 rows and two columns and that the column names are “speed” and “dist”. The data are 50 observations, from the 1920s, of automobile breaking distances observed at different speeds.

I will use Incanter’s new with-data macro and $ column-selector function to access the dataset’s columns. Within the body of a with-data expression, columns of the bound dataset can be accessed by name or index, using the $ function, for instance ($ :colname) or ($ 0).

For example, the following code will create a scatter plot of the data (speed vs. dist), and then add a regression line using the fitted values returned from the incanter.stats/linear-model function.

(with-data data
  (def lm (linear-model ($ :dist) ($ :speed)))
  (doto (scatter-plot ($ :speed) ($ :dist))
    (add-lines ($ :speed) (:fitted lm))
    view))

Within the with-data expression, the dataset itself is bound to $data, which can be useful if you want to perform operations on it. For instance, the following code uses the conj-cols function to prepend an integer ID column to the dataset, and then displays it in a window.

(with-data (get-dataset :cars)
  (view (conj-cols (range (nrow $data)) $data)))

The conj-cols function returns a dataset by conjoining sequences together as the columns of the dataset, or by prepending/appending columns to an existing dataset, and the related conj-rows function conjoins rows.

We can create a new dataset that adds the fitted (or predicted values) to the original data using the conj-cols function.

(def results (conj-cols data (:fitted lm)))

You’ll notice that the column names are changed to generic ones (i.e. col-0, col-1, col-2), this is done to prevent naming conflicts when merging datasets. We can add more meaningful names with the col-names function.

(def results (col-names data [:speed :dist :predicted-dist]))

We could have used the -> (thread) macro to perform both steps, as well as add the residuals from the output of linear-model to the dataset

(def results (-> (conj-cols data (:fitted lm) (:residuals lm))
                 (col-names [:speed :dist :predicted :residuals])))

Querying data sets with the $where function

Another new function, $where, lets you query an Incanter dataset using a syntax based on MongoDB and Somnium’s Congomongo Clojure library.

To perform a query, pass a query-map to the $where function. For instance, to get the rows from the results data set where the value of speed is 10, use

($where {:speed 10} results)

For the rows where the speed is between 10 and 20, use

($where {:speed {:$gt 10 :$lt 20}} results)

For rows where the speed is in the set #{4 7 24 25}, use

($where {:speed {:$in #{4 7 24 25}}} results)

Or not in that set,

($where {:speed {:$nin #{4 7 24 25}}} results)

Like the $ function, $where can be used within with-data, where the dataset is passed implicitly. For example, to get the mean speed of the observations that have residuals between -10 and 10 from the results dataset,

(with-data results
  (mean ($ :speed ($where {:residuals {:$gt -10 :$lt 10}}))))

which returns 14.32.

Query-maps don’t support ‘or’ directly, but we can use conj-rows to construct a dataset where speed is either less than 10 or greater than 20 as follows:

(with-data results
  (conj-rows ($where {:speed {:$lt 10}})
             ($where {:speed {:$gt 20}})))

An alternative to conjoining query results is to pass $where a predicate function that accepts a map containing the key/value pairs of a row and returns a boolean indicating whether the row should be included. For example, to perform the above query we could have done this,

(with-data results
  ($where (fn [row] (or (< (:speed row) 10) (> (:speed row) 20)))))

Storing and Retrieving Incanter datasets in MongoDB

The new incanter.mongodb library can be used with Somnium’s Congomongo to store and retrieve datasets in a MongoDB database.

MongoDB is schema-less, document-oriented database that is well suited as a data store for Clojure data structures. Getting started with MongoDB is easy, just download and unpack it, and run the following commands (on Linux or Mac OS X),

$ mkdir -p /data/db
$ ./mongodb/bin/mongod &

For more information, see the MongoDB quick start guide.

Once the database server is running, load Incanter’s MongoDB library and Congomongo,

(use 'somnium.congomongo)
(use 'incanter.mongodb)

and use Congomongo’s mongo! function to connect to the “mydb” database on the server running on the localhost on the default port.

(mongo! :db "mydb")

If mydb doesn’t exist, it will be created. Now we can insert the results dataset into the database with the incanter.mongodb/insert-dataset function.

(insert-dataset :breaking-dists results)

The first argument, :breaking-dists, is the name the collection will have in the database. We can now retrieve the dataset with the incanter.mongodb/fetch-dataset function.

(def breaking-dists (fetch-dataset :breaking-dists))

Take a look at the column names of the retrieved dataset and you’ll notice that MongoDB added a couple, :_ns and :_id, in order to uniquely identify each row.

user> (col-names breaking-dists)
[:speed :_ns :_id :predicted :residuals :dist]

The fetch-dataset function (and the congomongo.fetch function that it’s based on) support queries with the :where option. The following example retrieves only the rows from the :breaking-dists collection in the database where the :speed is between 10 and 20 mph, and then calculates the average breaking distance of the resulting observations.

(with-data (fetch-dataset :breaking-dists
			  :where {:speed {:$gt 10 :$lt 20}})
  (mean ($ :dist)))

The syntax for Congomongo’s query-maps is nearly the same as that for the $where function, although :$in and :$nin take a Clojure vector instead of a Clojure set.

For more information on the available functionality in Somnium’s Congomongo, visit its Github repository or read the documentation for incanter.mongodb

(doc incanter.mongodb)

The complete code for this post can be found here.

Starting an Incanter Swank server with Leiningen

Kevin Nuckolls asked if the Swank server I set up using Maven in my previous post can be started using Leiningen instead. The answer is yes. It’s very simple, in fact.

You don’t need Maven, Git, or even to manually install Incanter. You just need to install Leiningen, as described in this post, and then create a project directory containing the following project.clj file:

(defproject incanter-swank "0.1.0"
  :description "A Swank Server for Incanter"
  :dependencies [[incanter "1.0-master-SNAPSHOT"]]
  :dev-dependencies [[leiningen/lein-swank "1.0.0-SNAPSHOT"]])

Next download Incanter and its dependencies with Leiningen,

$ lein deps

and start a Swank server.

$ lein swank

Now connect to it from Emacs using M-x slime-connect, as described in my previous post and that’s it.

[NOTE: The first time you build this project, you may see a error message like, java.util.zip.ZipException: duplicate entry. This is a problem I’ve recently been seeing with Leiningen builds, but the jar files that are produced are valid, and the message will not occur when starting the Swank server after the initial build. I am looking into this.]

Setting up Clojure, Incanter, Emacs, Slime, Swank, and Paredit

Emacs is the favored development environment for the majority of Clojure developers, and there are good reasons for that, but personally, I don’t think it should be the first choice of developers new to Clojure, unless they have used it previously; it’s just too much to learn at once.

I recommend people use an editor they’re comfortable with, combined with a command-line REPL. There is no reason to tackle the complexities of configuring and using Emacs, Slime, and Swank until you’ve got your head around the basics of Clojure and functional programming. Once you’ve got the basics down though, it’s worth venturing into the arcane world of Emacs. You may decide it’s not for you, and luckily there are alternatives, from your favorite editor combined with a REPL to plugins for popular IDEs like Netbeans (Enclojure), IntelliJ (La Clojure), and Eclipse (Counter-Clockwise).

But you’ll never know if it’s for you unless you give it a try. So, I’ll be demonstrating how to build and install Incanter (which includes Clojure and Clojure-contrib), and then set up a development environment with Emacs, Slime, Swank, and Paredit.

Setting up Clojure and Incanter

Incanter is available on Clojars, so you can include it in your projects by adding it as a dependency in your project.clj file (see below). Alternatively, you can clone the full distribution from Github, which includes REPL and Swank start up scripts. Note: The repl scripts are necessary, since Leiningen’s repl task does not work correctly with Clojure 1.2.

The following examples will assume you cloned Incanter’s github repository (see below for instructions) and have the repl and swank scripts included with the distribution.

First, you’ll need Git and Leiningen to grab and build Incanter. First clone Incanter from its Github repository:

$ git clone git://github.com/liebke/incanter.git

This will create an incanter subdirectory

$ cd incanter

Use lein deps to downloads the necessary dependencies:

$ lein deps

Once this process is complete, you can start a Clojure REPL with all of Incanter’s dependencies pre-configured on the CLASSPATH by either using the repl scripts included in the script/ directory.

$ script/repl

or on Windows,

$ script/repl.bat

or you can start it directly with the java command:

$ java -cp 'lib/*' clojure.main

This will present you with the user=> prompt. As a simple example of using Incanter from the REPL, we’ll generate a line plot of the sine function over the range -4 to 4, first load the necessary Incanter libraries:

user=> (use '(incanter core charts))

and then use the function-plot function:

user=> (view (function-plot sin -4 4))

Now that we know Incanter and Clojure are installed correctly, let’s set up an Emacs development environment.

Setting up and using Emacs, Swank, Slime, and Paredit

I’m a long time vi/vim user and I typically use MacVim, but I have recently gone back to Emacs (the editor I used when I first learned Lisp) in order to take advantage of Slime, Swank, and Paredit. Doing most of my development on a Macbook, I like Aquamacs, which blends standard OS X and Emacs behaviors. Another nice option on the Mac is Carbon Emacs.

The procedure I’m going to use to setup the Emacs development environment is based on the instructions provided by Phil Hagelberg (a.k.a Technomancy) in this blog post and in the README for his fork of swank-clojure.

The best way to install the necessary packages (clojure-mode, slime, slime-repl, swank-clojure) is by using the Emacs Lisp Package Archive, or ELPA.

To access ELPA, use the following command:

M-x package-list-packages

The meta-key on the Mac for most flavors of Emacs is the command key, but with Aquamacs it’s the alt/option key.

If the ‘package-list-packages’ command cannot be found, you’ll need to paste the following snippet of elisp in your *scratch* buffer and then evaluate it, (go here for more detailed instructions).

 (let ((buffer (url-retrieve-synchronously
	       "http://tromey.com/elpa/package-install.el")))
  (save-excursion
    (set-buffer buffer)
    (goto-char (point-min))
    (re-search-forward "^$" nil 'move)
    (eval-region (point) (point-max))
    (kill-buffer (current-buffer))))

In Aquamacs, you’ll evaluate it by placing your cursor right after the last parentheses and entering:

C-x C-e

On most other version of Emacs, including Carbon Emacs, you’ll enter

C-j

Once this has been done, you should be able access ELPA with:

M-x package-list-packages

You’ll see a list of packages, either scroll down to find or search for, using C-s, the following packages:

  • clojure-mode
  • slime
  • slime-repl
  • swank-clojure
  • paredit

When you’re cursor is on the appropriate package, hit the i key to select it. Once all the packages are selected, hit x to begin their installation. When it’s complete, you might see some warnings, but don’t worry about them.

Slime is an Emacs-mode for editing Lisp/Clojure code and Swank is a back-end service that Slime connects to, letting you evaluate Clojure code from within Emacs. Paredit provides some additional Clojure/Lisp editing functionality, although, like Emacs, it requires some getting used to (see mudphone’s introduction to Paredit presentation and the Paredit cheat sheet).

Now it’s time to start up a Swank server that will let us run Clojure code from Emacs. We can use Leiningen to start one up that is pre-configured with all of Incanter’s dependencies with the script/swank script, or by running the following Leiningen command Incanter directory:

$ lein swank

This will generate some messages, ending with

Connection opened on local port  4005
#<ServerSocket ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=4005]>

Now we need to connect to the server from Emacs with the following command:

M-x slime-connect

It will prompt you for the IP address and port of the server, just use the defaults it offers. It may then show the following prompt:

Versions differ: nil (slime) vs. 2009-09-14 (swank). Continue? (y or n)

Just say ‘yes’. You will then get a message confirming you’re connected, and a window will open with a Clojure REPL and a ‘user>’ prompt. A cool feature of slime-connect is that you can connect to a swank server on a remote system, just provide the system’s IP address or host name, instead of the default 127.0.0.1, when prompted.

Now open or create a Clojure file, using ‘C-x C-f’ (or using ‘command-o’ or ‘command-n’ in Aquamacs). If you’re creating a new file, give it a *.clj suffix and Emacs will start clojure-mode automatically.

Now start up Paredit,

M-x paredit-mode

You’re now ready to edit Clojure code. Start by loading a few Incanter libraries with the following line:

(use '(incanter core stats charts))

You’ll notice that closing parens are automatically created when you create an opening paren, this is due to Paredit. You can evaluate this block of code by placing your cursor right after the last paren, and entering ‘C-x C-e’. You should see the return value, nil, in the Emacs message pane.

Now let’s generate a plot of the PDF of the Normal distribution, over the range -3 to 3, by entering and evaluating the following line:

(view (function-plot pdf-normal -3 3))

That’s it, you’re all set up. Have fun!

See also:

Funding Open Source Projects

Rich Hickey, the creator of the Clojure language, made an interesting request yesterday, he needs help funding Clojure’s development. For the last few years, he has essentially been the sole financial backer for the project, and now there is a need for additional financial support, so he can continue developing Clojure full-time.

The request spurred an outpouring of support for Rich and Clojure, raising more than half of the target amount within a day, with more than 187 individuals and five companies providing support so far.

The Clojure language is one of the primary reason Incanter has been such a joy to develop and use, and I have been a past and present financial contributor to Clojure, as well as to the other projects that provide the foundation that I built Incanter on, like Parallel Colt and JFreeChart. I hope you’ll join me in helping fund great open-source projects like these.

To help fund Clojure, visit its funding page. To help fund Piotr Wendykier’s Parallel Colt project, visit his donation page, and to help fund JFreeChart, purchase their developer’s guide.

Many other projects lay at the core of Incanter, and although they don’t all require funding, you can help in other ways by contributing your talent. To learn more about contributing to Processing, visit its contribute page, and to learn more about helping out with Incanter itself, visit its Google group.

David

FlightCaster merges its statistical-learning code into Incanter

Two goals have always been kept in mind during the development of Incanter 1) to provide an R-like, interactive statistical and graphics environment for performing data analysis, and 2) to provide a collection of libraries that can be embedded in larger data analysis systems, taking advantage of both the power of the Clojure language and the rich set of libraries available on the JVM for accessing and processing data.

Incanter has provided much of what you would expect from R by combining the power of the Clojure language with Java libraries, like Parallel Colt, jFreeChart, and Processing. The Clojure language itself is well suited for data processing due to its data structures, the sequence abstraction, the powerful sequence processing functions, destructuring, data structure literals, and myriad other niceties.

Now, Incanter’s goal of being embed-able in larger data analysis systems has taken a giant step forward with the merger of Flightcaster‘s code. Flightcaster’s service for predicting airline arrival times is a Clojure- and Hadoop-based distributed statistical-learning system that now has Incanter at the core. Their setup demonstrates how Incanter can scale up to larger data sets by leveraging JVM-based projects like Hadoop, AKKA, or other Clojure-based distributed computing systems. And with the Flightcaster code, Incanter has new functionality in classification, information theory, more probability, more io capabilities, a large range of dependence and similarity measures, and a variety of data transformation functions.

And this marks just the beginning of Flightcaster’s participation in the development of Incanter. Together we are working on a number of algorithms for learning models, structure, ensembles, and more. Future work will also ensure that Incanter can work seemlessly with FlightCaster’s recently open-source Crane framework for managing distributed processes on Hadoop and Amazon’s EC2.

Building a Clojure Web application with Incanter, Compojure, and Leiningen

This post will demonstrate how to build a simple Clojure Web application, with Compojure, that will use Incanter to generate a sample from a normal distribution with parameters (size, mean, and standard deviation) provided by the user, and then return a PNG image of the histogram of the data.

This example is simple, but it demonstrates how to dynamically generate an Incanter chart and pass it directly to the Web client as a PNG image.

For more information on the Compojure Web framework, visit its Github repository, its Google group, its WikiBooks site, and this quick tutorial.

I will use the Leiningen build tool to download dependencies, compile the application, and package it up as jar file. For more information on installing and using Leiningen, see my previous post.

Let’s start off by creating a project directory, I’ll call it incanter-webapp, and give it a src subdirectory. Then create the following Leiningen project.clj file in the project directory.

(defproject incanter-webapp "0.1.0"
  :description "A simple Incanter web-app."
  :dependencies [[incanter "1.2.1-SNAPSHOT"]
                 [compojure "0.3.2"]]
  :main simple_web_app)

This project file includes two dependencies, Incanter and Compojure, and indicates that the -main function for the project is in the file simple_web_app.clj in the src subdirectory of the project.

Next create the file called simple_web_app.clj in the src subdirectory.

(ns simple_web_app
  (:gen-class)
  (:use [compojure]
        [compojure.http response]
        [incanter core stats charts])
  (:import (java.io ByteArrayOutputStream
                    ByteArrayInputStream)))

Note that the namespace for this file is simple_web_app with underscores like in the file name and in the :main line of the project.clj file.

In Compojure, you need to define routes, which in this case associate the function sample-form with “/”, and the function gen-samp-hist-png with “/sample-normal”.

(defroutes webservice
  (GET "/"
    sample-form)
  (GET "/sample-normal"
    (gen-samp-hist-png request
                       (params :size)
                       (params :mean)
                       (params :sd))))

The function gen-samp-hist-png takes four arguments, a request object, the sample size, the population mean, and the population standard deviation, and then updates the Compojure response object to include an Incanter histogram of a sample from a normal distribution with the given parameters.

Now define the gen-samp-hist-png function; start by converting the string-arguments, passed in from the params object, into numbers.

(defn gen-samp-hist-png
  [request size-str mean-str sd-str]
    (let [size (if (nil? size-str)
                 1000
                 (Integer/parseInt size-str))
          m (if (nil? mean-str)
              0
              (Double/parseDouble mean-str))
          s (if (nil? sd-str)
              1
              (Double/parseDouble sd-str))

Each argument has a default value, 1000 for size, 0 for mean, and 1 for the standard deviation (sd).

Next generate the sample from the normal distribution using Incanter’s sample-normal function with the given size, mean, and standard deviation, and then create a histogram of the resulting data.

          samp (sample-normal size
                              :mean m
                              :sd s)
          chart (histogram
                  samp
                  :title "Normal Sample"
                  :x-label (str "sample-size = " size
                                ", mean = " m
                                ", sd = " s))

I’ve used the x-label option of the histogram function to customize the x-axis label so that it includes the given sample size, mean, and standard deviation.

Next, we need to convert the histogram into a byte array, and feed that into a ByteArrayInputStream that can be passed to Compojure’s update-response function, along with a header that sets the Content-Type to “image/png”.

        out-stream (ByteArrayOutputStream.)
        in-stream (do
                    (save chart out-stream)
                    (ByteArrayInputStream.
                      (.toByteArray out-stream)))
        header {:status 200
                :headers {"Content-Type" "image/png"}}]
    (update-response request
                     header
                     in-stream)))

The above code is the key to sending a dynamically generated chart directly to the client. In addition to a filename, an OutputStream can be passed to Incanter’s save function; in this case, we use a ByteArrayOutputStream, which is then converted to a byte array that is used to initialize the ByteArrayInputStream that is passed to update-response.

We can create a Web form for submitting requests to /sample-normal. The following function is a helper function that creates a simple html page skeleton.

(defn html-doc
  [title & body]
  (html
    (doctype :html4)
    [:html
      [:head
        [:title title]]
      [:body
       [:div
	[:h2
	 [:a {:href "/"}
          "Generate a normal sample"]]]
        body]]))

Now define the sample-form function, which will generate a Web form using html-doc. This function was associated with “/” earlier using the defroutes macro.

(def sample-form
  (html-doc "sample-normal histogram"
    (form-to [:get "/sample-normal"]
      "sample size: " (text-field {:size 4} :size)
      "mean: " (text-field {:size 4} :mean)
      "sd: " (text-field {:size 4} :sd)
      (submit-button "view"))))

Now, we need to create a -main function that will be called when the incanter-webapp.jar is executed. This function will call Compojure’s run-server function, starting the built-in Jetty server on port 8080.

(defn -main [& args]
  (run-server {:port 8080}
    "/*" (servlet webservice)))

Now we can use Leiningen to download and install all the necessary dependencies, including Incanter and Compojure, and then build and package the app.

$ lein deps
$ lein compile
$ lein uberjar

Finally, to start the server, run

$ java -jar incanter-webapp-standalone.jar 

and visit http://localhost:8080 to submit a request, or go directly to the sample-normal app by constructing a URL like the following:

http://localhost:8080/sample-normal?size=500&mean=100&sd=10

Which returns a PNG image of a histogram like this:

The complete code for this example can be found here, and the Leiningen project file can be found here.

Building Incanter applications with Leiningen

[NOTE: Make sure to use Leiningen version 1.1.0 or greater. The version can be determined with ‘lein version’]

This is a quick guide to using Leiningen to build applications that use Incanter, based on the very useful post by Zef Hemel, “Building Clojure Projects with Leiningen“.

(Note: the following tutorial uses Unix shell commands (i.e. Mac OS X or Linux), but Roland Sadowski has created a Leiningen PowerShell script for Windows that works with this example: lein.ps1)

First get a copy of Leiningen’s self-installing script:

$ cd ~/bin
$ wget http://github.com/technomancy/leiningen/raw/stable/bin/lein
$ chmod +x lein

and run the self-installer.

$ lein self-install

Now create a project directory called ‘incanter-helloword’ and give it a ‘src’ subdirectory.

$ mkdir incanter-helloworld
$ mkdir incanter-helloworld/src
$ cd incanter-helloworld

Next create the following project.clj file, which includes Incanter in the dependencies section:

(defproject incanter-helloworld "0.1.0"
  :description "The Incanter version of hello world."
  :dependencies [[incanter "1.2.3-SNAPSHOT"]]
  :main hello)

The project will have a single source file called hello.clj, which will contain a -main function that will be called when the jar file is executed. Create it in the incanter-helloworld/src directory:

(ns hello
  (:gen-class)
  (:use (incanter core stats charts)))

(defn -main [& args]
  (view (histogram (sample-normal 1000))))

The program will display a histogram of sample data from a standard normal distribution.

Now download and install all the project’s dependencies, including Incanter, using Leiningen’s ‘deps’ command:

$ lein deps

Compile the project:

$ lein compile

And build an ‘uberjar’ that includes all the Incanter jar files:

$ lein uberjar

And finally, run incanter-helloworld,

$ java -jar incanter-helloworld-standalone.jar

which displays the following histogram.

You can also start a Clojure REPL with a CLASSPATH that includes all your project’s dependencies, including Incanter and its dependencies, using the following command:

$ java -cp 'lib/*' clojure.main

The lein repl command won’t work correctly, since it uses Clojure 1.1 and Incanter requires Clojure 1.2. You can also use the repl and swank scripts available in the scripts directory of the Incanter distribution on Github.

Incanter Cheat Sheet

I’ve created an Incanter cheat sheet (pdf or LaTex) based on Steve Tayon’s Clojure cheat sheet (pdf or html). Click on the image below to see the complete pdf.

Updated: Trimmed some redundancies, got the sheet down to a single page.

Statistical Learning in Clojure Part 1: LDA & QDA Classifiers

This will hopefully be the first of a series of posts based on a book that has substantially influenced me over the last several years, The Elements of Statistical Learning (EoSL) by Hastie, Tibshirani, and Friedman (I went and got a degree in statistics essentially for the purposes of better understanding this book). Best of all, the pdf version of EoSL is now available free of charge at the book’s website, along with data, code, errata, and more.

This post will demonstrate the use of Linear Discriminant Analysis and Quadratric Discriminant Analysis for classification, as described in chapter 4, “Linear Methods for Classification”, of EoSL.

I will implement the classifiers in Clojure and Incanter, and use the same data set as EoSL to train and test them. The data has 11 different classes, each representing a vowel sound, and 10 predictors, each representing processed audio information captured from eight male and seven female speakers. Details of the data are available here. The data is partitioned into a training set and a test set.

The task is to classify each of the N observations as one of the (K=11) vowel sounds using the (p=10) predictors. I will do this using a Bayes classifier built from both linear- and quadratic-discriminant functions. The Bayes classifier works by estimating the probability that an observation, x, belongs to each class, k, and then selects the class with the highest probability. Appropriately, Bayes rule is used to calculate the conditional probability , or posterior probability, for each class.

(eq 4.7) P(G=k|X=x) = \frac{P(X=x|G=k)\pi_k}{\sum_{l=1}^{K} P(X=x|G=l)\pi_l}  

Where, \pi_k is the prior probability of class k. If we model each class density as multivariate-normal, then P(X=x|G=k) = f_k(x) , where

(eq 4.8) f_{k}(x) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)}  

and the posterior probability becomes

(eq 4.7) P(G=k|X=x) = \frac{f_{k}(x)\pi_k}{\sum_{l=1}^{K}f_l(x)\pi_l}  

Since the denominator is the same for each class, we can ignore it, and if we take the log of the numerator and ignore the constant, (2\pi)^{p/2} , in the density function, we end up with the quadratic discriminant function:

(eq 4.12) \delta_{k}(x)=-\frac{1}{2}log|\Sigma_{k}|-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)+log\pi_k 

If we further assume that the covariance matrices, \Sigma_k are equal for each cluster, then more terms cancel out and we end up with the linear discriminant function,

(eq 4.10) \delta_{k}(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+log\pi_k  

Classification is then performed with these discriminant functions using the following decision rule:

(eq 4.10a) G(x)=argmax_k\delta_k(x) 

In other words, select the class, k, with the highest score for each observation.

Implementing the LDA Classifier

Start by loading the necessary Incanter libraries and the data sets from the EoSL website:

(use '(incanter core stats charts io))
(def training (to-matrix 
                (read-dataset "http://bit.ly/464h4h" 
                              :header true)))
(def testing (to-matrix 
               (read-dataset "http://bit.ly/1btCei" 
                             :header true)))

Define the fixed parameters:

(def K 11)
(def p 10)
(def N (nrow training))
(def group-counts (map nrow (group-by training 1)))

Since we don’t know the parameters of the multivariate-normal distributions used to model each cluster, we need to estimate them with the training data. First, estimate the prior probabilities for each cluster

(eq 4.10b) \hat\pi_k=N_k/N 

where N_k is the number of observations in class k, and N is the total number of observations.

(def prior-probs (div group-counts N))

Next, estimate the centroids for each cluster

(eq 4.10c) \hat\mu_k=\sum_{g_i=k}x_i/N_k  
(def cluster-centroids 
  (matrix 
    (for [x_k (group-by training 1 :cols (range 2 12))]
      (map mean (trans x_k)))))

Finally, estimate the covariance matrix to be used for all clusters,

(eq 4.10d) \hat\Sigma=\sum_{k=1}^K\sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T/(N-K) 
(def cluster-cov-mat 
  (let [groups (group-by training 1 :cols (range 2 12))]
    (reduce plus
      (map (fn [group centroid n]
        (reduce plus 
                (map #(div 
                        (mmult (trans (minus % centroid))
                               (minus % centroid))
                        (- N K))
                     group)))
             groups cluster-centroids group-counts))))

and its inverse:

(def inv-cluster-cov-mat (solve cluster-cov-mat))

With the parameters estimated, now define the linear discriminant function based on equation 4.10 from EoSL:

(eq 4.10) \delta_{k}(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+log\pi_k  
(defn ldf [x Sigma-inv mu_k pi_k]
  (+ (mmult x Sigma-inv (trans mu_k))
     (- (mult 1/2 (mmult mu_k Sigma-inv (trans mu_k))))
     (log pi_k)))

Differences in the equation, as implemented in the code, and equation 4.10 are because most of the arguments to the ldf function above are row vectors, instead of the column vectors used in eq 4.10.

Next, define a function that returns a matrix of linear discriminate scores, where rows are observations and columns are classes,

(defn calculate-scores
  ([data inv-cov-mat centroids priors]
    (matrix 
      (pmap (fn [row]
             (pmap (partial ldf row inv-cov-mat) 
                   centroids 
                   priors))
           (sel data :cols (range 2 12))))))

and calculate the scores for the training data

(def training-lda-scores 
  (calculate-scores training
                    inv-cluster-cov-mat
                    cluster-centroids
                    prior-probs))

Now, create a function that returns the index (i.e. class) with the maximum score for a given row (i.e. observation) of the scoring matrix,

(defn max-index 
  ([x] 
    (let [max-x (reduce max x)
          n (length x)]
      (loop [i 0]
        (if (= (nth x i) max-x)
          i
          (recur (inc i)))))))

We can get a vector of the predicted class for every observation in the training set with the following line of code:

(map max-index training-lda-scores)

To determine the error rate of the classifier, create the following function,

(defn error-rate [data scores] 
  (/ (sum (map #(if (= %1 %2) 0 1) 
               (sel data :cols 1) 
               (plus 1 (map max-index scores))))
     (nrow data)))

and apply it to the score matrix for the training data,

(error-rate training training-lda-scores)

which returns an error rate of 0.316.

Now let’s see how the classifier performs on the test data. Start by calculating the linear discriminant score matrix,

(def testing-lda-scores 
  (calculate-scores testing
                    inv-cluster-cov-mat
                    cluster-centroids
                    prior-probs))

and then calculate the error-rate,

(error-rate testing testing-lda-scores)

which is 0.556. As expected the error rate is higher on the test data than the training data.

Implementing the QDA Classifier

In order to use QDA to classify the data, we need to estimate a unique covariance matrix for each of the K classes, and write the quadratic discriminant function.

First, estimate the covariance matrices,

(eq 4.10e) \hat\Sigma_k=\sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T/(N_k-1) 
(def cluster-cov-mats 
  (let [groups (group-by training 1 :cols (range 2 12))]
    (map (fn [group centroid n]
      (reduce plus 
              (map #(div 
                      (mmult (trans (minus % centroid))
                             (minus % centroid))
                      (dec n))
                   group)))
           groups cluster-centroids group-counts)))

and their inverses.

(def inv-cluster-cov-mats (map solve cluster-cov-mats))

Next, create a function that implements the quadratic discriminant function:

(eq 4.12) \delta_{k}(x)=-\frac{1}{2}log|\Sigma_{k}|-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)+log\pi_k 
(defn qdf [x Sigma_k Sigma-inv_k mu_k pi_k]
  (+ (- (mult 1/2 (log (det Sigma_k))))
     (- (mult 1/2 (mmult (minus x mu_k) 
                         Sigma-inv_k 
                         (trans (minus x mu_k)))))
     (log pi_k)))

Then calculate the quadratic discriminant score matrix and error rates for the training and test data, which are 0.0113 and 0.528 respectively. Notice that the error rate is substantially lower on the training data than it was with LDA (0.316), but the error rate on the test data is not much better than that achieved by LDA (0.556).

Improving Performance

Hastie, Tibshirani, and Friedman suggested performance of the calculations can be improved by first performing an Eigenvalue decomposition of the covariance matrices.

\Sigma_k=U_kD_kU_k^T 

where U_k is the matrix of eigenvectors, and D_k is a diagonal matrix of positive eigenvalues d_{kl} .

(def Sigma-decomp 
  (map decomp-eigenvalue cluster-cov-matrices ))
(def D (map #(diag (:values %)) Sigma-decomp))
(def U (map #(:vectors %) Sigma-decomp))

Making the following substitutions to \delta_k(x) ,

(a) (x-\hat\mu_k)^T\hat\Sigma_k^{-1}(x-\hat\mu_k)=[U_k^T(x-\hat\mu_k)]^TD_k^{-1}[U_k^T(x-\hat\mu_k)] 
(b) log|\hat\Sigma_k|=\sum_l log(d_{kl}) 

results in the following quadratic discriminant function:

(eq 4.12) \delta_{k}(x)=-\frac{1}{2}\sum_l log(d_{kl})-\frac{1}{2}[U_k^T(x-\hat\mu_k)]^TD_k^{-1}[U_k^T(x-\hat\mu_k)]+log\pi_k 
(defn qdf [x D_k U_k mu_k pi_k]
  (+ (- (mult 1/2 (sum (map log (diag D_k)))))
     (- (mult 1/2 
              (mmult (trans (mmult (trans U_k) 
                                   (trans (minus x mu_k))))
                     (solve D_k) 
                     (mmult (trans U_k) 
                            (trans (minus x mu_k))))))
     (log pi_k)))

This version of the quadratic discriminant function is a bit faster than the previous version, while producing the exact same results. The performance improvement should be greater with larger data sets.

The complete code for this post can be found here.