Incanter Development Roadmap

I want to discuss plans for future Incanter development, and I’m looking for volunteers interested in contributing to any of the following projects, as well as suggestions for other improvements.

My list of priorities, in no particular order:

1. Create new functions based on the Java libraries already included in Incanter. For instance, I would like to improve incanter.optimize by a) including the nonlinear optimization routines available in Parallel Colt b) writing new routines in Clojure, and c) improving the existing routines.

2. Integrate Parallel Colt’s sparse matrix support, as suggested by Mark Reid.

3. Expose more of the chart customizability of JFreeChart in the incanter.chart library, e.g. enabling annotations of categorical charts, allowing users to set the scale on axes, customizing colors, etc..

4. Create an incanter.viz library, consisting of Processing-based data visualizations.

5. Integrate the Weka machine learning library.

6. Provide additional statistical methods.

7. Optimize existing functions; in general, I have favored ease-of-use and expressibility over performance, but there is A LOT of room to optimize without compromising usability.

Any other suggestions or feedback is welcome, and should be directed to the Incanter Goolge Group, where I have started an “Incanter Roadmap” thread. You can subscribe to the mailing list here.

David

New Google group for Incanter

The number of questions and suggestions I have been receiving about Incanter has been on the rise, so I decided to create a Google group (http://groups.google.com/group/incanter) as a central location for future discussion. To subscribe to the mailing list visit http://groups.google.com/group/incanter/subscribe. I welcome questions, answers, and suggestions for future development.

I would also like to thank everybody who has previously sent encouragement, patches, and improvements, it has made this project even more fun to work on.

David

Creating Processing Visualizations with Clojure and Incanter

The Processing language, created by Ben Fry and Casey Reas,
“is an open source programming language and environment for people who want to program images, animation, and interactions.”

Incanter now includes the incanter.processing library, a fork of Roland Sadowski‘s clj-processing library, making it possible to create Processing visualizations with Clojure and Incanter. Incanter.processing provides much more flexibility when creating customized data visualizations than incanter.charts — of course, it is also more complex to use.

Several nice examples of the kinds of visualizations that can be created in Processing can be found on Ben Fry’s website and blog, including the cool zipcode, human vs. chimps, baseball salary vs. performance examples.

The processing website has a set of tutorials, including one on getting started, and there are also three great books on Processing worth checking out:

The API documentation for incanter.processing is still a bit underdeveloped, but is perhaps adequate when combined with the excellent API reference on the Processing website.

Incanter.processing was forked from of Roland Sadowski’s clj-processing library in order to provide cleaner integration with Incanter. I have added a few functions, eliminated some, and changed the names of others. There were a few instances where I merged multiple functions (e.g. *-float, *-int) into a single one to more closely mimic the original Processing API; I incorporated a couple functions into Incanter’s existing multi-methods (e.g. save and view); I eliminated a few functions that duplicated existing Incanter functions and caused naming conflicts (e.g. sin, cos, tan, asin, acos, atan, etc); and I changed the function signatures of pretty much every function in clj-processing to require the explicit passing of the ‘sketch’ (PApplet) object, whereas the original library passes it implicitly by requiring that it is bound to a variable called *applet* in each method of the PApplet proxy.

These changes make it easier to use Processing within Incanter, but if you just want to write Processing applications in Clojure without all the dependencies of Incanter, then the original clj-processing library is the best choice.

A Simple Example

The following is sort of a “hello world” example that demonstrates the basics of creating an interactive Processing visualization (a.k.a sketch), including defining the sketch’s setup, draw, and mouseMoved methods and representing state in the sketch using closures and refs. This example is based on this one, found at John Resig‘s Processing-js website.


Click on the image above to see the live Processing-js version of the sketch.

Start by loading the incanter.core and incanter.processing libraries,

(use '(incanter core processing))

Now define some refs that will represent the state of the sketch object,

(let [radius (ref 50.0)
      X (ref nil)
      Y (ref nil)
      nX (ref nil)
      nY (ref nil)
      delay 16

The variable radius will provide the value of the circle’s radius; X and Y will indicate the location of the circle; and nX and nY will indicate the location of the mouse. We use refs for these values because their values are mutable and need to be available across multiple functions in the sketch object.

Now define a sketch object, which is just a proxied processing.core.PApplet, and its required setup method,

sktch (sketch
        (setup []
          (doto this
            (size 200 200)
            (stroke-weight 10)
            (framerate 15)
            smooth)
          (dosync
            (ref-set X (/ (width this) 2))
            (ref-set Y (/ (width this) 2))
            (ref-set nX @X)
            (ref-set nY @Y)))

The first part of the setup method sets the size of the sketch, the stroke weight to be used when drawing, the framerate of the animation, and indicates that anti-aliasing should be used. The next part of the method uses a dosync block and ref-set to set initial values for the refs. Note the @ syntax to dereference (access the values of) the refs X and Y.

Processing sketches that use animation require the definition of a draw method, which in this case will be invoked 15 times per second as specified by the framerate.

  (draw []
   (dosync
     (ref-set radius (+ @radius 
                        (sin (/ (frame-count this) 4))))
     (ref-set X (+ @X (/ (- @nX @X) delay)))
     (ref-set Y (+ @Y (/ (- @nY @Y) delay))))
   (doto this
     (background 125) ;; gray
     (fill 0 121 184)
     (stroke 255)
     (ellipse @X @Y @radius @radius)))

The first part of the draw method uses dosync and ref-set to set new values for the radius, X, and Y refs for each frame of the animation. The sin function is used to grow and shrink the radius over time. The location of the circle, as indicated by X and Y, is determined by the mouse location (nX and nY) with a delay factor.

The next part of the draw method draws the background (i.e. gray background) and the circle with the ellipse function.

Finally, we need to define the mouseMoved method in order to track the mouse location, using the mouse-x and mouse-y functions, and set the values of the nX and nY refs. All event functions in incanter.processing, including mouseMoved, require an event argument; this is due to limitations of Clojure’s proxy macro, and isn’t required when using the Processing’s Java API directly.

(mouseMoved [mouse-event]
  (dosync
    (ref-set nX (mouse-x mouse-event)) 
    (ref-set nY (mouse-y mouse-event)))))]

Now that the sketch is fully defined, use the view function to display it,

(view sktch :size [200 200]))

The complete code for this example can be found here.

In future posts I will walk through other examples of Processing visualizations, some of which can be found in the Incanter distribution under incanter/examples/processing.

Bootstrapping: estimating confidence intervals, standard errors, and bias

In addition to discovering Benford’s law (which I used in a previous post) 57 years before Frank Benford, Simon Newcomb made a series of measurements of the speed of light. In 1882 he measured the time in seconds that a light signal took to travel the 7,400 meters from the US Naval observatory, on the Potomac River, to a mirror at the base of the Washington Monument and back. Newcomb’s data are frequently used to demonstrate statistical techniques, as I will do here. The data are coded as deviations from 24,800 nanoseconds, so the first entry, corresponding to 24,828 nanoseconds, is 28.

Here’s a vector, x, containing the measurements.

(def x [28 -44 29 30 24 28 37 32 36  
        27 26 28 29 26 27 22 23 20  
        25 25 36 23 31 32 24 27 33  
        16 24 29 36 21 28 26 27 27  
        32 25 28 24 40 21 31 32 28  
        26 30 27 26 24 32 29 34 -2  
        25 19 36 29 30 22 28 33 39  
        25 16 23])

There are two obvious outliers in the data, so the median, not the mean, is the preferred estimate of location. Bootstrapping is a method often employed for estimating confidence intervals, standard errors, and estimator bias for medians.

Load the necessary Incanter libraries,

(use '(incanter core stats charts))

View a histogram of the data, note the two outlier observations at -2 and -44.

(view (histogram x 
                 :nbins 20
                 :title "Newcomb's speed of light data"))

Now calculate the median of the sample data

(median x)

The result is 27, but how does this value relate to the speed of light? We can translate this value to a speed in meters/sec with the following function.

(defn to-speed 
  "Converts Newcomb's data into speed (meters/sec)"
  ([x] (div 7400 (div (plus 24800 x) 1E9))))

Now apply it to the median.

(to-speed (median x))

The result is 2.981E8 m/sec (the true speed-of-light is now considered 2.9799E8 m/sec), but how good of an estimate of the median is this? How much sampling error is there? We can estimate confidence intervals, standard errors and estimator bias using bootstrapping.

Bootstrapping is a technique where items are drawn from a sample, with replacement, until we have a new sample that is the same size as the original. Since each item is replaced before the next item is drawn, a single value may appear more than once, and other items may never appear at all. The desired statistic, in this case median, is calculated on the new sample and saved. This process is repeated until you have the desired number of sample statistics. Incanter’s bootstrap function can be used to perform this procedure.

Draw 10,000 bootstrapped samples of the median.

(def t* (bootstrap x median :size 10000))

View a histogram of the bootstrap sample of the median

(view (histogram t* 
                 :density true 
                 :nbins 20
                 :title "Bootstrap sample of medians"
                 :x-label "median"))

The bootstrap distribution is very jagged because the median is discrete, so there are only a small number of values that it can take. Later, I will demonstrate how to smooth over the discreteness of the median with the :smooth option to the bootstrap function.

The mean of t* is the bootstrap estimate of the median,

(mean t*)

which is 27.301. An estimate of the 95% confidence interval (CI) for the median can be calculated with the quantile function.

(quantile t* :probs [0.025 0.975])

The CI is estimated as (26.0 28.5). The standard error of the median statistic is estimated by the standard deviation of the bootstrap sample.

(sd t*)

The standard error estimate is 0.681. The bias of the median statistic is estimated as the difference between it and the mean of the bootstrap sample.

(- (mean t*) (median x))

The estimator bias is -0.3.

Smoothing

As was apparent in the previous histogram, the bootstrap distribution is jagged because the median is discrete. In order to smooth over the discreteness of the median, we can add a small amount of random noise to each bootstrap sample. The :smooth option adds gaussian noise to each median in the sample. The mean of the noise equals zero and the standard deviation equals 1/sqrt(n), where n is the original sample size.

Draw 10000 smoothed bootstrap samples of the median

(def smooth-t* (bootstrap x median 
                          :size 10000 
                          :smooth true))

View a histogram of the smoothed bootstrap samples of the median

(view (histogram smooth-t* 
                 :density true 
                 :nbins 20
                 :title "Smoothed bootstrap sample of medians"
                 :x-label "median"))

Calculate the estimate of the median.

(mean smooth-t*)

The smoothed estimate of the median is 27.304. Now, calculate a 95% CI.

(quantile smooth-t* :probs [0.025 0.975])

The smoothed result is (25.905 28.446).

The complete code for this example can be found here.

Bayesian inference of multinomial distribution parameters

The multinomial distribution is used to describe data where each observation is one of k possible outcomes. In his book, Bayesian Data Analysis (pg 83), Andrew Gelman demonstrates how to use Bayesian methods to make inferences about the parameters of a multinomial distribution. He used data from a sample survey by CBS news prior to the 1988 presidential election. In his book Bayesian Computation with R (pg. 60), Jim Albert used R to perform the same inferences on the same data as Gelman. I’ll now demonstrate how to perform these inferences in Clojure, with Incanter, using the sample-multinomial-params function. I will then describe the Bayesian model underlying sample-multinomial-params, and demonstrate how to do the same calculations directly with the sample-dirichlet function.

First, load the necessary libraries, including the incanter.bayes library.

(use '(incanter core stats bayes charts))

The data consist of 1447 survey responses; 727 respondents supported George Bush Sr., 583 supported Michael Dukakis, and 137 supported another candidate, or were undecided. Define a vector, y, that includes the counts for each of the outcomes.

(def y [727 583 137])

The proportion of supporters for the candidates, calculated by dividing y by the total responses,

(div y 1447)

are (0.502 0.403 0.095). If the survey data follows a multinomial distribution, these proportions are the parameters of that distribution. The question is, how good an estimate of the population parameters are these values?

Confidence intervals for the parameters can be calculated by applying the quantile function to a sample drawn from an appropriate distribution, which in the case of the parameters of a multinomial distribution is a Dirichlet distribution.

Samples can be drawn from the appropriate Dirichlet distribution with the sample-multinomial-params function.

(def  theta (sample-multinomial-params 1000 y))

This returns 1000 samples of the multinomial parameters, or probabilities of each of the outcomes in the data. Now, create variables representing the samples for each parameter,

(def theta1 (sel theta :cols 0))
(def theta2 (sel theta :cols 1))
(def theta3 (sel theta :cols 2))

and calculate the mean, standard deviation, 95% confidence interval, and display a histogram for each sample.

(mean theta1)
(sd theta1)
(quantile theta1 :probs [0.025 0.975]) 
(view (histogram theta1
                 :title "Bush, Sr. Support"))

The mean for theta1 is 0.502, the standard deviation is 0.013, and the 95% confidence interval is (0.476 0.528).

(mean theta2) 
(sd theta2)
(quantile theta2 :probs [0.025 0.975]) 
(view (histogram theta2
                 :title "Dukakis Support"))

The mean for theta2 is 0.403, the standard deviation is 0.013, and the 95% confidence interval is (0.376 0.427).

(mean theta3) 
(sd theta3)
(quantile theta3 :probs [0.025 0.975]) 
(view (histogram theta3
                 :title "Other/Undecided"))

The mean for theta3 is 0.095, the standard deviation is 0.008, and the 95% confidence interval is (0.082 0.111).

Using the sample distributions for the parameters, we can view the distribution of other statistics, like the difference in support between Bush, Sr. and Dukakis. View a histogram of the difference in support for the two candidates.

(view (histogram (minus theta1 theta2) 
                 :title "Bush, Sr. and Dukakis Diff"))

The sample-multinomial-params functions returns values from a Dirichlet distribution. The Dirichlet is the conjugate prior distribution for the parameters of the multinomial distribution, such that the posterior Dirichlet’s parameters equal the sum of the count data, y, and the parameters from the prior Dirichlet, alpha.

 p(\theta | y+\alpha) = p(y | \theta)p(\alpha)/p(y)  

Where

 y = (y_1, y_2, ..., y_k)  

are the count data that follow a multinomial distribution,

 \theta = (\theta_1, \theta_2, ..., \theta_k)  

are the parameters of the multinomial distribution, and

 \alpha = (\alpha_1, \alpha_2, ..., \alpha_k)  

are the hyper-parameters of the prior Dirichlet distribution. The likelihood function, p(y | theta), is a multinomial conditional distribution that describes the likelihood of seeing the observed y values, given the values of the parameter, theta.

The prior, p(alpha), is made uninformative by setting all the alpha values equal to one, producing a uniform distribution.

 \alpha = (1, 1, ..., 1) 

Then the posterior Dirichlet becomes

 p(\theta | y_1+1, y_2+1, .., y_3+1)  

Sample the proportions directly from the above Dirichlet distribution using the sample-dirichlet function.

(def alpha [1 1 1])
(def  props (sample-dirichlet 1000 (plus y alpha)))

This returns data from the same distribution as the sample-multinomial-params function, so all the means, standard deviations, and confidence intervals can be calculated as above.

The complete code for this example is here.

Monty Hall in Monte Carlo

The Monty Hall problem, named after the host of the Let’s Make a Deal game show, first appeared in Marilyn vos Savant‘s column in Parade magazine in 1990, although people still vigorously debate the answer, as discussed here and here.

The basic premise is that you must choose one of three doors, one of which contains a prize. Once you have selected a door, Monty Hall reveals the contents of one of the doors you did not select, usually containing something like a goat. He then asks if you would like to switch your choice to the remaining unopened door. The question is, will switching doors increase you chances of winning, decrease your chances, or have no effect at all?

One thing to keep in mind, is that when Monty Hall shows you what’s behind one of the doors you didn’t select, he doesn’t ever reveal the door containing the prize. This fact should change your decision making process.

We can answer the question of which strategy is best by calculating the conditional probabilities of winning given you either switch or stay by performing a Monte Carlo simulation. The conditional probability of winning given you decide to switch doors can be written as Prob(win | switch), and is equal to the probability that you switch and win divided by the probability that you switch,

Prob(win | switch) = Prob(switch & win)/Prob(switch)

and the conditional probability of winning given you decide to stay with your original door is equal to the probability you stay and win divided by the probability that you stay.

Prob(win | stay) = Prob(stay & win)/Prob(stay)

In this simulation, we will generate a bunch of initial guesses, a bunch of prize locations, and a bunch of switch decisions and then use them to calculate the probabilities. First, load the necessary libraries.

(use '(incanter core stats charts))

Set a simulation sample size, generate samples of initial-guesses, prize-doors, and switch decisions with the sample function. The probability that the prize is behind any of the three doors is 1/3, as is the probability that you select a particular door as an initial guess, while the probability that you switch doors is 1/2.

(def n 10000)
(def initial-guesses (sample [1 2 3] :size n))
(def prize-doors (sample [1 2 3] :size n))
(def switches (sample [true false] :size n))

Define a function that returns one if a switch decision results in winning, a zero otherwise.

(defn switch-win? [initial-guess switch prize-door] 
  (if (and switch (not= initial-guess prize-door)) 1 0))

Calculate the joint probability of winning and switching, by mapping the switch-win? function over the generated data, summing the occurrences of winning after switching, and then dividing this number by the total number of samples, n.

(def prob-switch-win (/ (sum (map switch-win? 
                              initial-guesses 
                              switches 
                              prize-doors)) 
                        n))

Calculate the probability of switching doors. Use the indicator and sum functions to calculate the frequency of switches, and then divide by the total number of samples.

(def prob-switch (/ (sum (indicator true? switches)) n))

Finally, use the formula above to calculate the conditional probability of winning given a switch.

(def prob-win-given-switch (/ prob-switch-win prob-switch))

Now repeat the process to calculate the conditional probability of winning given the decision to stay with your original choice. First, define a function that returns a one if a stay decision results in winning, zero otherwise.

(defn stay-win? [initial-guess switch prize-door] 
  (if (and (not switch) (= initial-guess prize-door)) 1 0))

Next, calculate the joint probability of winning and staying,

(def prob-stay-win (/ (sum (map stay-win? 
                            initial-guesses 
                            switches 
                            prize-doors)) 
                      n))

and, calculate the probability of staying with the original door.

(def prob-stay (/ (sum (indicator false? switches)) n))

Finally, calculate the conditional probability of winning given you stay with the original door

(def prob-win-given-stay (/ prob-stay-win prob-stay))

View the results with the pie-chart function.

(view (pie-chart ["Switch" "Stay"] 
                 [prob-win-given-switch prob-win-given-stay]))

The conditional probability of winning given you switch doors is approximately 0.66, while the probability of winning given you stay with your original choice is approximately 0.33.

To some this outcome may seem non-intuitive. Here’s why it’s the case: the probability that the prize is behind any of the three doors is 1/3, so once you have selected a door, the probability that the prize is behind a door you didn’t select is 2/3. Since Monty Hall knows which door has the prize, and never selects it as the door to reveal, all of the 2/3 probability belongs to the other unselected door. So switching will give you a 2/3 probability of winning the prize, and sticking with your original decision only gives you the original 1/3 chance.

The complete code for this example can be found here.

Chi-square goodness-of-fit, Benford’s law, and the Iranian election

Benford’s law has been discussed as a means of analyzing the results of the 2009 Iranian election by several analysts. One of the first attempts was in a paper (pdf) by Boudewijn Roukema, where he concluded that the data did not follow Benford’s law, and that there was evidence for vote tampering. Andrew Gelman responded; he was not convinced by the statistics of this particular analysis (his criticism was of the statistical analysis, and didn’t reflect his opinion of the possibility of vote tampering).

Stochastic Democracy and the Daily Kos also used Benford’s law to analyze the election data, here, but came to the opposite conclusion of Roukema; the data did follow Benford’s law. A third analysis was performed by John Graham-Cumming in this blog post. He came to the same conclusion as Stochastic Democracy and Daily Kos, that the data did follow Benford’s law. His analysis is based on this data, available from the Guardian’s Data Blog.

Again, like Gelman, I’m not convinced that Benford’s law can necessarily be used to draw conclusions on the presence or absence of vote tampering, but it is an interesting example of a Chi-square goodness-of-fit analysis, comparing some real-world data with a theoretical distribution, Benford’s law in this case. So I will demonstrate how to perform the analysis with Incanter using the data set made available by the Guardian. A CSV file with the data can be found here.

First load the necessary libraries,

(use '(incanter core stats charts io))

and read the data using the read-dataset function.

(def votes (read-dataset "data/iran_election_2009.csv" 
                         :header true))

Now, create variables representing the regions and the candidates.

(def regions (sel votes :cols "Region"))
(def ahmadinejad-votes (sel votes :cols "Ahmadinejad"))
(def mousavi-votes (sel votes :cols "Mousavi"))
(def rezai-votes (sel votes :cols "Rezai"))
(def karrubi-votes (sel votes :cols "Karrubi"))

We need to write a small function that will take a number and return only the first digit. We can do that by converting the number into a string with the str function, taking the first character of the string, and then converting that character into a number again.

(defn first-digit [x] 
  (Character/digit (first (str x)) 10)) 

Now, map the first-digit function over the vote counts in each of the 30 regions for each candidate.

(def ahmadinejad (map first-digit ahmadinejad-votes))
(def mousavi (map first-digit mousavi-votes))
(def rezai (map first-digit rezai-votes))
(def karrubi (map first-digit karrubi-votes))

Now we have a list of first-digits across the different regions, for each candidate. Next, define function that implements Benford’s Law, create a variable containing the probabilities for each of the nine digits, and calculate the expected frequencies for the 30 regions.

(defn benford-law [d] (log10 (plus 1 (div d))))
(def benford-probs (benford-law (range 1 11)))
(def benford-freq (mult benford-probs (count regions)))

We will then need a function that takes the list of first digits for each candidate and returns the counts for each digit. We can use the :counts field returned by the tabulate function to do most of the work.

(defn get-counts [digits] 
  (map #(get (:counts (tabulate digits)) % 0) 
       (range 1.0 10.0 1.0)))

It’s always a good idea to look at the data, so plot the predicted first-digit frequencies and the frequencies for Ahmadinejad and Mousavi.

(doto (xy-plot (range 1 10) (get-counts ahmadinejad)
               :legend true :series-label "Ahmadinejad"
               :y-label "First digit frequency"
               :x-label "First digit"
               :title "First digit frequency by candidate")
      (add-lines (range 1 10) (get-counts mousavi) 
                 :series-label "Mousavi")
      (add-lines (range 1 10) benford-freq 
                 :series-label "Predicted")
      clear-background
      view)

Based on that plot, the digit frequencies look like they follow Benford’s law. We can confirm that with a Chi-square goodness-of-fit test using the chisq-test function. Start with Ahmadinejad,

(def ahmadinejad-test 
  (chisq-test :table (get-counts ahmadinejad) 
              :probs benford-probs))
(:X-sq ahmadinejad-test)
(:p-value ahmadinejad-test)

The X-square test value is 5.439 with a p-value of 0.71. Based on these values we cannot reject the null hypothesis that the observed distribution of first-digits is the same as the expected distribution, which follows Benford’s law. Next test Mousavi’s vote counts,

(def mousavi-test 
  (chisq-test :table (get-counts mousavi) 
              :probs benford-probs))
(:X-sq mousavi-test) 
(:p-value mousavi-test)

The X-square test value for Mousavi is 5.775 with a p-value of 0.672. Like Ahmadinejad, we cannot reject the null hypothesis. We might as well test the distributions of first-digits for the remaining two candidates (which we didn’t plot), Rezai and Karrubi.

(def rezai-test 
  (chisq-test :table (get-counts rezai) 
              :probs benford-probs))
(:X-sq rezai-test) 
(:p-value rezai-test) 

The X-square test value for Rezai is 12.834 with a p-value of 0.118, so we can’t reject the null hypothesis.

(def karrubi-test 
  (chisq-test :table (get-counts karrubi) 
              :probs benford-probs))
(:X-sq karrubi-test) 
(:p-value karrubi-test) 

Karrubi’s X-square value is 8.8696 with a p-value of 0.353, and like the other candidates appears to follow Benford’s law.

The conclusion based on the vote count summaries for the 30 Iranian regions, is that the votes follow Benford’s law.

Last-digit frequencies

Another approach to identifying vote tampering is to examine the last digit of the vote counts, instead of the first. This approach was discussed in this Washington Post article.

Unlike first-digits, last-digits don’t follow Benford’s law, but rather follow a uniform distribution. We can test the distribution of last-digits from our sample of 30 regions using the chisq-test function again. First, define a function to extract the last digit from a number,

(defn last-digit [x] 
  (Character/digit (last (str x)) 10))

and map the function over the vote counts for each candidate.

(def ahmadinejad-last (map last-digit 
                           ahmadinejad-votes))
(def mousavi-last (map last-digit 
                       mousavi-votes))
(def rezai-last (map last-digit 
                     rezai-votes))
(def karrubi-last (map last-digit 
                       karrubi-votes))

Now we have a list of last-digits for the 30 regions for each candidate. We need to tweak the get-counts function, from above, so that it includes the digit zero, which wasn’t necessary in the first-digit case.

(defn get-counts [digits] 
  (map #(get (:counts (tabulate digits)) % 0) 
       (range 0.0 10.0 1.0)))

Now plot the last-digit distribution for each of hte candidates.

(doto (xy-plot (range 10) 
               (get-counts ahmadinejad-last)
               :legend true 
               :series-label "Ahmadinejad"
               :y-label "First digit frequency"
               :x-label "First digit"
               :title "Last digit frequency by candidate")
      (add-lines (range 10) (get-counts mousavi-last) 
                 :series-label "Mousavi")
      (add-lines (range 10) (get-counts rezai-last) 
                 :series-label "Rezai")
      (add-lines (range 10) (get-counts karrubi-last) 
                 :series-label "Karrubi")
      clear-background
      view)

The data certainly doesn’t look like it’s following Benford’s law, but is it uniformly distributed? We can test that hypothesis with the chisq-test function, this time we won’t need to supply the :probs argument because a uniform distribution is the default.

(def ahmadinejad-test 
  (chisq-test :table (get-counts ahmadinejad-last)))

(:X-sq ahmadinejad-test) ;; 4.667
(:p-value ahmadinejad-test) ;; 0.862

(def mousavi-test 
  (chisq-test :table (get-counts mousavi-last)))
(:X-sq mousavi-test) ;; 8.667
(:p-value mousavi-test) ;; 0.469

(def rezai-test 
  (chisq-test :table (get-counts rezai-last)))
(:X-sq rezai-test) ;; 15.333 
(:p-value rezai-test) ;; 0.0822

(def karrubi-test 
  (chisq-test :table (get-counts karrubi-last)))
(:X-sq karrubi-test) ;; 4.0
(:p-value karrubi-test) ;; 0.911

According to each of these tests, we cannot reject the hypothesis that the last-digits for each candidate follows a uniform distribution. The analysis written up in the Washington Post article came to the opposite conclusion, which is likely due to their use of more detailed data. The data used here was summary information for 30 regions. More detailed vote counts could certainly provide a different conclusion.

The complete code for this post can be found here.

Significance testing with randomization

We can use randomization (a.k.a. permutation tests) for significance testing by comparing a statistic calculated from a sample with the distribution of statistics calculated from samples where the null hypothesis is true.

In an earlier post, I demonstrated how to perform a t-test in Incanter. The p-value calculated for the test is based on the assumption that the distribution of test statistics when the null hypothesis, that the two sample means are identical, is true is a t-distribution. Let’s see if that assumption appears correct by generating 1000 samples of data where the null hypothesis is true, calculating the t-statistic for each, and then examining the distribution of statistics.

We can then compare the original t-statistic with this distribution and determine the probability of seeing that particular value when the null hypothesis is true. If the probability is below some cutoff (typically, 0.05) we reject the null hypothesis.

Start by loading the necessary libraries:

(use '(incanter core stats datasets charts))

Load the plant-growth data set.

(def data (to-matrix (get-dataset :plant-growth)))

Break the first column of the data into groups based on treatment type (second column) using the group-by function,

(def groups (group-by data 1 :cols 0))

The first group is the the control, the other two are treatments. View box-plots of the three groups, using the :group-by option of the box-plot function,

(view (box-plot (sel plant-growth :cols 0) 
                :group-by (sel plant-growth :cols 1)))

and then view the means of the three groups

(map mean groups)

This returns (5.032 4.661 5.526). So, the question is, are these means statistically different from each other?

Run the t-tests comparing the means of the first and second treatment groups with the control,

(def t1 (t-test (first groups) :y (second groups)))
(def t2 (t-test (first groups) :y (last groups)))

and view the test statistic and p-value for the first treatment,

(:t-stat t1)
(:p-value t1)

which are 1.191 and 0.2503, respectively, and the second treatment,

(:t-stat t2)
(:p-value t2)

which are -2.134 and 0.0479, respectively. These values provide evidence that the first treatment is not statistically significantly different from the control, but the second treatment is.

The above p-values are based on the assumption that the distribution of t-test statistics, when the null hypothesis is true, is a t-distribution. Let’s see if that assumption appears correct for these samples by generating 1000 samples where the null hypothesis is true. We can do that by permuting values between each of the treatments and the control using the sample-permutations function.


(def perm-groups1 (sample-permutations 1000 
                                       (first groups) 
                                       (second groups)))

(def perm-groups2 (sample-permutations 1000 
                                       (first groups) 
                                       (last groups)))

Now create a function, based on t-test, that takes two sequence and returns just the :t-stat value

(defn t-stat [x y] (:t-stat (t-test x :y y)))

Calculate the t-test statistics for each of the permuted versions of the two treatments and control with the map function.

(def t-stats1 (map t-stat 
                  (first perm-groups1) 
                  (second perm-groups1)))

(def t-stats2 (map t-stat 
                  (first perm-groups2) 
                  (second perm-groups2)))

Now plot the distribution of the t-test statistics, and overlay the pdf of a t-distribution for the for the first treatment.

(doto (histogram t-stats1 :density true :nbins 20)
      (add-function #(pdf-t % :df (:df t1)) -3 3)
      view)

The distribution looks like a Student’s t. Now inspect the mean, standard deviation, and 95% confidence interval for the distribution from the first treatment.

(mean t-stats1)

The mean is 0.023,

(sd t-stats1)

the standard deviation is 1.064,

(quantile t-stats1 :probs [0.025 0.975])

and the 95% confidence interval is (-2.116 2.002). Since the t-statistics for the original sample, 1.191, falls within the range of the confidence interval, we cannot reject the null hypothesis; the mean of the first treatment and the control are statistically indistinguishable.

Now view their mean, sd, and 95% confidence interval for the second treatment,

(mean t-stats2)

The mean is -0.014,

(sd t-stats2)

The standard deviation is 1.054,

(quantile t-stats2 :probs [0.025 0.975])

and the 95% confidence interval is (-2.075 2.122). In this case, the original t-statistic, -2.134, is outside of the confidence interval, therefore the null hypothesis should be rejected; the second treatment is statistically significantly different than the control.

In both cases the permutation tests agreed with the standard t-test p-values. However, permutation tests can be used to test significance on sample statistics that do not have well known distributions like the t-distribution.

Comparing means without the t-statistic

Using a permutation test, we can compare means without using the t-statistic at all. First, define a function to calculate the difference in means between two sequences.


(defn means-diff [x y] 
  (minus (mean x) (mean y)))

Calculate the difference in sample means between the first treatment and the control.


(def samp-mean-diff (means-diff (first groups) 
                                (second groups))) 

This returns 0.371. Now calculate the difference of means of the 1000 permuted samples by mapping the means-diff function over the rows returned by the sample-permutations function we ran earlier.


(def perm-means-diffs1 (map means-diff 
                            (first perm-groups1) 
                            (second perm-groups1)))

This returns a sequence of differences in the means of the two groups.

Plot a histogram of the perm-means-diffs1 using the density option, instead of the default frequency, and then add a pdf-normal curve with the mean and sd of perm-means-diffs data for a visual comparison.


(doto (histogram perm-means-diffs1 :density true)
      (add-lines (range -1 1 0.01) 
                 (pdf-normal (range -1 1 0.01) 
                             :mean (mean perm-means-diffs1) 
                             :sd (sd perm-means-diffs1)))
      view)

The permuted data looks normal. Now calculate the proportion of values that are greater than the original sample means-difference. Use the indicator function to generate a sequence of ones and zeros, where one indicates a value of the perm-means-diff sequence is greater than the original sample mean, and zero indicates a value that is less.

Then take the mean of this sequence, which gives the proportion of times that a value from the permuted sequences are more extreme than the original sample mean (i.e. the p-value).

(mean (indicator #(> % samp-mean-diff) 
                 perm-means-diffs1))

This returns 0.088, therefore the calculated p-value is 0.088, meaning that 8.8% of the permuted sequences had a difference in means at least as large as the difference in the original sample means. Therefore there is no evidence (at an alpha level of 0.05) that the means of the control and treatment-one are statistically significantly different.

We can also test the significance by computing a 95% confidence interval for the permuted differences. If the original sample means-difference is outside of this range, there is evidence that the two means are statistically significantly different.

(quantile perm-means-diffs1 :probs [0.025 0.975]) 

This returns (-0.606 0.595), but since the original sample mean difference was 0.371, which is inside the interval, this is further evidence that treatment one’s mean is not statistically significantly different than the control.

Now, calculate the p-values for the difference in means between the control and treatment two.


(def perm-means-diffs2 (map means-diff 
                            (first perm-groups2) 
                            (second perm-groups2)))

(def samp-mean-diff (means-diff (first groups) 
                                (last groups)))

(mean (indicator #(< % samp-mean-diff) 
                 perm-means-diffs2))

(quantile perm-means-diffs2 
          :probs [0.025 0.975])

The calculated p-value is 0.022, meaning that only 2.2% of the permuted sequences had a difference in means at least as large as the difference in the original sample means. Therefore there is evidence (at an alpha level of 0.05) that the means of the control and treatment-two are statistically significantly different.

The original sample mean difference between the control and treatment two was -0.4939, which is outside the range (-0.478 0.466) calculated above. This is further evidence that treatment two’s mean is statistically significantly different than the control.

The complete code for this example is here.

Further Reading

Chi-Square Test of Independence

This example will demonstrate the use of the chisq-test function to perform tests of independence on a couple sample data sets.

The first data set shows the number of male and female students with each of 16 different combinations of hair and eye color. We will use this data sample to test whether there is an association between hair and eye color. We will test males and females separately.

First load the necessary libraries.

(use '(incanter core stats charts datasets))

Now load the sample data with the get-dataset function, and then use group-by to break the data set into two groups based on gender (third column).

(def by-gender (group-by (get-dataset :hair-eye-color) 2))

Now take a look at the data for the males,

(def male-data (first by-gender))
(view male-data)

and the females.

(def female-data (second by-gender))
(view female-data)

Extract the hair color, eye color, and count values from the data set, defining three new variables.

(def m-hair (sel male-data :cols 0))
(def m-eye (sel male-data :cols 1))
(def m-count (sel male-data :cols 3))

Now use the bar-chart function to show the distribution of hair and eye color for males,

(view (bar-chart m-hair m-count 
                 :group-by m-eye 
                 :legend true 
                 :title "Male Hair and Eye Color"
                 :x-label "Hair Color"        
                 :y-label "Number of males"))

and for the females.

(def f-hair (sel female-data :cols 0))
(def f-eye (sel female-data :cols 1))
(def f-count (sel female-data :cols 3))
(view (bar-chart f-hair f-count 
                 :group-by f-eye 
                 :legend true 
                 :title "Female Hair and Eye Color"
                 :x-label "Hair Color"        
                 :y-label "Number of females"))

We can reshape the two vectors of 16 count values (male and female) into two 4×4 contingency tables using the matrix function.

(def m-table (matrix m-count 4))
(def f-table (matrix f-count 4))

Here are the two new contingency tables.

> m-table
[36.0000  9.0000  5.0000  2.0000
 66.0000 34.0000 29.0000 14.0000
 16.0000  7.0000  7.0000  7.0000
 4.0000 64.0000  5.0000  8.0000]

> f-table
[32.0000 11.0000 10.0000  3.0000
 53.0000 50.0000 25.0000 15.0000
 10.0000 10.0000  7.0000  7.0000
 3.0000 30.0000  5.0000  8.0000]

Now run the chisq-test function on the two tables,

(def m-test (chisq-test :table m-table))
(def f-test (chisq-test :table f-table))

and view the X-sq test statistics, the degrees of freedom, and the p-value for the test for males,

(:X-sq m-test) ;; 106.66
(:p-value m-test) ;; 7.01E-19
(:df m-test) ;; 9

and females.

(:X-sq f-test) ;; 41.28
(:p-value f-test) ;; 4.45E-6
(:df f-test) ;; 9

Both p-values are considerable below a 0.05 cut off threshold, indicating that the null hypothesis, that there is no association between eye and hair color, should be rejected for both males and females.

In addition to passing contingency tables as arguments to chisq-test, you can pass raw data using the :x and :y arguments, which we will do in the following example.

This example will test whether there is any correlation between the pass/fail results of a high school mathematics proficiency test and a freshman college programming course. The math-prog data set includes three columns: student_id, high school math proficiency test pass/fail results, and freshman college programming course pass/fail results.

First load the data, and convert the non-numeric data into integers with the to-matrix function.

(def math-prog (to-matrix (get-dataset :math-prog)))

And then extract the math and programming results.

(def math (sel math-prog :cols 1))
(def prog (sel math-prog :cols 2))

And run the chisq-test function.

(def math-prog-test (chisq-test :x math :y prog))
(:X-sq math-prog-test) ;; 1.24
(:df math-prog-test) ;; 1
(:p-value math-prog-test) ;; 0.265 

In this case, we can’t reject null hypothesis, there is no association between the high school math proficiency exam and pass/fail rate of the freshman programming course.

The complete code for this example can be found here.

Further reading:
Chi-square testing at Wikipedia
Yates’ continuity correction at Wikipedia

Plotting with non-numeric data

Yesterday, I received a question about plotting with non-numeric data. Unfortunately, the only way to do it in Incanter was to convert the data to numeric values, usually using the to-matrix function. So I have added two new functions, bar-chart and line-chart, that accept non-numeric data. In order to reduce the likelihood of confusing the line-plot function with the line-chart function, I renamed line-plot to xy-plot (which better reflects the name of the underlying JFreeChart class). The following are some examples of using these functions (for other plotting examples, see the sample plots and probability distributions pages of the Incanter wiki).

First, load the necessary libraries.

(use '(incanter core charts datasets))

Now plot a simple bar-chart. The first argument is a vector of categories, the second is a vector of values associated with each category.

(view (bar-chart ["a" "b" "c" "d" "e"] 
                 [10 20 30 25 20]))

Use the :group-by option to associate multiple bars with each category.

(view (bar-chart ["a" "a" "b" "b" "c" "c" ] 
                 [10 20 30 10 40 20]
                 :legend true 
                 :group-by ["I" "II" "I" "II" "I" "II"]))

Line-charts behave just like bar-charts.

(view (line-chart ["a" "b" "c" "d" "e"] 
                  [20 10 30 25 40]))

The following examples use data on the number of airline passengers from 1949 through 1960. First load the sample data using the get-dataset function.

(def data (get-dataset :airline-passengers))

Now group the data by year, and plot just the last year (1960),

(def by-year (group-by data 0))
(view (bar-chart (sel (last by-year) :cols 2) 
                 (sel (last by-year) :cols 1)
                 :title "Airline Travel in 1960"
                 :y-label "Passengers"
                 :x-label "Month"))

and view the same data with a line-chart.

(view (line-chart (sel (last by-year) :cols 2) 
                 (sel (last by-year) :cols 1)
                 :title "Airline Travel in 1960"
                 :y-label "Passengers"
                 :x-label "Month"))

Now use the :group-by option in line-chart to view the data for each year,

(view (line-chart (sel data :cols 2) 
                 (sel data :cols 1)
                 :group-by (sel data :cols 0)
                 :title "Airline Travel in 1949-1960"
                 :legend true
                 :y-label "Passengers"
                 :x-label "Month"))

and do the same with a bar-chart.

(view (bar-chart (sel data :cols 2) 
                 (sel data :cols 1)
                 :group-by (sel data :cols 0)
                 :title "Airline Travel in 1949-1960"
                 :legend true
                 :y-label "Passengers"
                 :x-label "Year"))

Instead of grouping by year, we can group-by month.

(view (bar-chart (sel data :cols 0) 
                 (sel data :cols 1)
                 :group-by (sel data :cols 2)
                 :title "Airline Travel in 1949-1960"
                 :legend true
                 :y-label "Passengers"
                 :x-label "Year")
       :width 525)

The complete code for these examples can be found here.