chi-square test | Data Sorcery with Clojure

Benford’s law has been discussed as a means of analyzing the results of the 2009 Iranian election by several analysts. One of the first attempts was in a paper (pdf) by Boudewijn Roukema, where he concluded that the data did not follow Benford’s law, and that there was evidence for vote tampering. Andrew Gelman responded; he was not convinced by the statistics of this particular analysis (his criticism was of the statistical analysis, and didn’t reflect his opinion of the possibility of vote tampering).

Stochastic Democracy and the Daily Kos also used Benford’s law to analyze the election data, here, but came to the opposite conclusion of Roukema; the data did follow Benford’s law. A third analysis was performed by John Graham-Cumming in this blog post. He came to the same conclusion as Stochastic Democracy and Daily Kos, that the data did follow Benford’s law. His analysis is based on this data, available from the Guardian’s Data Blog.

Again, like Gelman, I’m not convinced that Benford’s law can necessarily be used to draw conclusions on the presence or absence of vote tampering, but it is an interesting example of a Chi-square goodness-of-fit analysis, comparing some real-world data with a theoretical distribution, Benford’s law in this case. So I will demonstrate how to perform the analysis with Incanter using the data set made available by the Guardian. A CSV file with the data can be found here.

First load the necessary libraries,

(use '(incanter core stats charts io))

and read the data using the read-dataset function.

(def votes (read-dataset "data/iran_election_2009.csv" 
                         :header true))

Now, create variables representing the regions and the candidates.

(def regions (sel votes :cols "Region"))
(def ahmadinejad-votes (sel votes :cols "Ahmadinejad"))
(def mousavi-votes (sel votes :cols "Mousavi"))
(def rezai-votes (sel votes :cols "Rezai"))
(def karrubi-votes (sel votes :cols "Karrubi"))

We need to write a small function that will take a number and return only the first digit. We can do that by converting the number into a string with the str function, taking the first character of the string, and then converting that character into a number again.

(defn first-digit [x] 
  (Character/digit (first (str x)) 10))

Now, map the first-digit function over the vote counts in each of the 30 regions for each candidate.

(def ahmadinejad (map first-digit ahmadinejad-votes))
(def mousavi (map first-digit mousavi-votes))
(def rezai (map first-digit rezai-votes))
(def karrubi (map first-digit karrubi-votes))

Now we have a list of first-digits across the different regions, for each candidate. Next, define function that implements Benford’s Law, create a variable containing the probabilities for each of the nine digits, and calculate the expected frequencies for the 30 regions.

(defn benford-law [d] (log10 (plus 1 (div d))))
(def benford-probs (benford-law (range 1 11)))
(def benford-freq (mult benford-probs (count regions)))

We will then need a function that takes the list of first digits for each candidate and returns the counts for each digit. We can use the :counts field returned by the tabulate function to do most of the work.

(defn get-counts [digits] 
  (map #(get (:counts (tabulate digits)) % 0) 
       (range 1.0 10.0 1.0)))

It’s always a good idea to look at the data, so plot the predicted first-digit frequencies and the frequencies for Ahmadinejad and Mousavi.

(doto (xy-plot (range 1 10) (get-counts ahmadinejad)
               :legend true :series-label "Ahmadinejad"
               :y-label "First digit frequency"
               :x-label "First digit"
               :title "First digit frequency by candidate")
      (add-lines (range 1 10) (get-counts mousavi) 
                 :series-label "Mousavi")
      (add-lines (range 1 10) benford-freq 
                 :series-label "Predicted")
      clear-background
      view)

Based on that plot, the digit frequencies look like they follow Benford’s law. We can confirm that with a Chi-square goodness-of-fit test using the chisq-test function. Start with Ahmadinejad,

(def ahmadinejad-test 
  (chisq-test :table (get-counts ahmadinejad) 
              :probs benford-probs))
(:X-sq ahmadinejad-test)
(:p-value ahmadinejad-test)

The X-square test value is 5.439 with a p-value of 0.71. Based on these values we cannot reject the null hypothesis that the observed distribution of first-digits is the same as the expected distribution, which follows Benford’s law. Next test Mousavi’s vote counts,

(def mousavi-test 
  (chisq-test :table (get-counts mousavi) 
              :probs benford-probs))
(:X-sq mousavi-test) 
(:p-value mousavi-test)

The X-square test value for Mousavi is 5.775 with a p-value of 0.672. Like Ahmadinejad, we cannot reject the null hypothesis. We might as well test the distributions of first-digits for the remaining two candidates (which we didn’t plot), Rezai and Karrubi.

(def rezai-test 
  (chisq-test :table (get-counts rezai) 
              :probs benford-probs))
(:X-sq rezai-test) 
(:p-value rezai-test)

The X-square test value for Rezai is 12.834 with a p-value of 0.118, so we can’t reject the null hypothesis.

(def karrubi-test 
  (chisq-test :table (get-counts karrubi) 
              :probs benford-probs))
(:X-sq karrubi-test) 
(:p-value karrubi-test)

Karrubi’s X-square value is 8.8696 with a p-value of 0.353, and like the other candidates appears to follow Benford’s law.

The conclusion based on the vote count summaries for the 30 Iranian regions, is that the votes follow Benford’s law.

Last-digit frequencies

Another approach to identifying vote tampering is to examine the last digit of the vote counts, instead of the first. This approach was discussed in this Washington Post article.

Unlike first-digits, last-digits don’t follow Benford’s law, but rather follow a uniform distribution. We can test the distribution of last-digits from our sample of 30 regions using the chisq-test function again. First, define a function to extract the last digit from a number,

(defn last-digit [x] 
  (Character/digit (last (str x)) 10))

and map the function over the vote counts for each candidate.

(def ahmadinejad-last (map last-digit 
                           ahmadinejad-votes))
(def mousavi-last (map last-digit 
                       mousavi-votes))
(def rezai-last (map last-digit 
                     rezai-votes))
(def karrubi-last (map last-digit 
                       karrubi-votes))

Now we have a list of last-digits for the 30 regions for each candidate. We need to tweak the get-counts function, from above, so that it includes the digit zero, which wasn’t necessary in the first-digit case.

(defn get-counts [digits] 
  (map #(get (:counts (tabulate digits)) % 0) 
       (range 0.0 10.0 1.0)))

Now plot the last-digit distribution for each of hte candidates.

(doto (xy-plot (range 10) 
               (get-counts ahmadinejad-last)
               :legend true 
               :series-label "Ahmadinejad"
               :y-label "First digit frequency"
               :x-label "First digit"
               :title "Last digit frequency by candidate")
      (add-lines (range 10) (get-counts mousavi-last) 
                 :series-label "Mousavi")
      (add-lines (range 10) (get-counts rezai-last) 
                 :series-label "Rezai")
      (add-lines (range 10) (get-counts karrubi-last) 
                 :series-label "Karrubi")
      clear-background
      view)

The data certainly doesn’t look like it’s following Benford’s law, but is it uniformly distributed? We can test that hypothesis with the chisq-test function, this time we won’t need to supply the :probs argument because a uniform distribution is the default.

(def ahmadinejad-test 
  (chisq-test :table (get-counts ahmadinejad-last)))

(:X-sq ahmadinejad-test) ;; 4.667
(:p-value ahmadinejad-test) ;; 0.862

(def mousavi-test 
  (chisq-test :table (get-counts mousavi-last)))
(:X-sq mousavi-test) ;; 8.667
(:p-value mousavi-test) ;; 0.469

(def rezai-test 
  (chisq-test :table (get-counts rezai-last)))
(:X-sq rezai-test) ;; 15.333 
(:p-value rezai-test) ;; 0.0822

(def karrubi-test 
  (chisq-test :table (get-counts karrubi-last)))
(:X-sq karrubi-test) ;; 4.0
(:p-value karrubi-test) ;; 0.911

According to each of these tests, we cannot reject the hypothesis that the last-digits for each candidate follows a uniform distribution. The analysis written up in the Washington Post article came to the opposite conclusion, which is likely due to their use of more detailed data. The data used here was summary information for 30 regions. More detailed vote counts could certainly provide a different conclusion.

The complete code for this post can be found here.

This example will demonstrate the use of the chisq-test function to perform tests of independence on a couple sample data sets.

The first data set shows the number of male and female students with each of 16 different combinations of hair and eye color. We will use this data sample to test whether there is an association between hair and eye color. We will test males and females separately.

First load the necessary libraries.

(use '(incanter core stats charts datasets))

Now load the sample data with the get-dataset function, and then use group-by to break the data set into two groups based on gender (third column).

(def by-gender (group-by (get-dataset :hair-eye-color) 2))

Now take a look at the data for the males,

(def male-data (first by-gender))
(view male-data)

and the females.

(def female-data (second by-gender))
(view female-data)

Extract the hair color, eye color, and count values from the data set, defining three new variables.

(def m-hair (sel male-data :cols 0))
(def m-eye (sel male-data :cols 1))
(def m-count (sel male-data :cols 3))

Now use the bar-chart function to show the distribution of hair and eye color for males,

(view (bar-chart m-hair m-count 
                 :group-by m-eye 
                 :legend true 
                 :title "Male Hair and Eye Color"
                 :x-label "Hair Color"        
                 :y-label "Number of males"))

and for the females.

(def f-hair (sel female-data :cols 0))
(def f-eye (sel female-data :cols 1))
(def f-count (sel female-data :cols 3))

(view (bar-chart f-hair f-count 
                 :group-by f-eye 
                 :legend true 
                 :title "Female Hair and Eye Color"
                 :x-label "Hair Color"        
                 :y-label "Number of females"))

We can reshape the two vectors of 16 count values (male and female) into two 4×4 contingency tables using the matrix function.

(def m-table (matrix m-count 4))
(def f-table (matrix f-count 4))

Here are the two new contingency tables.

> m-table
[36.0000  9.0000  5.0000  2.0000
 66.0000 34.0000 29.0000 14.0000
 16.0000  7.0000  7.0000  7.0000
 4.0000 64.0000  5.0000  8.0000]

> f-table
[32.0000 11.0000 10.0000  3.0000
 53.0000 50.0000 25.0000 15.0000
 10.0000 10.0000  7.0000  7.0000
 3.0000 30.0000  5.0000  8.0000]

Now run the chisq-test function on the two tables,

(def m-test (chisq-test :table m-table))
(def f-test (chisq-test :table f-table))

and view the X-sq test statistics, the degrees of freedom, and the p-value for the test for males,

(:X-sq m-test) ;; 106.66
(:p-value m-test) ;; 7.01E-19
(:df m-test) ;; 9

and females.

(:X-sq f-test) ;; 41.28
(:p-value f-test) ;; 4.45E-6
(:df f-test) ;; 9

Both p-values are considerable below a 0.05 cut off threshold, indicating that the null hypothesis, that there is no association between eye and hair color, should be rejected for both males and females.

In addition to passing contingency tables as arguments to chisq-test, you can pass raw data using the :x and :y arguments, which we will do in the following example.

This example will test whether there is any correlation between the pass/fail results of a high school mathematics proficiency test and a freshman college programming course. The math-prog data set includes three columns: student_id, high school math proficiency test pass/fail results, and freshman college programming course pass/fail results.

First load the data, and convert the non-numeric data into integers with the to-matrix function.

(def math-prog (to-matrix (get-dataset :math-prog)))

And then extract the math and programming results.

(def math (sel math-prog :cols 1))
(def prog (sel math-prog :cols 2))

And run the chisq-test function.

(def math-prog-test (chisq-test :x math :y prog))

(:X-sq math-prog-test) ;; 1.24
(:df math-prog-test) ;; 1
(:p-value math-prog-test) ;; 0.265

In this case, we can’t reject null hypothesis, there is no association between the high school math proficiency exam and pass/fail rate of the freshman programming course.

The complete code for this example can be found here.

Further reading:
Chi-square testing at Wikipedia
Yates’ continuity correction at Wikipedia

Category Archives: chi-square test

Chi-square goodness-of-fit, Benford’s law, and the Iranian election

Chi-Square Test of Independence

Search

Incanter

Clojure

Statistics

programming

visualization

Recent Comments

@liebke on Twitter

Recent Posts

Del.icio.us/liebke

Topics