Statistical tests for categorical variables

This tutorial is the second in a series of four. This part shows you how to apply and interpret the tests for ordinal and interval variables. This link will get you back to the first part of the series.

A categorical variable values are just names, that indicate no ordering. An example is fruit: you’ve got apples and oranges, there is no order in these. A special case is a binominal is a variable that can only assume one of two values, true or false, heads or tails and the like. Churn and prospect/customer variables are more specific examples of binominal variables.

Descriptive

Sometimes you just want to describe one variable. Although these types of descriptions don’t need statistical tests, I’ll describe them here since they should be a part of interpreting the statistical test results. Statistical tests say whether they change, but descriptions on distibutions tell you in what direction they change.

Proportion

There are many ways to do this but, for no particular reason, I’ve chosen to do this with the dplyr framework:

iris %>% 
  group_by(Species) %>% 
  summarise (n = n()) %>%
  mutate(proportion = n / sum(n))

This shows the proportion of each species of iris (33% for each):

# A tibble: 3 x 3
     Species     n proportion
      <fctr> <int>      <dbl>
1     setosa    50  0.3333333
2 versicolor    50  0.3333333
3  virginica    50  0.3333333

Mode

There is no standard function to calculate the mode that I know of (if you know one, please leave a comment), so I created called calc_mode:

calc_mode <- function(x){ 
  table_freq = table(x)
  max_freq = max(table_freq)
  if (all(table_freq == max_freq))
    mod = NA
  else
    if(is.numeric(x))
      mod = as.numeric(names(table_freq)[table_freq == max_freq])
  else
    mod = names(table_freq)[table_freq == max_freq]
  return(mod)
}

In the case where there is one value with the highest frequency that value will be returned:

> table(diamonds$cut)
     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 

> calc_mode(diamonds$cut)
[1] "Ideal"

In case all values have the same frequency NA will be reported as the mode

> table(iris$Species)
    setosa versicolor  virginica 
        50         50         50 

> calc_mode(iris$Species)
[1] NA

When there are multiple values with the same frequency, all those will be returned

> table(mtcars$mpg)
10.4 13.3 14.3 14.7   15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7   21 21.4 21.5 22.8 24.4   26 27.3 30.4 32.4 33.9 
   2    1    1    1    1    2    1    1    1    1    1    1    1    2    1    2    2    1    2    1    1    1    2    1    1 

> calc_mode(mtcars$mpg)
[1] 10.4 15.2 19.2 21.0 21.4 22.8 30.4

One sample

One sample tests are done when you want to find out whether your measurements differ from some kind of theorethical distribution. For example: you might want to find out whether you have a dice that doesn’t get the random result you’d expect from a dice. In this case you’d expect that the dice would throw 1 to 6 about 1/6th of the time.

Chi-Square Goodness of Fit test

With the Chi-Square Goodness of Fit Test you test whether your data fits an hypothetical distribution you’d expect. Let’s take the example of dice. First you have a data set you’ve collected by throwing a dice 100 times, recording the number of times each is up, from 1 to 6:

throws <- c(18, 16, 15, 17, 16, 18)

Then we fill a vector which expected probabilities of each side being thrown:

exp_throws <- c(rep(1/6, 6))

Then the function chisq.test to test whether this dice isn’t tampered with, or needs to be replaced:

chisq.test(throws, p = exp_throws)

Resulting in:

Chi-squared test for given probabilities

data:  throws
X-squared = 0.44, df = 5, p-value = 0.9942

Where the p-value of .9942 tells us we can trust this dice when we want to play a fair game. If you store the result in a variable, you can access the p.value property; this might come in handy if you want to use the result of the test later on in your script.

Binominal test

With the binominal test you test whether one value is higher or lower in occurence than expected.

The binominal test is used when there is only two outcomes: succes or failure. While this doesn’t mean the variable can only have two values, but only one of the values could be considered succes. If a few values are considered a succes, I would recommend creating a new variable in which you recode the values into a logical values of successes and failures.

The test in R is done by using the function binom.test. Suppose we think about 75% of our customers are male (the things we think about…), we have no data on it, so we start collecting it by randomly calling 100 of them and registring their gender. We find that 70 of the customers were male. Do we now reject our initial hypothesis of 75%? Let’s find piece of mind, and answer this pressing question:

binom.test(70, 100, p = .75, alternative = "two.sided")

The output:

data:  70 and 100
number of successes = 70, number of trials = 100, p-value = 0.2491
alternative hypothesis: true probability of success is not equal to 0.75
95 percent confidence interval:
 0.6001853 0.7875936
sample estimates:
probability of success 
                   0.7 

The miserable p-value of .2491 tells us we can hold on to our hypothesis than 75% of our customers are male. Like with the Chi-squared test, you can access the p.value property if you store the binom.test function’s result in a variable.

Two unrelated samples

Unrelated sample tests can be used for analysing marketing tests; you apply some kind of marketing voodoo to two different groups of prospects/customers and you want to know which method was best.

Two sample Chi-Square test

The two sample Chi-square test can be used to compare two groups for categorical variables. A typical marketing application would be A-B testing. But because I want to give an example, I’ll take a R dataset about hair color. I’m very, very interested if the sexes differ in hair color. For this I use the HairEyeColor data-set; since it’s not specified whether the reported haircolor is dyed or natural, I’ll assume it’s natural. I must prepare it so the frequencies of the sexes are in two columns. Then the data frame is converted to a matrix and the Hair color column is removed and used to name the rows instead. This is done so the chisq.test function is able to process it.

tbl_hair_sex <- as.data.frame(HairEyeColor) %>% 
  group_by(Hair, Sex) %>% 
  summarise(qty = sum(Freq)) %>% 
  ungroup() %>% 
  spread(key = Sex, value = qty)

mat_hair_sex <- as.matrix(tbl_hair_sex[-1])
rownames(mat_hair_sex) <- levels(tbl_hair_sex$Hair)

Then the function is applied:

chisq.test(mat_hair_sex)

The output:

	Pearson's Chi-squared test

data:  mat_hair_sex
X-squared = 7.9942, df = 3, p-value = 0.04613

The p value is below 0.05 and tells us that there is a difference in hair color between men and women in the sample. When you look at the data, you would see that this is mostly caused by the the female students have proportionally blonder hair.

Visualizing the results

Since I’m a very visually oriented person I’d like to see the results of the test represented in a graph. Sadly, the regular chisq.test returns it’s data in a way that is hard to turn into something we can use with ggplot2. To overcome this problem I created a wrapper function around the chisq.test function: chisq_test:

chisq_test <- function(x, ...) {
  res_chisq <- chisq.test(x)
  groups <- length(dimnames(as.table(res_chisq$observed)))
  dataset <- as.data.frame(as.table(res_chisq$observed))
  names(dataset)[groups + 1] <- "observed"
  
  dataset <- cbind(dataset, 
                   expected  = as.data.frame(as.table(res_chisq$expected))[[groups + 1]],
                   residuals = as.data.frame(as.table(res_chisq$residuals))[[groups + 1]],
                   stdres    = as.data.frame(as.table(res_chisq$stdres))[[groups + 1]] )
  
  result <- list(statistic = res_chisq$statistic,
                 parameter = res_chisq$parameter,
                 p.value = res_chisq$p.value,
                 method = res_chisq$method,
                 data.name = res_chisq$data.name,
                 dataset = dataset)
  attr(result, "class") <- attr(res_chisq, "class")
  
  return(result)
}

I won’t go into detail how the function works, but the end result is the a variable similar to the one returned to the chisq.test, but all the data elements like observed values, expected values, residuals and standardized residuals are in one data-frame, ready for use in a ggplot. First we’ll apply the function, and store it’s result:

res_chisq <- chisq_test(mat_hair_sex)

To visualize this result I’ve used this ggplot call:

ggplot(res_chisq$dataset, aes(x = Var1, y = observed, group = Var2)) +
  geom_col(aes(fill = Var2, alpha = abs(residuals)), position = "dodge") +
  geom_text(aes(label = observed), position = position_dodge(width = 1), vjust = 1) +
  labs(x = "", y = "",fill = "") +
  guides(alpha = FALSE)

Image text

The related samples tests are used to determine whether there are differences before and after some kind of treatment. It is also useful when seeing when verifying the predictions of machine learning algorithms for categorical variables.

McNemar’s test

McNemar’s test is used to see whether observations differ in values on two sets of variables. It’s useful for comparing results of questionaires for the same person across a period of time.

In his classic book The Decline of Good Taste Dr. Edward McAuliffe lamented the ascent of the frivolous bow-tie to the expense of the refined cravat. Being the reputed scientist he was, he didn’t go by feeling but relentlessly carried out two questionnaires to the same 2.000 men with 2 years in between. The men were asked if they wouldd rather wear a bow-tie or a cravat. Here I recreate the results:

tbl_cravat <- data.frame(first_result = c(rep("Cravat", 1500), 
                                          rep("Bow-tie", 500)), 
                         second_result = c(rep("Cravat", 1300), 
			                   rep("Bow-tie", 700)))

tbl_data %<>%
  group_by(first_result, second_result) %>% 
  summarise(qty = n()) %>% 
  ungroup() %>%
  spread(key = second_result, value = qty) %>% 
  select(Approve, Disapprove)

To test whether their opinions changed we can apply the data frame to the mcnemar.test function.

mcnemar.test(as.matrix(tbl_data))

Output:

	McNemar's Chi-squared test with continuity correction

data:  as.matrix(tbl_data)
McNemar's chi-squared = 66.002, df = 1, p-value = 4.505e-16

Seeing the p value is so low, we can assume the general sentiment toward the cravat changed. And seeing the direction of the fashion evolution we will mourn together with Dr. McAuliffe about this great loss in gentlemanly traditions…

Association between 2 variables

Tests of association determine what the strength of the movement between variables is. It can be used if you want to know if there is any relation between the customer’s amount spent, and the number of orders the customer already placed.

Contigency coefficients or Cramer’s V

Several measures can be used to estimate the extent of the relationship between two variables, or to show the strength of a relationship:

Phi Coefficient - This can only be calculated on 2-by-2 tables.
Contingency coefficient - The contingency coefficient is critiqued for not reaching it’s outer limits of -1 and +1.
Cramer’s V - This measure has none of the drawbacks of the Phi and Contingency coefficcents. The values come between 0 and 1. All three can be calculated by using the assocstats function from the vcd library.

We’ll use the HairEyeColor, we’ve used for the Chi-square test. Again: I assume the data-set reports natural hair color like Steven Seagal does.

library(vcd)
assocstats(mat_hair_sex)