Basic Statistical Tests Explained with R

This chapter covers some of the most commonly used statistical tests

1. Shapiro Test: Testing for normality

Why is it used?

To test if a sample follows a Normal distribution.

shapiro.test(myVec) # Does myVec follow a normal disbn?
# Example: Test a normal distribution
normaly_disb <- rnorm(100, mean=5, sd=1) # generate a normal distribution

Shapiro-Wilk normality test

data:  normaly_disb
 W = 0.9936, p-value = 0.919

# Example: Test a uniform distribution
not_normaly_disb <- runif(100)

	Shapiro-Wilk normality test

data:  not_normaly_disb
W = 0.9563, p-value = 0.002195

How to interpret?

If p-Value is less than the significance level (0.05), the null-hypothesis that it is normally distributed can be rejected.

2. One Sample t-Test: Testing the mean of a sample from a normal distribution

Why is it used?

To test if the mean of a sample could reasonably be a specific value.

x <- rnorm(50, mean = 10, sd = 0.5)
t.test(x, mu=10) # testing if mean of x could be

	One Sample t-test

data:  x
t = -0.8547, df = 49, p-value = 0.3969
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
  9.797195 10.081767
sample estimates:
mean of x 

How to interpret?

In above case, the p-Value is not less than significance level (0.05),  therefore the null hypothesis that the mean=10 cannot be rejected. Also note that the 95% confidence interval range includes the value 10 within its range. So, it is ok to say the mean of ‘x’ is 10, especially since ‘x’ is assumed to be normally distributed. In case, a normal distribution is not assumed, use wilcoxon signed rank test shown in next section.
Note: Use conf.level argument to adjust the confidence level.

3. Wilcoxon Signed Rank Test: Testing the mean of a sample when normal distribution is not assumed

Why / When is it used?

Wilcoxon signed rank test can be an  alternative to t-Test, when the data sample is not assumed to follow a normal distribution. It is a non-parametric method used to test if an estimate is different from its true value.

wilcox.test(input.vector, mu = m, = TRUE)

How to interpret ?

If p-Value < 0.05, reject the null hypothesis and accept the alternate mentioned in your R code’s output. Type example(wilcox.test) in R console for illustration.

4. Two Sample t-Test and Wilcoxon Rank Sum Test: Comparing mean of two samples

Both t.Test and Wilcoxon rank test can be used to compare the mean of 2 samples.

How to implement in R?

Pass the two numeric vector samples into the t.test() when sample is distributed ‘normal’y and wilcox.test() when it isn’t assumed to follow a normal distribution.

x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
wilcox.test(x, y, alternative = "g") # greater

Wilcoxon rank sum test

data:  x and y
W = 35, p-value = 0.1272
alternative hypothesis: true location shift is greater than 0

With a p-Value of 0.1262, we cannot reject the null hypothesis that both x and y have same means.

t.test(1:10, y = c(7:20)) # P = .00001855

Welch Two Sample t-test

data:  1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -11.052802  -4.947198
sample estimates:
  mean of x mean of y 
5.5      13.5 

With p-Value < 0.05, we can safely reject the null hypothesis that there is no difference in mean. What if we want to do a 1-to-1 comparison of means for values of x and y?
# Use paired = TRUE for 1-to-1 comparison of observations.
t.test(x, y, paired = TRUE) # when observations are paired, use 'paired' argument.
wilcox.test(x, y, paired = TRUE) #both x and y are assumed to have similar shapes

When can I conclude if the mean’s are different?

Conventionally, If the p-Value is less than significance level (ideally 0.05), reject the null hypothesis that  both means are the are equal.

5. Kolmogorov And Smirnov Test: Test if two samples have the same distribution

Kolmogorov-Smirnov test is used to check whether 2 samples follow the same distribution. 

ks.test(x,y) # x and y are two numeric vector

# from different distributions
x <- rnorm(50)
y <- runif(50)
ks.test(x, y) # perform ks test

Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.48, p-value = 1.387e-05
alternative hypothesis: two-sided

# both from normal distribution
x <- rnorm(50)
y <- rnorm(50)
ks.test(x, y) # perform ks test

Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.18, p-value = 0.3959
alternative hypothesis: two-sided

How to tell if they are from the same distribution ?

If p-Value < 0.05 (significance level), we reject the null hypothesis that they are drawn from same distribution. In other words, p < 0.05 implies x and y from different distributions

6. Fisher’s F-Test: Test if two samples have same variance

Fisher’s F test can be used to compare variances of 2  samples.
var.test(x, y)  # Do x and y have the same variance?

Alternatively fligner.test() and bartlett.test() can be used for the same purpose.

7. Chi Squared Test: Test the independence of two variables in a contingency table

Chi-squared test can be used to test independence of two categorical variables. Example: You may want to figure out if big budget films become box-office hits. We got 2 categorical variables (Budget of film, Success Status) each with 2 factors (Big/Low budget and Hit/Flop), which forms a 2 x 2 matrix.

chisq.test(matrix, correct = FALSE)  # Yates continuity correction not applied
summary(table(x, y)) # performs a chi-squared test.

Pearson's Chi-squared test
data:  M
X-squared = 30.0701, df = 2, p-value = 2.954e-07


How to tell if x, y are independent?

There are two ways to tell if they are independent: (1) By looking at the p-Value  (2) From Chi.sq value

p-Value: If the p-Value is less that 0.05, we fail to reject the null hypothesis that the x and y are independent. So for the example output above, (p-Value=2.954e-07), we reject the null hypothesis and conclude that x and y are not independent.

Chi-sq Value: For 2 x 2 contingency tables with 2 degrees of freedom (d.o.f), if the Chi-Squared calculated is greater than  3.841 (critical value), we reject the null hypothesis that the variables are independent. To find the critical value of larger d.o.f contingency tables, use qchisq(0.95, n-1), where n is the number of variables.

8. Correlation: Test the linear relationship of two variables

The cor.test() function test if the correlation between two variables are significant.
cor.test(x, y) # where x and y are numeric vectors.

9. More Commonly Used Tests

fisher.test(contingencyMatrix, alternative = "greater")  # Fisher's exact test to test independence of rows and columns in contingency table
friedman.test()  # Friedman's rank sum non-parametric test

There are more useful tests available in various other packages.
The package ‘lawstat’ has a good collection. The outliers package has a number of test for testing for presence of outliers.


If you like us, please tell your friends.Share on LinkedInShare on Google+Share on RedditTweet about this on TwitterShare on Facebook