This chapter covers some of the most commonly used statistical tests

**1. Shapiro Test: Testing for normality**

**Why is it used?**

To test if a sample follows a Normal distribution.

`shapiro.test(myVec) `

*# Does myVec follow a normal disbn?*

*# Example: Test a normal distribution*

normaly_disb <- rnorm(100, mean=5, sd=1) *# generate a normal distribution*

shapiro.test(normaly_disb)

Shapiro-Wilk normality test data: normaly_disb W = 0.9936,p-value = 0.919

*# Example: Test a uniform distribution*

not_normaly_disb <- runif(100)

shapiro.test(not_normaly_disb)

Shapiro-Wilk normality test data: not_normaly_disb W = 0.9563,p-value = 0.002195

**How to interpret?**

If p-Value is less than the significance level (0.05), the null-hypothesis that it is normally distributed can be rejected.

**2. One Sample t-Test: Testing the mean of a sample from a normal distribution**

**Why is it used?**

To test if the mean of a sample could reasonably be a specific value.

`x <- rnorm(50, mean = `

**10**, sd = 0.5)

t.test(x, mu=**10**) # testing if mean of x could be

One Sample t-test data: x t = -0.8547, df = 49,p-value = 0.3969alternative hypothesis: true mean is not equal to 1095 percent confidence interval:9.797195 10.081767sample estimates: mean of x 9.939481

**How to interpret?**

In above case, the p-Value is not less than significance level (0.05), therefore the null hypothesis that the mean=10 cannot be rejected. Also note that the 95% confidence interval range includes the value 10 within its range. So, it is ok to say the mean of *‘x’* is 10, especially since *‘x’* is assumed to be normally distributed. In case, a normal distribution is not assumed, use *wilcoxon signed rank test* shown in next section.

Note: Use *conf.level* argument to adjust the confidence level.

**3. Wilcoxon Signed Rank Test: Testing the mean of a sample when normal distribution is not assumed**

**Why / When is it used?**

Wilcoxon signed rank test can be an alternative to t-Test, when the data sample is not assumed to follow a normal distribution. It is a non-parametric method used to test if an estimate is different from its true value.

`wilcox.test(input.vector, mu = m, conf.int = TRUE)`

**How to interpret ?**

If p-Value < 0.05, reject the null hypothesis and accept the alternate mentioned in your R code’s output. Type *example(wilcox.test)* in R console for illustration.

**4. Two Sample t-Test and Wilcoxon Rank Sum Test: Comparing mean of two samples**

Both t.Test and Wilcoxon rank test can be used to compare the mean of 2 samples.

**How to implement in R?**

Pass the two numeric vector samples into the t.test() when sample is distributed ‘normal’y and wilcox.test() when it isn’t assumed to follow a normal distribution.

`x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)`

y <- c(1.15, 0.88, 0.90, 0.74, 1.21)

wilcox.test(x, y, alternative = "g") # greater

Wilcoxon rank sum test data: x and y W = 35,alternative hypothesis: true location shift is greater than 0p-value = 0.1272

With a p-Value of 0.1262, we cannot reject the null hypothesis that both x and y have same means.

`t.test(1:10, y = c(7:20)) `

*# P = .00001855*

Welch Two Sample t-test data: 1:10 and c(7:20) t = -5.4349, df = 21.982,p-value = 1.855e-05alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.052802 -4.947198 sample estimates: mean of x mean of y 5.5 13.5

With p-Value < 0.05, we can safely reject the null hypothesis that there is no difference in mean.
**What if we want to do a 1-to-1 comparison of means for values of x and y?**

*# Use paired = TRUE for 1-to-1 comparison of observations.*

t.test(x, y, paired = TRUE) # when observations are paired, use 'paired' argument.

wilcox.test(x, y, paired = TRUE) #both x and y are assumed to have similar shapes

**When can I conclude if the mean’s are different?**

Conventionally, If the p-Value is less than significance level (ideally 0.05), reject the null hypothesis that both means are the are equal.

**5. Kolmogorov And Smirnov Test: Test if two samples have the same distribution**

Kolmogorov-Smirnov test is used to check whether 2 samples follow the same distribution.** **

`ks.test(x,y) # x and y are two numeric vector`

*# from different distributions*

x <- rnorm(50)

y <- runif(50)

ks.test(x, y) *# perform ks test*

Two-sample Kolmogorov-Smirnov test data: x and y D = 0.48,p-value = 1.387e-05alternative hypothesis: two-sided

*# both from normal distribution*

x <- rnorm(50)

y <- rnorm(50)

ks.test(x, y) *# perform ks test*

Two-sample Kolmogorov-Smirnov test data: x and y D = 0.18,p-value = 0.3959alternative hypothesis: two-sided

**How to tell if they are from the same distribution ?**

If p-Value < 0.05 (significance level), we reject the null hypothesis that they are drawn from same distribution. In other words, p < 0.05 implies x and y from different distributions

**6. Fisher’s F-Test: Test if two samples have same variance**

Fisher’s F test can be used to compare variances of 2 samples.

`var.test(x, y) # Do x and y have the same variance?`

Alternatively *fligner.test()* and* bartlett.test()* can be used for the same purpose.

**7. Chi Squared Test: Test the independence of two variables in a contingency table**

Chi-squared test can be used to test independence of two categorical variables. Example: You may want to figure out if big budget films become box-office hits. We got 2 categorical variables (Budget of film, Success Status) each with 2 factors (Big/Low budget and Hit/Flop), which forms a 2 x 2 matrix.

`chisq.test(matrix, correct = FALSE) # Yates continuity correction not applied`

or

summary(table(x, y)) # performs a chi-squared test.

Pearson's Chi-squared test data: M X-squared = 30.0701, df = 2, p-value = 2.954e-07

**How to tell if x, y are independent?**

There are two ways to tell if they are independent: (1) By looking at the p-Value (2) From Chi.sq value

**p-Value: **If the p-Value is less that 0.05, we fail to reject the null hypothesis that the x and y are independent. So for the example output above, (p-Value=2.954e-07), we reject the null hypothesis and conclude that x and y are not independent.

**Chi-sq Value: **For 2 x 2 contingency tables with 2 degrees of freedom (d.o.f), if the Chi-Squared calculated is greater than **3.841 (critical value)**, we reject the null hypothesis that the variables are independent. To find the critical value of larger d.o.f contingency tables, use *qchisq(0.95, n-1), *where n is the number of variables.

**8. Correlation: Test the linear relationship of two variables**

The *cor.test()* function test if the correlation between two variables are significant.

`cor.test(x, y) # where x and y are numeric vectors.`

**9. More Commonly Used Tests**

`fisher.test(contingencyMatrix, alternative = "greater") # Fisher's exact test to test independence of rows and columns in contingency table`

friedman.test() # Friedman's rank sum non-parametric test

There are more useful tests available in various other packages.

The package ‘lawstat’ has a good collection. The outliers package has a number of test for testing for presence of outliers.