<Permutation Test>
Permutation Test is a non-parametric method.
<Comparison to Bootstrap Test>
Even though Permutation Test and Bootstrap Test are both non-parametric method, they have a difference, which is 'replacement.'
Bootstrap test is the test method accomplished by allowing 'with replacement,' and the Permutation test is the test 'without replacement.'
<Comparison to T-test>
The two-sample t-test starts from the assumption that the sampling distribution is normally distributed. This is the case when both groups have the normal distribution for their sample mean, and it is approximately true for large samples from non-normal distribution because of Central limit theorem. In contrast, the permutation test does not require the normality assumption. As it does not need the central limit theorem to use the test, it robustly works for small or large samples.
<Limitation of Permutation Test>
The permutation test requires that the observations are exchangeable under the null hypothesis. And the exchangeability assumption makes the both groups have an equal variance. So we can say that the permutation test have the same weakness as the classical Student's t-test which sets var.equal=TRUE.
<Show how Permutation Test works using R>
Object: Show that the samples are from the same distribution. That is, the individual two groups are same.
H0: M0=M1
H1: M0!=M1
1. Assume that we have individual two groups.
2. collect random variables from two groups. (The random variables are not changeable)
3. We permute the samples and re-calculate the test statistics .
4. Repeat this permutation test, and re-calculate the test statistics .
5. After sufficient permutations, we set the approximate test statistic distribution.
6. Show that the observed difference between two sample means are large enough to reject H0, at some significance level.
or, show that the observed difference between two sample means are large enough to accept H0 and reject H1.
---->Compare p-value and significance value. (If p-value is less than or equal to significance level, we reject null hypothesis. )
<Assumption>
#We want to test about difference in mean weight between men and females
#H0: M(A)=M(B) H1: M(A)!=M(B)
#sample sizes: n(A),n(B)
#significance level: 5%
#goal: to determine whether the observed difference between the two sample means is large enough to reject the null hypothesis at a significance level 5% : p-value is smaller than significance level.
> set.seed(3)
> gender=c(rep("male",20),rep("female",20))
> weight=c(runif(20,50,100),runif(20,50,90))
> d<-as.data.frame(cbind(gender,weight))
> d$weight<-as.numeric(d$weight)
> d
gender weight
1 male 58.40208
2 male 90.37582
3 male 69.24712
4 male 66.38672
5 male 80.10503
6 male 80.21970
7 male 56.23167
8 male 64.73005
9 male 78.88050
10 male 81.54896
11 male 75.60079
12 male 75.25120
13 male 76.70177
14 male 77.86247
15 male 93.39597
16 male 91.48543
17 male 55.57246
18 male 85.18442
19 male 94.87441
20 male 63.98663
21 female 59.12808
22 female 50.61320
23 female 55.15926
24 female 53.73528
25 female 59.47540
26 female 81.64590
27 female 73.98926
28 female 86.40591
29 female 72.41698
30 female 80.22819
31 female 65.16688
32 female 64.93124
33 female 56.81163
34 female 68.13229
35 female 60.33656
36 female 63.45064
37 female 85.58332
38 female 58.07785
39 female 73.16744
40 female 58.30528
> of_male=which(d$gender=="male")
> of_female=which(d$gender=="female")
> of_male
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
[20] 20
> of_female
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
[20] 40
> T.obs=mean(d$weight[of_male])-mean(d$weight[of_female])
> T.obs
[1] 9.464131
> d$gender
[1] "male" "male" "male" "male" "male" "male"
[7] "male" "male" "male" "male" "male" "male"
[13] "male" "male" "male" "male" "male" "male"
[19] "male" "male" "female" "female" "female" "female"
[25] "female" "female" "female" "female" "female" "female"
[31] "female" "female" "female" "female" "female" "female"
[37] "female" "female" "female" "female"
Permute the genders
<1>
> gender=sample(d$gender)
> gender
[1] "female" "male" "female" "male" "female" "female"
[7] "female" "female" "male" "male" "male" "male"
[13] "male" "female" "male" "male" "female" "male"
[19] "female" "female" "female" "female" "male" "male"
[25] "female" "male" "male" "male" "female" "male"
[31] "male" "female" "male" "female" "female" "male"
[37] "male" "female" "female" "female"
> which(gender=="male")
[1] 2 4 9 10 11 12 13 15 16 18 23 24 26 27 28 30 31 33 36
[20] 37
<2>
> gender=sample(d$gender)
> gender
[1] "female" "male" "male" "female" "male" "male"
[7] "male" "female" "female" "female" "female" "male"
[13] "male" "female" "male" "female" "female" "male"
[19] "male" "female" "male" "female" "male" "female"
[25] "female" "male" "male" "female" "female" "female"
[31] "male" "male" "female" "male" "female" "male"
[37] "female" "male" "female" "male"
> which(gender=="male")
[1] 2 3 5 6 7 12 13 15 18 19 21 23 26 27 31 32 34 36 38
[20] 40
<3>
> gender=sample(d$gender)
> gender
[1] "male" "male" "female" "female" "female" "female"
[7] "female" "female" "female" "male" "female" "female"
[13] "female" "male" "male" "female" "male" "male"
[19] "male" "male" "male" "female" "female" "female"
[25] "male" "male" "female" "female" "male" "female"
[31] "male" "female" "male" "male" "female" "male"
[37] "female" "male" "male" "male"
> which(gender=="male")
[1] 1 2 10 14 15 17 18 19 20 21 25 26 29 31 33 34 36 38 39
[20] 40
Let's repeat this permutation process enough time.
>T.null=numeric(10000)
>for (i in 1:10000){
+ gender.label=sample(d$gender)
+ of_male=which(gender.label=="male")
+ of_female=which(gender.label=="female")
+ T.null[i]=mean(d$weight[of_male])-mean(d$weight[of_female])
+ }
> hist(T.null,xlab="Mean Difference",ylab="Frequency",main="Histogram of permuted mean differences")
> abline(v=T.obs,col="red")
>p.value=sum(abs(T.null)>=abs(T.obs))/10000
It is suggested to add 1 to both denominator and nominator while using > sign instead of >= .
>p.value=sum(abs(T.null)>abs(T.obs)+1)/10001
> p.value
[1] 0.0128 ---------> It is smaller than the significance level we set before the test, which is 0.05.
So reject the null hypothesis and accept that M(A)!=M(B), which is alternative hypothesis.
>conf_int_95=quantlile(T.null+T.obs,c(0.025,0.975)
'R' 카테고리의 다른 글
(R)Wilkinson dot plot (+boxplot) (0) | 2020.10.29 |
---|---|
(R) Two sample t-test (Student's t, Welch's t ) (0) | 2020.10.27 |
(R) Two sample Bootstrap Method (0) | 2020.10.23 |
(R) One-sample Bootstrap Method (0) | 2020.10.17 |
(R)How to add several line graphs to a graph / rearrange the order of colour label/ How to add colours to each graph (0) | 2020.10.09 |