<Two sample Bootstrap Method>
I explained why Bootstrap method is important and how to use it in previous post. The difference between the previous one and this post will be the number of samples we are interested in. We want to assume the relationship ,for example the difference between more than one populations, by using bootstrap method. Let's assume that we want to show that weights of male and women are 0, that is weights of men = weights of women. We set it as null hypothesis.
H0: M0=M1
H1: M0!=M1
Bootstrap Method is simply the way to make inference about population using smaller samples with replacement. The most important point here is 'with replacement.'
Firstly, we assume that there are two Independent groups A and B. Because the number of people is very large, we collect random variables from two groups. In the example below, the sample sizes for both group is 30. And as sample cannot be changed, we must set 'set.seed(3)' firstly.
Secondly, we devise the way to distinguish between male and female information
Next, make a numeric vector, length of 10000, and fill the vector in mean difference of 20 samples' sample with replacement between men and women weight.
1.
> set.seed(3)
> gender=c(rep("male",20),rep("female",20))
> weight=c(runif(20,50,110),runif(20,50,90))
> d<-as.data.frame(cbind(gender,weight))
> d
gender weight
1 male 58.4020763169974
2 male 90.3758199536242
3 male 69.2471175687388
4 male 66.3867158582434
5 male 80.105033738073
6 male 80.2197027020156
7 male 56.2316722120158
8 male 64.7300462122075
9 male 78.8804959505796
10 male 81.5489637199789
11 male 75.6007948773913
12 male 75.2511957078241
13 male 76.7017676727846
14 male 77.862471784465
15 male 93.3959743822925
16 male 91.4854346658103
17 male 55.5724576697685
18 male 85.1844179444015
19 male 94.8744132183492
20 male 63.9866276877001
21 female 59.1280752513558
22 female 50.6131957005709
23 female 55.1592623442411
24 female 53.7352771405131
25 female 59.4754002988338
26 female 81.6458963789046
27 female 73.9892626274377
28 female 86.4059084374458
29 female 72.4169821571559
30 female 80.2281907666475
31 female 65.1668756455183
32 female 64.9312390759587
33 female 56.8116257153451
34 female 68.1322929449379
35 female 60.3365584183484
36 female 63.4506380930543
37 female 85.5833213869482
38 female 58.0778519529849
39 female 73.1674417108297
40 female 58.3052811957896
> d$weight<-as.numeric(d$weight)
> d
gender weight
1 male 58.40208
2 male 90.37582
3 male 69.24712
4 male 66.38672
5 male 80.10503
6 male 80.21970
7 male 56.23167
8 male 64.73005
9 male 78.88050
10 male 81.54896
11 male 75.60079
12 male 75.25120
13 male 76.70177
14 male 77.86247
15 male 93.39597
16 male 91.48543
17 male 55.57246
18 male 85.18442
19 male 94.87441
20 male 63.98663
21 female 59.12808
22 female 50.61320
23 female 55.15926
24 female 53.73528
25 female 59.47540
26 female 81.64590
27 female 73.98926
28 female 86.40591
29 female 72.41698
30 female 80.22819
31 female 65.16688
32 female 64.93124
33 female 56.81163
34 female 68.13229
35 female 60.33656
36 female 63.45064
37 female 85.58332
38 female 58.07785
39 female 73.16744
40 female 58.30528
2.
> of_male=which(d$gender=="male")
> of_female=which(d$gender=="female")
3.
> bs.means.difference=numeric(10000)
> for (i in 1:10000){
+ bs.means.difference[i]=mean(sample(d$weight[of_male],replace=TRUE))-mean(sample(d$weight[of_female],replace=TRUE))
+ }
4.
T.obs is the value we can observe from the samples from the population, that is mean(samples from men)-mean(samples from women)
> T.obs=mean(d$weight[of_male])-mean(d$weight[of_female])
> T.obs
[1] 9.464131
> p.val = sum(abs(bs.means.difference-mean(bs.means.difference)) >= abs(T.obs))/10000
> p.val
[1] 0.0071
It is suggested to add 1 to both denominator and nominator while using > sign instead of >=
> p.val = (sum(abs(bs.means.difference-mean(bs.means.difference)) > abs(T.obs))+1)/(10000+1)
> p.val
[1] 0.00719928
<Histogram 1>
> hist(bs.means.difference,main="Histogram of Bootstrap Mean Differences",xlab="Mean Difference",ylab="Frequency")
>
> quantile(bs.means.difference,c(0.025,0.975))
2.5% 97.5%
2.455102 16.480545
> abline(v=quantile(bs.means.difference,c(0.025,0.975)),col="blue")
<Histogram 2 - to make 0 a center of the histogram - subtract mean(bs.means.difference from bs.means.difference)>
>hist(bs.means.difference-T.obs,main="Histogram of Bootstrap Mean Differences",xlab="Mean Difference",ylab="Frequency")
>abline(v=T.obs,col="red")
The area right side of the red line is about 0.00719 which is p-value .
'R' 카테고리의 다른 글
(R) Two sample t-test (Student's t, Welch's t ) (0) | 2020.10.27 |
---|---|
(R) Permutation Test - Non parametric method (0) | 2020.10.24 |
(R) One-sample Bootstrap Method (0) | 2020.10.17 |
(R)How to add several line graphs to a graph / rearrange the order of colour label/ How to add colours to each graph (0) | 2020.10.09 |
(R) Central limit theorem (0) | 2020.09.27 |