본문 바로가기
R

(R) Two sample Bootstrap Method

by jangpiano 2020. 10. 23.
반응형

<Two sample Bootstrap Method> 

I explained why Bootstrap method is important and how to use it in previous post. The difference between the previous one and this post will be the number of samples we are interested in. We want to assume the relationship ,for example the difference between more than one populations, by using bootstrap method. Let's assume that we want to show that weights of male and women are 0, that is weights of men = weights of women. We set it as null hypothesis. 

H0: M0=M1

H1: M0!=M1


Bootstrap Method is simply the way to make inference about population using smaller samples with replacement. The most important point here is 'with replacement.'

Firstly, we assume that there are two Independent groups A and B. Because the number of people is very large, we collect random variables from two groups. In the example below, the sample sizes for both group is 30. And as sample cannot be changed, we must set 'set.seed(3)' firstly. 

Secondly, we devise the way to distinguish between male and female information 

Next, make a numeric vector, length of 10000, and fill the vector in mean difference of 20 samples' sample with replacement between men and women weight. 

 


 

1.

> set.seed(3)

> gender=c(rep("male",20),rep("female",20))

> weight=c(runif(20,50,110),runif(20,50,90))


> d<-as.data.frame(cbind(gender,weight))

> d

   gender           weight

1    male 58.4020763169974

2    male 90.3758199536242

3    male 69.2471175687388

4    male 66.3867158582434

5    male  80.105033738073

6    male 80.2197027020156

7    male 56.2316722120158

8    male 64.7300462122075

9    male 78.8804959505796

10   male 81.5489637199789

11   male 75.6007948773913

12   male 75.2511957078241

13   male 76.7017676727846

14   male  77.862471784465

15   male 93.3959743822925

16   male 91.4854346658103

17   male 55.5724576697685

18   male 85.1844179444015

19   male 94.8744132183492

20   male 63.9866276877001

21 female 59.1280752513558

22 female 50.6131957005709

23 female 55.1592623442411

24 female 53.7352771405131

25 female 59.4754002988338

26 female 81.6458963789046

27 female 73.9892626274377

28 female 86.4059084374458

29 female 72.4169821571559

30 female 80.2281907666475

31 female 65.1668756455183

32 female 64.9312390759587

33 female 56.8116257153451

34 female 68.1322929449379

35 female 60.3365584183484

36 female 63.4506380930543

37 female 85.5833213869482

38 female 58.0778519529849

39 female 73.1674417108297

40 female 58.3052811957896



> d$weight<-as.numeric(d$weight)

> d

   gender   weight

1    male 58.40208

2    male 90.37582

3    male 69.24712

4    male 66.38672

5    male 80.10503

6    male 80.21970

7    male 56.23167

8    male 64.73005

9    male 78.88050

10   male 81.54896

11   male 75.60079

12   male 75.25120

13   male 76.70177

14   male 77.86247

15   male 93.39597

16   male 91.48543

17   male 55.57246

18   male 85.18442

19   male 94.87441

20   male 63.98663

21 female 59.12808

22 female 50.61320

23 female 55.15926

24 female 53.73528

25 female 59.47540

26 female 81.64590

27 female 73.98926

28 female 86.40591

29 female 72.41698

30 female 80.22819

31 female 65.16688

32 female 64.93124

33 female 56.81163

34 female 68.13229

35 female 60.33656

36 female 63.45064

37 female 85.58332

38 female 58.07785

39 female 73.16744

40 female 58.30528


2.

> of_male=which(d$gender=="male")

> of_female=which(d$gender=="female")


3.

> bs.means.difference=numeric(10000)


> for (i in 1:10000){

+   bs.means.difference[i]=mean(sample(d$weight[of_male],replace=TRUE))-mean(sample(d$weight[of_female],replace=TRUE))

+ }


4.


T.obs is the value we can observe from the samples from the population, that is mean(samples from men)-mean(samples from women)


> T.obs=mean(d$weight[of_male])-mean(d$weight[of_female])

> T.obs

[1] 9.464131


> p.val = sum(abs(bs.means.difference-mean(bs.means.difference)) >= abs(T.obs))/10000

> p.val

[1] 0.0071


It is suggested to add 1 to both denominator and nominator while using > sign instead of >= 


> p.val = (sum(abs(bs.means.difference-mean(bs.means.difference)) > abs(T.obs))+1)/(10000+1)

> p.val

[1] 0.00719928


<Histogram 1>


> hist(bs.means.difference,main="Histogram of Bootstrap Mean Differences",xlab="Mean Difference",ylab="Frequency")

> quantile(bs.means.difference,c(0.025,0.975))

     2.5%     97.5% 

 2.455102 16.480545 


> abline(v=quantile(bs.means.difference,c(0.025,0.975)),col="blue")

<Histogram 2 - to make 0 a center of the histogram - subtract mean(bs.means.difference from bs.means.difference)> 


>hist(bs.means.difference-T.obs,main="Histogram of Bootstrap Mean Differences",xlab="Mean Difference",ylab="Frequency")


>abline(v=T.obs,col="red")


The area right side of the red line is about 0.00719 which is p-value . 

반응형