본문 바로가기
R

(R) One-sample Bootstrap Method

by jangpiano 2020. 10. 17.
반응형

Let's assume that we wanna get information of Weights of 20-30 Women worldwide. However there are about the 3 billion number of women. We cannot survey every woman in the world in the reason of time and cost limit.

So we assume that we have surveyed on only 30 women's weight.  

It seems nonsense predicting trends of 3 billion population with only 30 samples. 

However, surprisingly, by using Bootstrap method, we can assume the distribution of weights of 3 billion women in the world. 

Then what is the 'Bootstrap Method'? 

Bootstrap Method is simply the way to make inference about population using smaller samples with replacement. The most important point here is 'with replacement.'

Firstly, let's assume we have surveyed 30 people. And then make about at least 10000 trials to take samples from the result of 30 people with replacement. Then there will be 10000 different samples there. As a result, by using means of the 10000 different samples, we can approximate the true population data. 

I will show this really works by giving you an example. 


1. We want to get the information of 20-30 world wide women weight. (population= about 3 billion)

2. We have 30 women's data. 

3. make samples data about 10000 times with replacement from the 30 original samples

4. compute the means of 10000 sample's sample.

5. By using the distributions of 10000 means, we can can approximate population distribution by doing test on parameters of population distribution. That is, we can find the confidence Interval, which is similar with T-test by Bootstrap method.


2.

>set.seed(3) 

#This process is needed for supposing that we have surveyed 30 data. The 30 sample value must be the fixed sample data.


>x=runif(30,88,220) - make 30 samples from uniform distribution with alpha=88, beta=220, which is Uni(a=88,b=220) . 



> x

 [1] 110.18148 194.59216 138.81239 131.26093 167.47729

 [6] 167.78002 104.45161 126.88732 164.24451 171.28926

[11] 155.58610 154.66316 158.49267 161.55693 202.56537

[16] 197.52155 102.71129 180.88686 206.46845 124.92470

[21] 118.12265  90.02355 105.02557 100.32641 119.26882

[26] 192.43146 167.16457 208.13950 161.97604 187.75303


> x

 [1] 110.18148 194.59216 138.81239 131.26093 167.47729

 [6] 167.78002 104.45161 126.88732 164.24451 171.28926

[11] 155.58610 154.66316 158.49267 161.55693 202.56537

[16] 197.52155 102.71129 180.88686 206.46845 124.92470

[21] 118.12265  90.02355 105.02557 100.32641 119.26882

[26] 192.43146 167.16457 208.13950 161.97604 187.75303


> x

 [1] 110.18148 194.59216 138.81239 131.26093 167.47729

 [6] 167.78002 104.45161 126.88732 164.24451 171.28926

[11] 155.58610 154.66316 158.49267 161.55693 202.56537

[16] 197.52155 102.71129 180.88686 206.46845 124.92470

[21] 118.12265  90.02355 105.02557 100.32641 119.26882

[26] 192.43146 167.16457 208.13950 161.97604 187.75303



> mean(x)

[1] 152.4195

> mean(x)

[1] 152.4195

> mean(x)

[1] 152.4195


# 항상 똑같음 set.seed 했기 때문 


3.

#88 에서 220 사이의 수 중 30 개의 샘플  (30명을 조사했다고 가정)  --항상 다름 

> sample(x,replace=TRUE)

 [1] 180.88686 197.52155 119.26882 154.66316 126.88732

 [6]  90.02355 208.13950 206.46845 167.78002 126.88732

[11] 202.56537 171.28926 155.58610 161.97604 105.02557

[16] 138.81239 187.75303 167.47729 105.02557  90.02355

[21] 126.88732 171.28926 131.26093 110.18148 206.46845

[26] 167.78002 124.92470 119.26882 192.43146 158.49267

> sample(x,replace=TRUE)

 [1] 104.4516 192.4315 110.1815 164.2445 102.7113

 [6] 164.2445 167.4773 180.8869 167.7800 161.9760

[11] 126.8873 155.5861 171.2893 131.2609 161.5569

[16] 164.2445 138.8124 131.2609 154.6632 180.8869

[21] 102.7113 197.5215 131.2609 167.7800 110.1815

[26] 158.4927 194.5922 118.1226 194.5922 100.3264

> sample(x,replace=TRUE)

 [1] 105.0256 104.4516 171.2893 167.7800 161.5569

 [6] 206.4685 202.5654 110.1815 171.2893 194.5922

[11] 118.1226 202.5654 155.5861 192.4315 187.7530

[16] 131.2609 119.2688 197.5215 118.1226 167.7800

[21] 105.0256 104.4516 197.5215 138.8124 100.3264

[26] 154.6632 131.2609 167.1646 161.9760 197.5215


> mean(sample(x,replace=TRUE))

[1] 159.0381

> mean(sample(x,replace=TRUE))

[1] 155.6042

> mean(sample(x,replace=TRUE))

[1] 147.8545


4.

Let's assume that we have repeated computing means of samples with replacement of the original sample 10000 times( as much as possible).


> bs.means=numeric(10000)

> for (i in 1:10000){

+   bs.means[i]=mean(sample(x,replace=TRUE))

+ }

> mean(bs.means)

[1] 152.3677

> var(bs.means)

[1] 40.45533

> hist(bs.means, xlab = "Bootstrap Mean", ylab = "Frequency", main="Histogram of Bootstrap Means")


>abline(v=mean(bs.means),col="red")                    #v: the x-values for vertical line

> quantile(bs.means,c(0.025,0.975))    

# the 95%parametric confidence interval for the sample mean based on the one-sample t-test.

           

    2.5%    97.5% 

139.8644 164.8106 =


<Comparing to t-test on sample mean> 


> t.val = qt(0.975, length(x) -1)  #length(x)=30

> se = sd(x)/sqrt(length(x))   #sd(x):standard deviation of x :standard deviation of sample :S

> ci.low = mean(x) - t.val * se     #mean(x)= mean of sample : SAMPLE MEAN 

> ci.upp = mean(x) + t.val * se

> c(ci.low, ci.upp)

[1] 139.0564 165.7827  #similar to quantile(bs.means,c(0.025,0.975))

반응형