본문 바로가기
R

(R) tapply function/ comparision to aggregate function /permutation test with tapply function

by jangpiano 2020. 11. 17.
반응형

<tapply function>


the tapply function applies a function to each group of values given by a unique combination of the levels of certain factors. 


>?tapply


the apply function needs three main arguments which are x, INDEX, FUN.

x : typically vector which we want to apply functions to. 

INDEX: a list of one or more factors, each of same length as x. 

fun: a function to be applied


Depending on the number of factors in the INDEX argument, and the number of values returned by the FUN argument, the class and/or the mode of the object that is returned by the tapply function differ. Each case will be explored below.


<Using one factor with a function returning one value> 


> str(airquality)

'data.frame': 153 obs. of  7 variables:

 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...

 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...

 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...

 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...

 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...


> Temp_month_mean=tapply(airquality$Temp,airquality$Month,mean)

> Temp_month_mean

       5        6        7        8        9 

65.54839 79.10000 83.90323 83.96774 76.90000 


> Temp_month_sd=tapply(airquality$Temp,airquality$Month,sd)

> Temp_month_sd

       5        6        7        8        9 

6.854870 6.598589 4.315513 6.585256 8.355671 


> class(Temp_month)

[1] "array"

> mode(Temp_month)

[1] "numeric"


> mean.sd=function(x)c(Mean=mean(x),SD=sd(x))

> Temp_month_mean_sd=tapply(airquality$Temp,airquality$Month,mean.sd)

> Temp_month_sd

$`5`

    Mean       SD 

65.54839  6.85487 


$`6`

     Mean        SD 

79.100000  6.598589 


$`7`

     Mean        SD 

83.903226  4.315513 


$`8`

     Mean        SD 

83.967742  6.585256 


$`9`

     Mean        SD 

76.900000  8.355671


[access a component of a list]

As tapply is the function making a list depending on function and vector, we can reach the element of result of tapply function with the same method of reaching elements in list like this


> list=list(x=rnorm(3,0,1),y=runif(5,5,10))

> list

$x

[1]  0.1008402 -2.1531764 -1.0358789


$y

[1] 7.242432 5.188416 6.961774 8.765925 8.328372


> list[1]

$x

[1]  0.1008402 -2.1531764 -1.0358789


> list[[1]]

[1]  0.1008402 -2.1531764 -1.0358789



> Temp_month_mean_sd[1]

$`5`

    Mean       SD 

65.54839  6.85487 


> Temp_month_mean_sd['5']

$`5`

    Mean       SD 

65.54839  6.85487  

> Temp_month_mean_sd[[1]]

    Mean       SD 

65.54839  6.85487 

> Temp_month_mean_sd[['5']]

    Mean       SD 

65.54839  6.85487  



<Using more than one factor with a function returning one value> 

In the previous example, I was interested in the mean of temperature depending on a level 'month.'

For this case, let's assume that we want to find the mean of temperature for each unique combination of the levels of 'month' and ' solar.'

To make two factors, I will make a new variable which is 'Solar' using if else function. 


> summary(airquality$Solar.R)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 

    7.0   115.8   205.0   185.9   258.8   334.0       7 


> airquality$SOLAR<-ifelse(airquality$Solar.R<=115,"weak",ifelse(airquality$Solar.R<=205,"med","strong"))


> str(airquality)

'data.frame': 153 obs. of  7 variables:

 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...

 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...

 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...

 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...

 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

 $ SOLAR  : chr  "med" "med" "med" "strong" ... 


After making a new variable 'SOLAR,' we can find the mean of temperature for combination of two factors 'Month' and 'SOLAR.'

Here, means are computed along each column, which represents levels of Month and row which represents the levels of SOLAR.


> Temp_month_mean_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean)

> Temp_month_mean_2

   med   strong   weak

5 70.5 68.84615 60.200

6 78.7 81.35714 74.500

7 84.4 84.42857 81.200

8 87.0 85.30769 78.375

9 83.0 73.75000 74.000


> Temp_month_sd_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),sd)

> Temp_month_sd_2

       med   strong     weak

5 3.109126 6.706292 2.973961

6 3.772709 7.850597 5.167204

7 3.435113 4.376887 4.604346

8 6.855655 7.016464 3.292307

9 8.615232 5.658863 7.982123


> mean.sd=function(x)c(Mean=mean(x),SD=sd(x))

> Temp_month_mean_sd_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean.sd)

> Temp_month_mean_sd_2

  med       strong    weak     

5 Numeric,2 Numeric,2 Numeric,2

6 Numeric,2 Numeric,2 Numeric,2

7 Numeric,2 Numeric,2 Numeric,2

8 Numeric,2 Numeric,2 Numeric,2

9 Numeric,2 Numeric,2 Numeric,2


First time when I look at this result, I think my code was wrong. But right after I realized that I can reach each element from correctly.

You can reach each elements of lists like this. 

As the first column first row represents the temperature of May with med solar energy, the way to reach the first columns of first row is the same with the way to reach rows for 'MAY' and columns for 'med'


> Temp_month_mean_sd_2[1,1]

[[1]]

     Mean        SD 

70.500000  3.109126 


> Temp_month_mean_sd_2[[1,1]]

     Mean        SD 

70.500000  3.109126 

> Temp_month_mean_sd_2["5","med"]

[[1]]

     Mean        SD 

70.500000  3.109126 


> Temp_month_mean_sd_2[["5","med"]]

     Mean        SD 

70.500000  3.109126 


<tapply function VS aggregate function> 

 tapply function

 aggregate function

 the apply function needs three main arguments which are x, INDEX, FUN.


x : typically vector which we want to apply functions to. 

INDEX: a list of one or more factors, each of same length as x. 

fun: a function to be applied


the aggregate function needs three main arguments which are x, by, FUN. 


x: vector, matrix, data frame

by: a list of grouping elements 

FUN: function to compute the summary statistics which can be applied to all data subsets.  

> tapply(airquality$Tem,airquality$Month,mean)

       5        6        7        8        9 

65.54839 79.10000 83.90323 83.96774 76.90000 

 > aggregate(airquality$Temp,list(airquality$Month),mean)

  Group.1        x

1       5 65.54839

2       6 79.10000

3       7 83.90323

4       8 83.96774

5       9 76.90000

Able to have more than one factor. 


> tapply(airquality$Temp, list(airquality$Month,airquality$SOLAR),mean)

   med   strong   weak

5 70.5 68.84615 60.200

6 78.7 81.35714 74.500

7 84.4 84.42857 81.200

8 87.0 85.30769 78.375

9 83.0 73.75000 74.000


Able to have more than one value. 

> aggregate(airquality[,c("Temp","Solar.R","Wind")],list(airquality$Month),mean)

  Group.1     Temp  Solar.R      Wind

      1       5 65.54839       NA         11.622581

          2       6 79.10000   190.1667     10.266667

     3       7 83.90323     216.4839  8.94193

   4       8 83.96774       NA      8.793548

      5       9 76.90000 167.4333 10.180000


   med       strong    weak     

5 Numeric,2 Numeric,2 Numeric,2

6 Numeric,2 Numeric,2 Numeric,2

7 Numeric,2 Numeric,2 Numeric,2

8 Numeric,2 Numeric,2 Numeric,2

9 Numeric,2 Numeric,2 Numeric,2

 > aggregate(airquality[,c("Temp","Solar.R","Wind")],list(airquality$SOLAR),mean.sd)

  Group.1 Temp.Mean   Temp.SD Solar.R.Mean Solar.R.SD Wind.Mean   Wind.SD

1     med 81.388889  7.545208    164.36111   27.02819  8.775000  2.957448

2  strong 79.465753  8.768761    261.42466   32.14201 10.067123  3.454428

3    weak 72.270270  9.170147     57.97297   32.25978 11.072973  3.811288


<Permutation test with tapply function


> set.seed(3)

> major=c(rep("statistics",35),rep("math",35))

> score=c(runif(35,50,100),runif(35,70,99))

> data<-as.data.frame(cbind(major,score))

> data

        major    score

1  statistics 58.40208

2  statistics 90.37582

3  statistics 69.24712

4  statistics 66.38672

5  statistics 80.10503

6  statistics 80.21970

7  statistics 56.23167

8  statistics 64.73005

9  statistics 78.88050

10 statistics 81.54896

11 statistics 75.60079

12 statistics 75.25120

13 statistics 76.70177

14 statistics 77.86247

15 statistics 93.39597

16 statistics 91.48543

17 statistics 55.57246

18 statistics 85.18442

19 statistics 94.87441

20 statistics 63.98663

21 statistics 61.41009

22 statistics 50.76649

23 statistics 56.44908

24 statistics 54.66910

25 statistics 61.84425

26 statistics 89.55737

27 statistics 79.98658

28 statistics 95.50739

29 statistics 78.02123

30 statistics 87.78524

31 statistics 68.95859

32 statistics 68.66405

33 statistics 58.51453

34 statistics 72.66537

35 statistics 62.92070

36       math 79.75171

37       math 95.79791

38       math 75.85644

39       math 86.79640

40       math 76.02133

41       math 78.16259

42       math 92.80215

43       math 75.01756

44       math 86.55168

45       math 82.15921

46       math 77.76104

47       math 71.38647

48       math 73.00130

49       math 79.10691

50       math 93.21859

51       math 76.65042

52       math 76.17695

53       math 95.43593

54       math 98.80344

55       math 94.48316

56       math 96.40266

57       math 83.66682

58       math 76.50813

59       math 73.70663

60       math 78.11082

61       math 93.66708

62       math 71.67078

63       math 93.28205

64       math 73.02725

65       math 92.23157

66       math 78.83951

67       math 92.30934

68       math 85.67902

69       math 80.50875

70       math 72.68415

> data$score<-as.numeric(data$score)

> means=tapply(data$score,data$major,mean)

> means

      math statistics 

  83.06388   73.25038 


> T.obs=means[2]-means[1]                          #mean score of statistics - mean score of math


> T.null=numeric()

> for (i in 1:10000){

+   major.label=sample(data$major)

+   means=tapply(data$score,major.label,mean)

+   T.null[i]=means[2]-means[1]

+ }

> T.obs

statistics 

   -9.8135 

> hist(T.null, xlab="Mean difference",ylab="Frequency")

> abline(v=T.obs, col="red")


> p.val=(sum(abs(T.null)>abs(T.obs))+1)/(10000+1)

> p.val

[1] 0.00039996           #small enough --> reject null 


반응형