<tapply function>
the tapply function applies a function to each group of values given by a unique combination of the levels of certain factors.
>?tapply
the apply function needs three main arguments which are x, INDEX, FUN.
x : typically vector which we want to apply functions to.
INDEX: a list of one or more factors, each of same length as x.
fun: a function to be applied
Depending on the number of factors in the INDEX argument, and the number of values returned by the FUN argument, the class and/or the mode of the object that is returned by the tapply function differ. Each case will be explored below.
<Using one factor with a function returning one value>
> str(airquality)
'data.frame': 153 obs. of 7 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
> Temp_month_mean=tapply(airquality$Temp,airquality$Month,mean)
> Temp_month_mean
5 6 7 8 9
65.54839 79.10000 83.90323 83.96774 76.90000
> Temp_month_sd=tapply(airquality$Temp,airquality$Month,sd)
> Temp_month_sd
5 6 7 8 9
6.854870 6.598589 4.315513 6.585256 8.355671
> class(Temp_month)
[1] "array"
> mode(Temp_month)
[1] "numeric"
> mean.sd=function(x)c(Mean=mean(x),SD=sd(x))
> Temp_month_mean_sd=tapply(airquality$Temp,airquality$Month,mean.sd)
> Temp_month_sd
$`5`
Mean SD
65.54839 6.85487
$`6`
Mean SD
79.100000 6.598589
$`7`
Mean SD
83.903226 4.315513
$`8`
Mean SD
83.967742 6.585256
$`9`
Mean SD
76.900000 8.355671
[access a component of a list]
As tapply is the function making a list depending on function and vector, we can reach the element of result of tapply function with the same method of reaching elements in list like this.
> list=list(x=rnorm(3,0,1),y=runif(5,5,10))
> list
$x
[1] 0.1008402 -2.1531764 -1.0358789
$y
[1] 7.242432 5.188416 6.961774 8.765925 8.328372
> list[1]
$x
[1] 0.1008402 -2.1531764 -1.0358789
> list[[1]]
[1] 0.1008402 -2.1531764 -1.0358789
> Temp_month_mean_sd[1] $`5` Mean SD 65.54839 6.85487 > Temp_month_mean_sd['5'] $`5` Mean SD 65.54839 6.85487 |
> Temp_month_mean_sd[[1]] Mean SD 65.54839 6.85487 > Temp_month_mean_sd[['5']] Mean SD 65.54839 6.85487 |
<Using more than one factor with a function returning one value>
In the previous example, I was interested in the mean of temperature depending on a level 'month.'
For this case, let's assume that we want to find the mean of temperature for each unique combination of the levels of 'month' and ' solar.'
To make two factors, I will make a new variable which is 'Solar' using if else function.
> summary(airquality$Solar.R)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
7.0 115.8 205.0 185.9 258.8 334.0 7
> airquality$SOLAR<-ifelse(airquality$Solar.R<=115,"weak",ifelse(airquality$Solar.R<=205,"med","strong"))
> str(airquality)
'data.frame': 153 obs. of 7 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
$ SOLAR : chr "med" "med" "med" "strong" ...
After making a new variable 'SOLAR,' we can find the mean of temperature for combination of two factors 'Month' and 'SOLAR.'
Here, means are computed along each column, which represents levels of Month and row which represents the levels of SOLAR.
> Temp_month_mean_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean)
> Temp_month_mean_2
med strong weak
5 70.5 68.84615 60.200
6 78.7 81.35714 74.500
7 84.4 84.42857 81.200
8 87.0 85.30769 78.375
9 83.0 73.75000 74.000
> Temp_month_sd_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),sd)
> Temp_month_sd_2
med strong weak
5 3.109126 6.706292 2.973961
6 3.772709 7.850597 5.167204
7 3.435113 4.376887 4.604346
8 6.855655 7.016464 3.292307
9 8.615232 5.658863 7.982123
> mean.sd=function(x)c(Mean=mean(x),SD=sd(x))
> Temp_month_mean_sd_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean.sd)
> Temp_month_mean_sd_2
med strong weak
5 Numeric,2 Numeric,2 Numeric,2
6 Numeric,2 Numeric,2 Numeric,2
7 Numeric,2 Numeric,2 Numeric,2
8 Numeric,2 Numeric,2 Numeric,2
9 Numeric,2 Numeric,2 Numeric,2
First time when I look at this result, I think my code was wrong. But right after I realized that I can reach each element from correctly.
You can reach each elements of lists like this.
As the first column first row represents the temperature of May with med solar energy, the way to reach the first columns of first row is the same with the way to reach rows for 'MAY' and columns for 'med'
> Temp_month_mean_sd_2[1,1]
[[1]]
Mean SD
70.500000 3.109126
> Temp_month_mean_sd_2[[1,1]]
Mean SD
70.500000 3.109126
> Temp_month_mean_sd_2["5","med"]
[[1]]
Mean SD
70.500000 3.109126
> Temp_month_mean_sd_2[["5","med"]]
Mean SD
70.500000 3.109126
<tapply function VS aggregate function>
tapply function |
aggregate function |
the apply function needs three main arguments which are x, INDEX, FUN. x : typically vector which we want to apply functions to. INDEX: a list of one or more factors, each of same length as x. fun: a function to be applied |
the aggregate function needs three main arguments which are x, by, FUN. x: vector, matrix, data frame by: a list of grouping elements FUN: function to compute the summary statistics which can be applied to all data subsets. |
> tapply(airquality$Tem,airquality$Month,mean) 5 6 7 8 9 65.54839 79.10000 83.90323 83.96774 76.90000 |
> aggregate(airquality$Temp,list(airquality$Month),mean) Group.1 x 1 5 65.54839 2 6 79.10000 3 7 83.90323 4 8 83.96774 5 9 76.90000 |
Able to have more than one factor. > tapply(airquality$Temp, list(airquality$Month,airquality$SOLAR),mean) med strong weak 5 70.5 68.84615 60.200 6 78.7 81.35714 74.500 7 84.4 84.42857 81.200 8 87.0 85.30769 78.375 9 83.0 73.75000 74.000 |
Able to have more than one value. > aggregate(airquality[,c("Temp","Solar.R","Wind")],list(airquality$Month),mean) Group.1 Temp Solar.R Wind 1 5 65.54839 NA 11.622581 2 6 79.10000 190.1667 10.266667 3 7 83.90323 216.4839 8.94193 4 8 83.96774 NA 8.793548 5 9 76.90000 167.4333 10.180000 |
med strong weak 5 Numeric,2 Numeric,2 Numeric,2 6 Numeric,2 Numeric,2 Numeric,2 7 Numeric,2 Numeric,2 Numeric,2 8 Numeric,2 Numeric,2 Numeric,2 9 Numeric,2 Numeric,2 Numeric,2 |
> aggregate(airquality[,c("Temp","Solar.R","Wind")],list(airquality$SOLAR),mean.sd) Group.1 Temp.Mean Temp.SD Solar.R.Mean Solar.R.SD Wind.Mean Wind.SD 1 med 81.388889 7.545208 164.36111 27.02819 8.775000 2.957448 2 strong 79.465753 8.768761 261.42466 32.14201 10.067123 3.454428 3 weak 72.270270 9.170147 57.97297 32.25978 11.072973 3.811288 |
<Permutation test with tapply function>
> set.seed(3)
> major=c(rep("statistics",35),rep("math",35))
> score=c(runif(35,50,100),runif(35,70,99))
> data<-as.data.frame(cbind(major,score))
> data
major score
1 statistics 58.40208
2 statistics 90.37582
3 statistics 69.24712
4 statistics 66.38672
5 statistics 80.10503
6 statistics 80.21970
7 statistics 56.23167
8 statistics 64.73005
9 statistics 78.88050
10 statistics 81.54896
11 statistics 75.60079
12 statistics 75.25120
13 statistics 76.70177
14 statistics 77.86247
15 statistics 93.39597
16 statistics 91.48543
17 statistics 55.57246
18 statistics 85.18442
19 statistics 94.87441
20 statistics 63.98663
21 statistics 61.41009
22 statistics 50.76649
23 statistics 56.44908
24 statistics 54.66910
25 statistics 61.84425
26 statistics 89.55737
27 statistics 79.98658
28 statistics 95.50739
29 statistics 78.02123
30 statistics 87.78524
31 statistics 68.95859
32 statistics 68.66405
33 statistics 58.51453
34 statistics 72.66537
35 statistics 62.92070
36 math 79.75171
37 math 95.79791
38 math 75.85644
39 math 86.79640
40 math 76.02133
41 math 78.16259
42 math 92.80215
43 math 75.01756
44 math 86.55168
45 math 82.15921
46 math 77.76104
47 math 71.38647
48 math 73.00130
49 math 79.10691
50 math 93.21859
51 math 76.65042
52 math 76.17695
53 math 95.43593
54 math 98.80344
55 math 94.48316
56 math 96.40266
57 math 83.66682
58 math 76.50813
59 math 73.70663
60 math 78.11082
61 math 93.66708
62 math 71.67078
63 math 93.28205
64 math 73.02725
65 math 92.23157
66 math 78.83951
67 math 92.30934
68 math 85.67902
69 math 80.50875
70 math 72.68415
> data$score<-as.numeric(data$score)
> means=tapply(data$score,data$major,mean)
> means
math statistics
83.06388 73.25038
> T.obs=means[2]-means[1] #mean score of statistics - mean score of math
> T.null=numeric()
> for (i in 1:10000){
+ major.label=sample(data$major)
+ means=tapply(data$score,major.label,mean)
+ T.null[i]=means[2]-means[1]
+ }
> T.obs
statistics
-9.8135
> hist(T.null, xlab="Mean difference",ylab="Frequency")
> abline(v=T.obs, col="red")
> p.val=(sum(abs(T.null)>abs(T.obs))+1)/(10000+1)
> p.val
[1] 0.00039996 #small enough --> reject null
'R' 카테고리의 다른 글
(R) ANOVA in Simple Linear Regression / 단순 선형 회귀분석 - 분산분석표 (0) | 2021.01.11 |
---|---|
(R) split function/ sapply--> tapply (0) | 2020.11.17 |
(R) lapply, sapply, mapply/ two-sample t-test using mapply function (0) | 2020.11.17 |
(R) apply function on matrix and data frame (0) | 2020.11.15 |
(R) functions for tables/ table(), ftable(),addmargins(), prop.table(), margin.table() (0) | 2020.11.08 |