(R) tapply function/ comparision to aggregate function /permutation test with tapply function

<tapply function>

the tapply function applies a function to each group of values given by a unique combination of the levels of certain factors.

>?tapply

the apply function needs three main arguments which are x, INDEX, FUN.

x : typically vector which we want to apply functions to.

INDEX: a list of one or more factors, each of same length as x.

fun: a function to be applied

Depending on the number of factors in the INDEX argument, and the number of values returned by the FUN argument, the class and/or the mode of the object that is returned by the tapply function differ. Each case will be explored below.

<Using one factor with a function returning one value>

> str(airquality)

'data.frame': 153 obs. of 7 variables:

$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...

$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...

$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...

$ Month : int 5 5 5 5 5 5 5 5 5 5 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

> Temp_month_mean=tapply(airquality$Temp,airquality$Month,mean)

> Temp_month_mean

5 6 7 8 9

65.54839 79.10000 83.90323 83.96774 76.90000

> Temp_month_sd=tapply(airquality$Temp,airquality$Month,sd)

> Temp_month_sd

5 6 7 8 9

6.854870 6.598589 4.315513 6.585256 8.355671

> class(Temp_month)

[1] "array"

> mode(Temp_month)

[1] "numeric"

> mean.sd=function(x)c(Mean=mean(x),SD=sd(x))

> Temp_month_mean_sd=tapply(airquality$Temp,airquality$Month,mean.sd)

> Temp_month_sd

$`5`

Mean SD

65.54839 6.85487

$`6`

Mean SD

79.100000 6.598589

$`7`

Mean SD

83.903226 4.315513

$`8`

Mean SD

83.967742 6.585256

$`9`

Mean SD

76.900000 8.355671

[access a component of a list]

As tapply is the function making a list depending on function and vector, we can reach the element of result of tapply function with the same method of reaching elements in list like this.

> list=list(x=rnorm(3,0,1),y=runif(5,5,10))

> list

[1] 0.1008402 -2.1531764 -1.0358789

[1] 7.242432 5.188416 6.961774 8.765925 8.328372

> list[1]

[1] 0.1008402 -2.1531764 -1.0358789

> list[[1]]

[1] 0.1008402 -2.1531764 -1.0358789

> Temp_month_mean_sd[1]

$`5`

Mean SD

65.54839 6.85487

> Temp_month_mean_sd['5']

$`5`

Mean SD

65.54839 6.85487

> Temp_month_mean_sd[[1]]

Mean SD

65.54839 6.85487

> Temp_month_mean_sd[['5']]

Mean SD

65.54839 6.85487

<Using more than one factor with a function returning one value>

In the previous example, I was interested in the mean of temperature depending on a level 'month.'

For this case, let's assume that we want to find the mean of temperature for each unique combination of the levels of 'month' and ' solar.'

To make two factors, I will make a new variable which is 'Solar' using if else function.

> summary(airquality$Solar.R)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

7.0 115.8 205.0 185.9 258.8 334.0 7

> airquality$SOLAR<-ifelse(airquality$Solar.R<=115,"weak",ifelse(airquality$Solar.R<=205,"med","strong"))

> str(airquality)

'data.frame': 153 obs. of 7 variables:

$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...

$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...

$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...

$ Month : int 5 5 5 5 5 5 5 5 5 5 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

$ SOLAR : chr "med" "med" "med" "strong" ...

After making a new variable 'SOLAR,' we can find the mean of temperature for combination of two factors 'Month' and 'SOLAR.'

Here, means are computed along each column, which represents levels of Month and row which represents the levels of SOLAR.

> Temp_month_mean_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean)

> Temp_month_mean_2

med strong weak

5 70.5 68.84615 60.200

6 78.7 81.35714 74.500

7 84.4 84.42857 81.200

8 87.0 85.30769 78.375

9 83.0 73.75000 74.000

> Temp_month_sd_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),sd)

> Temp_month_sd_2

med strong weak

5 3.109126 6.706292 2.973961

6 3.772709 7.850597 5.167204

7 3.435113 4.376887 4.604346

8 6.855655 7.016464 3.292307

9 8.615232 5.658863 7.982123

> mean.sd=function(x)c(Mean=mean(x),SD=sd(x))

> Temp_month_mean_sd_2=tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean.sd)

> Temp_month_mean_sd_2

med strong weak

5 Numeric,2 Numeric,2 Numeric,2

6 Numeric,2 Numeric,2 Numeric,2

7 Numeric,2 Numeric,2 Numeric,2

8 Numeric,2 Numeric,2 Numeric,2

9 Numeric,2 Numeric,2 Numeric,2

First time when I look at this result, I think my code was wrong. But right after I realized that I can reach each element from correctly.

You can reach each elements of lists like this.

As the first column first row represents the temperature of May with med solar energy, the way to reach the first columns of first row is the same with the way to reach rows for 'MAY' and columns for 'med'

> Temp_month_mean_sd_2[1,1]

[[1]]

Mean SD

70.500000 3.109126

> Temp_month_mean_sd_2[[1,1]]

Mean SD

70.500000 3.109126

> Temp_month_mean_sd_2["5","med"]

[[1]]

Mean SD

70.500000 3.109126

> Temp_month_mean_sd_2[["5","med"]]

Mean SD

70.500000 3.109126

<tapply function VS aggregate function>

tapply function

aggregate function

the apply function needs three main arguments which are x, INDEX, FUN.

x : typically vector which we want to apply functions to.

INDEX: a list of one or more factors, each of same length as x.

fun: a function to be applied

the aggregate function needs three main arguments which are x, by, FUN.

x: vector, matrix, data frame

by: a list of grouping elements

FUN: function to compute the summary statistics which can be applied to all data subsets.

> tapply(airquality$Tem,airquality$Month,mean)

5 6 7 8 9

65.54839 79.10000 83.90323 83.96774 76.90000

> aggregate(airquality$Temp,list(airquality$Month),mean)

Group.1 x

1 5 65.54839

2 6 79.10000

3 7 83.90323

4 8 83.96774

5 9 76.90000

Able to have more than one factor.

> tapply(airquality$Temp, list(airquality$Month,airquality$SOLAR),mean)

med strong weak

5 70.5 68.84615 60.200

6 78.7 81.35714 74.500

7 84.4 84.42857 81.200

8 87.0 85.30769 78.375

9 83.0 73.75000 74.000

Able to have more than one value.

> aggregate(airquality[,c("Temp","Solar.R","Wind")],list(airquality$Month),mean)

Group.1 Temp Solar.R Wind

1 5 65.54839 NA 11.622581

2 6 79.10000 190.1667 10.266667

3 7 83.90323 216.4839 8.94193

4 8 83.96774 NA 8.793548

5 9 76.90000 167.4333 10.180000

med strong weak

5 Numeric,2 Numeric,2 Numeric,2

6 Numeric,2 Numeric,2 Numeric,2

7 Numeric,2 Numeric,2 Numeric,2

8 Numeric,2 Numeric,2 Numeric,2

9 Numeric,2 Numeric,2 Numeric,2

> aggregate(airquality[,c("Temp","Solar.R","Wind")],list(airquality$SOLAR),mean.sd)

Group.1 Temp.Mean Temp.SD Solar.R.Mean Solar.R.SD Wind.Mean Wind.SD

1 med 81.388889 7.545208 164.36111 27.02819 8.775000 2.957448

2 strong 79.465753 8.768761 261.42466 32.14201 10.067123 3.454428

3 weak 72.270270 9.170147 57.97297 32.25978 11.072973 3.811288

<Permutation test with tapply function>

> set.seed(3)

> major=c(rep("statistics",35),rep("math",35))

> score=c(runif(35,50,100),runif(35,70,99))

> data<-as.data.frame(cbind(major,score))

> data

major score

1 statistics 58.40208

2 statistics 90.37582

3 statistics 69.24712

4 statistics 66.38672

5 statistics 80.10503

6 statistics 80.21970

7 statistics 56.23167

8 statistics 64.73005

9 statistics 78.88050

10 statistics 81.54896

11 statistics 75.60079

12 statistics 75.25120

13 statistics 76.70177

14 statistics 77.86247

15 statistics 93.39597

16 statistics 91.48543

17 statistics 55.57246

18 statistics 85.18442

19 statistics 94.87441

20 statistics 63.98663

21 statistics 61.41009

22 statistics 50.76649

23 statistics 56.44908

24 statistics 54.66910

25 statistics 61.84425

26 statistics 89.55737

27 statistics 79.98658

28 statistics 95.50739

29 statistics 78.02123

30 statistics 87.78524

31 statistics 68.95859

32 statistics 68.66405

33 statistics 58.51453

34 statistics 72.66537

35 statistics 62.92070

36 math 79.75171

37 math 95.79791

38 math 75.85644

39 math 86.79640

40 math 76.02133

41 math 78.16259

42 math 92.80215

43 math 75.01756

44 math 86.55168

45 math 82.15921

46 math 77.76104

47 math 71.38647

48 math 73.00130

49 math 79.10691

50 math 93.21859

51 math 76.65042

52 math 76.17695

53 math 95.43593

54 math 98.80344

55 math 94.48316

56 math 96.40266

57 math 83.66682

58 math 76.50813

59 math 73.70663

60 math 78.11082

61 math 93.66708

62 math 71.67078

63 math 93.28205

64 math 73.02725

65 math 92.23157

66 math 78.83951

67 math 92.30934

68 math 85.67902

69 math 80.50875

70 math 72.68415

> data$score<-as.numeric(data$score)

> means=tapply(data$score,data$major,mean)

> means

math statistics

83.06388 73.25038

> T.obs=means[2]-means[1] #mean score of statistics - mean score of math

> T.null=numeric()

> for (i in 1:10000){

+ major.label=sample(data$major)

+ means=tapply(data$score,major.label,mean)

+ T.null[i]=means[2]-means[1]

+ }

> T.obs

statistics

-9.8135

> hist(T.null, xlab="Mean difference",ylab="Frequency")

> abline(v=T.obs, col="red")

> p.val=(sum(abs(T.null)>abs(T.obs))+1)/(10000+1)

> p.val

[1] 0.00039996 #small enough --> reject null

저작자표시 (새창열림)

'R' 카테고리의 다른 글

(R) ANOVA in Simple Linear Regression / 단순 선형 회귀분석 - 분산분석표 (0)	2021.01.11
(R) split function/ sapply--> tapply (0)	2020.11.17
(R) lapply, sapply, mapply/ two-sample t-test using mapply function (0)	2020.11.17
(R) apply function on matrix and data frame (0)	2020.11.15
(R) functions for tables/ table(), ftable(),addmargins(), prop.table(), margin.table() (0)	2020.11.08

Jangpiano Science

(R) tapply function/ comparision to aggregate function /permutation test with tapply function

'R' 카테고리의 다른 글

티스토리툴바

(R) tapply function/ comparision to aggregate function /permutation test with tapply function

'R' 카테고리의 다른 글

관련글

티스토리툴바