본문 바로가기
R

(R) split function/ sapply--> tapply

by jangpiano 2020. 11. 17.
반응형

<split>

split function divides the data in the vector x into the groups defined by function. 

There are two main arguments in split function. 

x: a vector or a data frame 

f: a factor or a list of factors 


>?split


<EXAMPLE 1>

> list_1=list(x=rnorm(5,0,1),y=runif(3,5,10))

> list_1

$x

[1] -0.1205420  0.6790684 -1.0631887  0.7876966  0.3130599


$y

[1] 8.521317 5.693054 5.061110


> sapply(list_1,mean)

        x         y 

0.1192188 6.4251602 


<EXAMPLE 2>

> str(airquality)

'data.frame': 153 obs. of  7 variables:

 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...

 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...

 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...

 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...

 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

 $ SOLAR  : chr  "med" "med" "med" "strong" ...


> split(airquality$Temp,airquality$Month)

$`5`

 [1] 67 72 74 62 56 66 65 59 61 69 74 69 66 68 58 64 66 57 68 62 59 73 61 61 57

[26] 58 57 67 81 79 76


$`6`

 [1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76 77 76 76 76 75

[26] 78 73 80 77 83


$`7`

 [1] 84 85 81 84 83 83 88 92 92 89 82 73 81 91 80 81 82 84 87 85 74 81 82 86 85

[26] 82 86 88 86 83 81


$`8`

 [1] 81 81 82 86 85 87 89 90 90 92 86 86 82 80 79 77 79 76 78 78 77 72 75 79 81

[26] 86 88 97 94 96 94


$`9`

 [1] 91 92 93 93 87 84 80 78 75 73 81 76 77 71 71 78 67 76 68 82 64 71 81 69 63

[26] 70 77 75 76 68



<sapply vs tapply>

sapply  

 tapply

 two main arguments 

 three main arguments 

 > sapply(airquality,mean)

    Ozone   Solar.R      Wind      Temp     Month       Day 

       NA        NA  9.957516 77.882353  6.993464 15.803922 

> tapply(airquality$Temp,airquality$Month,mean)

       5        6        7        8        9 

65.54839 79.10000 83.90323 83.96774 76.90000  

 

With sapply function, we canapplies a function to each group of values given by a unique combination of the levels of certain factors, which is what tapply function does. 


[one factor, one function]

 split +sapply

 tapply 

> airquality_2=split(airquality$Temp,airquality$Month) 

> airquality_2

$`5`

 [1] 67 72 74 62 56 66 65 59 61 69 74 69 66 68 58 64 66 57 68 62 59 73 61 61

[25] 57 58 57 67 81 79 76


$`6`

 [1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76 77 76 76 76

[25] 75 78 73 80 77 83


$`7`

 [1] 84 85 81 84 83 83 88 92 92 89 82 73 81 91 80 81 82 84 87 85 74 81 82 86

[25] 85 82 86 88 86 83 81


$`8`

 [1] 81 81 82 86 85 87 89 90 90 92 86 86 82 80 79 77 79 76 78 78 77 72 75 79

[25] 81 86 88 97 94 96 94


$`9`

 [1] 91 92 93 93 87 84 80 78 75 73 81 76 77 71 71 78 67 76 68 82 64 71 81 69

[25] 63 70 77 75 76 68


> Temp_month_mean_2=sapply(airquality_2,mean)

> Temp_month_mean_2   

       5        6        7        8        9 

65.54839 79.10000 83.90323 83.96774 76.90000 

> tapply(airquality$Temp,airquality$Month,mean)

       5        6        7        8        9 

65.54839 79.10000 83.90323 83.96774 76.90000  


[one factor, more than one function]

  split +sapply

  tapply 

 

> mean.sd=function(x)c(Mean=mean(x),SD=sd(x))


> airquality_2=split(airquality$Temp,airquality$Month)


> sapply(airquality_2,mean.sd)

            5         6         7         8         9

Mean 65.54839 79.100000 83.903226 83.967742 76.900000

SD    6.85487  6.598589  4.315513  6.585256  8.355671


> sapply(airquality_2,mean.sd,simplify=F)

$`5`

    Mean       SD 

65.54839  6.85487 


$`6`

     Mean        SD 

79.100000  6.598589 


$`7`

     Mean        SD 

83.903226  4.315513 


$`8`

     Mean        SD 

83.967742  6.585256 


$`9`

     Mean        SD 

76.900000  8.355671 

> tapply(airquality$Temp,airquality$Month,mean.sd)

$`5`

    Mean       SD 

65.54839  6.85487 


$`6`

     Mean        SD 

79.100000  6.598589 


$`7`

     Mean        SD 

83.903226  4.315513 


$`8`

     Mean        SD 

83.967742  6.585256 


$`9`

     Mean        SD 

76.900000  8.355671  


[More than a factor, one function]


> airquality$SOLAR<-ifelse(airquality$Solar.R<=115,"weak",ifelse(airquality$Solar.R<=205,"med","strong"))

> table(airquality$SOLAR)


   med strong   weak 

    36     73     37 

 sapply

 tapply

> airquality_3=split(airquality$Temp,list(airquality$Month,airquality$SOLAR))


> airquality_3

$`5.med`

[1] 67 72 74 69


$`6.med`

 [1] 84 82 82 77 73 76 77 75 78 83


$`7.med`

[1] 83 89 82 81 87


$`8.med`

[1] 86 86 80 78 88 97 94


$`9.med`

 [1] 91 92 93 93 82 81 70 77 75 76


$`5.strong`

 [1] 62 65 69 66 68 64 66 68 73 58 81 79 76


$`6.strong`

 [1] 78 74 67 85 79 87 90 87 93 92 80 79 72 76


$`7.strong`

 [1] 84 85 81 83 88 92 92 73 91 81 82 84 85 81 82 86 85 88 86 83 81


$`8.strong`

 [1] 89 90 90 92 82 78 77 75 79 81 86 94 96


$`9.strong`

 [1] 80 78 75 73 81 76 77 78 67 68 64 68


$`5.weak`

 [1] 59 61 58 57 62 59 61 61 57 67


$`6.weak`

[1] 65 76 76 73 80 77


$`7.weak`

[1] 84 80 74 82 86


$`8.weak`

[1] 81 81 82 79 77 79 76 72


$`9.weak`

[1] 87 84 71 71 76 71 69 63




 > tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean.sd)

  med       strong    weak     

5 Numeric,2 Numeric,2 Numeric,2

6 Numeric,2 Numeric,2 Numeric,2

7 Numeric,2 Numeric,2 Numeric,2

8 Numeric,2 Numeric,2 Numeric,2

9 Numeric,2 Numeric,2 Numeric,2


[More than a factor, More than a function]

 sapply

 tapply

 


 > mean.sd=function(x)c(Mean=mean(x),SD=sd(x))

> tapply(airquality$Temp,list(airquality$Month,airquality$SOLAR),mean.sd)

  med       strong    weak     

5 Numeric,2 Numeric,2 Numeric,2

6 Numeric,2 Numeric,2 Numeric,2

7 Numeric,2 Numeric,2 Numeric,2

8 Numeric,2 Numeric,2 Numeric,2

9 Numeric,2 Numeric,2 Numeric,2


반응형