본문 바로가기
R

(R) apply function on matrix and data frame

by jangpiano 2020. 11. 15.
반응형


<Some function provided by R for simple computation - colSums, rowSums, colMeans, rowMeans>


There are some functions provided by R to calculate statistics for rows or columns for matrix. columns, rowSums, colMeans, rowMeans are the things. 


> x = matrix(runif(20,min=2,max=35), 3) 


> x

         [,1]            [,2]             [,3]        [,4]              [,5]      [,6]

[1,] 25.49332 29.93656  8.329258  9.427876  6.314251 11.469574

[2,] 32.90328 25.86397 25.692699 23.865729 32.419499 33.890851

[3,] 17.61071 20.22315 33.823737 18.290469  4.744376  6.378427

          [,7]

[1,] 22.232995

[2,]  2.672031

[3,] 25.493322


> rowSums(x) #sums for each row. So for here number of rows are 3, so rowSums(x) prints out 3 numbers, sums for 3 rows. 

[1] 113.2038 177.3081 126.5642

> rowMeans(x) #means for each row. So for here number of rows are 3, so rowMeans(x) prints out 3 numbers.

[1] 16.17198 25.32972 18.08060

> colSums(x) #sums for each column. So for here number of columns are 7, so colMeans(x) prints out 7 numbers, sums for 7 columns.

[1] 76.00732 76.02368 67.84569 51.58407 43.47813 51.73885 50.39835

> colMeans(x)

[1] 25.33577 25.34123 22.61523 17.19469 14.49271 17.24628 16.79945



<apply>


Apply function returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. 

The margins here generally means columns and rows. 

apply function needs three arguments, which are x ,matrix or data frame, MARGIN, a vector giving the subscript which the function will be applied over, for here, you may usually set Margin as 1 or 2 by being interested in rows and columns of matrix or data frame. By setting Margins as 1, the function will focus on rows, setting to 2, will focus on columns for matrix. You can focus on both rows and columns by setting c(1,2) in MARGIN.

As the last argument,  Fun is the function to be applied.You can make the function by yourself, or just use the simple functions of R such as mean,sd,max,min


?apply


> apply(x,1,mean)                          #same results with rowMeans(x)

[1] 16.17198 25.32972 18.08060

> apply(x,2,mean)                         #same results with colMeans(x)

[1] 25.33577 25.34123 22.61523 17.19469 14.49271 17.24628 16.79945

> apply(x,1,sd)                               #standard deviations for each row

[1]  9.481259 10.776890 10.185332

> apply(x,2,sd)                               #standard deviation for each column 

[1]  7.647502  4.877755 13.022873  7.281033 15.544886 14.637662

[7] 12.342828


> apply(x[2:3,],1,sd)                      #standard deviation from 2nd to 3rd rows

[1] 10.77689 10.18533


> apply(x[,3:6],2,sd)                     #standard deviation from 3rd to 6rd columns. 

[1] 13.022873  7.281033 15.544886 14.637662



[How to remove missing values from the results of apply function]


> x[3,4]=NA                                      #NA is missing value. This is for setting the 3rd row and 4th column as NA. 

> x

         [,1]     [,2]      [,3]      [,4]      [,5]      [,6]

[1,] 25.49332 29.93656  8.329258  9.427876  6.314251 11.469574

[2,] 32.90328 25.86397 25.692699 23.865729 32.419499 33.890851

[3,] 17.61071 20.22315 33.823737        NA  4.744376  6.378427

          [,7]

[1,] 22.232995

[2,]  2.672031

[3,] 25.493322


 #NA is missing values. As x has NA in the third row, it prints out also NA when applying for a function to rows and columns.


> apply(x,1,mean)              

[1] 16.17198 25.32972  NA


> apply(x,1,mean,na.rm=TRUE)       #We can ignore NA when applying a function by setting na.rm=TRUE

[1] 16.17198 25.32972 18.04562


[Comparing apply function to for loop ]

By comparing apply function to the use of for loop, you can find the convenience for using apply function.


> z=numeric()

> for (i in 1:ncol(x)){

+   z[i]=mean(x[,i],na.rm=TRUE)

+ }

> z

 > apply(x,2,mean,na.rm=TRUE)

[1] 25.33577 25.34123 22.61523 16.64680 14.49271 17.24628 16.79945

 [1] 25.33577 25.34123 22.61523 16.64680 14.49271 17.24628 16.79945


[expand.grid ---> apply for a unique function]

> a=c(2,9,1,1:3)

> b=c(1,2,3)

> ab=expand.grid(a,b)

> ab

   Var1 Var2

1      2    1

2     9    1

3     1    1

4     1    1

5     2    1

6     3    1

7     2    2

8     9    2

9     1    2

10    1    2

11    2    2

12    3    2

13    2    3

14    9    3

15    1    3

16    1    3

17    2    3

18    3    3


> class(a)

[1] "numeric"

> class(b)

[1] "numeric"

> class(ab) 

[1] "data.frame".         # we can know this expand.grid function makes the separate vectors to a combined data frame. 


> apply(ab,1,function(x)(mean(x)^5+1))            # make a own function which is mean^5+1, and apply the function to each row. 

 [1]    8.59375 3126.00000    2.00000    2.00000    8.59375

 [6]   33.00000   33.00000 5033.84375    8.59375    8.59375

[11]   33.00000   98.65625   98.65625 7777.00000   33.00000

[16]   33.00000   98.65625  244.00000


[Application]

> d<-data.frame(class=c("A",rep("B",5),rep("A",3),"B","B","A",rep("A",3)),math=rnorm(15,70,3),english=runif(15,70,100))

> d

   class     math  english

1      A 72.65783 95.79730

2      B 70.05633 96.20539

3      B 71.18744 81.20444

4      B 70.38728 91.35513

5      B 72.67096 72.70461

6      B 67.64913 89.04535

7      A 69.16103 75.18307

8      A 71.22940 84.42090

9      A 68.06340 79.68066

10     B 70.90637 77.98655

11     B 69.93301 72.80735

12     A 71.27574 86.89547

13     A 63.91645 90.62090

14     A 71.28327 77.58761

15     A 69.33419 88.87152

> A.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="A")]))

> B.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="B")]))

> T_test_on_means=apply(d[,c(2,3)],2,function(x)t.test(x~d$class))

> lower.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[1])

> upper.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[2])

> p.value=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$p.value)

> d<-data.frame(class=c("A",rep("B",5),rep("A",3),"B","B","A",rep("A",3)),math=rnorm(15,70,3),english=runif(15,70,100))

> d

   class     math  english

1      A 70.62153 96.27295

2      B 71.10101 91.65343

3      B 64.83370 75.35453

4      B 72.69631 84.40178

5      B 69.23331 85.45932

6      B 71.44392 99.12645

7      A 69.16528 84.24816

8      A 72.13993 77.57584

9      A 70.11286 72.15659

10     B 68.79835 89.70399

11     B 66.58327 72.88395

12     A 74.49583 81.67268

13     A 65.05467 87.56123

14     A 70.58212 89.32241

15     A 68.00336 90.64574

> A.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="A")]))

> A.mean

    math  english 

70.02195 84.93195 

> B.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="B")]))

> B.mean

    math  english 

69.24141 85.51192 

> T_test_on_means=apply(d[,c(2,3)],2,function(x)t.test(x~d$class))

> T_test_on_means

$math


Welch Two Sample t-test


data:  x by d$class

t = 0.53943, df = 12.733, p-value = 0.5989

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -2.352102  3.913176

sample estimates:

mean in group A mean in group B 

       70.02195        69.24141 



$english


Welch Two Sample t-test


data:  x by d$class

t = -0.13154, df = 11.835, p-value = 0.8976

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 -10.201083   9.041143

sample estimates:

mean in group A mean in group B 

       84.93195        85.51192 



> lower.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[1])

> lower.bound

      math    english 

 -2.352102 -10.201083 

> upper.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[2])

> upper.bound

    math  english 

3.913176 9.041143 

> p.value=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$p.value)

> p.value

     math   english 

0.5988940 0.8975542



반응형