(R) apply function on matrix and data frame

<Some function provided by R for simple computation - colSums, rowSums, colMeans, rowMeans>

There are some functions provided by R to calculate statistics for rows or columns for matrix. columns, rowSums, colMeans, rowMeans are the things.

> x = matrix(runif(20,min=2,max=35), 3)

> x

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 25.49332 29.93656 8.329258 9.427876 6.314251 11.469574

[2,] 32.90328 25.86397 25.692699 23.865729 32.419499 33.890851

[3,] 17.61071 20.22315 33.823737 18.290469 4.744376 6.378427

[,7]

[1,] 22.232995

[2,] 2.672031

[3,] 25.493322

> rowSums(x) #sums for each row. So for here number of rows are 3, so rowSums(x) prints out 3 numbers, sums for 3 rows.

[1] 113.2038 177.3081 126.5642

> rowMeans(x) #means for each row. So for here number of rows are 3, so rowMeans(x) prints out 3 numbers.

[1] 16.17198 25.32972 18.08060

> colSums(x) #sums for each column. So for here number of columns are 7, so colMeans(x) prints out 7 numbers, sums for 7 columns.

[1] 76.00732 76.02368 67.84569 51.58407 43.47813 51.73885 50.39835

> colMeans(x)

[1] 25.33577 25.34123 22.61523 17.19469 14.49271 17.24628 16.79945

<apply>

Apply function returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

The margins here generally means columns and rows.

apply function needs three arguments, which are x ,matrix or data frame, MARGIN, a vector giving the subscript which the function will be applied over, for here, you may usually set Margin as 1 or 2 by being interested in rows and columns of matrix or data frame. By setting Margins as 1, the function will focus on rows, setting to 2, will focus on columns for matrix. You can focus on both rows and columns by setting c(1,2) in MARGIN.

As the last argument, Fun is the function to be applied.You can make the function by yourself, or just use the simple functions of R such as mean,sd,max,min

?apply

> apply(x,1,mean) #same results with rowMeans(x)

[1] 16.17198 25.32972 18.08060

> apply(x,2,mean) #same results with colMeans(x)

[1] 25.33577 25.34123 22.61523 17.19469 14.49271 17.24628 16.79945

> apply(x,1,sd) #standard deviations for each row

[1] 9.481259 10.776890 10.185332

> apply(x,2,sd) #standard deviation for each column

[1] 7.647502 4.877755 13.022873 7.281033 15.544886 14.637662

[7] 12.342828

> apply(x[2:3,],1,sd) #standard deviation from 2nd to 3rd rows

[1] 10.77689 10.18533

> apply(x[,3:6],2,sd) #standard deviation from 3rd to 6rd columns.

[1] 13.022873 7.281033 15.544886 14.637662

[How to remove missing values from the results of apply function]

> x[3,4]=NA #NA is missing value. This is for setting the 3rd row and 4th column as NA.

> x

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 25.49332 29.93656 8.329258 9.427876 6.314251 11.469574

[2,] 32.90328 25.86397 25.692699 23.865729 32.419499 33.890851

[3,] 17.61071 20.22315 33.823737 NA 4.744376 6.378427

[,7]

[1,] 22.232995

[2,] 2.672031

[3,] 25.493322

#NA is missing values. As x has NA in the third row, it prints out also NA when applying for a function to rows and columns.

> apply(x,1,mean)

[1] 16.17198 25.32972 NA

> apply(x,1,mean,na.rm=TRUE) #We can ignore NA when applying a function by setting na.rm=TRUE

[1] 16.17198 25.32972 18.04562

[Comparing apply function to for loop ]

By comparing apply function to the use of for loop, you can find the convenience for using apply function.

> z=numeric()

> for (i in 1:ncol(x)){

+ z[i]=mean(x[,i],na.rm=TRUE)

+ }

> z

> apply(x,2,mean,na.rm=TRUE)

[1] 25.33577 25.34123 22.61523 16.64680 14.49271 17.24628 16.79945

[expand.grid ---> apply for a unique function]

> a=c(2,9,1,1:3)

> b=c(1,2,3)

> ab=expand.grid(a,b)

> ab

Var1 Var2

1 2 1

2 9 1

3 1 1

4 1 1

5 2 1

6 3 1

7 2 2

8 9 2

9 1 2

10 1 2

11 2 2

12 3 2

13 2 3

14 9 3

15 1 3

16 1 3

17 2 3

18 3 3

> class(a)

[1] "numeric"

> class(b)

[1] "numeric"

> class(ab)

[1] "data.frame". # we can know this expand.grid function makes the separate vectors to a combined data frame.

> apply(ab,1,function(x)(mean(x)^5+1)) # make a own function which is mean^5+1, and apply the function to each row.

[1] 8.59375 3126.00000 2.00000 2.00000 8.59375

[6] 33.00000 33.00000 5033.84375 8.59375 8.59375

[11] 33.00000 98.65625 98.65625 7777.00000 33.00000

[16] 33.00000 98.65625 244.00000

[Application]

> d<-data.frame(class=c("A",rep("B",5),rep("A",3),"B","B","A",rep("A",3)),math=rnorm(15,70,3),english=runif(15,70,100))

> d

class math english

1 A 72.65783 95.79730

2 B 70.05633 96.20539

3 B 71.18744 81.20444

4 B 70.38728 91.35513

5 B 72.67096 72.70461

6 B 67.64913 89.04535

7 A 69.16103 75.18307

8 A 71.22940 84.42090

9 A 68.06340 79.68066

10 B 70.90637 77.98655

11 B 69.93301 72.80735

12 A 71.27574 86.89547

13 A 63.91645 90.62090

14 A 71.28327 77.58761

15 A 69.33419 88.87152

> A.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="A")]))

> B.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="B")]))

> T_test_on_means=apply(d[,c(2,3)],2,function(x)t.test(x~d$class))

> lower.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[1])

> upper.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[2])

> p.value=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$p.value)

> d<-data.frame(class=c("A",rep("B",5),rep("A",3),"B","B","A",rep("A",3)),math=rnorm(15,70,3),english=runif(15,70,100))

> d

class math english

1 A 70.62153 96.27295

2 B 71.10101 91.65343

3 B 64.83370 75.35453

4 B 72.69631 84.40178

5 B 69.23331 85.45932

6 B 71.44392 99.12645

7 A 69.16528 84.24816

8 A 72.13993 77.57584

9 A 70.11286 72.15659

10 B 68.79835 89.70399

11 B 66.58327 72.88395

12 A 74.49583 81.67268

13 A 65.05467 87.56123

14 A 70.58212 89.32241

15 A 68.00336 90.64574

> A.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="A")]))

> A.mean

math english

70.02195 84.93195

> B.mean=apply(d[,c(2,3)],2,function(x)mean(x[which(d$class=="B")]))

> B.mean

math english

69.24141 85.51192

> T_test_on_means=apply(d[,c(2,3)],2,function(x)t.test(x~d$class))

> T_test_on_means

$math

Welch Two Sample t-test

data: x by d$class

t = 0.53943, df = 12.733, p-value = 0.5989

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-2.352102 3.913176

sample estimates:

mean in group A mean in group B

70.02195 69.24141

$english

Welch Two Sample t-test

data: x by d$class

t = -0.13154, df = 11.835, p-value = 0.8976

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-10.201083 9.041143

sample estimates:

mean in group A mean in group B

84.93195 85.51192

> lower.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[1])

> lower.bound

math english

-2.352102 -10.201083

> upper.bound=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$conf.int[2])

> upper.bound

math english

3.913176 9.041143

> p.value=apply(d[,c(2,3)],2,function(x)t.test(x~d$class)$p.value)

> p.value

math english

0.5988940 0.8975542

저작자표시 (새창열림)

'R' 카테고리의 다른 글

(R) tapply function/ comparision to aggregate function /permutation test with tapply function (0)	2020.11.17
(R) lapply, sapply, mapply/ two-sample t-test using mapply function (0)	2020.11.17
(R) functions for tables/ table(), ftable(),addmargins(), prop.table(), margin.table() (0)	2020.11.08
(R) Sorting vector, data frame / sort(), order(),xtfrm (0)	2020.11.08
(R) ANOVA TEST (0)	2020.11.07

Jangpiano Science

(R) apply function on matrix and data frame

'R' 카테고리의 다른 글

티스토리툴바

(R) apply function on matrix and data frame

'R' 카테고리의 다른 글

관련글

티스토리툴바