apply function to groups within each column of a data frame in R

https://stackoverflow.com/questions/16407758

14-04-2022
|

Pergunta

I want to calculate the mean and standard deviation, by group, for each column in a subset of a large data frame.

I'm trying to understand why some of the answers to similar questions aren't working for me; I'm still pretty new at R and I'm sure there are a lot of subtleties (and not-so-subtle things!) I'm completely missing.

I have a large data frame similar to this one:

mydata <- data.frame(Experiment = rep(c("E1", "E2", "E3", "E4"), each = 9), 
                     Treatment = c(rep(c("A", "B", "C"), each = 3), rep(c("A", "C", "D"), each = 3), rep(c("A", "D", "E"), each = 3), rep(c("A", "B", "D"), each = 3)), 
                     Day1 = sample(1:100, 36), 
                     Day2 = sample(1:100, 36),
                     Day3 = sample(1:150, 36),
                     Day4 = sample(50:150, 36))

I need to subset the data by Experiment and by Treatment, for example:

testB <- mydata[(mydata[, "Experiment"] %in% c("E1", "E4")) 
            & mydata[, "Treatment"] %in% c("A", "B"), 
            c("Treatment", "Day1", "Day2", "Day4")]

Then, for each column in testB, I want to calculate the mean and standard deviation for each Treatment group.

I started by trying to use tapply (over just one column to begin with), but get back "NA" for Treatment groups that shouldn't be in testB, which isn't a big problem with this small dataset, but is pretty irksome with my real data:

>tapply(testB$Day1, testB$Treatment, mean)
   A        B        C        D        E 
70.66667 61.00000       NA       NA       NA

I tried implementing solutions from Compute mean and standard deviation by group for multiple variables in a data.frame. Using aggregate worked:

ag <- aggregate(. ~ Treatment, testB, function(x) c(mean = mean(x), sd = sd(x)))

But I can't get the data.table solutions to work.

library(data.table)
testB[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by = Treatment]
testB[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = Treatment]

both gave me the error message

Error in `[.data.frame`(testB, , c(mean = lapply(.SD, mean), sd = lapply(.SD,  : 
unused argument(s) (by = Treatment)

What am I doing wrong?

Thanks in advance for helping a clueless beginner!

Solução

Your columns are factors. Although you've dropped the rows that have the treatments "C", "D", and "E" in your subset testB, those levels still exist. Use levels(testB) to see them. You can use the droplevels function when defining your testB subset to allow you to get means for A and B without returning NAs for empty factor levels.

testB <- droplevels(mydata[(mydata[, "Experiment"] %in% c("E1", "E4")) 
        & mydata[, "Treatment"] %in% c("A", "B"), 
        c("Treatment", "Day1", "Day2", "Day4")]
tapply(testB$Day1,testB$Treatment,mean)
   A        B 
59.16667 66.00000

Hope this helps!

Ron

Outras dicas

You could use plyr and reshape2 to tackle this problem as well; I generally prefer to use these libraries because the abstractions they introduce apply to more problems, and are cleaner.

How I would solve it:

library(plyr)
library(reshape2)
# testB from your code above

# make a "long" version of testB
longTestB <- melt(testB, id.vars="Treatment")
# then use ddply for calculating your metrics
ddply(longTestB, .(Treatment), summarize, mean=mean(value), stdev=sd(value))

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow