Collapsing data frame by selecting one row per group

https://stackoverflow.com/questions/2626567

r
dataframe

26-09-2019
|

문제

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.

For example, I'd like to convert this

> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
  x  y  z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17

Into this:

    x  y  z
1   1 11 19
2   2 12 18
3   4 13 17

I'm using aggregate to do this currently, but the performance is unacceptable with more data:

> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})

I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.

Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?

해결책

Maybe duplicated() can help:

R> d[ !duplicated(d$x), ]
  x  y  z
1 1 10 20
3 2 12 18
4 4 13 17
R>

Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:

R> ddply(d, "x", function(z) tail(z,1))
  x  y  z
1 1 11 19
2 2 12 18
3 4 13 17
R>

Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).

다른 팁

Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:

d[ !duplicated(d$x,fromLast=TRUE), ]

Here is a data.table solution which will be time and memory efficient for large data sets

library(data.table)
DT <- as.data.table(d)           # convert to data.table
setkey(DT, x)                    # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last']   # subset out the last row for each x
DT[J(unique(x)), mult ='first']  # if you wanted the first row for each x

There are a couple options using dplyr:

library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)

You can use more than one column with both distinct() and group_by():

df %>% distinct(x, y, .keep_all = TRUE)

The group_by() and filter() approach can be useful if there is a date or some other sequential field and you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:

df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow