Ranking variables with conditions

https://stackoverflow.com/questions/18153214

24-06-2022
|

Вопрос

Say I have the following data frame:

df <- data.frame(store       = LETTERS[1:8],
                 sales       = c(  9, 128,  54,  66,  23, 132,  89,  70),
                 successRate = c(.80, .25, .54, .92, .85, .35, .54, .46))

I want to rank the stores according to successRate, with ties going to the store with more sales, so first I do this (just to make visualization easier):

df <- df[order(-df$successRate, -df$sales), ]

In order to actually create a ranking variable, I do the following:

df$rank <- ave(df$successRate, FUN = function(x) rank(-x, ties.method='first'))

So df looks like this:

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85    2
1     A     9        0.80    3
7     G    89        0.54    4
3     C    54        0.54    5
8     H    70        0.46    6
6     F   132        0.35    7
2     B   128        0.25    8

The problem is I don't want small stores to be part of the ranking. Specifically, I want stores with less than 50 sales not to be ranked. So this is how I define df$rank instead:

df$rank <- ifelse(df$sales < 50, NA,
                  ave(df$successRate, FUN = function(x) rank(-x, ties.method='first')))

The problem is that even though this correctly removes stores E and A, it doesn't reassign the rankings they were occupying. df looks like this now:

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85   NA
1     A     9        0.80   NA
7     G    89        0.54    4
3     C    54        0.54    5
8     H    70        0.46    6
6     F   132        0.35    7
2     B   128        0.25    8

I've experimented with conditions inside and outside ave(), but I can'r get R to do what I want! How can I get it to rank the stores like this?

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85   NA
1     A     9        0.80   NA
7     G    89        0.54    2
3     C    54        0.54    3
8     H    70        0.46    4
6     F   132        0.35    5
2     B   128        0.25    6

Решение

Super easy to do with data.table:

library(data.table)
dt = data.table(df)

# do the ordering you like (note, could also use setkey to do this faster)
dt = dt[order(-successRate, -sales)]

dt[sales >= 50, rank := .I]
dt
#   store sales successRate rank
#1:     D    66        0.92    1
#2:     E    23        0.85   NA
#3:     A     9        0.80   NA
#4:     G    89        0.54    2
#5:     C    54        0.54    3
#6:     H    70        0.46    4
#7:     F   132        0.35    5
#8:     B   128        0.25    6

If you must do it in data.frame, then after your preferred order, run:

df$rank <- NA
df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow