How do I do a conditional sum which only looks between certain date criteria

https://stackoverflow.com//questions/24020207

21-12-2019
|

Question

Say I have data that looks like

date, user, items_bought, event_number
2013-01-01, x, 2, 1
2013-01-02, x, 1, 2
2013-01-03, x, 0, 3
2013-01-04, x, 0, 4
2013-01-04, x, 1, 5
2013-01-04, x, 2, 6
2013-01-05, x, 3, 7
2013-01-06, x, 1, 8
2013-01-01, y, 1, 1
2013-01-02, y, 1, 2
2013-01-03, y, 0, 3
2013-01-04, y, 5, 4
2013-01-05, y, 6, 5
2013-01-06, y, 1, 6

to get the cumulative sum per user per data point I was doing

data.frame(cum_items_bought=unlist(tapply(as.numeric(data$items_bought), data$user, FUN = cumsum)))

output from this looks like

date, user, items_bought
2013-01-01, x, 2
2013-01-02, x, 3
2013-01-03, x, 3
2013-01-04, x, 3
2013-01-04, x, 4
2013-01-04, x, 6
2013-01-05, x, 9
2013-01-06, x, 10
2013-01-01, y, 1
2013-01-02, y, 2
2013-01-03, y, 2
2013-01-04, y, 7
2013-01-05, y, 13
2013-01-06, y, 14

However I want to restrict my sum to only add up those that happened within 3 days of each row (relative to the user). i.e. the output needs to look like this:

date, user, cum_items_bought_3_days
2013-01-01, x, 2
2013-01-02, x, 3
2013-01-03, x, 3
2013-01-04, x, 1
2013-01-04, x, 2
2013-01-04, x, 4
2013-01-05, x, 6
2013-01-06, x, 7
2013-01-01, y, 1
2013-01-02, y, 2
2013-01-03, y, 2
2013-01-04, y, 6
2013-01-05, y, 11
2013-01-06, y, 12

Solution

Here's a dplyr solution which will produce the desired result (14 rows) as specified in the question. Note that it takes care of duplicate date entries, for example, 2013-01-04 for user x.

# define a custom function to be used in the dplyr chain
myfunc <- function(x){
  with(x, sapply(event_number, function(y) 
    sum(items_bought[event_number <= event_number[y] & date[y] - date <= 2])))
}

require(dplyr)                 #install and load into your library

df %>%
  mutate(date = as.Date(as.character(date))) %>%
  group_by(user) %>%
  do(data.frame(., cum_items_bought_3_days = myfunc(.))) %>%
  select(-c(items_bought, event_number))

#         date user cum_items_bought_3_days
#1  2013-01-01    x                       2
#2  2013-01-02    x                       3
#3  2013-01-03    x                       3
#4  2013-01-04    x                       1
#5  2013-01-04    x                       2
#6  2013-01-04    x                       4
#7  2013-01-05    x                       6
#8  2013-01-06    x                       7
#9  2013-01-01    y                       1
#10 2013-01-02    y                       2
#11 2013-01-03    y                       2
#12 2013-01-04    y                       6
#13 2013-01-05    y                      11
#14 2013-01-06    y                      12

In my answer I use a custom function myfunc inside a dplyr chain. This is done using the do operator from dplyr. The custom function is passed the subsetted df by user groups. It then uses sapply to pass each event_number and calculate the sums of items_bought. The last line of the dplyr chain deselects the undesired columns.

Let me know if you'd like a more detailed explanation.

Edit after comment by OP:

If you need more flexibility to also conditionally sum up other columns, you can adjust the code as follows. I assume here, that the other columns should be summed up the same way as items_bought. If that is not correct, please specify how you want to sum up the other columns.

I first create two additional columns with random numbers in the data (I'll post a dput of the data at the bottom of my answer):

set.seed(99)   # for reproducibility only

df$newCol1 <- sample(0:10, 14, replace=T)
df$newCol2 <- runif(14)

df
#         date user items_bought event_number newCol1     newCol2
#1  2013-01-01    x            2            1       6 0.687800094
#2  2013-01-02    x            1            2       1 0.640190769
#3  2013-01-03    x            0            3       7 0.357885360
#4  2013-01-04    x            0            4      10 0.102584999
#5  2013-01-04    x            1            5       5 0.097790922
#6  2013-01-04    x            2            6      10 0.182886256
#7  2013-01-05    x            3            7       7 0.227903474
#8  2013-01-06    x            1            8       3 0.080524150
#9  2013-01-01    y            1            1       3 0.821618422
#10 2013-01-02    y            1            2       1 0.591113977
#11 2013-01-03    y            0            3       6 0.773389019
#12 2013-01-04    y            5            4       5 0.350085977
#13 2013-01-05    y            6            5       2 0.006061323
#14 2013-01-06    y            1            6       7 0.814506223

Next, you can modify myfunc to take 2 arguments, instead of 1. The first argument will remain the subsetted data.frame as before (represented by . inside the dplyr chain and x in the function definition of myfunc), while the second argument to myfunc will specify the column to sum up (colname).

myfunc <- function(x, colname){
  with(x, sapply(event_number, function(y) 
    sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname])))
}

Then, you can use myfunc several times if you want to conditionally sum up several columns:

df %>%
  mutate(date = as.Date(as.character(date))) %>%
  group_by(user) %>%
  do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"),
                   newCol1Sums = myfunc(., "newCol1"),            
                   newCol2Sums = myfunc(., "newCol2"))) %>%
select(-c(items_bought, event_number, newCol1, newCol2))

#         date user cum_items_bought_3_days newCol1Sums newCol2Sums
#1  2013-01-01    x                       2           6   0.6878001
#2  2013-01-02    x                       3           7   1.3279909
#3  2013-01-03    x                       3          14   1.6858762
#4  2013-01-04    x                       1          18   1.1006611
#5  2013-01-04    x                       2          23   1.1984520
#6  2013-01-04    x                       4          33   1.3813383
#7  2013-01-05    x                       6          39   0.9690510
#8  2013-01-06    x                       7          35   0.6916898
#9  2013-01-01    y                       1           3   0.8216184
#10 2013-01-02    y                       2           4   1.4127324
#11 2013-01-03    y                       2          10   2.1861214
#12 2013-01-04    y                       6          12   1.7145890
#13 2013-01-05    y                      11          13   1.1295363
#14 2013-01-06    y                      12          14   1.1706535

Now you created conditional sums of the columns items_bought, newCol1 and newCol2. You can also leave out any of the sums in the dplyr chain or add more columns to sum up.

Edit #2 after comment by OP:

To calculate the cumulative sum of distinct (unique) items bought per user, you could define a second custom function myfunc2 and use it inside the dplyr chain. This function is also flexible as myfunc so that you can define the columns to which you want to apply the function.

The code would then be:

myfunc <- function(x, colname){
  with(x, sapply(event_number, function(y) 
    sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname])))
}

myfunc2 <- function(x, colname){
  cumsum(sapply(seq_along(x[[colname]]), function(y) 
    ifelse(!y == 1 & x[y, colname] %in% x[1:(y-1), colname], 0, 1)))
}

require(dplyr)                 #install and load into your library

dd %>%
  mutate(date = as.Date(as.character(date))) %>%
  group_by(user) %>%
  do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"),
                   newCol1Sums = myfunc(., "newCol1"),
                   newCol2Sums = myfunc(., "newCol2"),
                   distinct_items_bought = myfunc2(., "items_bought"))) %>%   
  select(-c(items_bought, event_number, newCol1, newCol2))

Here is the data I used:

dput(df)
structure(list(date = structure(c(1L, 2L, 3L, 4L, 4L, 4L, 5L, 
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("2013-01-01", "2013-01-02", 
"2013-01-03", "2013-01-04", "2013-01-05", "2013-01-06"), class = "factor"), 
user = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L), .Label = c(" x", " y"), class = "factor"), 
items_bought = c(2L, 1L, 0L, 0L, 1L, 2L, 3L, 1L, 1L, 1L, 
0L, 5L, 6L, 1L), event_number = c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L), newCol1 = c(6L, 1L, 7L, 
10L, 5L, 10L, 7L, 3L, 3L, 1L, 6L, 5L, 2L, 7L), newCol2 = c(0.687800094485283, 
0.640190769452602, 0.357885359786451, 0.10258499882184, 0.0977909218054265, 
0.182886255905032, 0.227903473889455, 0.0805241498164833, 
0.821618422167376, 0.591113976901397, 0.773389018839225, 
0.350085976999253, 0.00606132275424898, 0.814506222726777
)), .Names = c("date", "user", "items_bought", "event_number", 
"newCol1", "newCol2"), row.names = c(NA, -14L), class = "data.frame")

OTHER TIPS

I'd like to propose an additional data.table approach combined with zoo package rollapplyr function

First, we will aggregate items_bought column per user per unique date (as you pointed out that there could be more than one unique date per user)

library(data.table)
data <- setDT(data)[, lapply(.SD, sum), by = c("user", "date"), .SDcols = "items_bought"]

Next, we will compute rollapplyr combined with sum and partial = TRUE in order to cover up for margins (thanks for the advice @G. Grothendieck) in 3 days intervals

library(zoo)
data[, cum_items_bought_3_days := lapply(.SD, rollapplyr, 3, sum, partial = TRUE), .SDcols = "items_bought", by = user]

#     user       date items_bought cum_items_bought_3_days
#  1:    x 2013-01-01            2                       2
#  2:    x 2013-01-02            1                       3
#  3:    x 2013-01-03            0                       3
#  4:    x 2013-01-04            0                       1
#  5:    x 2013-01-05            3                       3
#  6:    x 2013-01-06            1                       4
#  7:    y 2013-01-01            1                       1
#  8:    y 2013-01-02            1                       2
#  9:    y 2013-01-03            0                       2
# 10:    y 2013-01-04            5                       6
# 11:    y 2013-01-05            6                      11
# 12:    y 2013-01-06            1                      12

This is the data set I've used

data <- structure(list(date = structure(c(15706, 15707, 15708, 15709, 15710, 15711, 15706, 15707, 15708, 15709, 15710, 15711), class = "Date"), user = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c(" x", " y"), class = "factor"), items_bought = c(2L, 1L, 0L, 0L, 3L, 1L, 1L, 1L, 0L, 5L, 6L, 1L)), .Names = c("date", "user", "items_bought"), row.names = c(NA, -12L), class = "data.frame")

Here is a fairly simple method:

# replicate your data, shifting the days ahead by your required window,
# and rbind into a single data frame
d <- do.call(rbind,lapply(0:2, function(x) transform(data,date=date+x)))

# use aggregate to add it together, subsetting out "future" days
aggregate(items_bought~date+user,subset(d,date<=max(data$date)),sum)
         date user items_bought
1  2013-01-01    x            2
2  2013-01-02    x            3
3  2013-01-03    x            3
4  2013-01-04    x            1
5  2013-01-05    x            3
6  2013-01-06    x            4
7  2013-01-01    y            1
8  2013-01-02    y            2
9  2013-01-03    y            2
10 2013-01-04    y            6
11 2013-01-05    y           11
12 2013-01-06    y           12

The following looks valid:

unlist(lapply(split(data, data$user), 
              function(x) {
                 ave(x$items_bought, 
                 cumsum(c(0, diff(x$date)) >= 3), FUN = cumsum) 
              }))   
#x1  x2  x3  x4  y1  y2  y3  y4 
# 2   3   3   4   1   6   6   7

Where data:

data = structure(list(date = structure(c(15706, 15707, 15710, 15711, 
15706, 15707, 15710, 15711), class = "Date"), user = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c(" x", " y"), class = "factor"), 
    items_bought = c(2L, 1L, 3L, 1L, 1L, 5L, 6L, 1L)), .Names = c("date", 
"user", "items_bought"), row.names = c(NA, -8L), class = "data.frame")

Here is an approach that doesn't use cumsum but a nested lapply instead. The first one goes over the users and then for each user the second lapply constructs the desired data frame by summing all items bought from within the last 2 days of each date. Note that if data$date were not sorted, it would have to be sorted in ascending order first.

data <- structure(list(
    date = structure(c(15706, 15707, 15708, 15709, 15710, 15711, 
        15706, 15707, 15708, 15709, 15710, 15711), class = "Date"), 
    user = c("x", "x", "x", "x", "x", "x", "y", "y", "y", "y", "y", "y"),
    items_bought = c(2L, 1L, 0L, 0L, 3L, 1L, 1L, 1L, 0L, 5L, 6L, 1L)),
    .Names = c("date", "user", "items_bought"),
    row.names = c(NA, -12L),
    class = "data.frame")

do.call(rbind, lapply(unique(data$user),
   function(u) {
       subd <- subset(data, user == u)
       do.call(rbind, lapply(subd$date, 
           function(x) data.frame(date = x, 
               user = u, items_bought = 
               sum(subd[subd$date %in% (x - 2):x, "items_bought"]))))
}))

Edit

To deal with the issue of having several timestamps for each day (more than 1 row per date) I would first aggregate by summing all items bought during at each time in the same day. You can do that e.g. using the built-in function aggregate but if your data is too large you can also use data.table for speed. I'll call your original data frame (with more than 1 row per date) predata and the aggregated one (1 row per date) data. So by calling

predt <- data.table(predata)
setkey(predt, date, user)
data <- predt[, list(items_bought = sum(items_bought)), by = key(predt)]

you get a data frame containing one row per date and columns date, user, items_bought. Now, I think the following way will be faster than the nested lapply above, but I am not sure since I cannot test it on your data. I am using data.table because it is meant to be fast (if used the right way, which I am not sure this is). The inner loop will be replaced by a function f. I do not know if there is a neater way, avoiding this function and replacing the double loop with only one call to data.table, or how to write a data.table call that would execute faster.

library(data.table)
dt <- data.table(data)
setkey(dt, user)
f <- function(d, u) {
    do.call(rbind, lapply(d$date, function(x) data.frame(date = x,
        items_bought = d[date %in% (x - 2):x, sum(items_bought)])))
}
data <- dt[, f(.SD, user), by = user]

Another way, which doesn't use data.table, assuming that you have enough RAM (again, I don't know the size of your data), is to store items bought 1 day before in a vector, then items bought 2 days before in another vector, etc, and to sum them up in the end. Something like

sumlist <- vector("list", 2) # this will hold one vector, which contains items 
    # bought 1 or 2 days ago
for (i in 1:2) {
    # tmpstr will be used to find the items that a given user bought i days ago
    tmpstr <- paste(data$date - i, data$user, sep = "|")
    tmpv <- data$items_bought[
        match(tmpstr, paste(data$date, data$user, sep = "|"))]
    # if a date is not in the original data, assume no purchases
    tmpv[is.na(tmpv)] <- 0
    sumlist[[i]] <- tmpv
}
# finally, add up items bought in the past as well as the present day
data$cum_items_bought_3_days <- 
    rowSums(as.data.frame(sumlist)) + data$items_bought

A final thing I would try would be to parallelize the lapply calls, e.g. by using the function mclapply instead, or by re-writing the code using the parallel functionality of foreach or plyr. Depending on the strength of your PC and the size of the task, this may outperform the data.table single-core performance...

It seems like packages xts and zoo contain functions that do what you want, although you may have the same problems with the size of your actual dataset as with @alexis_laz answer. Using the functions from the xts answer to this question seem to do the trick.

First I took the code from the answer I link to above and made sure it worked for just one user. I include the apply.daily function because I believe from your edits/comments that you have multiple observations for some days for some users - I added an extra line to the toy dataset to reflect this.

# Make dataset with two observations for one date for "y" user
dat <- structure(list(
    date = structure(c(15706, 15707, 15708, 15709, 15710, 15711, 
        15706, 15707, 15708, 15709, 15710, 15711, 15711), class = "Date"), 
    user = c("x", "x", "x", "x", "x", "x", "y", "y", "y", "y", "y", "y", "y"),
    items_bought = c(2L, 1L, 0L, 0L, 3L, 1L, 1L, 1L, 0L, 5L, 6L, 1L, 0L)),
    .Names = c("date", "user", "items_bought"),
    row.names = c(NA, -13L),
    class = "data.frame")

# Load xts package (also loads zoo)
require(xts)

# See if this works for one user
dat1 = subset(dat, user == "y")
# Create "xts" object for use with apply.daily()
dat1.1 = xts(dat1$items_bought, dat1$date)
dat2 = apply.daily(dat1.1, sum)
# Now use rollapply with a 3-day window
# The "partial" argument appears to only work with zoo objects, not xts
sum.itemsbought = rollapply(zoo(dat2), 3, sum, align = "right", partial = TRUE)

I thought the output could look nicer (more like example output from your question). I haven't worked with zoo objects much, but the answer to this question gave me some pointers for putting the info into a data.frame.

data.frame(Date=time(sum.itemsbought), sum.itemsbought, row.names=NULL)

Once I had this worked out for one user, it was straightforward to expand this to the entire toy dataset. This is where speed could become an issue. I use lapply and do.call for this step.

allusers = lapply(unique(dat$user), function(x) {
    dat1 = dat[dat$user == x,]
    dat1.1 = xts(dat1$items_bought, dat1$date)
    dat2 = apply.daily(dat1.1, sum)
    sum.itemsbought = rollapply(zoo(dat2), 3, sum, align = "right", partial = TRUE)
    data.frame(Date=time(sum.itemsbought), user = x, sum.itemsbought, row.names=NULL)
} )
do.call(rbind, allusers)

I like James' answer better, but here's an alternative:

with(data,{
  sapply(split(data,user),function(x){
    sapply(x$date,function(y) sum(x$items_bought[x$date %in% c(y,y-1,y-2)]))
  })
})

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow