1

I have a list of data frames that each have year, month, day, and temperature columns along with some other stuff. I have already figured out (with stack overflow help thanks guys) how to subset the data frames so that they only contain the data that has the temperature above a certain threshold. However, now I need to take those subsets and find another subset of that of the days that are consecutive (basically finding all the rows where the temperature was above the threshold for 3+ days).

example data:

set.seed(1234)

A <- data.frame("D" = c(sample(1:5, size = 1000, replace = TRUE)), "Temp" = c(sample(0:35, size = 1000, replace = TRUE)))
B <- data.frame("D" = c(sample(1:5, size = 1000, replace = TRUE)), "Temp" = c(sample(-10:22, size = 1000, replace = TRUE)))
C <- data.frame("D" = c(sample(1:5, size = 1000, replace = TRUE)), "Temp" = c(sample(3:42, size = 1000, replace = TRUE)))

climate <- list("Alist" = A, "Blist" = B, "Clist" = C)

# courtesy of ThomasIsCoding
tm <- lapply(
  climate,
  function(x) {
    subset(
      x,
      Temp > quantile(Temp, probs = 0.90)
    )
  }
)
tm

So far I've tried using lapply, diff, and rle to find the instances where days were consecutive. I took my list of dataframes, used lapply to select the days column, then used lapply again to find the difference between each element in the days column, and then used lapply again to do rle.

result <- lapply(lapply(lapply(
  tm, '[[',1), # this selects the days column in all the dataframes in the list tmax95
  diff), # this finds the difference between each element in the days column to help determine if they are consecutive
  rle) # i don't really understand rle fully but to my understanding this takes all the differences and counts how many there are in order or something like that
result

Then I use another lapply to find the instances where there are more than 2 differences of 1 day (consecutive)

fin <- lapply(result, function(x) x$lengths>=2 & x$values==1)
fin

However, this isn't quite what I want as an end result as this returns a list of lists of Trues and Falses, when what I'd like is the rows of data associated with the Trues that are being returned. How would I be able to get my output to be the data with consecutive days instead of the Trues?

2
  • You have to go back to the original data that contains the dates. The data you provided doesn't contain any dates. Commented Aug 9, 2024 at 13:36
  • @Edward In this case, just being able to reference back to the A,B, and C data frames with the D column should be good enough for me to figure out with the original data. I didn't use the original data as the sample since it has 20 data frames each with around 10,000 rows and 15 columns. Would it be more helpful if I put in the original data in here? Commented Aug 9, 2024 at 13:46

2 Answers 2

1

You can try the following code, which first applies the Temp threshold to each climate using subset (as you did), and then subsets these using diff to find the rows that have 3 consecutive days.

result <- lapply(climate, \(x) {
  y <- subset(x,
              Temp > quantile(Temp, probs = 0.90)
  )
  
  y[unique(
    sort(
      unlist(
        lapply(
          which(
            c(diff(y[,'D'])==1, FALSE) & c(diff(y[,'D'], diff=2)==0, FALSE, FALSE)), 
          \(x) x + 0:2)
      ))),]
}
)

lapply(result, head)
    $Alist
    D Temp
317 2   35
326 3   33
341 4   34
579 3   34
582 4   33
584 5   35

$Blist
    D Temp
357 2   21
358 3   22
361 4   20

$Clist
    D Temp
257 2   40
258 3   40
269 4   41
350 1   41
351 2   42
364 3   41

Explanation:

diff() takes differences of each element. Setting this to == 1 finds successive days. diff(..., diff=2)==0 finds successive differences that are equal. Together, these two conditions find the first of 3 successive days. Adding FALSE ensures the vector has the correct length since differencing reduces the length by 1 each time.

which finds the indices of the first successive 3 days. +0:2 gives the indices for the 3 consecutive days.

lapply does the loop for each day and we need to unlist that to get a vector, sort it, and then remove duplicates with unique.


Data (climate) provided in question with set.seed(1234).

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! This worked with a slight modification, but I was wondering if you might be able to explain the different functions you used and what they do in this context? Even after reading the R Help documentation, I'm still a bit confused
Yes - see my edit for an explanation.
0

See if this works for you, using dplyr. It's a bit unclear what consecutive means in the context of randomly distributed days, but I take the data as is for now.

library(dplyr)

threshTemp = 20
threshDay = 3

lapply(tm, \(x)
  x %>%
  mutate(cond = Temp > threshTemp, consid = consecutive_id(cond)) %>%
  filter(n() >= threshDay & cond, .by = c(D, consid)) %>%
  select(-cond, -consid))

2 Comments

The randomly distributed days is just to create a sample dataset. In my original data it's based off of actual observations and then I pick out the days that are above a certain temperature threshold which leaves some days being consecutive and most not, if that makes sense
Should be fine, we could also sort, but since we're grouping it doesn't matter much except maybe for printing and display reasons.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.