1

I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.

The dataset looks something like this

data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
              x1 = c(1, 11, 111),
              x2 = c(2, 22, 222),
              x1_1 = c(3, 333, 333),
              x2_1 = c(4, 44, 44),
              x1_2 = c(5, 55, 555),
              x2_2 = c(6, 66, 666),)

I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1+3+5, value of x2 for row 2 = 22+44+66, etc.

Any help would be much appreciated.

1
  • I should have mentioned that column names in my actual dataset are not limited to x1 and x2. It goes up to x2000 so naming columns individually would not be possible. Commented May 11, 2022 at 2:28

3 Answers 3

1

Here's a for loop approach. I use stringr but we could just as easily use base regex functions to keep it dependency-free.

library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
  result[[name_stems[i]]] = 
    rowSums(data[
      str_detect(
        names(data),
        pattern = paste0(name_stems[i], "_")
      )
    ])
}

result
# # A tibble: 3 × 3
#   date        x1    x2
#   <chr>    <dbl> <dbl>
# 1 1/1/2018     9    12
# 2 2/1/2018   399   132
# 3 3/1/2018   999   932
Sign up to request clarification or add additional context in comments.

3 Comments

While this works fine for the example above, I run into an issue with my actual dataset. It seems that for x1 for example, the rowSums includes not only x1, x1_1 and x1_2 but also any column that starts with x1 (e.g. x10, x10_1, x11, x100, etc.). Any thoughts on how to resolve this?
Sure, we can include the _ in the pattern. Editing now.
Thanks Gregor. I've added a line of code after line 3 to add a "" to the end of any column (except date) which doesn't contain a "" as the above code does not include x1 in the sum, just x1_1, x1_2, etc. data <- rename_with(data, .fn = ~paste0(., "_"), .cols = c(-contains("_"),-contains("date")))
0

Using data.table:

baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
##        date  x1  x2
## 1: 1/1/2018   9  12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932

Comments

0

Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.

library(tidyverse)

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = substr(name, 1,2)) |> 
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )

#  date        x1    x2
#  <chr>    <dbl> <dbl>
#1 1/1/2018     9    12
#2 2/1/2018   399   132
#3 3/1/2018   999   932

Updates

In order to include only columns x1 and x1_, and exclude any other column even though it starts with x1, the following regular expression pattern can be used : "x1$|(x1_).*". The similar pattern can be used to include only columns x2 and x2_. For example:

s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10"    "x1"     "x1_1"   "x1_2"   "x2"     "x2_1"   "x2_2"   "x20"   
#[10] "x20_1" 

s |> str_extract("x1$|(x1_).*")
#[1] NA     NA     "x1"   "x1_1" "x1_2" NA     NA     NA     NA     NA

s |> str_extract("x2$|(x2_).*")
#[1] NA     NA     NA     NA     NA     "x2"   "x2_1" "x2_2" NA     NA   

This pattern can then be used to create a group that consists of x1 and x1_ columns only and another group that consists of x2 and x2_ columns only.

Here is the full code:

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
                              str_detect(name, "x2$|(x2_).*")~"x2")) |>
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )

7 Comments

This one works well too. However, I have the same issue with this as with Gregor's solution. i.e. While this works fine for the example above, I run into an issue with my actual dataset. It seems that for x1 for example, the sum includes not only x1, x1_1 and x1_2 but also any column that starts with x1 (e.g. x10, x10_1, x11, x100, etc.). Any thoughts on how to resolve this?
OK I see. What about x2 group? Does it also include every column that starts with x2 too?
That's correct, if I run the code for the following df 'data <- tibble(date = c('a','b','c'), x1 = c(0.1, 0.2, 0.3), x2 = c(1, 2, 3), x1_1 = c(0.1, 0.2, 0.3), x2_1 = c(1, 2, 3), x100 = c(10, 20, 30), x201_1 = c(100, 200, 300))' The result table has two columns x1 (which is the sum of x1,x1_1 and x100) and x2 (which is the sum of x2, x2_1 and x201_1)
I have updated my answer. Please check if the updated code can work accurately as you expected.
Pivoting was my first instinct as well, but with such wide data (10k columns is a lot of columns!) this could be very inefficient memory-wise. It would be my first choice method if the data wasn't so wide.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.