R sum row values based on column name

Question

I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.

The dataset looks something like this

data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
              x1 = c(1, 11, 111),
              x2 = c(2, 22, 222),
              x1_1 = c(3, 333, 333),
              x2_1 = c(4, 44, 44),
              x1_2 = c(5, 55, 555),
              x2_2 = c(6, 66, 666),)

I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1+3+5, value of x2 for row 2 = 22+44+66, etc.

Any help would be much appreciated.

I should have mentioned that column names in my actual dataset are not limited to x1 and x2. It goes up to x2000 so naming columns individually would not be possible. — R.Ha
– R.Ha, Commented May 11, 2022 at 2:28

Gregor Thomas · Accepted Answer · 2022-05-11 13:32:21Z

1

Here's a for loop approach. I use stringr but we could just as easily use base regex functions to keep it dependency-free.

library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
  result[[name_stems[i]]] = 
    rowSums(data[
      str_detect(
        names(data),
        pattern = paste0(name_stems[i], "_")
      )
    ])
}

result
# # A tibble: 3 × 3
#   date        x1    x2
#   <chr>    <dbl> <dbl>
# 1 1/1/2018     9    12
# 2 2/1/2018   399   132
# 3 3/1/2018   999   932

edited May 11, 2022 at 13:32

answered May 11, 2022 at 2:04

Gregor Thomas

147k22 gold badges185 silver badges320 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

R.Ha Over a year ago

While this works fine for the example above, I run into an issue with my actual dataset. It seems that for x1 for example, the rowSums includes not only x1, x1_1 and x1_2 but also any column that starts with x1 (e.g. x10, x10_1, x11, x100, etc.). Any thoughts on how to resolve this?

Gregor Thomas Over a year ago

Sure, we can include the _ in the pattern. Editing now.

R.Ha Over a year ago

Thanks Gregor. I've added a line of code after line 3 to add a "" to the end of any column (except date) which doesn't contain a "" as the above code does not include x1 in the sum, just x1_1, x1_2, etc. data <- rename_with(data, .fn = ~paste0(., "_"), .cols = c(-contains("_"),-contains("date")))

jlhoward · Accepted Answer · 2022-05-11 02:32:50Z

0

Using data.table:

baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
##        date  x1  x2
## 1: 1/1/2018   9  12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932

answered May 11, 2022 at 2:32

jlhoward

59.6k7 gold badges105 silver badges144 bronze badges

Comments

Abdur Rohman · Accepted Answer · 2022-05-11 06:49:10Z

0

Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.

library(tidyverse)

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = substr(name, 1,2)) |> 
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )

#  date        x1    x2
#  <chr>    <dbl> <dbl>
#1 1/1/2018     9    12
#2 2/1/2018   399   132
#3 3/1/2018   999   932

Updates

In order to include only columns x1 and x1_, and exclude any other column even though it starts with x1, the following regular expression pattern can be used : "x1$|(x1_).*". The similar pattern can be used to include only columns x2 and x2_. For example:

s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10"    "x1"     "x1_1"   "x1_2"   "x2"     "x2_1"   "x2_2"   "x20"   
#[10] "x20_1" 

s |> str_extract("x1$|(x1_).*")
#[1] NA     NA     "x1"   "x1_1" "x1_2" NA     NA     NA     NA     NA

s |> str_extract("x2$|(x2_).*")
#[1] NA     NA     NA     NA     NA     "x2"   "x2_1" "x2_2" NA     NA

This pattern can then be used to create a group that consists of x1 and x1_ columns only and another group that consists of x2 and x2_ columns only.

Here is the full code:

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
                              str_detect(name, "x2$|(x2_).*")~"x2")) |>
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )

edited May 11, 2022 at 6:49

answered May 11, 2022 at 2:13

Abdur Rohman

2,9443 gold badges9 silver badges14 bronze badges

7 Comments

R.Ha Over a year ago

This one works well too. However, I have the same issue with this as with Gregor's solution. i.e. While this works fine for the example above, I run into an issue with my actual dataset. It seems that for x1 for example, the sum includes not only x1, x1_1 and x1_2 but also any column that starts with x1 (e.g. x10, x10_1, x11, x100, etc.). Any thoughts on how to resolve this?

Abdur Rohman Over a year ago

OK I see. What about x2 group? Does it also include every column that starts with x2 too?

R.Ha Over a year ago

That's correct, if I run the code for the following df 'data <- tibble(date = c('a','b','c'), x1 = c(0.1, 0.2, 0.3), x2 = c(1, 2, 3), x1_1 = c(0.1, 0.2, 0.3), x2_1 = c(1, 2, 3), x100 = c(10, 20, 30), x201_1 = c(100, 200, 300))' The result table has two columns x1 (which is the sum of x1,x1_1 and x100) and x2 (which is the sum of x2, x2_1 and x201_1)

Abdur Rohman Over a year ago

I have updated my answer. Please check if the updated code can work accurately as you expected.

Gregor Thomas Over a year ago

Pivoting was my first instinct as well, but with such wide data (10k columns is a lot of columns!) this could be very inefficient memory-wise. It would be my first choice method if the data wasn't so wide.

|

Collectives™ on Stack Overflow

R sum row values based on column name

3 Answers 3

3 Comments

Comments

Updates

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Updates

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related