How to count distinct values for multiple columns in R for a grouped data frame

Question

I have a data frame like this:

v_1 <- c("1a", "1b","1c", "2a", "2b", "2c", "3a", "3b","3c", "4a", "4b", "4c")
v_2 <- c(1,1,1,2,2,2,3,3,3,4,4,4)
v_3 <- c("dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "cat", "cat", "cat", "cat")
v_4 <- c(1:12)

df <- data.frame(v_1, v_2, v_3, v_4)
df
   v_1 v_2 v_3 v_4
1   1a   1 dog   1
2   1b   1 dog   2
3   1c   1 dog   3
4   2a   2 dog   4
5   2b   2 dog   5
6   2c   2 dog   6
7   3a   3 cat   7
8   3b   3 cat   8
9   3c   3 cat   9
10  4a   4 cat  10
11  4b   4 cat  11
12  4c   4 cat  12

I want to group this data frame and count the distinct values for v_1 and v_2. If I am just interested in the count in v_1 it is quite easy. I just do:

library(dplyr)

df_grouped <- df %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

  v_3   v_4_sum v_1_count
  <chr>   <int>     <int>
1 cat        57         6
2 dog        21         6

If I want to also see te count of v_2 it seems like I have to use group_by two times like this:

df_grouped_v2 <- df %>% 
  group_by(v_2, v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

df_grouped_v22 <- df_grouped_v2 %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4_sum), 
            v_1_count = sum(v_1_count), 
            v_2_count = n())

df_grouped_v22

     v_3   v_4_sum v_1_count v_2_count
      <chr>   <int>     <int>     <int>
    1 cat        57         6         2
    2 dog        21         6         2

This is the result I want but it seems not straight forward. Especially if I have a huge data frame the group_by operation is time intensive and I would prefer to use it just one time.

iroha · Accepted Answer · 2023-03-17 15:13:12Z

2

For distinct values, you can use n_distinct() rather than n():

library(dplyr)

df |>
  summarise(v_4_sum = sum(v_4),
            across(c(v_1, v_2), n_distinct, .names = "{.col}_count"),
            .by = v_3)

  v_3 v_4_sum v_1_count v_2_count
1 dog      21         6         2
2 cat      57         6         2

answered Mar 17, 2023 at 15:13

iroha

35.3k4 gold badges55 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

langtang · Accepted Answer · 2023-03-17 15:17:05Z

1

If you have a large table, probably shouldn't be using dplyr, but either way, you don't only need to group by v_3.

library(data.table)
setDT(df)
df[, .(v_4_sum = sum(v_4), v_1_count = uniqueN(v_1), v_2_count = uniqueN(v_2)), v_3]

Output:

      v_3 v_4_sum v_1_count v_2_count
   <char>   <int>     <int>     <int>
1:    dog      21         6         2
2:    cat      57         6         2

answered Mar 17, 2023 at 15:17

langtang

25.3k1 gold badge14 silver badges32 bronze badges

Collectives™ on Stack Overflow

How to count distinct values for multiple columns in R for a grouped data frame

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related