1

I have a data frame like this:

v_1 <- c("1a", "1b","1c", "2a", "2b", "2c", "3a", "3b","3c", "4a", "4b", "4c")
v_2 <- c(1,1,1,2,2,2,3,3,3,4,4,4)
v_3 <- c("dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "cat", "cat", "cat", "cat")
v_4 <- c(1:12)

df <- data.frame(v_1, v_2, v_3, v_4)
df
   v_1 v_2 v_3 v_4
1   1a   1 dog   1
2   1b   1 dog   2
3   1c   1 dog   3
4   2a   2 dog   4
5   2b   2 dog   5
6   2c   2 dog   6
7   3a   3 cat   7
8   3b   3 cat   8
9   3c   3 cat   9
10  4a   4 cat  10
11  4b   4 cat  11
12  4c   4 cat  12

I want to group this data frame and count the distinct values for v_1 and v_2. If I am just interested in the count in v_1 it is quite easy. I just do:

library(dplyr)

df_grouped <- df %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

  v_3   v_4_sum v_1_count
  <chr>   <int>     <int>
1 cat        57         6
2 dog        21         6

If I want to also see te count of v_2 it seems like I have to use group_by two times like this:

df_grouped_v2 <- df %>% 
  group_by(v_2, v_3) %>% 
  summarise(v_4_sum = sum(v_4), 
            v_1_count = n())

df_grouped_v22 <- df_grouped_v2 %>% 
  group_by(v_3) %>% 
  summarise(v_4_sum = sum(v_4_sum), 
            v_1_count = sum(v_1_count), 
            v_2_count = n())

df_grouped_v22

     v_3   v_4_sum v_1_count v_2_count
      <chr>   <int>     <int>     <int>
    1 cat        57         6         2
    2 dog        21         6         2

This is the result I want but it seems not straight forward. Especially if I have a huge data frame the group_by operation is time intensive and I would prefer to use it just one time.

2 Answers 2

2

For distinct values, you can use n_distinct() rather than n():

library(dplyr)

df |>
  summarise(v_4_sum = sum(v_4),
            across(c(v_1, v_2), n_distinct, .names = "{.col}_count"),
            .by = v_3)

  v_3 v_4_sum v_1_count v_2_count
1 dog      21         6         2
2 cat      57         6         2
Sign up to request clarification or add additional context in comments.

Comments

1

If you have a large table, probably shouldn't be using dplyr, but either way, you don't only need to group by v_3.

library(data.table)
setDT(df)
df[, .(v_4_sum = sum(v_4), v_1_count = uniqueN(v_1), v_2_count = uniqueN(v_2)), v_3]

Output:

      v_3 v_4_sum v_1_count v_2_count
   <char>   <int>     <int>     <int>
1:    dog      21         6         2
2:    cat      57         6         2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.