39

I have data like this, where some "name" occurs more than three times:

df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)

  name x
1    a 1
2    a 2
3    a 3
4    b 4
5    b 5
6    c 6
7    c 7
8    c 8
9    c 9

I wish to subset (filter) the data based on number of rows (observations) within each level of the name variable. If a certain level of name occurs more than say 3 times, I want to remove all rows belonging to that level. So in this example, we would drop observations where name == c, since there are > 3 rows in that group:

  name x
1    a 1
2    a 2
3    a 3
4    b 4
5    b 5

I wrote this code, but can't get it to work.

as.data.frame(table(unique(df)$name))
subset(df, name > 3)
0

4 Answers 4

61

First, two base alternatives. One relies on table, and the other on ave and length. Then, two data.table ways.


1. table

tt <- table(df$name)

df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

If you want to walk it through step by step:

# count each 'name', assign result to an object 'tt'
tt <- table(df$name)

# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3

# from the table, select 'name' that occur < 3 times
tt[tt < 3]

# ...their names
names(tt[tt < 3])

# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])

# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. ave and length

As suggested by @flodel:

df[ave(df$x, df$name, FUN = length) < 3, ]

3. data.table: .N and .SD:

library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]

4. data.table: .N and .I:

setDT(df)
df[df[, .I[.N < 3], name]$V1] 

See also the related Q&A Count number of observations/rows per group and add result to data frame.

Sign up to request clarification or add additional context in comments.

Comments

47

Using the dplyr package:

df %>%
  group_by(name) %>%
  filter(n() < 4)

# A tibble: 5 x 2
# Groups:   name [2]
  name      x
  <fct> <int>
1 a         1
2 a         2
3 a         3
4 b         4
5 b         5

n() returns the number of observations in the current group, so we can group_by name, and then keep only those rows which are part of a group where the number of rows in that group is less than 4.

Comments

4

Yet another way using the dpylr package is using the count function and then doing a semi join on the original data frame:

library(dplyr)

df %>% 
  count(name) %>%
  filter(n <= 3) %>%
  semi_join(df, ., by = "name")

Comments

1

Package "inops" has some useful infix operators. For this particular case the operator %in#% can select elements based on how many times they occur.

library(inops)

df[df$name %in#% 1:3,]

Which returns:

  name x
1    a 1
2    a 2
3    a 3
4    b 4
5    b 5

Here df$name %in#% 1:3 returns TRUE only for elements that occur 1, 2, or 3 times. If instead we wanted to select elements that occur 4 times we would do:

df[df$name %in#% 4,]

With the following result:

  name x
6    c 6
7    c 7
8    c 8
9    c 9

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.