463

I have an R data frame with 6 columns, and I want to create a new data frame that only has three of the columns.

Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:

data.frame(df$A,df$B,df$E)

Is there a more compact way of doing this?

1
  • 8
    select(df, c('A','B','C')) Commented Jul 26, 2022 at 9:02

12 Answers 12

534
Answer recommended by R Language Collective

You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.

# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]

Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.

str(df["A"])
## 'data.frame':    1 obs. of  1 variable:
## $ A: int 1
str(df[,"A"])  # vector
##  int 1

Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).

# subset (original solution--not recommended)
df[,c("A","B","E")]  # returns a data.frame
df[,"A"]             # returns a vector
Sign up to request clarification or add additional context in comments.

14 Comments

That gives the error object of type 'closure' is not subsettable.
@ArenCambre: then your data.frame isn't really named df. df is also a function in the stats package.
@Cina: Because -"A" is a syntax error. And ?Extract says, "i, j, ... can also be negative integers, indicating elements/slices to leave out of the selection."
There is an issue with this syntax because if we extract only one column R, returns a vector instead of a dataframe and this could be unwanted: > df[,c("A")] [1] 1. Using subset doesn't have this disadvantage.
|
262

Using the dplyr package, if your data.frame is called df1:

library(dplyr)

df1 %>%
  select(A, B, E)

This can also be written without the %>% pipe as:

select(df1, A, B, E)

8 Comments

Given the considerably evolution of the Tidyverse since posting my question, I've switched the answer to you.
Given the furious rate of change in the tidyverse, I would caution against using this pattern. This is in addition to my strong preference against treating column names as if they are object names when writing code for functions, packages, or applications.
It has been over four years since this answer was submitted, and the pattern hasn't changed. Piped expressions can be quite intuitive, which is why they are appealing.
You'd chain together a pipeline like: df1 %>% select(A, B, E) %>% rowMeans(.). See the documentation for the %>% pipe by typing ?magrittr::`%>%`
This is a useful solution, but for the example given in the question, Josh's answer is more readable, faster, and dependency free. I hope new users learn square bracket subsetting before diving in the tidyverse :)!
|
117

This is the role of the subset() function:

> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> subset(dat, select=c("A", "B"))
  A B
1 1 3
2 2 4

4 Comments

When I try this, with my data, I get the error: " Error in x[j] : invalid subscript type 'list' " But if c("A", "B") isn't a list, what is it?
@Rafael_Espericueta Hard to guess without viewing your code... But c("A", "B") is a vector, not a list.
It convert data frame to list.
subset() works with naked variable names too : subset(dat, select = c(A, B)), A and B here will be treated as numeric indices, similar to what tidy selection does.
92

There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or

df[,c(1,2,5)]

as in

> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> df
  A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
  A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
  A B E
1 1 3 8
2 2 4 8

Comments

23

Where df1 is your original data frame:

df2 <- subset(df1, select = c(1, 2, 5))

1 Comment

This doesn't use dplyr. It uses base::subset, and is identical to Stephane Laurent's answer except that you use column numbers instead of column names.
22

For some reason only

df[, (names(df) %in% c("A","B","E"))]

worked for me. All of the above syntaxes yielded "undefined columns selected".

Comments

15

You can also use the sqldf package which performs selects on R data frames as :

df1 <- sqldf("select A, B, E from df")

This gives as the output a data frame df1 with columns: A, B ,E.

Comments

5

You can use with :

with(df, data.frame(A, B, E))

Comments

1

Sometimes it is easier to remove columns you do not want than selecting ones that you do. This can be done by using the - operator for indexes, setdiff or subset by name, or ! for logical vectors in base R:

# Column index
df[-c(3, 4)]

# Column name
subset(df, select = -c(C, D))
df[setdiff(names(df), c("C", "D"))]

# Logical vector
df[!names(df) %in% c("C", "D")]

Comments

0

[ and subset are not substitutable:

[ does return a vector if only one column is selected.

df = data.frame(a="a",b="b")    

identical(
  df[,c("a")], 
  subset(df,select="a")
) 

identical(
  df[,c("a","b")],  
  subset(df,select=c("a","b"))
)

1 Comment

Not if you set drop=FALSE. Example: df[,c("a"),drop=F]
0

TL;DR

If you are using a tibble (commonly used in the tidyverse) you can safely do any of the following to select columns and you will get a tibble back:

library(tibble)
tb <- tibble(A = 1:2, B = 3:4)

# By index
tb[1]
tb[, 1]

tb[1:2]
tb[, 1:2]


# By name
tb["A"]
tb[, "A"]

tb[c("A", "B")]
tb[, c("A", "B")]

This is in addition to the answer given by @Sam Firke which uses the popular select() verb for column selection.

You can use any of these selection operators on base R data frames, but know there are some cases where you should specify drop = FALSE.


There is already some discussion about tidyverse versus base R in other answers, but hopefully this adds something.

You can see from the documentation ?`[.data.frame` (and the answer from @Joshua Ulrich) that data frame columns can be selected several ways. This has to do with the drop argument:

If TRUE the result is coerced to the lowest possible dimension. The default is to drop if only one column is left, but not to drop if only one row is left.

If a single vector is given, then columns are indexed and selection behaves like list selection (the drop argument of [ is ignored). In this case, a data frame is always returned:

df <- data.frame(A = 1:2, B = 3:4)

str(df[1])
# 'data.frame': 2 obs. of  1 variable:
#  $ A: int  1 2

str(df[1:2])
# 'data.frame': 2 obs. of  2 variables:
#  $ A: int  1 2
#  $ B: int  3 4

str(df[c("A", "B")])
# 'data.frame': 2 obs. of  2 variables:
#  $ A: int  1 2
#  $ B: int  3 4

However, if two indicies are given ([row, column]) then selection behaves more like matrix selection. In this case the default argument of [ is drop = TRUE so the result is coerced to the lowest possible dimension only if there is only a single column left:

str(df[1, ]) # single row selection (does not reduce dimension)
# 'data.frame': 1 obs. of  2 variables:
#  $ A: int 1
#  $ B: int 3

str(df[, 1]) # single column selection (does reduce dimension)
# int [1:2] 1 2

Of course you can always change the default behavior by setting drop = FALSE:

str(df[, 1, drop = FALSE])
# 'data.frame': 2 obs. of  1 variable:
#  $ A: int  1 2

In the tidyverse, tibbles are preferred. They are like data frames, but have a few significant differences -- one being column selection. Column selection using tibbles never reduces dimensionality, as shown above:

library(tibble)

tb <- as_tibble(df)
class(tb)
# [1] "tbl_df"     "tbl"        "data.frame"

str(tb[, 1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
#  $ A: int [1:2] 1 2

str(tb[1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
#  $ A: int [1:2] 1 2

All the other tibble column selection works as you would expect (above only shows by index, but you can select by name too).

Comments

-2
df<- dplyr::select ( df,A,B,C)

Also, you can assign a different name to the newly created data

data<- dplyr::select ( df,A,B,C)

1 Comment

This was already in the accepted answer

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.