Extracting specific columns from a data frame

Question

I have an R data frame with 6 columns, and I want to create a new data frame that only has three of the columns.

Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:

data.frame(df$A,df$B,df$E)

Is there a more compact way of doing this?

select(df, c('A','B','C'))

user2110417
– user2110417

2022-07-26 09:02:21 +00:00
Commented Jul 26, 2022 at 9:02 — user2110417
– user2110417, Commented Jul 26, 2022 at 9:02

Joshua Ulrich · Accepted Answer · 2020-06-30 14:20:29Z

534

Answer recommended by R Language Collective

You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.

# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]

Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.

str(df["A"])
## 'data.frame':    1 obs. of  1 variable:
## $ A: int 1
str(df[,"A"])  # vector
##  int 1

Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).

# subset (original solution--not recommended)
df[,c("A","B","E")]  # returns a data.frame
df[,"A"]             # returns a vector

edited Jun 30, 2020 at 14:20

answered Apr 10, 2012 at 2:44

Joshua Ulrich

177k33 gold badges357 silver badges429 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

Aren Cambre Over a year ago

That gives the error object of type 'closure' is not subsettable.

Joshua Ulrich Over a year ago

@ArenCambre: then your data.frame isn't really named df. df is also a function in the stats package.

tumultous_rooster Over a year ago

@ArenCambre: 2.bp.blogspot.com/-XU9PduVhq-I/Um-Y6e19jZI/AAAAAAAADfI/…

Joshua Ulrich Over a year ago

@Cina: Because -"A" is a syntax error. And ?Extract says, "i, j, ... can also be negative integers, indicating elements/slices to leave out of the selection."

David Dorchies Over a year ago

There is an issue with this syntax because if we extract only one column R, returns a vector instead of a dataframe and this could be unwanted: > df[,c("A")] [1] 1. Using subset doesn't have this disadvantage.

|

Sam Firke · Accepted Answer · 2015-04-19 21:19:17Z

262

Using the dplyr package, if your data.frame is called df1:

library(dplyr)

df1 %>%
  select(A, B, E)

This can also be written without the %>% pipe as:

select(df1, A, B, E)

answered Apr 19, 2015 at 21:19

Sam Firke

23.4k11 gold badges100 silver badges117 bronze badges

8 Comments

Aren Cambre Over a year ago

Given the considerably evolution of the Tidyverse since posting my question, I've switched the answer to you.

Joshua Ulrich Over a year ago

Given the furious rate of change in the tidyverse, I would caution against using this pattern. This is in addition to my strong preference against treating column names as if they are object names when writing code for functions, packages, or applications.

Aren Cambre Over a year ago

It has been over four years since this answer was submitted, and the pattern hasn't changed. Piped expressions can be quite intuitive, which is why they are appealing.

Sam Firke Over a year ago

You'd chain together a pipeline like: df1 %>% select(A, B, E) %>% rowMeans(.). See the documentation for the %>% pipe by typing ?magrittr::`%>%`

moodymudskipper Over a year ago

This is a useful solution, but for the example given in the question, Josh's answer is more readable, faster, and dependency free. I hope new users learn square bracket subsetting before diving in the tidyverse :)!

|

Uli Köhler · Accepted Answer · 2014-01-15 00:24:29Z

117

This is the role of the subset() function:

> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> subset(dat, select=c("A", "B"))
  A B
1 1 3
2 2 4

edited Jan 15, 2014 at 0:24

Uli Köhler

13.9k17 gold badges75 silver badges127 bronze badges

answered Apr 10, 2012 at 9:50

Stéphane Laurent

85.3k18 gold badges140 silver badges261 bronze badges

4 Comments

Rafael_Espericueta Over a year ago

When I try this, with my data, I get the error: " Error in x[j] : invalid subscript type 'list' " But if c("A", "B") isn't a list, what is it?

Stéphane Laurent Over a year ago

@Rafael_Espericueta Hard to guess without viewing your code... But c("A", "B") is a vector, not a list.

Suat Atan PhD Over a year ago

It convert data frame to list.

moodymudskipper Over a year ago

subset() works with naked variable names too : subset(dat, select = c(A, B)), A and B here will be treated as numeric indices, similar to what tidy selection does.

Henry · Accepted Answer · 2012-04-10 06:49:54Z

92

There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or

df[,c(1,2,5)]

as in

> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> df
  A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
  A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
  A B E
1 1 3 8
2 2 4 8

answered Apr 10, 2012 at 6:49

Henry

6,8722 gold badges26 silver badges42 bronze badges

Comments

Arthur Yip · Accepted Answer · 2019-03-07 04:24:23Z

23

Where df1 is your original data frame:

df2 <- subset(df1, select = c(1, 2, 5))

edited Mar 7, 2019 at 4:24

Arthur Yip

6,4002 gold badges36 silver badges57 bronze badges

answered Jun 10, 2016 at 11:34

Richard Ball

5605 silver badges14 bronze badges

1 Comment

Gregor Thomas Over a year ago

This doesn't use dplyr. It uses base::subset, and is identical to Stephane Laurent's answer except that you use column numbers instead of column names.

so860 · Accepted Answer · 2017-10-12 18:12:23Z

22

For some reason only

df[, (names(df) %in% c("A","B","E"))]

worked for me. All of the above syntaxes yielded "undefined columns selected".

answered Oct 12, 2017 at 18:12

so860

4394 silver badges12 bronze badges

Comments

Gilad Green · Accepted Answer · 2018-04-20 16:57:16Z

15

You can also use the sqldf package which performs selects on R data frames as :

df1 <- sqldf("select A, B, E from df")

This gives as the output a data frame df1 with columns: A, B ,E.

edited Apr 20, 2018 at 16:57

Gilad Green

37.3k7 gold badges67 silver badges99 bronze badges

answered Nov 30, 2016 at 8:00

Aman Burman

2991 gold badge5 silver badges12 bronze badges

Comments

moodymudskipper · Accepted Answer · 2019-05-22 09:49:02Z

5

You can use with :

with(df, data.frame(A, B, E))

answered May 22, 2019 at 9:49

moodymudskipper

47.7k12 gold badges131 silver badges185 bronze badges

Comments

LMc · Accepted Answer · 2024-02-08 17:18:07Z

1

Sometimes it is easier to remove columns you do not want than selecting ones that you do. This can be done by using the - operator for indexes, setdiff or subset by name, or ! for logical vectors in base R:

# Column index
df[-c(3, 4)]

# Column name
subset(df, select = -c(C, D))
df[setdiff(names(df), c("C", "D"))]

# Logical vector
df[!names(df) %in% c("C", "D")]

answered Feb 8, 2024 at 17:18

LMc

19k4 gold badges41 silver badges54 bronze badges

Comments

fxi · Accepted Answer · 2016-11-09 15:32:24Z

0

[ and subset are not substitutable:

[ does return a vector if only one column is selected.

df = data.frame(a="a",b="b")    

identical(
  df[,c("a")], 
  subset(df,select="a")
) 

identical(
  df[,c("a","b")],  
  subset(df,select=c("a","b"))
)

answered Nov 9, 2016 at 15:32

fxi

6378 silver badges17 bronze badges

1 Comment

untill Over a year ago

Not if you set drop=FALSE. Example: df[,c("a"),drop=F]

LMc · Accepted Answer · 2024-02-07 23:19:40Z

TL;DR

If you are using a tibble (commonly used in the tidyverse) you can safely do any of the following to select columns and you will get a tibble back:

library(tibble)
tb <- tibble(A = 1:2, B = 3:4)

# By index
tb[1]
tb[, 1]

tb[1:2]
tb[, 1:2]


# By name
tb["A"]
tb[, "A"]

tb[c("A", "B")]
tb[, c("A", "B")]

This is in addition to the answer given by @Sam Firke which uses the popular select() verb for column selection.

You can use any of these selection operators on base R data frames, but know there are some cases where you should specify drop = FALSE.

There is already some discussion about tidyverse versus base R in other answers, but hopefully this adds something.

You can see from the documentation ?`[.data.frame` (and the answer from @Joshua Ulrich) that data frame columns can be selected several ways. This has to do with the drop argument:

If TRUE the result is coerced to the lowest possible dimension. The default is to drop if only one column is left, but not to drop if only one row is left.

If a single vector is given, then columns are indexed and selection behaves like list selection (the drop argument of [ is ignored). In this case, a data frame is always returned:

df <- data.frame(A = 1:2, B = 3:4)

str(df[1])
# 'data.frame': 2 obs. of  1 variable:
#  $ A: int  1 2

str(df[1:2])
# 'data.frame': 2 obs. of  2 variables:
#  $ A: int  1 2
#  $ B: int  3 4

str(df[c("A", "B")])
# 'data.frame': 2 obs. of  2 variables:
#  $ A: int  1 2
#  $ B: int  3 4

However, if two indicies are given ([row, column]) then selection behaves more like matrix selection. In this case the default argument of [ is drop = TRUE so the result is coerced to the lowest possible dimension only if there is only a single column left:

str(df[1, ]) # single row selection (does not reduce dimension)
# 'data.frame': 1 obs. of  2 variables:
#  $ A: int 1
#  $ B: int 3

str(df[, 1]) # single column selection (does reduce dimension)
# int [1:2] 1 2

Of course you can always change the default behavior by setting drop = FALSE:

str(df[, 1, drop = FALSE])
# 'data.frame': 2 obs. of  1 variable:
#  $ A: int  1 2

In the tidyverse, tibbles are preferred. They are like data frames, but have a few significant differences -- one being column selection. Column selection using tibbles never reduces dimensionality, as shown above:

library(tibble)

tb <- as_tibble(df)
class(tb)
# [1] "tbl_df"     "tbl"        "data.frame"

str(tb[, 1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
#  $ A: int [1:2] 1 2

str(tb[1])
# tibble [2 × 1] (S3: tbl_df/tbl/data.frame)
#  $ A: int [1:2] 1 2

All the other tibble column selection works as you would expect (above only shows by index, but you can select by name too).

Mohamed Rahouma · Accepted Answer · 2019-10-15 19:54:27Z

-2

df<- dplyr::select ( df,A,B,C)

Also, you can assign a different name to the newly created data

data<- dplyr::select ( df,A,B,C)

answered Oct 15, 2019 at 19:54

Mohamed Rahouma

1,2341 gold badge14 silver badges28 bronze badges

1 Comment

camille Over a year ago

This was already in the accepted answer

Collectives™ on Stack Overflow

Extracting specific columns from a data frame

12 Answers 12

14 Comments

8 Comments

4 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

14 Comments

8 Comments

4 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Comments

1 Comment

Linked

Related