merge dataframes based on common columns but keeping all rows from x [duplicate]

Question

I need to merge two dataframes x and y which have about 50 columns in common and some unique columns, and I need to keep all the rows from x.

It works if I run:

 NewDataframe <- merge(x, y, by=c("ColumnA", "ColumnB", "ColumnC"),all.x=TRUE)

The issue is that there are more than 50 common columns, and I would rather avoid typing the names of all the common columns.

I have tried with:

 NewDataframe <- merge(x, y, all.x=TRUE)

But the following error appears:

 Error in merge.data.table(x, y, all.x = TRUE) :
 Elements listed in `by` must be valid column names in x and y

Is there any way of using by with the common columns without typing all of them, but keeping all the rows from x?

Thank you.

Have you tried dplyr::left_join? That should be able to do everything you want in one command. — Andrew Brēza
– Andrew Brēza, Commented Apr 19, 2017 at 15:25

Community · Accepted Answer · 2017-05-23 10:31:02Z

7

You want to merge based on all common columns. So first you need to find out which column names are common between the two dataframes.

common_col_names <- intersect(names(x), names(y))

Then you use this character vector as your by parameters in the merge function.

merge(x, y, by=common_col_names, all.x=TRUE)

Edit: after reading @Andrew Gustar's answer, I double checked the documentation for the merge function, and this is exactly the default by parameter:

## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)), # <-- Look here
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"),
      incomparables = NULL, ...)

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Apr 19, 2017 at 15:22

zelite

1,53517 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PM0087 Over a year ago

With all.y=T, I correctly get the values that are in dataframe Y, but it also adds all columns to the final merged dataframe, not just the common columns. All columns are passed.

Andrew Gustar · Accepted Answer · 2017-04-19 15:25:00Z

6

I think you have data tables rather than simple dataframes, and merge works slightly differently between the two. You could try forcing it to use the dataframe method by using NewDataframe <- merge.data.frame(x, y, all.x=TRUE) which should by default merge on all shared column names.

answered Apr 19, 2017 at 15:25

Andrew Gustar

18.6k1 gold badge26 silver badges36 bronze badges

Comments

Andrew Brēza · Accepted Answer · 2017-04-19 15:32:40Z

2

The left_join function from dplyr should give you exactly what you want.

First create some dummy data...

shared <- tibble(a=rnorm(327), b=rnorm(327),c=rnorm(327),d=rnorm(327),e=rnorm(327),f=rnorm(327),g=rnorm(327))
x <- cbind(shared, tibble(v =rnorm(327),w=rnorm(327),x=rnorm(327),y=rnorm(327),z=rnorm(327)))
y <- cbind(shared, tibble(h =rnorm(327),i=rnorm(327),j=rnorm(327),k=rnorm(327),l=rnorm(327)))
rm(shared)

Then just run the join...

dplyr::left_join(x,y)
# Joining, by = c("a", "b", "c", "d", "e", "f", "g")

As you can see, the function figures out which columns are shared between the two and automatically uses them in the join.

answered Apr 19, 2017 at 15:32

Andrew Brēza

8,4673 gold badges37 silver badges42 bronze badges

Collectives™ on Stack Overflow

merge dataframes based on common columns but keeping all rows from x [duplicate]

3 Answers 3

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Linked

Related