0

I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.

Here is my code:

john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.

x <- list(john,jane)

for (i in x) {
  test <- rbind(bcm,i)
  test$dups <- duplicated(test$Full.Name,fromLast=T)
  test$dups2 <- duplicated(test$Full.Name)
  test <- test[which(test$dups==T | test$dups2==T),]
  newname <- paste("dupl",i,sep=".")
  assign(newname, test)
}

Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.

Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.

I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.


EDIT:

Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.

1
  • Try test <- rbind(bcm,get(i)). Indeed some lapply maybe convenient. Commented Jul 13, 2016 at 15:40

1 Answer 1

1

If I understand correctly based on your intended result, maybe using the match_df could be an option.

library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)

dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?

EDITED after the first comment

library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])

Now, res will have a list of the data frames with the matches, based on the column "Full.Name".

Sign up to request clarification or add additional context in comments.

5 Comments

Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
Thanks for the edit. This is more along the lines of what I am attempting to do. However, my one last thing I am trying to work through is a problem that I probably caused with my explanation (or lack thereof). I would like to retain all row matches by $Full.Name. So I am currently experimenting with join() rather than match_df(). If you have time to look into that, it's much appreciated. Otherwise, I can accept your edit as answer due to the wording of my original question.
Sure, but I'm afraid I don't fully understand the last problem you mentioned. Full.Name is a common column to all the data frames, right? Do all the data frames have the same columns? AFAIU, you'd want to match rows in all your 13 dataframes which column Full.Name is equal to the column Full.Name of any of the rows in bcm. Is that correct? If so, do you want to get a "join" operation of all the columns of the matched rows based on this column?
I'll provide an example that might make more sense: jane has a Contact's Full.Name and Phone.Number. bcm has a Contact's Full.Name and Phone.Number. I basically want to rbind() and have both rows (one from each data frame) show up whenever there is a match on Full.Name so I can clean these manually and easily eyeball differences in Phone.Number we have stored. There probably is an easier way to do this using R and finding mismatches in columns by Full.Name, but this is how I am currently approaching it.
Ok, what about this: res <- lapply(l, function(x) {join(x, bcm, by = "Full.Name", type = "inner")} ) With this approach, you'll end up having both phone numbers on the same row (instead of on different rows), but you will also be able to tell the difference between both numbers.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.