Comparing two columns in a data frame across many rows using R language

Question

I have a data frame that I'm working within which I'd like to compare a data point PathwayName with ExpressionData. This comparison will be done across many rows (10million+) of the data frame. Here are the first few lines of my data frame where the contents inside each row are only separated by space:

>View(df)

    PathwayName                                      ExpressionData 
1   41bbPathway BLACK   215538_at   210671_x_at...   215538_at  na  28.566616...
2   ace2Pathway BLACK   214533_at   215184_at...     215538_at  na  28.566616...    
3   acetPathway BLACK   215184_at   01502_s_at...    215184_at  na  4.2084746...
4   achPathway  BLACK   211570_s_at 215184_at...     215184_at  na  4.2084746...
5   hoPathway   BLACK   201968_at   214578_s_at...   201968_at  na  472.4969...

As a final product, I want it to compare, copy and save into a new file where the output should be like this:

>View(df)

    PathwayName               ExpressionData 
1   41bbPathway 215538_at     215538_at         
2   acetPathway 215184_at     215184_at 
3   achPathway  215184_at     215184_at 
4   hoPathway   201968_at     201968_at

This is what I had done:

sub("BLACK.*", "", df)

I know that this doesn't work, so I hope someone can help. I had look into many Q&A about comparing two columns in a data frame, but I cannot follow those because, I need to compare each contents in a rows and find any similar contents(in this case the one with ..._at) and not only comparing based on the columns.

Hope someone know about this. Thank you.

This certainly looks like a merge operation, although I'm guessing you don't just want 3 columns in the output but rather want to drag along some of the other information in the matching rows. You should post dput(head(df)) — IRTFM
– IRTFM, Commented Jan 26, 2016 at 8:15
@42- I think two columns are desired, the first column just has two terms separated by whitespace. — steveb
– steveb, Commented Jan 26, 2016 at 8:18
Yes, @42, you are right. I don't want 3columns in the output. I only want those two columns to be as it is but with only two terms separated by whitespace in the 1st column and one term in the 2nd column. It still should be two columns afterall as @steveb said. I will add it after finish running the code for dput(head(df)). — rafidah muhamad
– rafidah muhamad, Commented Jan 26, 2016 at 8:36
you could duplicate the Pathway name column and gsub everything after the pathway for one copy and then in the second copy use gsub(".* BLACK +([0-9]{6}_at) .*","\\1",df$PathwayName) then select rows where the gene names are the same — JeremyS
– JeremyS, Commented Jan 26, 2016 at 9:54

JeremyS · Accepted Answer · 2016-01-27 06:03:41Z

0

This is not a simple task, the order of _at, _x_at and _s_at genes are inconsistent and I am guessing they have differing lengths. The other assumption I make is that the ExpressionData only lists a single gene per line, if that is violated this will not work properly. So I would use a list rather than a data.frame as it makes comparing a bit more simple. Since we only have a snippet of the data to go by I am using only that.

# firstly to make the data
PathwayName <-                                       
c("41bbPathway BLACK   215538_at   210671_x_at...",
"ace2Pathway BLACK   214533_at   215184_at...",
"acetPathway BLACK   215184_at   01502_s_at...",
"achPathway  BLACK   211570_s_at 215184_at...",
"hoPathway   BLACK   201968_at   214578_s_at...")
 PathwayName <- gsub("\\.\\.\\.","",PathwayName) # you shouldn't need this, it only fixes the partial data you supplied when I copied and pasted

ExpressionData <- 
c("215538_at  na  28.566616...",
"215538_at  na  28.566616...",
"215184_at  na  4.2084746...",
"215184_at  na  4.2084746...",
"201968_at  na  472.4969...")
  ExpressionData <- gsub("\\.\\.\\.","",ExpressionData) # you shouldn't need this, it only fixes the partial data you supplied when I copied and pasted

# to compare
PNlist <- sapply(PathwayName,function(x) strsplit(x, split=" ")) # make a list from each line
PNlist <- lapply(PNlist, function(x) x[grepl("_at",x)]) # select genes
EDlist <- sapply(ExpressionData,function(x) strsplit(x, split=" "))
EDlist <- lapply(EDlist, function(x) x[grepl("_at",x)])

Result <- data.frame("PathwayName"=gsub(" BLACK.*","",PathwayName),
                     "PathwayGene"=as.character(lapply(1:length(PNlist),function(x) PNlist[[x]][PNlist[[x]] %in% EDlist[[x]]])),
                     "ExpressionData"=gsub(" .*","",ExpressionData),stringsAsFactors=F)
# this will return a 'character(0)' if PathwayName has no gene matching ExpressionData so the next line corrects for this
Result <- Result[Result$PathwayGene == Result$ExpressionData,]

edited Jan 27, 2016 at 6:03

answered Jan 27, 2016 at 2:35

JeremyS

3,5351 gold badge19 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

rafidah muhamad Over a year ago

thank you very much @JeremyS but as you said, there is a possibility that I also need _x_at and _s_at genes beside this _at, therefore, how to add it in the PNlist and EDlist?

rafidah muhamad Over a year ago

I ran this

Result <- data.frame("PathwayName"=gsub(" BLACK.*", "",x$PathwayName),                       "PathwayGene"=as.character(lapply(1:length(PNlist), function(x) PNlist[[x]][PNlist[[x]] %in% EDlist[[x]]])),                       "ExpressionData"=gsub(" .*","",x$ExpressionData), stringAsFactors=F)

and I got this Error in gsub(" BLACK.*", "", x$PathwayName) : object 'x' not found. Why is it?

JeremyS Over a year ago

oh right, remove x$ from both calls. I updated the answer.

rafidah muhamad Over a year ago

I had removed it and got this instead

Error in data.frame(PathwayName = gsub(" BLACK.*", "", PathwayName), PathwayGene = as.character(lapply(1:length(PNlist),  :    arguments imply differing number of rows: 481, 22284, 1

.

JeremyS Over a year ago

That says your objects are of differing lengths, how those numbers are possible if you have a data.frame of over 10 million rows I don't know.

|

Collectives™ on Stack Overflow

Comparing two columns in a data frame across many rows using R language

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related