0

I have a data frame that I'm working within which I'd like to compare a data point PathwayName with ExpressionData. This comparison will be done across many rows (10million+) of the data frame. Here are the first few lines of my data frame where the contents inside each row are only separated by space:

>View(df)

    PathwayName                                      ExpressionData 
1   41bbPathway BLACK   215538_at   210671_x_at...   215538_at  na  28.566616...
2   ace2Pathway BLACK   214533_at   215184_at...     215538_at  na  28.566616...    
3   acetPathway BLACK   215184_at   01502_s_at...    215184_at  na  4.2084746...
4   achPathway  BLACK   211570_s_at 215184_at...     215184_at  na  4.2084746...
5   hoPathway   BLACK   201968_at   214578_s_at...   201968_at  na  472.4969...

As a final product, I want it to compare, copy and save into a new file where the output should be like this:

>View(df)

    PathwayName               ExpressionData 
1   41bbPathway 215538_at     215538_at         
2   acetPathway 215184_at     215184_at 
3   achPathway  215184_at     215184_at 
4   hoPathway   201968_at     201968_at  

This is what I had done:

sub("BLACK.*", "", df)

I know that this doesn't work, so I hope someone can help. I had look into many Q&A about comparing two columns in a data frame, but I cannot follow those because, I need to compare each contents in a rows and find any similar contents(in this case the one with ..._at) and not only comparing based on the columns.

Hope someone know about this. Thank you.

7
  • This certainly looks like a merge operation, although I'm guessing you don't just want 3 columns in the output but rather want to drag along some of the other information in the matching rows. You should post dput(head(df)) Commented Jan 26, 2016 at 8:15
  • @42- I think two columns are desired, the first column just has two terms separated by whitespace. Commented Jan 26, 2016 at 8:18
  • The OP needs to respond to both questions. Commented Jan 26, 2016 at 8:20
  • Yes, @42, you are right. I don't want 3columns in the output. I only want those two columns to be as it is but with only two terms separated by whitespace in the 1st column and one term in the 2nd column. It still should be two columns afterall as @steveb said. I will add it after finish running the code for dput(head(df)). Commented Jan 26, 2016 at 8:36
  • 1
    you could duplicate the Pathway name column and gsub everything after the pathway for one copy and then in the second copy use gsub(".* BLACK +([0-9]{6}_at) .*","\\1",df$PathwayName) then select rows where the gene names are the same Commented Jan 26, 2016 at 9:54

1 Answer 1

0

This is not a simple task, the order of _at, _x_at and _s_at genes are inconsistent and I am guessing they have differing lengths. The other assumption I make is that the ExpressionData only lists a single gene per line, if that is violated this will not work properly. So I would use a list rather than a data.frame as it makes comparing a bit more simple. Since we only have a snippet of the data to go by I am using only that.

# firstly to make the data
PathwayName <-                                       
c("41bbPathway BLACK   215538_at   210671_x_at...",
"ace2Pathway BLACK   214533_at   215184_at...",
"acetPathway BLACK   215184_at   01502_s_at...",
"achPathway  BLACK   211570_s_at 215184_at...",
"hoPathway   BLACK   201968_at   214578_s_at...")
 PathwayName <- gsub("\\.\\.\\.","",PathwayName) # you shouldn't need this, it only fixes the partial data you supplied when I copied and pasted

ExpressionData <- 
c("215538_at  na  28.566616...",
"215538_at  na  28.566616...",
"215184_at  na  4.2084746...",
"215184_at  na  4.2084746...",
"201968_at  na  472.4969...")
  ExpressionData <- gsub("\\.\\.\\.","",ExpressionData) # you shouldn't need this, it only fixes the partial data you supplied when I copied and pasted

# to compare
PNlist <- sapply(PathwayName,function(x) strsplit(x, split=" ")) # make a list from each line
PNlist <- lapply(PNlist, function(x) x[grepl("_at",x)]) # select genes
EDlist <- sapply(ExpressionData,function(x) strsplit(x, split=" "))
EDlist <- lapply(EDlist, function(x) x[grepl("_at",x)])

Result <- data.frame("PathwayName"=gsub(" BLACK.*","",PathwayName),
                     "PathwayGene"=as.character(lapply(1:length(PNlist),function(x) PNlist[[x]][PNlist[[x]] %in% EDlist[[x]]])),
                     "ExpressionData"=gsub(" .*","",ExpressionData),stringsAsFactors=F)
# this will return a 'character(0)' if PathwayName has no gene matching ExpressionData so the next line corrects for this
Result <- Result[Result$PathwayGene == Result$ExpressionData,]
Sign up to request clarification or add additional context in comments.

8 Comments

thank you very much @JeremyS but as you said, there is a possibility that I also need _x_at and _s_at genes beside this _at, therefore, how to add it in the PNlist and EDlist?
I ran this Result <- data.frame("PathwayName"=gsub(" BLACK.*", "",x$PathwayName), "PathwayGene"=as.character(lapply(1:length(PNlist), function(x) PNlist[[x]][PNlist[[x]] %in% EDlist[[x]]])), "ExpressionData"=gsub(" .*","",x$ExpressionData), stringAsFactors=F) and I got this Error in gsub(" BLACK.*", "", x$PathwayName) : object 'x' not found. Why is it?
oh right, remove x$ from both calls. I updated the answer.
I had removed it and got this instead Error in data.frame(PathwayName = gsub(" BLACK.*", "", PathwayName), PathwayGene = as.character(lapply(1:length(PNlist), : arguments imply differing number of rows: 481, 22284, 1.
That says your objects are of differing lengths, how those numbers are possible if you have a data.frame of over 10 million rows I don't know.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.