I have a data frame that I'm working within which I'd like to compare a data point PathwayName with ExpressionData. This comparison will be done across many rows (10million+) of the data frame. Here are the first few lines of my data frame where the contents inside each row are only separated by space:
>View(df)
PathwayName ExpressionData
1 41bbPathway BLACK 215538_at 210671_x_at... 215538_at na 28.566616...
2 ace2Pathway BLACK 214533_at 215184_at... 215538_at na 28.566616...
3 acetPathway BLACK 215184_at 01502_s_at... 215184_at na 4.2084746...
4 achPathway BLACK 211570_s_at 215184_at... 215184_at na 4.2084746...
5 hoPathway BLACK 201968_at 214578_s_at... 201968_at na 472.4969...
As a final product, I want it to compare, copy and save into a new file where the output should be like this:
>View(df)
PathwayName ExpressionData
1 41bbPathway 215538_at 215538_at
2 acetPathway 215184_at 215184_at
3 achPathway 215184_at 215184_at
4 hoPathway 201968_at 201968_at
This is what I had done:
sub("BLACK.*", "", df)
I know that this doesn't work, so I hope someone can help.
I had look into many Q&A about comparing two columns in a data frame, but I cannot follow those because, I need to compare each contents in a rows and find any similar contents(in this case the one with ..._at) and not only comparing based on the columns.
Hope someone know about this. Thank you.
mergeoperation, although I'm guessing you don't just want 3 columns in the output but rather want to drag along some of the other information in the matching rows. You should postdput(head(df))dput(head(df)).gsub(".* BLACK +([0-9]{6}_at) .*","\\1",df$PathwayName)then select rows where the gene names are the same