1

I have two dataframes (df1 and df2) with the same length of rows. I want to check if values in each column in df2 are contained as substring in the column S in df1 for each row correspondingly, then sum up the matches for each column in df2. any helps are appreciated.

    df1 = read.table(text="R    S
    GG  AACCTT
    CC  AAGGTT
    CC  AAGGTT
    GG  AACCTT
    GG  AACCTT
    CC  AAGGTT
    GG  AACCTT
    GG  AACCTT
    GG  AACCTT
    TT  AACCGG
    AA  CCGGTT
    CC  AAGGTT
    TT  AACCGG
    AA  CCGGTT
    AA  CCGGTT", header=T, stringsAsFactors=F)

    df2 = read.table(text="M1   M2  M3  M4  M5  M6  M7  M8
    GG  GG  GG  GG  GG  GG  GG  GG
    CC  CC  CC  CC  CC  CC  CC  CC
    CC  TT  TT  TT  TT  TT  TT  TT
    HH  TT  TT  TT  TT  HH  TT  TT
    GG  AA  GG  GG  GG  --  GG  HH
    CC  CC  CC  CC  CC  TT  CC  CC
    GG  GG  GG  GG  GG  --  GG  GG
    GG  GG  GG  GG  GG  --  GG  GG
    --  --  HH  AA  HH  AA  HH  AA
    TT  --  HH  CC  HH  CC  HH  CC
    AA  --  AA  AA  AA  --  AA  AA
    CC  CC  CC  CC  CC  CC  CC  CC
    --  HH  CC  CC  CC  HH  HH  CC
    AA  HH  GG  GG  GG  HH  HH  GG
    AA  HH  GG  GG  GG  HH  GG  GG", header=T, stringsAsFactors=F)

after checking, the intermediate result as:

    M1  M2  M3  M4  M5  M6  M7  M8
    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
    FALSE   TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE
    FALSE   TRUE    TRUE    TRUE    TRUE    FALSE   TRUE    TRUE
    FALSE   TRUE    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
    FALSE   FALSE   FALSE   FALSE   FALSE   TRUE    FALSE   FALSE
    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
    FALSE   FALSE   FALSE   TRUE    FALSE   TRUE    FALSE   TRUE
    FALSE   FALSE   FALSE   TRUE    FALSE   TRUE    FALSE   TRUE
    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE
    FALSE   FALSE   TRUE    TRUE    TRUE    FALSE   FALSE   TRUE
    FALSE   FALSE   TRUE    TRUE    TRUE    FALSE   FALSE   TRUE
    FALSE   FALSE   TRUE    TRUE    TRUE    FALSE   TRUE    TRUE

the expected final result is summed up by colSums():

    M1  M2  M3  M4  M5  M6  M7  M8
    0   3   5   7   5   4   3   7

2 Answers 2

4

We can use stri_count_fixed() from library(stringi) to achive this:

library(stringi)
colSums(matrix(stri_count_fixed(df1$S, unlist(df2)), nrow = nrow(df1)))

Edit: a benchmark with @shaun_m's dplyr/mapply approach:

library(microbenchmark)
microbenchmark(
  stri_count_fixed = {colSums(matrix(stri_count_fixed(df1$S, unlist(df2)), nrow = nrow(df1)))},
  dplyr = {colSums(dplyr::bind_cols(lapply(df2, \(x) mapply(grepl, x, df1$S))))}
  )

Unit: microseconds
             expr    min      lq     mean  median     uq    max neval
 stri_count_fixed   58.8   63.95   78.428   74.10   81.7  265.9   100
            dplyr 1372.5 1408.85 1549.779 1446.95 1520.6 3298.0   100
Sign up to request clarification or add additional context in comments.

Comments

0

You can use grepl() inside mapply() to check each cell of a column in df2 against the cells in df1$S. Wrap that inside an lapply() to do all the colums of df2. And dplyr::bind_cols() that result to make it a tibble, giving your indermediate result.

Then colSums() will give you the final result.

dplyr::bind_cols(lapply(df2, \(x) mapply(grepl, x, df1$S)))

# A tibble: 15 × 8
#   M1    M2    M3    M4    M5    M6    M7    M8   
#   <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
# 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 3 FALSE TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
# 4 FALSE TRUE  TRUE  TRUE  TRUE  FALSE TRUE  TRUE 
# 5 FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE
# 6 FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE
# 7 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 9 FALSE FALSE FALSE TRUE  FALSE TRUE  FALSE TRUE 
#10 FALSE FALSE FALSE TRUE  FALSE TRUE  FALSE TRUE 
#11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#13 FALSE FALSE TRUE  TRUE  TRUE  FALSE FALSE TRUE 
#14 FALSE FALSE TRUE  TRUE  TRUE  FALSE FALSE TRUE 
#15 FALSE FALSE TRUE  TRUE  TRUE  FALSE TRUE  TRUE 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.