4

I have a reference and query sequences:

ref_seq <- "ATTT"
df <- data.frame(V1=c("AATT", "TTTT", "GGTT"))

I would like to return the mismatch positions in the sequence for each query compared to reference:

seqdiff <- function(seq1, seq2) {
  seq <- strsplit(c(seq1, seq2), split= '')
  mismatches <- which(seq[[1]] != seq[[2]])
  return(mismatches)
}
    
apply(X=df, MARGIN=2, function(x) seqdiff(x, ref_seq))

#      V1
# [1,]  1
# [2,]  2

Expected Output:

#      V1
# [1,]  2
# [2,]  1
# [3,]  1 2
3
  • Does every sequence have equal length? Commented Apr 11 at 13:51
  • 2
    apply returns a matrix(-like) structure. A matrix is quadratic. You cannot create irregularly-shaped data with it. Commented Apr 11 at 13:54
  • for this case yes, but you can post also post a solution for different length , that would be useful Commented Apr 11 at 13:55

5 Answers 5

6

Given these are likely nucleotide sequences, you might consider the adist function. This can be used in other cases to determine the minimal weighted number of insertions, deletions, and substitutions needed to transform one string into another. This allows for counts to be computed in transformation, as well as the transformation sequence in the "trafos" attribute (M = match, I = insertion, D = deletion, S = Substitution).

df$trafos <- attr(adist(df$V1, ref_seq, counts = TRUE), "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   MSMM            2
2 TTTT   SMMM            1
3 GGTT   SSMM         1, 2

As noted in the comments by @AndreWildberg, different contexts may require different strategies. Other packages such as through Bioconductor or bedtoolsr may be more appropriate depending on need.

If using adist, you can also specify the "cost" argument, to apply costs for insertions, deletions, or substitutions when computing string distance.

In the case of ref_seq <- "GAAA" for example, you could apply costs for insertions and deletions, which would give a different (and probably more desirable) answer in this case.

df$trafos <- attr(adist(df$V1, 
                        ref_seq, 
                        counts = TRUE, 
                        costs = list(ins = 1, del = 1, sub = 0)), 
                  "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   SMSS      1, 3, 4
2 TTTT   SSSS   1, 2, 3, 4
3 GGTT   MSSS      2, 3, 4

In case there is interest in Bioconductor (Biostrings) to look at alignment and mismatches between two sequences, I'm adding a quick example.

library(Biostrings)
library(pwalign)

alg <- pairwiseAlignment("AATT", "GAAA")
mismatchTable(alg)

  PatternId PatternStart PatternEnd PatternSubstring PatternQuality SubjectStart SubjectEnd SubjectSubstring SubjectQuality
1         1            1          1                A              7            1          1                G              7
2         1            3          3                T              7            3          3                A              7
3         1            4          4                T              7            4          4                A              7
7
  • 1
    You hit one of my reflex upvote criteria, gregexpr.
    – IRTFM
    Commented Apr 12 at 3:17
  • 2
    Nice answer, but it doesn't work for all combinations, since sometimes adist decides it wants an Insertion/Deletion instead of a simple Substitution. E.g. try with ref_seq <- "GAAA". Commented Apr 13 at 10:00
  • @AndreWildberg - I understand and appreciate the comment. My thinking was this could represent a broader approach to dealing with genetic sequences, considering not just mismatch but insertions/deletions as well. In the comments above the OP mentioned possible sequences of different lengths (which would complicate things further). I found another example here with adist. One may want partial matches contained within a long sequence and add a "cost" to minimize mismatches.
    – Ben
    Commented Apr 13 at 14:05
  • Also, more options for string matching and identifying mismatches can be done with Biostrings, part of Bioconductor.
    – Ben
    Commented Apr 13 at 14:15
  • Got it. The problem is that adist is not made for that so it might behave unexpected, especially on larger similar problems. There are a lot of solutions for matching biological sequences, e.g. blast or bedtools which hold less surprises while using. Commented Apr 13 at 14:18
1

1) This produces a data frame whose mismatch columnn is a list of vectors.

library(dplyr)
out <- df %>%
  rowwise %>%
  mutate(mismatch = list(seqdiff(V1, ref_seq))) %>%
  ungroup

str(out)
## tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##  $ V1      : chr [1:3] "AATT" "TTTT" "GGTT"
##  $ mismatch:List of 3
##   ..$ : int 2
##   ..$ : int 1
##   ..$ : int [1:2] 1 2

str(out$mismatch)
## List of 3
##  $ : int 2
##  $ : int 1
##  $ : int [1:2] 1 2

2) This one is the same except the mismatch column is a character vector.

library(dplyr)

out <- df %>%
  rowwise %>%
  mutate(mismatch = toString(seqdiff(V1, ref_seq))) %>%
  ungroup
out
## # A tibble: 3 × 2
##   V1    mismatch
##   <chr> <chr>   
## 1 AATT  2       
## 2 TTTT  1       
## 3 GGTT  1, 2    

3) This unnests (1) igiving a long form data frame where the output for each input row can span multple output rows.

library(dplyr)
library(tidyr)

out <- df %>%
  rowwise %>%
  mutate(mismatch = list(seqdiff(V1, ref_seq))) %>%
  ungroup %>%
  unnest(mismatch)
out
## A tibble: 4 × 2
##   V1    mismatch
##   <chr>    <int>
## 1 AATT         2
## 2 TTTT         1
## 3 GGTT         1
## 4 GGTT         2

library(igraph)

set.seed(123)
g <- graph_from_data_frame(out)
plot(g)

(contnued after image) screenshot

4) To use apply the funtion should use list or toString

out <- apply(df, 1, \(x) list(seqdiff(x, ref_seq)))
str(out)
List of 3
 $ :List of 1
  ..$ : int 2
 $ :List of 1
  ..$ : int 1
 $ :List of 1
  ..$ : int [1:2] 1 2

out <- apply(df, 1, \(x) toString(seqdiff(x, ref_seq)))
out
## [1] "2"    "1"    "1, 2"
1

An approach using substr

sapply(df$V1, \(seq) 
  which(sapply(seq_len(nchar(ref_seq)), \(x) 
    substr(ref_seq, x, x) != substr(seq, x, x))))
$AATT
[1] 2

$TTTT
[1] 1

$GGTT
[1] 1 2

If you want to just match specific positions use spaces in ref_seq, e.g.

ref_seq <- " GTA"

sapply(df$V1, \(seq) 
  which(sapply(seq_len(nchar(ref_seq)), \(x) 
    substr(ref_seq, x, x) != substr(seq, x, x) & 
    (substr(ref_seq,x ,x) != " "))))
$AATT
[1] 2 4

$TTTT
[1] 2 4

$GGTT
[1] 4

Getting the result in a data frame format

as.matrix(lapply(df$V1, \(seq)
  which(sapply(seq_len(nchar(ref_seq)), \(x)
    substr(ref_seq, x, x) != substr(seq, x, x))))) |>
  `colnames<-`(ref_seq) |>
  data.frame()
  ATTT
1    2
2    1
3 1, 2
0

You could use != in mapply and parse TRUE positions toString.

> s <- \(x) {strsplit(x, '')}
> apply(mapply(`!=`, s(ref_seq), s(df$V1)), 2, \(x) toString(which(x)))
[1] "2"    "1"    "1, 2"

Note: Avoid row-wise operations on data.frames, which are extremely inefficient. Remember that a data.frame is actually a list of vectors with adjacent positions appearing as rows—and consider using mapply() or Map(). apply() is designed for matrices (as applied in this solution) and should only rarely (if at all) be used on data.frames.

0

If nchar(ref_seq) (4) equals nchar of all elements of V1 (4), we can try vapply() + utf8ToInt() with which(.., arr.ind=TRUE).

M = vapply(V1, utf8ToInt, numeric(4)) 
i = which(M != utf8ToInt(ref_seq), arr.ind=TRUE)
V2 = with(data.frame(i), split(row, col)) |> sapply(toString) # improve
> data.frame(V1, V2)
    V1   V2
1 AATT    2
2 TTTT    1
3 GGTT 1, 2

Note

# input data
ref_seq = "ATTT"
V1 = c("AATT", "TTTT","GGTT")

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.