apply a function with two arguments, rowwise on a data frame

Question

I have a reference and query sequences:

ref_seq <- "ATTT"
df <- data.frame(V1=c("AATT", "TTTT", "GGTT"))

I would like to return the mismatch positions in the sequence for each query compared to reference:

seqdiff <- function(seq1, seq2) {
  seq <- strsplit(c(seq1, seq2), split= '')
  mismatches <- which(seq[[1]] != seq[[2]])
  return(mismatches)
}
    
apply(X=df, MARGIN=2, function(x) seqdiff(x, ref_seq))

#      V1
# [1,]  1
# [2,]  2

Expected Output:

#      V1
# [1,]  2
# [2,]  1
# [3,]  1 2

apply returns a matrix(-like) structure. A matrix is quadratic. You cannot create irregularly-shaped data with it. — Konrad Rudolph, Commented Apr 11 at 13:54
for this case yes, but you can post also post a solution for different length , that would be useful — Paolo Lorenzini, Commented Apr 11 at 13:55

Ben · Accepted Answer · 2025-04-15 19:07:27Z

Given these are likely nucleotide sequences, you might consider the adist function. This can be used in other cases to determine the minimal weighted number of insertions, deletions, and substitutions needed to transform one string into another. This allows for counts to be computed in transformation, as well as the transformation sequence in the "trafos" attribute (M = match, I = insertion, D = deletion, S = Substitution).

df$trafos <- attr(adist(df$V1, ref_seq, counts = TRUE), "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   MSMM            2
2 TTTT   SMMM            1
3 GGTT   SSMM         1, 2

As noted in the comments by @AndreWildberg, different contexts may require different strategies. Other packages such as through Bioconductor or bedtoolsr may be more appropriate depending on need.

If using adist, you can also specify the "cost" argument, to apply costs for insertions, deletions, or substitutions when computing string distance.

In the case of ref_seq <- "GAAA" for example, you could apply costs for insertions and deletions, which would give a different (and probably more desirable) answer in this case.

df$trafos <- attr(adist(df$V1, 
                        ref_seq, 
                        counts = TRUE, 
                        costs = list(ins = 1, del = 1, sub = 0)), 
                  "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   SMSS      1, 3, 4
2 TTTT   SSSS   1, 2, 3, 4
3 GGTT   MSSS      2, 3, 4

In case there is interest in Bioconductor (Biostrings) to look at alignment and mismatches between two sequences, I'm adding a quick example.

library(Biostrings)
library(pwalign)

alg <- pairwiseAlignment("AATT", "GAAA")
mismatchTable(alg)

  PatternId PatternStart PatternEnd PatternSubstring PatternQuality SubjectStart SubjectEnd SubjectSubstring SubjectQuality
1         1            1          1                A              7            1          1                G              7
2         1            3          3                T              7            3          3                A              7
3         1            4          4                T              7            4          4                A              7

Nice answer, but it doesn't work for all combinations, since sometimes adist decides it wants an Insertion/Deletion instead of a simple Substitution. E.g. try with ref_seq <- "GAAA". — Andre Wildberg, Commented Apr 13 at 10:00
@AndreWildberg - I understand and appreciate the comment. My thinking was this could represent a broader approach to dealing with genetic sequences, considering not just mismatch but insertions/deletions as well. In the comments above the OP mentioned possible sequences of different lengths (which would complicate things further). I found another example here with adist. One may want partial matches contained within a long sequence and add a "cost" to minimize mismatches. — Ben, Commented Apr 13 at 14:05
Also, more options for string matching and identifying mismatches can be done with Biostrings, part of Bioconductor. — Ben, Commented Apr 13 at 14:15
Got it. The problem is that adist is not made for that so it might behave unexpected, especially on larger similar problems. There are a lot of solutions for matching biological sequences, e.g. blast or bedtools which hold less surprises while using. — Andre Wildberg, Commented Apr 13 at 14:18

G. Grothendieck · Accepted Answer · 2025-04-11 14:39:32Z

1) This produces a data frame whose mismatch columnn is a list of vectors.

library(dplyr)
out <- df %>%
  rowwise %>%
  mutate(mismatch = list(seqdiff(V1, ref_seq))) %>%
  ungroup

str(out)
## tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##  $ V1      : chr [1:3] "AATT" "TTTT" "GGTT"
##  $ mismatch:List of 3
##   ..$ : int 2
##   ..$ : int 1
##   ..$ : int [1:2] 1 2

str(out$mismatch)
## List of 3
##  $ : int 2
##  $ : int 1
##  $ : int [1:2] 1 2

2) This one is the same except the mismatch column is a character vector.

library(dplyr)

out <- df %>%
  rowwise %>%
  mutate(mismatch = toString(seqdiff(V1, ref_seq))) %>%
  ungroup
out
## # A tibble: 3 × 2
##   V1    mismatch
##   <chr> <chr>   
## 1 AATT  2       
## 2 TTTT  1       
## 3 GGTT  1, 2

3) This unnests (1) igiving a long form data frame where the output for each input row can span multple output rows.

library(dplyr)
library(tidyr)

out <- df %>%
  rowwise %>%
  mutate(mismatch = list(seqdiff(V1, ref_seq))) %>%
  ungroup %>%
  unnest(mismatch)
out
## A tibble: 4 × 2
##   V1    mismatch
##   <chr>    <int>
## 1 AATT         2
## 2 TTTT         1
## 3 GGTT         1
## 4 GGTT         2

library(igraph)

set.seed(123)
g <- graph_from_data_frame(out)
plot(g)

(contnued after image)

4) To use apply the funtion should use list or toString

out <- apply(df, 1, \(x) list(seqdiff(x, ref_seq)))
str(out)
List of 3
 $ :List of 1
  ..$ : int 2
 $ :List of 1
  ..$ : int 1
 $ :List of 1
  ..$ : int [1:2] 1 2

out <- apply(df, 1, \(x) toString(seqdiff(x, ref_seq)))
out
## [1] "2"    "1"    "1, 2"

Andre Wildberg · Accepted Answer · 2025-04-11 17:56:04Z

An approach using substr

sapply(df$V1, \(seq) 
  which(sapply(seq_len(nchar(ref_seq)), \(x) 
    substr(ref_seq, x, x) != substr(seq, x, x))))
$AATT
[1] 2

$TTTT
[1] 1

$GGTT
[1] 1 2

If you want to just match specific positions use spaces in ref_seq, e.g.

ref_seq <- " GTA"

sapply(df$V1, \(seq) 
  which(sapply(seq_len(nchar(ref_seq)), \(x) 
    substr(ref_seq, x, x) != substr(seq, x, x) & 
    (substr(ref_seq,x ,x) != " "))))
$AATT
[1] 2 4

$TTTT
[1] 2 4

$GGTT
[1] 4

Getting the result in a data frame format

as.matrix(lapply(df$V1, \(seq)
  which(sapply(seq_len(nchar(ref_seq)), \(x)
    substr(ref_seq, x, x) != substr(seq, x, x))))) |>
  `colnames<-`(ref_seq) |>
  data.frame()
  ATTT
1    2
2    1
3 1, 2

jay.sf · Accepted Answer · 2025-04-12 04:50:42Z

0

You could use != in mapply and parse TRUE positions toString.

> s <- \(x) {strsplit(x, '')}
> apply(mapply(`!=`, s(ref_seq), s(df$V1)), 2, \(x) toString(which(x)))
[1] "2"    "1"    "1, 2"

Note: Avoid row-wise operations on data.frames, which are extremely inefficient. Remember that a data.frame is actually a list of vectors with adjacent positions appearing as rows—and consider using mapply() or Map(). apply() is designed for matrices (as applied in this solution) and should only rarely (if at all) be used on data.frames.

edited Apr 12 at 4:50

answered Apr 12 at 4:36

jay.sf

75.1k8 gold badges64 silver badges126 bronze badges

Add a comment |

Friede · Accepted Answer · 2025-04-13 14:00:52Z

0

If nchar(ref_seq) (4) equals nchar of all elements of V1 (4), we can try vapply() + utf8ToInt() with which(.., arr.ind=TRUE).

M = vapply(V1, utf8ToInt, numeric(4)) 
i = which(M != utf8ToInt(ref_seq), arr.ind=TRUE)
V2 = with(data.frame(i), split(row, col)) |> sapply(toString) # improve

> data.frame(V1, V2)
    V1   V2
1 AATT    2
2 TTTT    1
3 GGTT 1, 2

Note

# input data
ref_seq = "ATTT"
V1 = c("AATT", "TTTT","GGTT")

edited Apr 13 at 14:00

answered Apr 11 at 16:39

Friede

9,5802 gold badges11 silver badges30 bronze badges

Add a comment |

Collectives™ on Stack Overflow

apply a function with two arguments, rowwise on a data frame

5 Answers 5

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Linked

Related