1

I have a column of values with unique identifiers that look like this:

df$1 <– c("identifier:ab134:4sfh", "identifier:gh164:9sgh", "identifier:3h1v4:kk9gh"

Some of them are in another column in a separate data frame with 71 columns but in that data frame, they are often clustered like this:

df2$1 <– c(""identifier:ab134:4sfh|identifier:gh164:9sgh", "identifier:sfghskg8:kk9gh|identifier:fj893n:9sgh|identifier:gh164:9sgh",...)

I need to find all rows which have any of the identifiers in them in the second dataframe. I would strsplit the column but I want to keep the rest of the second dataset as it is.

I have tried using this code both ways (i.e. df1 %in% df2 and df2 %in% df1) but obviously it's not giving me all the matches because it's trying to match whole strings rather than substrings:

new_subset <- subset(df$1, trimws(1) %in% trimws(df2$1))

Any suggestions? Thanks in advance for your help!

5
  • I’m really not sure what I can add. I need a match for every row and I’ve used the code above (which doesn’t work).
    – OliverL
    Commented Dec 4, 2019 at 14:06
  • 1
    If you can provide expected output for the vectors you showed, it would help lapply(v1, function(x) unlist(lapply(strsplit(v2, "|", fixed = TRUE), function(y) match(x, y)))) Also try grep(df2$1, df$1)
    – akrun
    Commented Dec 4, 2019 at 14:07
  • So I tried this and I got a very long list that looks like this: List of 8806 $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ...
    – OliverL
    Commented Dec 4, 2019 at 16:38
  • I want an output that looks like this: df$2 <– c("identifier:ab134:4sfh", "identifier:gh164:9sgh") but only includes matches from df$1
    – OliverL
    Commented Dec 4, 2019 at 16:39
  • 1
    You've got mismatched quotation marks in your code
    – camille
    Commented Dec 4, 2019 at 18:29

1 Answer 1

1

Maybe you can use grep to find matching strings.

new_subset <- df[grep(paste0("^(",paste(df2$z, collapse = "|"),")$"), df$z),]
new_subset
#[1] identifier:ab134:4sfh identifier:gh164:9sgh

Data:

df <- data.frame(z=c("identifier:ab134:4sfh", "identifier:gh164:9sgh", "identifier:3h1v4:kk9gh"))
df2 <- data.frame(z=c("identifier:ab134:4sfh|identifier:gh164:9sgh", "identifier:sfghskg8:kk9gh|identifier:fj893n:9sgh|identifier:gh164:9sgh"))
12
  • I get an error message in response: paste(df2, , collapse = "|") : argument is missing, with no default
    – OliverL
    Commented Dec 4, 2019 at 15:05
  • @OliverL I wrote paste(df2, collapse = "|") and not paste(df2, , collapse = "|")
    – GKi
    Commented Dec 4, 2019 at 15:09
  • @OliverL I have updated the data-set in the question from vector to data.frame. Hope this solves the problem.
    – GKi
    Commented Dec 4, 2019 at 15:42
  • Thanks so much for your help. So now it's saying "Error in grep(paste0("^(", paste(df1, collapse = "|", : invalid regular expression '^("identifier:ab134:4sfh|identifier:gh164:9sgh"..)
    – OliverL
    Commented Dec 4, 2019 at 15:58
  • Could it be something to do with collapse = "|" - do I need to change it somehow to make it comply with regex rules?
    – OliverL
    Commented Dec 4, 2019 at 16:00

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.