3

I have a dataframe with Name1 (10 observations), and Name2, with 3 observations. I have the following toy example:

   Name1                            Name2         
Acadian Hospitals                 Wellington      
Bridgewater Trust Associates      Zeus        
Concordia Consulting              Acadian
Wellington Corporation LLC          .
Wellington Wealth Management        .
Prime Acadian Charity

If Name1 is able to match a part of its string in Name2, I want the output in column3 to be TRUE. Currently, my code only works the other way around, using pmatch

My final output should look like this:

   Name1                            Name2           Is_Matched
Acadian Hospitals                 Wellington           TRUE
Bridgewater Trust Associates      Zeus                 FALSE
Concordia Consulting              Acadian              FALSE
Wellington Corporation LLC          .                  TRUE
Wellington Wealth Management        .                  TRUE
Prime Acadian Charity               .                  TRUE
4
  • Are the periods intentionally TRUE, do they denote missing data, or are accidentally true in Is_Matched?
    – Andrew
    Commented Mar 2, 2019 at 1:11
  • Also, why is Acadian FALSE in the Is_Matched column?
    – Andrew
    Commented Mar 2, 2019 at 1:31
  • The true and false points to Name1 being present in Name2, not the other way around. The periods are missing values and have been encoded as such in my dataframe Commented Mar 2, 2019 at 1:36
  • 1
    I see! I posted an edit below but it looks like you already figure it out. Well done!!
    – Andrew
    Commented Mar 2, 2019 at 3:00

3 Answers 3

4

It sounds like Name2 is really just a set of lookup values. In that case you could build a lookup by pasting all the values together and then do one simple grepl search on all of df$Name2:

df$Is_Matched <- grepl(paste(df$Name2[df$Name2 == "."], collapse = "|"), df$Name1)
#                         Name1      Name2 Is_Matched
#1            Acadian Hospitals Wellington       TRUE
#2 Bridgewater Trust Associates       Zeus      FALSE
#3         Concordia Consulting    Acadian      FALSE
#4   Wellington Corporation LLC          .       TRUE
#5 Wellington Wealth Management          .       TRUE
#6        Prime Acadian Charity          .       TRUE

Note this assumes that missing values in Name2 are coded as "." rather than NA. It would be easy enough to change to any other coding of missing values.

2
  • your clue on grepl with paste definitely helped, but I figured a way without the "." placement holder. Commented Mar 2, 2019 at 2:45
  • is there a way to employ this for very large datasets? Commented Mar 3, 2019 at 0:42
2

With assistance from Mike H. :

Name1 = c("Bridgewater Trust Associates", "Acadian Wealth Management", "Wellington Wealth Trust", "Concordia University", "Southern Zeus College", "Parametric Modeling", "Wellington City Corporation", "Hotel Zanzibar") 
Name2 = c("Acadian", "Wellington", "Zeus")

max.len = max(length(Name1), length(Name2))
Name1 = c(Name1, rep(NA, max.len - length(Name1)))
Name2 = c(Name2, rep(NA, max.len - length(Name2)))
column3 <- grepl(paste(Name2, collapse = "|"), Name1)

df <- data.frame(Name1, Name2, column3, stringsAsFactors = FALSE)
2

You could use sapply. Without an example I think something like this should work. I'll check on an example in a sec.

df$Is_Matched <- sapply(df$Name2, function(x) any(grepl(x, df$Name1))

EDIT:

Creating an example dataframe helped. sapply was exporting a matrix with each word in Name2 having its own column. So, you can test to see if any row contains a true using rowSums (true = 1, false = 0). Let me know if you have any issues with it.

> df <- data.frame(
+   Name1 = c("Acadian Hospitals", "Bridgewater Trust Associates",
+             "Concordia Consulting", "Wellington Corporation LLC",
+             "Wellington Wealth Management", "Prime Acadian Charity"),
+   Name2 = c("Wellington", "Zeus", "Acadian", NA, NA, NA),
+   stringsAsFactors = FALSE
+ )
> 
> match_me <- na.omit(df$Name2)
> df$Is_Matched <- rowSums(sapply(match_me, function(x) grepl(x, df$Name1))) > 0
> df
                         Name1      Name2 Is_Matched
1            Acadian Hospitals Wellington       TRUE
2 Bridgewater Trust Associates       Zeus      FALSE
3         Concordia Consulting    Acadian      FALSE
4   Wellington Corporation LLC       <NA>       TRUE
5 Wellington Wealth Management       <NA>       TRUE
6        Prime Acadian Charity       <NA>       TRUE

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.