If the only obstacle is punctuation, the problem is trivial: Just discard non-word characters and compare the remaining lists of words.
s1 = 'Title - Subtitle'
toks1 = re.split(r"^\W+", s1) # keep just the words
toks1 = [ w.lower() for w in toks1 ]
I threw in lowercasing since that could differ too. Apply the same to each input and compare the lists.
But as you point out, there can be other differences. If your data really consists of titles (books, movies, scientific articles), you can start by removing articles and common connectives (so-called "stopwords"), like libraries do. E.g., "The title of the article" gets stripped down to ["title", "article"]
. To deal with other possible differences in word order, you could use the so-called "bag of words" approach, common in information retrieval. Convert the list of tokens to a set, or to a dictionary of word counts for cases where some words occur multiple times. Here's an example, using word counts and the nltk
's "stopword" list as a filter.
import nltk
from collections import Counter
stopwords = set(nltk.corpus.stopwords.words("english"))
toks1 = [ t for t in toks1 if t not in stopwords ]
cnt1 = Counter(toks1)
cnt2 = Counter(toks2) # Another title string, processed the same way
if cnt1 == cnt2:
print("The two strings have exactly the same content words")
If there's still more variation, the sky is the limit. Approximate text matching is a topic of active research with applications in information retrieval, plagiarism detection, genetics, etc. You could check if one title is a subset of the other (maybe someone left out the subtitle). You could try matching by "edit distance" (e.g. the "Levenshtein distance" mentioned by a couple of other answers), applying it either to letters or to whole words. You could try information retrieval algorithms like TF-IDF score. These are just a few of the things you could try, so look for the simplest solution that will do the job for you. Google is your friend.
The Title: The Subtitle
andTitle, The: Subtitle, The
should be considered equal as well?The
may be better than doing a comparison on them as-is