13

Let's say I have 2 strings

AAABBBCCCCC

and

AAAABBBBCCCC

to make these strings as similar as possible, given that I can only remove characters I should

  • delete the last C from the first string
  • delete the last A and the last B from the second string,

so that they become

AAABBBCCCC

What would be an efficient algorithm to find out which characters to remove from each string?

I'm currently crushing my brain cells thinking about a sollution involving substrings of the strings, looking for them i,n the other string.

7
  • Does the order of the characters to be removed matter? For example, do you have to know that it's the 4th A and last C that are to be removed, or do you just need know that there's one A and one C to be removed? Commented May 6, 2012 at 10:59
  • If the order of characters to be removed don't matter, wouldn't sorting both strings, and subtracting the smaller one from the bigger work? Commented May 6, 2012 at 11:00
  • the order doesn't matter within groups of the same group of the same characters, for example in the string ÀABBAA removing the first character would be the same as removing the second, but removing the first character is not the same as removing the last one. Commented May 6, 2012 at 11:02
  • What is the expected end result, positions of characters to be removed(0th, 3rd etc.), or the extra characters themselves without order (A,A,C,A,C..)? Commented May 6, 2012 at 11:05
  • 3
    For a general solution, you will have to find the longest common subsequence, which is a nice problem to be solved with dynamic programming :) Commented May 6, 2012 at 12:37

4 Answers 4

15

How about using difflib?

import difflib

s1 = 'AAABBBCCCCC'
s2 = 'AAAABBBBCCCC'

for difference in difflib.ndiff(s1, s2):
    print difference,
    if difference[0] == '+':
        print 'remove this char from s2'
    elif difference[0] == '-':
        print 'remove this char from s1'
    else:
        print 'no change here'

This will print out the differences between the two strings that you can then use to remove the differences. Here is the output:

  A no change here
  A no change here
  A no change here
+ A remove this char from s2
+ B remove this char from s2
  B no change here
  B no change here
  B no change here
  C no change here
  C no change here
  C no change here
  C no change here
- C remove this char from s1
Sign up to request clarification or add additional context in comments.

2 Comments

So using this the solution would be something like this ''.join(s[2] for s in difflib.ndiff(s1,s2) if s[0] == ' ')
+1 for concise reformulation. The longer version might be useful if the removal of the extra chars should trigger other actions as well.
14

Levenshtein distance can calculate how many changes you need to convert one string into another. A small change to the source, and you may get not only distance, but the conversions needed.

5 Comments

Levenshtein also allows modifications of single characters, which isn't allowed by OP's description.
using levehshtein as base, you may introduce or remove any character modifications you like or don't like.
Yes, but then it might be that you don't find the optimal number of removals.
levenstein deals with 1)char removed 2)char added 3)char replaced, first one means "remove character from the first string", second one means "remove characted from the second string", third one means "remove character from both strings", please, open a separate question if you need further clarification.
I just realized that the algorithm can be configured to assign different costs to the different operations. In our case we'd have to assign costs 2 to "char replaced".
1

Don't know if it's the fastest, but as code goes, it is at least short:

import difflib
''.join([c[-1] for c in difflib.Differ().compare('AAABBBCCCCC','AAAABBBBCCCC') if c[0] == ' '])

Comments

0

I think regular expression can do this.It's a string distance problem. I mean. Let's have two string:

str1 = 'abc'
str2 = 'aabbcc'

first, I choose the short, and construct a regular expression like is:

regex = '(\w*)'+'(\w*)'.join(list(str1))+'(\w*)'

Then, we can search:

matches = re.search(regex,str2)

I use round brackets to group the section I am interested. these groups of matches.group() is the distance of two strings.Next, I can figure out what characters should be removed. It's my idea, I hope it can help you.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.