making two strings into one

Question

Let's say I have 2 strings

AAABBBCCCCC

and

AAAABBBBCCCC

to make these strings as similar as possible, given that I can only remove characters I should

delete the last C from the first string
delete the last A and the last B from the second string,

so that they become

AAABBBCCCC

What would be an efficient algorithm to find out which characters to remove from each string?

I'm currently crushing my brain cells thinking about a sollution involving substrings of the strings, looking for them i,n the other string.

Does the order of the characters to be removed matter? For example, do you have to know that it's the 4th A and last C that are to be removed, or do you just need know that there's one A and one C to be removed? — Nadh
– Nadh, Commented May 6, 2012 at 10:59
If the order of characters to be removed don't matter, wouldn't sorting both strings, and subtracting the smaller one from the bigger work? — Nadh
– Nadh, Commented May 6, 2012 at 11:00
the order doesn't matter within groups of the same group of the same characters, for example in the string ÀABBAA removing the first character would be the same as removing the second, but removing the first character is not the same as removing the last one. — bigblind
– bigblind, Commented May 6, 2012 at 11:02
What is the expected end result, positions of characters to be removed(0th, 3rd etc.), or the extra characters themselves without order (A,A,C,A,C..)? — Nadh
– Nadh, Commented May 6, 2012 at 11:05
For a general solution, you will have to find the longest common subsequence, which is a nice problem to be solved with dynamic programming :) — Niklas B.
– Niklas B., Commented May 6, 2012 at 12:37

daedalus · Accepted Answer · 2012-05-06 11:06:51Z

15

How about using difflib?

import difflib

s1 = 'AAABBBCCCCC'
s2 = 'AAAABBBBCCCC'

for difference in difflib.ndiff(s1, s2):
    print difference,
    if difference[0] == '+':
        print 'remove this char from s2'
    elif difference[0] == '-':
        print 'remove this char from s1'
    else:
        print 'no change here'

This will print out the differences between the two strings that you can then use to remove the differences. Here is the output:

  A no change here
  A no change here
  A no change here
+ A remove this char from s2
+ B remove this char from s2
  B no change here
  B no change here
  B no change here
  C no change here
  C no change here
  C no change here
  C no change here
- C remove this char from s1

answered May 6, 2012 at 11:06

daedalus

10.9k5 gold badges52 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jamylak Over a year ago

So using this the solution would be something like this ''.join(s[2] for s in difflib.ndiff(s1,s2) if s[0] == ' ')

daedalus Over a year ago

+1 for concise reformulation. The longer version might be useful if the removal of the extra chars should trigger other actions as well.

lenik · Accepted Answer · 2012-05-06 11:05:09Z

14

Levenshtein distance can calculate how many changes you need to convert one string into another. A small change to the source, and you may get not only distance, but the conversions needed.

answered May 6, 2012 at 11:05

lenik

23.6k4 gold badges38 silver badges44 bronze badges

5 Comments

Niklas B. Over a year ago

Levenshtein also allows modifications of single characters, which isn't allowed by OP's description.

lenik Over a year ago

using levehshtein as base, you may introduce or remove any character modifications you like or don't like.

Niklas B. Over a year ago

Yes, but then it might be that you don't find the optimal number of removals.

lenik Over a year ago

levenstein deals with 1)char removed 2)char added 3)char replaced, first one means "remove character from the first string", second one means "remove characted from the second string", third one means "remove character from both strings", please, open a separate question if you need further clarification.

Niklas B. Over a year ago

I just realized that the algorithm can be configured to assign different costs to the different operations. In our case we'd have to assign costs 2 to "char replaced".

deinonychusaur · Accepted Answer · 2012-05-06 11:06:13Z

1

Don't know if it's the fastest, but as code goes, it is at least short:

import difflib
''.join([c[-1] for c in difflib.Differ().compare('AAABBBCCCCC','AAAABBBBCCCC') if c[0] == ' '])

answered May 6, 2012 at 11:06

deinonychusaur

7,3543 gold badges32 silver badges43 bronze badges

Comments

hakunami · Accepted Answer · 2012-05-06 11:08:12Z

I think regular expression can do this.It's a string distance problem. I mean. Let's have two string:

str1 = 'abc'
str2 = 'aabbcc'

first, I choose the short, and construct a regular expression like is:

regex = '(\w*)'+'(\w*)'.join(list(str1))+'(\w*)'

Then, we can search:

matches = re.search(regex,str2)

I use round brackets to group the section I am interested. these groups of matches.group() is the distance of two strings.Next, I can figure out what characters should be removed. It's my idea, I hope it can help you.

Collectives™ on Stack Overflow

making two strings into one

4 Answers 4

2 Comments

5 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

5 Comments

Comments

Comments

Linked

Related