Find the missing and duplicated numbers from shuffled range

Question

I'm solving "Unknown amount of duplicates. One missing number.":

In this kata, we have an unsorted sequence of consecutive numbers from a to b, such that a < b always (remember a, is the minimum, and b the maximum value). They were introduced an unknown amount of duplicates in this sequence and we know that there is an only missing value such that all the duplicate values and the missing value are between a and b, but never coincide with them. Find the missing number with the duplicate numbers (duplicates should be output in a sorted array)

I'm using this code:

    def find_dups_miss(arr):

        m = max(arr)
        n = min(arr)

        a = list(range(n, m+1))

        d = list(set(a).difference(arr))  # creates the list with the missing element
        z = sorted(set([x for x in arr if arr.count(x) >= 2])) # creates the list with duplicate elements
    
        return d + [z]

but it timeouts since it exceeds the 12000 ms time limit.

What can I do to make it run faster?

an unknown amount of duplicates len(sequence) - a + b (-1 because b supposedly inclusive +1 because one missing) — greybeard
– greybeard, Commented Sep 12, 2022 at 23:24

greybeard · Accepted Answer · 2022-09-20 04:15:09Z

I found the explanation hard to understand. I've paraphrased what I believe is required (and implemented):

A sequence of consecutive numbers, a…b (inclusive) is shuffled, and modified by removing one number and duplicating zero or more of the numbers. Given this modified sequence, determine which number was removed and which duplicates were added.

It's not surprising that this code is slow. We traverse the array multiple times, notably here:

    z = sorted(set([x for x in arr if arr.count(x) >= 2])) # creates the list with duplicate elements

This scales very poorly, because each arr.count() has to traverse the whole array, and we do that for each x in the array, meaning it scales as O(n²), where n is the array length.

If we use a collections.Counter, we only have to traverse the array once; we can then filter the elements (and we can find the min and max values more efficiently from the counter). Or we can .subtract() the "range" a, which will reveal the removed element (having a count of -1, so revealed using unary -) and the duplicates (having positive count, and returned as the counter's .elements()).

Did you notice the bug? Because we make a set from the duplicated elements, we will get the wrong answer if any number is present three times or more (we'll report it only once in the list of duplicates).

we'll report [any number present three times or more] only once in the list of duplicates that's what post_lupy's solution reports. codewars.com just mentions duplicates. — greybeard
– greybeard, Commented Sep 20, 2022 at 3:27
I guess it's not clear what the requirement is (from the question - I don't have a codewars account). I understood it to mean "report which duplicates have been added", but there's an argument for "report which numbers have been duplicated". BTW, thanks for the edit - I forgot that "subtract" could be interpreted as - operator there. — Toby Speight
– Toby Speight, Commented Sep 20, 2022 at 6:52

Reinderien · Accepted Answer · 2022-09-20 12:50:47Z

1

As Toby suggests, you traverse the array multiple times; you need to cut that down. However, my proposal is going to be fairly different. Don't use counters and sets. Sort the input array first, then do a single pass over it. Both gaps and dupes will be obvious and can be detected within the O(n) loop. Then push any dupes to the tail of a list which itself is O(1). No post sortation will be needed.

answered Sep 20, 2022 at 12:50

Reinderien

71.2k5 gold badges76 silver badges257 bronze badges

\$\begingroup\$ (sortation? Franco-Canadian?) \$\endgroup\$

greybeard
– greybeard

2022-09-20 12:58:45 +00:00
Commented Sep 20, 2022 at 12:58
\$\begingroup\$ Hmm. Merriam-Webster seems to agree that it's an English word.. \$\endgroup\$

Reinderien
– Reinderien

2022-09-20 22:54:15 +00:00
Commented Sep 20, 2022 at 22:54
\$\begingroup\$ I see. Oxford Reference does not include it, but shows a sortation noun for a New Oxford American Dictionary (3 ed.) (2015). My spelling checker (British English) flags it… \$\endgroup\$

greybeard
– greybeard

2022-09-21 05:10:47 +00:00
Commented Sep 21, 2022 at 5:10

Add a comment |

janos · Accepted Answer · 2022-09-20 21:22:49Z

Use counting sort

Since we know that all values of the array are within a range [a, b] with one missing value, this is a good opportunity for Counting sort:

find the minimum
find the maximum
initialize the list of counts to 0 for maximum - minimum + 1 values
loop over the input, and add counts, applying minimum as an offset for the list index into counts
to find the missing value and the duplicates, loop over counts, and:
- if count is 0, that's the missing value
- if count is > 1, add to list of dups

Beware: a caller may mistakenly call the function with an input like [1, sys.maxint]. So before we allocate a list of sys.maxint zeros, it's important to check that the input looks legit.

Time and space complexity is \$O(n)\$.

Validate the input

The description says that there is one missing value. Can we take that for granted? What if the input has more? What if the input has none?

Unless it's specified somewhere that the input is guaranteed to have these properties, I think it's worth raising an error for these cases.

Make the API more clear

The function returns a list. It's not clear which is the missing value. I think it would be better to return a tuple: (the_missing_value, sorted_list_of_duplicates)

Embrace the spirit of kata

The description says this is a kata. A programming kata often involves applying test driven development, implementing failing tests first before the code works. You end up with many tests for the various corner cases. I'm surprised to not see any tests included with the posted code.

Putting it together

Here's how the above ideas could be implemented. I used doctests because it's more compact to demonstrate, I recommend to replace these with proper unit tests in real life.

The minimum and maximum values can be computed in a single pass, I used two passes with min(...) and max(...) for the sake of simplicity.

def find_missing_and_dups(arr: List[int]) -> Tuple[int, List[int]]
    """
    Find missing number and sorted list of duplicates from a list of numbers from a to b

    :param arr: consecutive numbers from a to b, with one missing, and some duplicates
    :return: (the_missing_value, sorted_list_of_duplicates)

    >>> find_missing_and_dups([10, 9, 8, 9, 6, 4, 3, 2, 5, 5, 3])
    (7, [3, 5, 9])

    >>> find_missing_and_dups([3, 1])
    (2, [])

    >>> find_missing_and_dups([3, 10])
    Traceback (most recent call last):
      ...
    ValueError: Invalid input: there are more than 1 missing values!

    >>> find_missing_and_dups([3, 5, 7, 7, 7])
    Traceback (most recent call last):
      ...
    ValueError: Invalid input: there are more than 1 missing values!

    >>> find_missing_and_dups([3, 2])
    Traceback (most recent call last):
      ...
    ValueError: Invalid input: there is no missing value!
    """
    a = min(arr)
    b = max(arr)
    if b - a > len(arr):
        raise ValueError("Invalid input: there are more than 1 missing values!")

    counts = [0] * (b - a + 1)

    for x in arr:
        counts[x - a] += 1

    missing = None
    dups = []
    for i, count in enumerate(counts):
        if count == 0:
            if missing is not None:
                raise ValueError("Invalid input: there are more than 1 missing values!")
            missing = i + a
        elif count > 1:
            dups.append(i + a)

    if missing is None:
        raise ValueError("Invalid input: there is no missing value!")

    return missing, dups


if __name__ == '__main__':
    import doctest

    doctest.testmod()

Jason Leaver · Accepted Answer · 2022-09-20 23:19:05Z

a more performant solution

original solution

as many have mentioned the traversing of the list is the leading loss in performance.

arr = [10, 9, 8, 9, 6, 1, 2, 4, 3, 2, 5, 5, 3]

# original solution
def find_dups_miss(arr):
    m = max(arr)
    n = min(arr)

    a = list(range(n, m + 1))

    d = list(set(a).difference(arr))  # creates the list with the missing element
    z = sorted(
        set([x for x in arr if arr.count(x) >= 2])
    )  # creates the list with duplicate elements

    return d + [z]

performant solution

creating a list the length of the max value in this case is more performant. If this list was [1,1000] this would not be the case.


def filter_and_generate(arr: list[int]) -> tuple[int, list[int]]:
    container = list(range(0, max(arr) + 1))

    def generate():
        for i in arr:
            curr = container[i]
            if curr == -1:
                yield i
            elif curr:
                container[i] = -1
            else:
                container[i] = None

    dupes = sorted(generate())
    (seven,) = filter(lambda x: x > 0, container)
    return [seven, dupes]


%time find_dups_miss(arr)
%time filter_and_generate(arr)

CPU times: user 16 µs, sys: 0 ns, total: 16 µs
Wall time: 17.6 µs
CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 10 µs
[7, [2, 3, 5, 9]]

I want to apologize for any formatting mistakes, I am on mobile waiting on a flight.

You might consider using a generator function.

arr = [10,9,8,9,6,1,2,4,3,2,5,5,3]
def f(a:list[int]):
    c = []
    for i in a:
        if not i in c:
            c.append(i)
            continue
        yield i

    
# sorted(set(f(arr)))
# [2, 3, 5, 9]
def getseven(a:list[int]) -> int | None:
    for i in range(1,max(arr)):
        if i not in arr:
            return i

[getseven(arr),sorted(set(f(arr)))]
# [7, [2, 3, 5, 9]]

I don't find the missing 7 in the output, let alone its beginning. — greybeard
– greybeard, Commented Sep 19, 2022 at 13:20
a = sorted(set(f(arr))); for i in range(max(a)): if i not in a: return i — Jason Leaver
– Jason Leaver, Commented Sep 20, 2022 at 1:47
(The minimum number is not specified to be 1.) What is the advantage of using generator function f() over post_lupy's use of arr.count()? With c (bad naming, btw.) a list and few duplicates, does the number of list elements to check differ significantly? — greybeard
– greybeard, Commented Sep 20, 2022 at 3:41

Stack Exchange Network

Find the missing and duplicated numbers from shuffled range

4 Answers 4

Use counting sort

Validate the input

Make the API more clear

Embrace the spirit of kata

Putting it together

a more performant solution

original solution

performant solution

You must log in to answer this question.

Hot Network Questions

Find the missing and duplicated numbers from shuffled range

4 Answers 4

Use counting sort

Validate the input

Make the API more clear

Embrace the spirit of kata

Putting it together

a more performant solution

original solution

performant solution

You must log in to answer this question.

Related

Hot Network Questions