Revisions to Python program that deals with overlapping intervals

added 1 character in body

Source Link

edited May 23, 2023 at 7:46

3.7k
13
48

My intended process is equivalent to simply processesprocessing them in order and update the dictionary according to the current item one by one.

added 1899 characters in body

Source Link

edited May 23, 2023 at 7:12

Ξένη Γήινος

3.7k
13
48

Update 3.0

I have updated my code yet again and this time it seems to be working correctly, at least it gives correct output for all the three examples given so far: the original example included in the question, [[10, 15, 9], [16, 20, 9]] from a comment, and my second example [[0, 10, 'A'], [0, 1, 'B'], [2, 5, 'C'], [3, 4, 'C'], [6, 7, 'C'],[8, 8, 'D']].

As you can see the code is of poor quality, it needs to be split into smaller methods.

Before you ask why did I share code as a screenshot, first see if there is existing answer to the question. As there is already an answer posted, I don't know if I am allowed to edit the code, lest it be rolled back.

The changes I made were quite simple, first I duplicated the last element in the data by appending the last element to the list. This is to ensure every row will be processed, because my code works by comparing the current element to the last element, and only the last element will be appended to the result.

Second I got rid of last element assignment, or assigning the current element to last at the end of the loop to memorize what was the last element, by simply zipping the data starting at the zeroth index with the data starting at the first index, so I have two variables that keep track of what comes before the latter.

Third I eliminated the first and last if checks inside the for loop.

Finally and most importantly, this is what made it work for all inputs (that I have tested anyway, again there might still be edge cases), the trick is extremely simple, I changed the two conditions that look like this: if pos < number: to this: if pos <= number:. That is right, I merely changed less than operator to less than or equal to operator and it made a huge difference.

Update 3.0

I have updated my code yet again and this time it seems to be working correctly, at least it gives correct output for all the three examples given so far: the original example included in the question, [[10, 15, 9], [16, 20, 9]] from a comment, and my second example [[0, 10, 'A'], [0, 1, 'B'], [2, 5, 'C'], [3, 4, 'C'], [6, 7, 'C'],[8, 8, 'D']].

As you can see the code is of poor quality, it needs to be split into smaller methods.

Before you ask why did I share code as a screenshot, first see if there is existing answer to the question. As there is already an answer posted, I don't know if I am allowed to edit the code, lest it be rolled back.

The changes I made were quite simple, first I duplicated the last element in the data by appending the last element to the list. This is to ensure every row will be processed, because my code works by comparing the current element to the last element, and only the last element will be appended to the result.

Second I got rid of last element assignment, or assigning the current element to last at the end of the loop to memorize what was the last element, by simply zipping the data starting at the zeroth index with the data starting at the first index, so I have two variables that keep track of what comes before the latter.

Third I eliminated the first and last if checks inside the for loop.

Finally and most importantly, this is what made it work for all inputs (that I have tested anyway, again there might still be edge cases), the trick is extremely simple, I changed the two conditions that look like this: if pos < number: to this: if pos <= number:. That is right, I merely changed less than operator to less than or equal to operator and it made a huge difference.

added 2424 characters in body

Source Link

edited May 23, 2023 at 4:05

Ξένη Γήινος

3.7k
13
48

Update 2

No, that is not a bug, that is exactly the intended behavior.

Like I already wrote, I want to split the overlapping ranges into discrete non-overlapping ranges. If two ranges overlap, then one range is completely inside the other.

Again, say they are range A and B, B comes after A, in all cases B is a su-brange of A. And if B and A have different data, they need to be split into discrete ranges, this means the portion of A that is B will be deleted so that the numbers inside range B will only have data from B.

I did write they are number ranges, right? The rule is very simple, each range assigns its attribute to all numbers within the corresponding range, and like Python assignment, later assignment should always overwrite everything assigned before. This means the sub-range always wins and the for each number there will only be one value to it.

Here is a simple example to show what I mean.

Data:

[(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D')]

What the data actually means:

[
    {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'A', 9: 'A', 10: 'A'},
    {0: 'B', 1: 'B'},
    {2: 'C', 3: 'C', 4: 'C', 5: 'C'},
    {3: 'C', 4: 'C'},
    {6: 'C', 7: 'C'},
    {8: 'D'}
]

My intended process is equivalent to simply processes them in order and update the dictionary according to the current item one by one.

Processed data:

{0: 'B', 1: 'B', 2: 'C', 3: 'C', 4: 'C', 5: 'C', 6: 'C', 7: 'C', 8: 'D', 9: 'A', 10: 'A'}

Output:

[(0, 1, 'B'), (2, 7, 'C'), (8, 8, 'D'), (9, 10, 'A')]

And I know I can just expand all the ranges and merge them one by one, and that is fool-proof and guaranteed to work, but that is incredibly stupid and terribly inefficient, as you can see the numbers in the data are so big and there are millions of rows I really have no idea how long it will take.

I wrote the completely working first version, the first smart approach I came up with, it was really completely working and included in the question I linked.

But it was still inefficient. This is my second attempt at a smart solution. This does everything in one for loop, in one go, therefore extremely efficient. But it isn't completely working and ugly. However the output of this algorithm on any subset of the gigantic dataset I have is correct. I always started with the first item though.

Update 2

No, that is not a bug, that is exactly the intended behavior.

Like I already wrote, I want to split the overlapping ranges into discrete non-overlapping ranges. If two ranges overlap, then one range is completely inside the other.

Again, say they are range A and B, B comes after A, in all cases B is a su-brange of A. And if B and A have different data, they need to be split into discrete ranges, this means the portion of A that is B will be deleted so that the numbers inside range B will only have data from B.

I did write they are number ranges, right? The rule is very simple, each range assigns its attribute to all numbers within the corresponding range, and like Python assignment, later assignment should always overwrite everything assigned before. This means the sub-range always wins and the for each number there will only be one value to it.

Here is a simple example to show what I mean.

Data:

[(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D')]

What the data actually means:

[
    {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'A', 9: 'A', 10: 'A'},
    {0: 'B', 1: 'B'},
    {2: 'C', 3: 'C', 4: 'C', 5: 'C'},
    {3: 'C', 4: 'C'},
    {6: 'C', 7: 'C'},
    {8: 'D'}
]

My intended process is equivalent to simply processes them in order and update the dictionary according to the current item one by one.

Processed data:

{0: 'B', 1: 'B', 2: 'C', 3: 'C', 4: 'C', 5: 'C', 6: 'C', 7: 'C', 8: 'D', 9: 'A', 10: 'A'}

Output:

[(0, 1, 'B'), (2, 7, 'C'), (8, 8, 'D'), (9, 10, 'A')]

And I know I can just expand all the ranges and merge them one by one, and that is fool-proof and guaranteed to work, but that is incredibly stupid and terribly inefficient, as you can see the numbers in the data are so big and there are millions of rows I really have no idea how long it will take.

I wrote the completely working first version, the first smart approach I came up with, it was really completely working and included in the question I linked.

But it was still inefficient. This is my second attempt at a smart solution. This does everything in one for loop, in one go, therefore extremely efficient. But it isn't completely working and ugly. However the output of this algorithm on any subset of the gigantic dataset I have is correct. I always started with the first item though.

added 2259 characters in body

Source Link

edited May 22, 2023 at 21:05

Ξένη Γήινος

3.7k
13
48

Loading

Source Link

asked May 22, 2023 at 14:37

Ξένη Γήινος

3.7k
13
48

Loading

Stack Exchange Network

Return to Question

Update 3.0

Update 3.0

Update 2

Update 2