Update 2
No, that is not a bug, that is exactly the intended behavior.
Like I already wrote, I want to split the overlapping ranges into discrete non-overlapping ranges. If two ranges overlap, then one range is completely inside the other.
Again, say they are range A and B, B comes after A, in all cases B is a su-brange of A. And if B and A have different data, they need to be split into discrete ranges, this means the portion of A that is B will be deleted so that the numbers inside range B will only have data from B.
I did write they are number ranges, right? The rule is very simple, each range assigns its attribute to all numbers within the corresponding range, and like Python assignment, later assignment should always overwrite everything assigned before. This means the sub-range always wins and the for each number there will only be one value to it.
Here is a simple example to show what I mean.
Data:
[(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D')]
What the data actually means:
[
{0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'A', 9: 'A', 10: 'A'},
{0: 'B', 1: 'B'},
{2: 'C', 3: 'C', 4: 'C', 5: 'C'},
{3: 'C', 4: 'C'},
{6: 'C', 7: 'C'},
{8: 'D'}
]
My intended process is equivalent to simply processes them in order and update the dictionary according to the current item one by one.
Processed data:
{0: 'B', 1: 'B', 2: 'C', 3: 'C', 4: 'C', 5: 'C', 6: 'C', 7: 'C', 8: 'D', 9: 'A', 10: 'A'}
Output:
[(0, 1, 'B'), (2, 7, 'C'), (8, 8, 'D'), (9, 10, 'A')]
And I know I can just expand all the ranges and merge them one by one, and that is fool-proof and guaranteed to work, but that is incredibly stupid and terribly inefficient, as you can see the numbers in the data are so big and there are millions of rows I really have no idea how long it will take.
I wrote the completely working first version, the first smart approach I came up with, it was really completely working and included in the question I linked.
But it was still inefficient. This is my second attempt at a smart solution. This does everything in one for loop, in one go, therefore extremely efficient. But it isn't completely working and ugly.
However the output of this algorithm on any subset of the gigantic dataset I have is correct. I always started with the first item though.