This is related to my previous question, this is the core logic of that script that I wanted you to review.
Basically, I have two very large lists collectively containing literally over one million items, each item can be simplified to a triplet, in which the first and second element are integers, and the third element is some arbitrary data.
In each triplet the second integer is always no less than the first integer, the two integers represent integer ranges that includes both ends. Basically I have two extremely big lists containing integer ranges with some data associated with each range.
And here is the problem I wanted to address: the ranges often overlap, and sub-ranges can also have sub-ranges. So I intended to split the overlapping ranges into discrete ranges.
And there are two kinds of problematic situations:
1, the sub-ranges can have the same data as their parent ranges, this causes unnecessary nesting so I wanted the sub-ranges to be ignored and only the parent ranges need to be processed in this situation.
2, if the start of a range is the end of the previous range plus one, they sometimes have the same data, in this situation they need to be merged.
I intend to split the ranges into discrete ranges, so that any gap is filled with data from the immediate parent range, and no adjacent ranges share the same data.
My previous post didn't get reviewed, the code was too long and more importantly the solution was very inefficient.
I had spent many hours today trying to find a better solution, and I have found it, I have achieved doing everything I mentioned in one for loop, so that the code is much more efficient. But the code is ugly so I wanted it to be reviewed.
Code
class Merging_List:
def __init__(self):
self.rows = []
self.last = [False]*3
def append(self, row):
if row[0] != self.last[1] + 1 or row[2] != self.last[2]:
self.rows.append(row)
self.last = row
else:
self.last[1] = row[1]
def process(rows):
last = rows[0]
pos = last[0]
stack = [last]
processed = Merging_List()
for row in rows[1:]:
if stack and row[2] == stack[-1][2]:
continue
if last[0] > pos and stack:
processed.append([pos, last[0] - 1, stack[-1][2]])
if row[0] > last[1]:
processed.append(last)
pos = last[1] + 1
elif last not in stack:
stack.append(last)
if stack and row[0] > stack[-1][1]:
if pos < stack[-1][1]:
processed.append([pos, stack[-1][1], stack[-1][2]])
pos = stack[-1][1] + 1
stack.pop(-1)
if row[0] >= last[1] or row[2] != last[2]:
last = row
if stack and last[2] != stack[-1][2]:
if last[0] > pos:
processed.append([pos, last[0] - 1, stack[-1][2]])
processed.append(last)
pos = last[1] + 1
while stack:
row = stack.pop(-1)
if pos < row[1]:
processed.append([pos, row[1], row[2]])
pos = row[1] + 1
return processed.rows
I haven't tested it on the full dataset, so I am not sure if it is completely working, but from what I have tested so far the code seems to be working perfectly.
The ranges are always sorted in such a way that ranges with lower indexes have smaller starts, and if two adjacent ranges share the same start, the range with larger end is ordered first. And I have verified that if two ranges overlap, then the smaller range is always completely contained within the larger range, with no exceptions, in other words if the start of range A is no greater than range B, the end of range A will never be greater than range B.
So for two adjacent ranges A and B, B comes after A, if the start of B is larger than the end of A, it is guaranteed that they don't overlap, else B must be a sub-range of A.
Test data
rows = [
[16777216, 33554431, 0], [16777216, 16777471, 1], [16777472, 16777727, 2], [16777728, 16778239, 2],
[16778240, 16779263, 3], [16778496, 16778751, 3], [16779264, 16781311, 2], [16781312, 16785407, 4],
[16781312, 16781567, 5], [16785408, 16793599, 2], [16785408, 16785663, 6], [16793600, 16809983, 7],
[16809984, 16842751, 8], [16809984, 16826367, 8], [16809984, 16818175, 8], [16809984, 16810239, 8],
[16810240, 16810495, 8], [16810496, 16811007, 8], [16811008, 16811263, 8], [16811264, 16811519, 8],
[16812032, 16812287, 8], [16812288, 16812543, 8], [16812544, 16812799, 8], [16812800, 16813055, 8],
[16813312, 16813567, 8], [16814080, 16818175, 8], [16818176, 16826367, 8], [16818176, 16819199, 8],
[16819200, 16819455, 8], [16819456, 16819711, 8], [16819712, 16819967, 8], [16819968, 16820223, 8],
[16820224, 16820479, 8], [16820480, 16820735, 8], [16820736, 16820991, 8], [16820992, 16821247, 8],
[16821248, 16822271, 8], [16822272, 16822527, 8], [16822528, 16822783, 8], [16822784, 16823039, 8],
[16823040, 16823295, 8], [16823808, 16824063, 8], [16825600, 16825855, 8], [16825856, 16826111, 8],
[16826112, 16826367, 8], [16826368, 16842751, 8], [16826368, 16834559, 8], [16826368, 16826623, 8],
[16826624, 16826879, 8], [16826880, 16827135, 8], [16827136, 16827391, 8], [16827392, 16827647, 8],
[16827648, 16827903, 8], [16827904, 16828159, 8], [16828160, 16828415, 8], [16828416, 16828671, 8],
[16828672, 16828927, 8], [16828928, 16829439, 8], [16829440, 16829695, 8], [16829696, 16829951, 8],
[16829952, 16830207, 8], [16830208, 16830463, 8], [16830464, 16830975, 8], [16830976, 16831487, 8]
]
Test result
[[16777216, 16777471, 1],
[16777472, 16778239, 2],
[16778240, 16779263, 3],
[16779264, 16781311, 2],
[16781312, 16781567, 5],
[16781568, 16785407, 4],
[16785408, 16785663, 6],
[16785664, 16793599, 2],
[16793600, 16809983, 7],
[16809984, 16842751, 8],
[16842752, 33554431, 0]]
I have verified the result to be indeed correct.
I know my code is ugly and this is exactly why I want it to be reviewed, it is just prototype code. The function is way too long and not elegant, but it does get the job done. I want the function to be split into smaller functions, but I don't know where to start, I need everything to be done in exactly one loop, because I literally have millions of rows to process.
So what is a better method to achieve exactly the same thing I have done, but with smaller function size, more concise and elegant code?