I want to sort a 20GB binary file that contains 30-byte key and 20 byte value placed in a contiguous manner. Everything is on a single line. I would like to specify the key length the sort must use for comparison and the record size. So that, when a key has moved, the value associated with it is also moved.
Ideally, I would not like to modify the file in any way (i.e add separators between Key and value). The file looks like KVKVKVKVKVKV. Single line binary file.
Hexdump of first 200B of the 20GB file:
# hexdump -n 200 -C 20gbUnsorted
00000000 54 65 73 74 69 6E 67 31 32 33 65 08 00 60 83 6b |Testing123e..`.k|
00000010 39 2c d5 8b 8f 5e 55 96 18 55 e7 9b 87 f0 22 83 |9,...^U..U....".|
00000020 a4 66 b6 aa b1 f9 e0 ca cf 1e 26 b3 29 2a fd 10 |.f........&.)*..|
00000030 64 bb 18 b5 6a c0 7d 6f 65 6b 1d 2f 43 0d 57 bd |d...j.}oek./C.W.|
00000040 e7 e4 7d 81 f3 6a 6d d2 67 94 8b bc 23 97 bf e2 |..}..jm.g...#...|
00000050 8c 33 4e 4a d8 2b 8e 70 16 62 93 cf aa 01 16 bf |.3NJ.+.p.b......|
00000060 da 3b b1 ab 95 e0 e4 82 62 b3 ed fe 04 47 b5 7f |.;......b....G..|
00000070 77 b1 3a 35 87 fb e7 90 42 e3 c4 06 d6 8e 9f d2 |w.:5....B.......|
00000080 c7 f3 f6 39 0d 9d 0d ce 13 fb 83 42 e1 52 81 2e |...9.......B.R..|
00000090 99 4b 4b 40 3a 16 7a 2a 7c 93 c3 84 1d e1 93 0a |.KK@:.z*|.......|
000000a0 0d b2 07 f4 eb 9e 04 b5 9e d8 77 d9 a1 a0 67 a1 |..........w...g.|
000000b0 01 fa 8d 8d 4c 04 5b ee a3 00 6f b4 20 50 a4 e6 |....L.[...o. P..|
000000c0 5b b3 cc 40 83 eb b2 ad |[..@....|
000000c8
I am using Linux.
sortis not the right tool to use here. I am just exploring if one can use it sort this kind of files.54toF0, the first value from22toBB; the second value from18toE2and it needs to be sorted before the first one (along with it's value field), because 0x18 < 0xE2?hexdumpyou're going to get a little over a 3.5x increase in size. That's a 70GB stream of data to pipe intosort.--paralleloption but not sure if I can use that with the solutions suggested below.