Remove lines from a csv file based on column value

Question

I have a csv file with 12 million lines in the following format:

mcu_i,INIT,200,iFlash,  11593925,     88347,,0x00092684,r,0x4606b570,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  11593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,

I want to remove lines based on the value of the 6th column with the following logic: if (value >= X AND value <= Y ) => remove lines

I found a solution using gawk:

gawk -i inplace -F ',' -v s="$start_marker" -v e="$end_marker" '!($6 <= e && $6 >= s)' myfile.csv

but it takes way too long, I would like another solution with better performance.

Thank you

Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on Unix & Linux Meta, or in Unix & Linux Chat. Comments continuing discussion may be removed. — terdon, Commented Mar 3, 2024 at 18:00

aviro · Accepted Answer · 2024-02-28 17:35:44Z

TL;DR

Redirecting your gawk's stdout to /dev/null or piping it to cat will greatly accelerate it and reduce the runtime significantly.

gawk -i inplace [...] myfile.csv >/dev/null

Or:

gawk -i inplace [...] myfile.csv | cat

Diving down

Though @RomeoNinov's answer would indeed work faster then your original command, I would like to explain why it's faster, and my solution will run just as fast even with -i inplace.

If you look at the Interactive Versus Noninteractive Buffering section in gawk info pages, you'll see that:

Interactive programs generally line buffer their output (i.e., they write out every line). Noninteractive programs wait until they have a full buffer, which may be many lines of output.

It seems this is true even when the result is not printed by gawk to stdout, but also when it's printed to some "in place".

Example

I have a file with 10 lines.

$ cat somefile
1
2
3
4
5
6
7
8
9
10

By default (without making any changes in the file and just printing back all the lines as-is), notice that strace shows that gawk runs 10 write system calls - one for each line in the original file.

$ strace -e trace=write -c gawk -i inplace 1 somefile 
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000098           9        10           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.000098           9        10           total

That's because it's an interactive run, and the result is line buffered (gawk would print each line as soon is it finishes with it, even if the result is being written to a file and not to stdout).

Now, if I redirect stdout to /dev/null (or just pipe the command to a cat command) to make this command Noninteractive, strace shows that gawk only calls a single write system call. That's because it doesn't print every line immediately, but rather flush the result only once the buffer is full.

$ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000020          20         1           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.000020          20         1           total

This builds up of course, and the bigger your input file is, the larger the difference between interactive and non-interactive runs will be.

Summary

Your command is slow because gawk in interactive mode writes every line to the file once it finishes processing it. This means it performs millions of writes to the file.

@RomeoNinov's solution is faster than your original command because instead of using inplace, it redirects the output to a temporary files, thus it runs in non-interactive mode, which optimizes the buffer flushing and makes gawk perform less write operations to the file.

However, you can still use the command provided in your question, but just redirect its stdout to /dev/null (since it's empty anyway) or pipe it to cat, and it will run just as fast.

Security implications of using `gawk` with `inplace`

While I don't fully agree with @RomeoNinov comment that inplace operations might lead to unpredictable results, please notice @OlivierDulac's comment that provides a useful answer explaining why usually using -i inplace is considered a security vulnerability and how to workaround it to run it in a safe manner.

I tried seq 1000000 > file1m then tried printing all lines and "modifying" the input file with the output. time { awk -i inplace '1' file1m; } output real 0m11.431s user 0m0.437s sys 0m10.828s while time { awk '1' file1m > tmp && mv tmp file1m; } was an order of magnitude faster at real 0m0.758s user 0m0.264s sys 0m0.263s, i.e. less than 1 second with output to a temp file vs 11 seconds with -i inplace. — Ed Morton, Commented Feb 28, 2024 at 18:26
FWIW I didn't see a time difference like that with sed -i -n 'p' file1m vs sed -n 'p' file1m > tmp && mv tmp file1m; — Ed Morton, Commented Feb 28, 2024 at 18:33
I contacted the gawk providers about this, see lists.gnu.org/archive/html/bug-gawk/2024-02/msg00037.html. — Ed Morton, Commented Feb 29, 2024 at 13:10

Romeo Ninov · Accepted Answer · 2024-02-29 13:30:16Z

6

One possible way (via rewriting your command) is:

gawk  -F, -v s="$start_marker" -v e="$end_marker" '$6 > e || $6 < s'  myfile.csv >/tmp/newfile

In awk, it is not recommended practice to use inplace operations, it has security implications. Moreover you can mess up the source file(s) before being 100% sure the script is correct.

edited Feb 29, 2024 at 13:30

answered Feb 28, 2024 at 13:39

Romeo Ninov

19k5 gold badges34 silver badges46 bronze badges

It's better if you show a complete solution, ie using some random file name instead of /tmp/newfile (to avoid overriding it if it exists) for instance maybe myfile.csv.$$, and renaming it at the end to reach the same result as inplace, for instance by adding && mv myfile.csv.$$ myfile.csv.
– aviro
Commented Feb 28, 2024 at 15:37
1

@aviro, using "random" filename may lead to fill the target directory in case of emergency stop of program. Static filename may save the disk space and simplify investigation. So every way have own pros and cons :)
– Romeo Ninov
Commented Feb 28, 2024 at 15:48
&& mv myfile.csv.$$ myfile.csv || rm myfile.csv.$$
– aviro
Commented Feb 28, 2024 at 15:52
1

tmp=$(mktemp) && trap 'rm -f "$tmp"; exit' EXIT && awk '...' file > "$tmp" && mv -- "$tmp" file.
– Ed Morton
Commented Feb 29, 2024 at 13:17
1

@EdMorton, updated. And you are absolute right about negative logic.
– Romeo Ninov
Commented Feb 29, 2024 at 13:31

Add a comment |

U. Windl · Accepted Answer · 2024-02-29 11:21:15Z

If it does not have to be awk, then you might try Perl instead:

#!/usr/bin/perl
use 5.18.2;
use warnings;
use strict;

my ($X, $Y) = (88347, 88347);
while (<>) {
    next
        if (/(?:^[^,]*,){5}\s*([^,]+)/ && $1 >= $X && $1 <= $Y);
    print;
}

The regular expression skips the first five comma-separated fields in a line, then ignores space, capturing the rest of the 6th field to $1. If the condition matches, the line is ignored; otherwise it's output.

For the example, it would output the line with value 88348.

Use like perl your_script input_file(s) > output_file. Obviously input and output file names should be different!

aborruso · Accepted Answer · 2024-02-29 22:33:26Z

if you can use tools other than awk, duckdb is very quick and convenient.

If your input is

mcu_i,INIT,200,iFlash,  11593925,     88347,,0x00092684,r,0x4606b570,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  11593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  10593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  21593931,     98348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  31593931,     108348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,

you can run

duckdb --csv -c "SELECT * from read_csv_auto('input.csv',HEADER = false,all_varchar=true) 
where NOT
TRY_CAST(TRIM(column05) AS INTEGER) >88348 AND
TRY_CAST(TRIM(column05) AS INTEGER) < 108348;" | tail -n +2 | sed 's/"//g'>outuput.txt

to get

mcu_i,INIT,200,iFlash,  11593925,     88347,,0x00092684,r,0x4606b570,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  11593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  10593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,

I have constructed the command in such a way as to leave the strange spaces in your input.

duckdb looks awesome, thanks. I'm definitely going to try it. — aviro, Commented Feb 29, 2024 at 23:11
Have you tried to compare its performance to awk, at least in this case? — aviro, Commented Feb 29, 2024 at 23:12

jubilatious1 · Accepted Answer · 2024-03-02 08:44:13Z

Using Raku (formerly known as Perl_6)

~$ raku -MText::CSV -e 'my @rows; my $csv = Text::CSV.new( sep => ",");  \
                        while ($csv.getline($*IN)) -> $row { @rows.push: $row.map(*.trim) if 88000 < $row.[5] < 98000; };  \
                        .join(",").put for @rows;'  <  ~/raphui_771255.csv

Raku is a programming language in the Perl-family. It features high-level support for Unicode as well as a powerful Regex engine.

Above answer uses Raku's Text::CSV module. The Perl(5) module Text::CSV_XS is well-regarded, and a longtime author/maintainer of that module has gone on to develop Raku's Text::CSV module (H. Merijn Brand, personal communication).

Sample Input (thanks to @aborruso!):

mcu_i,INIT,200,iFlash,  11593925,     88347,,0x00092684,r,0x4606b570,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  11593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  10593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  21593931,     98348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  31593931,     108348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,

Sample Output:

mcu_i,INIT,200,iFlash,11593925,88347,,0x00092684,r,0x4606b570,ok,,32,single,op-c,0,,0,0,0,
mcu_i,INIT,200,iFlash,11593931,88348,,0x00092678,r,0x28003801,ok,,32,single,op-c,0,,0,0,0,
mcu_i,INIT,200,iFlash,10593931,88348,,0x00092678,r,0x28003801,ok,,32,single,op-c,0,,0,0,0,

Note: Raku allows the use of "chained" inequalities. Also, instead of hardcoding the values, Raku has a special associative array %*ENV which can be used to access shell variables. So below takes shell variables startMarker and stopMarker from the environment (i.e. the shell). Use the high level csv( …, out => $*OUT) function for output, and your whitespace-containing strings will automatically be quoted (in your case, drop the .trim call as well):

~$ env startMarker="88000" stopMarker="89000"     \
   raku -MText::CSV -e 'my $start = %*ENV<startMarker>; my $stop =  %*ENV<stopMarker>;     \
                        my @rows; my $csv = Text::CSV.new( sep => ",");    \
                        while ($csv.getline($*IN)) -> $row { @rows.push: $row if $start < $row.[5] < $stop; };    \
                        csv(in => @rows, out => $*OUT);'  <  ~/raphui_771255.csv
mcu_i,INIT,200,iFlash,"  11593925","     88347",,0x00092684,r,0x4606b570,"   ok",,"         32",single,op-c,0,,"         0","         0","         0",
mcu_i,INIT,200,iFlash,"  11593931","     88348",,0x00092678,r,0x28003801,"   ok",,"         32",single,op-c,0,,"         0","         0","         0",
mcu_i,INIT,200,iFlash,"  10593931","     88348",,0x00092678,r,0x28003801,"   ok",,"         32",single,op-c,0,,"         0","         0","         0",

https://raku.land/zef:Tux/Text::CSV
https://github.com/Tux/CSV/blob/master/doc/Text-CSV.md
https://docs.raku.org
https://raku.org

Stack Exchange Network

Remove lines from a csv file based on column value

5 Answers 5

TL;DR

Diving down

Example

Summary

Security implications of using `gawk` with `inplace`

You must log in to answer this question.

Linked

Hot Network Questions

Remove lines from a csv file based on column value

5 Answers 5

TL;DR

Diving down

Example

Summary

Security implications of using gawk with inplace

You must log in to answer this question.

Linked

Related

Hot Network Questions

Security implications of using `gawk` with `inplace`