4
\$\begingroup\$

I have just written a small application to aggregate the lines of a text file, for example to group the lines according to the frequency of IP addresses in a log file.

Would the code be sufficiently commented (and self-explanatory) or is something still missing?

I would appreciate any comments from you.

Example call

javac Main.java and java Main input.txt ".*? (\d+) .*"

with an input.txt file:

g 23 foo
a 234 bar
b 234 baz
c 123 qux
d 32 quux
e 234 corge
f 32 grault

will print

0003 - 234 - a 234 bar
0002 - 32 - d 32 quux
0001 - 23 - g 23 foo
0001 - 123 - c 123 qux

Code

import java.io.BufferedReader;
import java.io.FileReader;
import java.nio.charset.Charset;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
  /**
   * Main method to aggregate and count the occurrences of a regex pattern in a text file.
   *
   * @param args input text file, aggregate regex, and optional ignore regex.
   * @throws Exception if the input file is not found or cannot be read.
   */
  public static void main(String[] args) throws Exception {
    if (args.length < 2) {
      System.out.println(
          "Usage: java -jar <jar file> <input text file> <aggregate regex> (<ignore regex>)");
      System.out.println(
          "Example: java -jar <jar file> \"input.txt\" \".*? (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) .*\"");
      System.out.println(" to aggregate IP addresses from the input file.");
      throw new IllegalArgumentException("Invalid number of arguments");
    }
    Pattern aggregatePattern = Pattern.compile(args[1]);
    Pattern ignorePattern = args.length > 2 ? Pattern.compile(args[2]) : Pattern.compile("(?!x)x");

    LinkedHashMap<String, Map.Entry<ArrayList<String>, Integer>> map = new LinkedHashMap<>();
    try (BufferedReader reader =
        new BufferedReader(new FileReader(args[0], Charset.defaultCharset()))) {
      String line;
      while ((line = reader.readLine()) != null) {
        Matcher aggregateMatcher = aggregatePattern.matcher(line);
        if (aggregateMatcher.find() && !ignorePattern.matcher(line).find()) {
          String key = aggregateMatcher.group(1);
          map.computeIfAbsent(key, k -> new AbstractMap.SimpleEntry<>(new ArrayList<>(), 0));
          Map.Entry<ArrayList<String>, Integer> entry = map.get(key);
          entry.getKey().add(line);
          entry.setValue(entry.getValue() + 1);
        }
      }
    }

    ArrayList<Map.Entry<String, Map.Entry<ArrayList<String>, Integer>>> list =
        new ArrayList<>(map.entrySet());
    // Sort by count in descending order, if counts are equal, sort by input order (earlier first).
    list.sort((o1, o2) -> o2.getValue().getValue().compareTo(o1.getValue().getValue()));
    for (Map.Entry<String, Map.Entry<ArrayList<String>, Integer>> entry : list) {
      System.out.printf(
          "%04d - %s - %s%n",
          entry.getValue().getValue(), entry.getKey(), entry.getValue().getKey().get(0));
    }
  }
}
\$\endgroup\$
7
  • \$\begingroup\$ Funny you should ask about commentation when I find exactly one comment. \$\endgroup\$
    – greybeard
    Commented Mar 2 at 18:08
  • \$\begingroup\$ I count the usage example as a comment. \$\endgroup\$ Commented Mar 2 at 18:43
  • \$\begingroup\$ (Was tempted to give you that. Am tempted to see the text of "execution trace" statements as executable comments - at times.) \$\endgroup\$
    – greybeard
    Commented Mar 2 at 18:54
  • 1
    \$\begingroup\$ Sorry, @TobiasGrothe , I had a little bit of a hard time reading that. In part because we're using the very generic name main(). It would be helpful to crack argv and then call some appropriately named helper function. Perhaps one that has a /** javadoc */ comment. What I'm looking for is, "what is the contract?", and there's room for the OP code to do a better job of explaining that. There's no Answer posted, yet, so there's still time to edit and improve the Question. \$\endgroup\$
    – J_H
    Commented Mar 2 at 18:57
  • \$\begingroup\$ Thanks for the comments, I have edited the question. \$\endgroup\$ Commented Mar 3 at 4:59

2 Answers 2

7
\$\begingroup\$

The code is understandable but only because you spoiled us with the example in the question. That example should have been in the actual code documentation. I'm trying to read the code as if that example did not exist. The shortness of the code helps a lot, if the program was longer it would be very hard to follow.

LinkedHashMap<String, Map.Entry<ArrayList<String>, Integer>> map = new LinkedHashMap<>();

This is the literal reason why Java does not have a generic Tuple or Pair class. A generic pair does not have a semantic meaning. You need to define your own class that has a name that conveys it's purpose. That you also used the most abstract name for the map doesn't make this data structure any easier to understand. To make this data structure more understandable you can extract the operations that manipulate it into functions and give the parameters names that tell the reader what is being done (and why).

You haven't documented that the regex must have a capturing group and you are not doing error checking on the input parameter to make sure it exists.

Keeping in mind the purpose of the code (simple tool to get the work done, no desire for reusability), I would make this more readable by moving the patterns and maps to be static class fields and split the main method into several sub-methods (parameter parsing, reading, printing result, printing help, etc.).

\$\endgroup\$
4
  • \$\begingroup\$ "You haven't documented that the regex must have a capturing group" - can I check that programmatically, or have I introduce to split the input arguments? (... Apart from the point that this should be documented.) \$\endgroup\$ Commented Mar 3 at 6:34
  • \$\begingroup\$ What immediately comes to my mind is aggregatePattern.match("").groupCount() but that tickles me from exactly the wrong places. Maybe there's a stackoverflow.com question about it... \$\endgroup\$ Commented Mar 3 at 9:10
  • \$\begingroup\$ @TorbenPutkonen I think one should be able to test if the pattern contains a capturing group with the regex /([^?].*)/ ? \$\endgroup\$
    – Cecilia
    Commented Mar 3 at 11:47
  • \$\begingroup\$ As mentioned, Matcher has such a method, but Pattern does not. This means that you would have to check this in the loop for each line, "on the fly". \$\endgroup\$ Commented Mar 3 at 15:43
2
\$\begingroup\$

As @TorbenPutkonen says, use a proper type instead of abusing Map.Entry

I would further make use of the Stream APIs. Don't do a get on your map after calling computeIfAbsent .. you already have the entry.

This is slightly cleaner...

import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
  static class LineCounts {
    final ArrayList<String> lines = new ArrayList<>();
    int count;
  }

  /**
   * Main method to aggregate and count the occurrences of a regex pattern in a text file.
   *
   * @param args input text file, aggregate regex, and optional ignore regex.
   * @throws Exception if the input file is not found or cannot be read.
   */
  public static void main(String[] args) throws Exception {

    if (args.length < 2) {
      System.out.println("""
          Usage: java -jar <jar file> <input text file> <aggregate regex> (<ignore regex>)
          Example: java -jar <jar file> "input.txt" ".*? (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) .*"
           to aggregate IP addresses from the input file.""");
      throw new IllegalArgumentException("Invalid number of arguments");
    }
    Pattern aggregatePattern = Pattern.compile(args[1]);
    Pattern ignorePattern = Pattern.compile(args.length > 2 ? args[2] : "(?!x)x");

    LinkedHashMap<String, LineCounts> map = new LinkedHashMap<>();
    Files.lines(Paths.get(args[0]), Charset.defaultCharset()).forEach(line -> {
      Matcher aggregateMatcher = aggregatePattern.matcher(line);
      if (aggregateMatcher.find() && !ignorePattern.matcher(line).find()) {
        String key = aggregateMatcher.group(1);
        LineCounts entry = map.computeIfAbsent(key, k -> new LineCounts());
        entry.lines.add(line);
        entry.count++;
      }
    });

    map.entrySet().stream()
      // Sort by count in descending order, if counts are equal, sort by input order (earlier first).
      .sorted((o1, o2) -> o2.getValue().count - o1.getValue().count)
      .forEach(entry -> {
        System.out.printf("%04d - %s - %s%n", entry.getValue().count, entry.getKey(), entry.getValue().lines.get(0));
      });
  }
}
\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.