Funny time with DNA: a \$k\$-mer index data structure in Java, Take II

Question

(See the previous and initial iteration.)

Intro

This time, I decided to pack the genomic data such that 4 nucleotide bases are encoded into a single byte. In other words, A is mapped to binary 00, C to 01, G to 10, and T to 11. For example, the genomic string CATG will be encoded as 10110001.

Code

io.github.coderodde.dna.kmerindex.GenomicSequence.java:

package io.github.coderodde.dna.kmerindex;

import java.util.Arrays;
import java.util.Objects;

/**
 * This class implements a tightly packed genomic sequence over the nucleotide
 * base alphabet <code>A, C, G, T</code>. Each byte, contains 4 nucleotide
 * bases.
 * 
 * @version 2.0.0 (Aug 18, 2025)
 * @since 1.0.0 (Aug 17, 2025)
 */
public final class GenomicSequence {
    
    private static final int NUCLEOTIDES_PER_BYTE = 4;
    private static final int BITS_PER_CODE = 2;
    
    private final byte[] data;
    private final int length;
    
    public GenomicSequence(String sequence) {
        Objects.requireNonNull(sequence, "The input sequence is null");
        data = new byte[sequence.length() / NUCLEOTIDES_PER_BYTE +
                       (sequence.length() % NUCLEOTIDES_PER_BYTE != 0 ? 1 : 0)];
        
        length = sequence.length();
        
        for (int i = 0; i < sequence.length(); ++i) {
            char nucleotideBase = sequence.charAt(i);
            checkNucleotideBase(nucleotideBase);
            
            writeToData(nucleotideBase,
                        i % NUCLEOTIDES_PER_BYTE,
                        i / NUCLEOTIDES_PER_BYTE);
        }
    }
    
    public char get(int nucleotideIndex) {
        if (nucleotideIndex < 0) {
            throw new IndexOutOfBoundsException(
                    String.format(
                            "nucleotideIndex(%d) < 0",
                            nucleotideIndex));
        }
        
        if (nucleotideIndex >= length) {
            throw new IndexOutOfBoundsException(
                    String.format(
                            "nucleotideIndex(%d) >= length(%d)\n", 
                            nucleotideIndex, 
                            length));
        }
        
        int byteIndex = nucleotideIndex / NUCLEOTIDES_PER_BYTE;
        nucleotideIndex %= NUCLEOTIDES_PER_BYTE;
        
        byte datum = data[byteIndex];
        datum >>>= nucleotideIndex * BITS_PER_CODE;
        datum &= 0b11;
        
        switch (datum) {
            case 0b00 -> {
                return 'A';
            }
                
            case 0b01 -> {
                return 'C';
            }
                
            case 0b10 -> {
                return 'G';
            }
                
            case 0b11 -> {
                return 'T';    
            }
        }
        
        throw new IllegalStateException("Should not get here");
    }
    
    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder(length);
        
        for (int i = 0; i < length; ++i) {
            sb.append(get(i));
        }
        
        return sb.toString();
    }
    
    /**
     * Extracts a kmer of length {@code k} starting from index 
     * {@code startIndex}.
     * 
     * @param k          the length of the requested <code>k</code>-mer.
     * @param startIndex the starting index of the requested <code>k</code>-mer.
     * @return the requested <code>k</code>-mer.
     */
    public GenomicSequence kmer(int k, int startIndex) {
        checkKmerParams(k, startIndex);
        
        StringBuilder sb = new StringBuilder(k);
        
        for (int i = 0; i < k; ++i) {
            sb.append(get(startIndex + i));
        }
        
        return new GenomicSequence(sb.toString());
    }
    
    @Override
    public int hashCode() {
        return Arrays.hashCode(data);
    }
    
    @Override
    public boolean equals(Object o) {
        if (o == null) {
            return false;
        }
        
        if (o == this) {
            return true;
        }
        
        if (!getClass().equals(o.getClass())) {
            return false;
        }
        
        GenomicSequence other = (GenomicSequence) o;
        
        if (other.length != length) {
            return false;
        }
        
        return Arrays.equals(data, other.data);
    }
    
    public int length() {
        return length;
    }
    
    private void checkKmerParams(int k, int startIndex) {
        if (k < 1) {
            String exceptionMessage = 
                String.format(
                    "The k-parameter is too small (%d). Must be at least 1", 
                    k);
            
            throw new IllegalArgumentException(exceptionMessage);
        }
        
        if (k > length) {
            String exceptionMessage = String.format("k(%d) > length(%d)", 
                                                    k,
                                                    length);
            
            throw new IllegalArgumentException(exceptionMessage);
        }
        
        if (startIndex + k > length) {
            String exceptionMessage = 
                    String.format("startIndex(%d) + k(%d) = %d > length(%d)",
                                  startIndex,
                                  k,
                                  k + startIndex,
                                  length);
            
            throw new IllegalArgumentException(exceptionMessage);
        }
    }
    
    private static void checkNucleotideBase(char candidate) {
        switch (candidate) {
            case 'A', 'C', 'G', 'T' -> {
                return;
            }
        }
        
        String exceptionMessage = 
                String.format(
                        "Invalid nucleotide base: %d\n", 
                        candidate);
        
        throw new IllegalArgumentException(exceptionMessage);
    }
    
    private void writeToData(char nucleotideBase,
                             int nucleotideIndex,
                             int byteIndex) {
        
        int index = nucleotideIndex % NUCLEOTIDES_PER_BYTE;
        byte mask = encodeNucleotideName(nucleotideBase);
        mask <<= index * BITS_PER_CODE;
        data[byteIndex] |= mask;
    }
    
    private static byte encodeNucleotideName(char nucleotide) {
        switch (nucleotide) {
            case 'A' -> {
                return 0b00;
            }
                
            case 'C' -> {
                return 0b01;
            }
                
            case 'G' -> {
                return 0b10;
            }
                
            case 'T' -> {
                return 0b11;
            }
        }
        
        throw new IllegalStateException("Should not get here");
    }
}

io.github.coderodde.dna.kmerindex.DnaKmerIndex.java:

package io.github.coderodde.dna.kmerindex;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;

/**
 * This class implements the <code>k</code>-mer index data structures that maps
 * each <code>k</code>-mer to the list of indices at which that very 
 * <code>k</code>-mer appears. The alphabet is restricted to the set of 
 * nucleotide bases <code>A, C, G, T</pre>.
 * 
 * @version 2.0.0 (Aug 18, 2025)
 * @since 1.0.0 (Aug 17, 2025)
 */
public final class DnaKmerIndex {
    
    // Package-private since DnaKmerIndexToStringConverter uses this:
    final Map<GenomicSequence, List<Integer>> index = new HashMap<>();
    
    public DnaKmerIndex(GenomicSequence sequence, int k) {
        checkArguments(sequence, k);
        
        for (int i = 0; i < sequence.length() - k + 1; ++i) {
            GenomicSequence kmer = sequence.kmer(k, i);
            
            if (!index.containsKey(kmer)) {
                index.put(kmer, new ArrayList<>());
            }
            
            index.get(kmer).add(i);
        }
    }
    
    public List<Integer> getListOfStartingIndices(GenomicSequence kmer) {
        return !index.containsKey(kmer) ? 
                List.of() : 
                Collections.unmodifiableList(index.get(kmer));
    }
    
    private void checkArguments(GenomicSequence sequence, int k) {
        Objects.requireNonNull(sequence, "The input GenomicSequence is null");
        
        if (sequence.length() == 0) {
            throw new IllegalArgumentException("Empty sequence");
        }
        
        if (k < 1) {
            String exceptionMessage = 
                String.format(
                    "The k-parameter is too small (%d). Must be at least 1", 
                    k);
            
            throw new IllegalArgumentException(exceptionMessage);
        }
        
        if (k > sequence.length()) {
            String exceptionMessage = 
                    String.format("k(%d) > sequence.length(%d)", 
                                  k,
                                  sequence.length());
            
            throw new IllegalArgumentException(exceptionMessage);
        }
    }
}

io.github.coderodde.dna.kmerindex.DnaKmerIndexToStringConverter.java:

package io.github.coderodde.dna.kmerindex;

import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.Objects;

/**
 * This class is responsible for converting a 
 * {@link io.github.coderodde.dna.kmerindex.DnaKmerIndex} to a neat string that
 * may be arbitrary or sorted.
 * 
 * @version 1.1.0 (Aug 18, 2025)
 * @since 1.0.0 (Aug 17, 2025)
 */
public final class DnaKmerIndexToStringConverter {
    
    private boolean sorted = true;
    private final DnaKmerIndex index;
    
    public DnaKmerIndexToStringConverter(DnaKmerIndex index, boolean sorted) {
        this.index = Objects.requireNonNull(index, "The input index is null");
        setSorted(sorted);
    }
    
    public DnaKmerIndexToStringConverter(DnaKmerIndex index) {
        this(index, true);
    }
    
    public void setSorted(boolean sorted) {
        this.sorted = sorted;
    }
    
    public boolean isSorted() {
        return sorted;
    }
    
    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder();
       
        for (Map.Entry<GenomicSequence, List<Integer>> e :
                index.index.entrySet()) {
            
            sb.append(e.getKey())
              .append(" -> ")
              .append(e.getValue())
              .append('\n');
        }
        
        if (sb.length() > 0) {
            sb.setLength(sb.length() - 1);
        }
        
        if (sorted) {
            String[] lines = sb.toString().split("\n");
            Arrays.sort(lines);
            return String.join("\n", lines);
        }
        
        return sb.toString();
    }
}

io.github.coderodde.dna.kmerindex.RandomGenomicSequenceProvider.java:

package io.github.coderodde.dna.kmerindex;

import java.util.Random;

/**
 * This class provides a facility for computing a random genetic string.
 * 
 * @version 1.0.0 (Aug 16, 2025)
 * @since 1.0.0 (Aug 16, 2025)
 */
public class RandomGenomicSequenceProvider {
    
    private static final char[] NUCLEOTIDE_BASES = { 'A', 'C', 'G', 'T' };
    
    /**
     * Generates a random genomic string.
     * 
     * @param length the length of the genomic string.
     * @param random the random number generator.
     * @return a random genomic string.
     */
    public static String generate(int length, Random random) {
        StringBuilder sb = new StringBuilder(length);
        
        for (int i = 0; i < length; ++i) {
            int nucleotideIndex = random.nextInt(NUCLEOTIDE_BASES.length);
            sb.append(NUCLEOTIDE_BASES[nucleotideIndex]);
        }
        
        return sb.toString();
    }
    
    /**
     * Generates a random genomic string.
     * 
     * @param length the length of the genomic string.
     * @return a random genomic string.
     */
    public static String generate(int length) {
        return generate(length, new Random());
    }
    
    /**
     * Generates a random genomic string.
     * 
     * @param length the length of the genomic string.
     * @param seed   the seed for the random number generator.
     * @return a random genomic string.
     */
    public static String generate(int length, long seed) {
        return generate(length, new Random(seed));
    }
}

io.github.coderodde.dna.kmerindex.demo.Demo.java:

package io.github.coderodde.dna.kmerindex.demo;

import io.github.coderodde.dna.kmerindex.DnaKmerIndex;
import io.github.coderodde.dna.kmerindex.DnaKmerIndexToStringConverter;
import io.github.coderodde.dna.kmerindex.GenomicSequence;
import io.github.coderodde.dna.kmerindex.RandomGenomicSequenceProvider;

/**
 * This class provides the demonstration program for the {@code k}-mer index.
 * 
 * @version 1.0.0 (Aug 16, 2025)
 * @since 1.0.0 (Aug 16, 2025)
 */
public final class Demo {
    
    private static final int LENGTH = 30;
    private static final int K = 2;
    
    public static void main(String[] args) {
        long seed = System.currentTimeMillis();
        
        System.out.println("seed = " + seed);
        
        String genomicString = RandomGenomicSequenceProvider.generate(LENGTH,
                                                                      seed);
        
        System.out.println("Genomic string: " + genomicString);
        
        DnaKmerIndex kmerIndex = 
                new DnaKmerIndex(new GenomicSequence(genomicString), K);
        
        System.out.println(new DnaKmerIndexToStringConverter(kmerIndex, false));
    }
}

io.github.coderodde.dna.kmerindex.DnaKmerIndexTest.java:

package io.github.coderodde.dna.kmerindex;

import java.util.Arrays;
import java.util.List;
import static org.junit.Assert.*;
import org.junit.Test;

public class DnaKmerIndexTest {
    
    @Test
    public void getListOfStartingIndices() {
        GenomicSequence gs = new GenomicSequence("ACGTAC");
        DnaKmerIndex index = new DnaKmerIndex(gs, 2);
        
        List<Integer> list = index.getListOfStartingIndices(gs.kmer(2, 0));
        
        assertEquals(Arrays.asList(0, 4), list);
        
        list = index.getListOfStartingIndices(gs.kmer(2, 1));
        
        assertEquals(Arrays.asList(1), list);
        
        list = index.getListOfStartingIndices(gs.kmer(2, 2));
        
        assertEquals(Arrays.asList(2), list);
        
        list = index.getListOfStartingIndices(new GenomicSequence("CC"));
        
        assertEquals(Arrays.asList(), list);
    }
}

io.github.coderodde.dna.kmerindex.GenomicSequenceTest.java:

package io.github.coderodde.dna.kmerindex;

import java.util.Random;
import static org.junit.Assert.assertEquals;
import org.junit.Test;

public class GenomicSequenceTest {
    
    @Test
    public void stressTestToString() {
        Random random = new Random(100L);
        
        for (int length = 1; length < 20; ++length) {
            for (int repeat = 0; repeat < 20; ++repeat) {
                String sequence = 
                        RandomGenomicSequenceProvider.generate(length,
                                                               random);
                
                GenomicSequence gs = new GenomicSequence(sequence);
                assertEquals(sequence, gs.toString());
            }
        }
    }
    
    @Test
    public void stressTestKmer() {
        Random random = new Random(101L);
        
        for (int length = 1; length < 20; ++length) {
            
            String sequenceString =
                    RandomGenomicSequenceProvider.generate(length,
                                                           random);
            
            GenomicSequence sequence = new GenomicSequence(sequenceString);
            
            for (int k = 1; k <= length; ++k) {
                
                for (int startIndex = 0; 
                         startIndex < length - k + 1; 
                         startIndex++) {
                    String kmerString = getKmerString(sequenceString,
                                                      k,
                                                      startIndex);
                    
                    GenomicSequence kmerSequence = sequence.kmer(k, startIndex);
                    assertEquals(kmerString, kmerSequence.toString());
                }
            }
        }
    }
    
    private static String getKmerString(String sequence, 
                                        int k, 
                                        int startIndex) {
        
        StringBuilder sb = new StringBuilder(k);
        
        for (int i = startIndex; i < startIndex + k; ++i) {
            sb.append(sequence.charAt(i));
        }
        
        return sb.toString();
    }
}

Typical demo output

seed = 1755516016804
Genomic string: TCGCTTCTCTCCCAACGTCCGGCGGGCAGG
CA -> [12, 26]
AC -> [14]
CC -> [10, 11, 18]
GC -> [2, 21, 25]
TC -> [0, 5, 7, 9, 17]
AG -> [27]
CG -> [1, 15, 19, 22]
GG -> [20, 23, 24, 28]
CT -> [3, 6, 8]
GT -> [16]
TT -> [4]
AA -> [13]

Critique request

Please, tell me anything about to improve my tiny library.

Chris · Accepted Answer · 2025-12-31 04:21:35Z

Switch expressions

You're using switch expressions, but in a very imperative way, with explicit returns in each case.

    public char get(int nucleotideIndex) {
        if (nucleotideIndex < 0) {
            throw new IndexOutOfBoundsException(
                    String.format(
                            "nucleotideIndex(%d) < 0",
                            nucleotideIndex));
        }
        
        if (nucleotideIndex >= length) {
            throw new IndexOutOfBoundsException(
                    String.format(
                            "nucleotideIndex(%d) >= length(%d)\n", 
                            nucleotideIndex, 
                            length));
        }
        
        int byteIndex = nucleotideIndex / NUCLEOTIDES_PER_BYTE;
        nucleotideIndex %= NUCLEOTIDES_PER_BYTE;
        
        byte datum = data[byteIndex];
        datum >>>= nucleotideIndex * BITS_PER_CODE;
        datum &= 0b11;
        
        switch (datum) {
            case 0b00 -> {
                return 'A';
            }
                
            case 0b01 -> {
                return 'C';
            }
                
            case 0b10 -> {
                return 'G';
            }
                
            case 0b11 -> {
                return 'T';    
            }
        }
        
        throw new IllegalStateException("Should not get here");
    }

Instead let the switch expression evaluate to the value you want to return and then return that. You'll need to use default to throw your exception so the switch expression is exhaustive.

    public char get(int nucleotideIndex) {
        if (nucleotideIndex < 0) {
            throw new IndexOutOfBoundsException(
                    String.format(
                            "nucleotideIndex(%d) < 0",
                            nucleotideIndex));
        }
        
        if (nucleotideIndex >= length) {
            throw new IndexOutOfBoundsException(
                    String.format(
                            "nucleotideIndex(%d) >= length(%d)\n", 
                            nucleotideIndex, 
                            length));
        }
        
        int byteIndex = nucleotideIndex / NUCLEOTIDES_PER_BYTE;
        nucleotideIndex %= NUCLEOTIDES_PER_BYTE;
        
        byte datum = data[byteIndex];
        datum >>>= nucleotideIndex * BITS_PER_CODE;
        datum &= 0b11;
        
        return switch (datum) {
            case 0b00 -> 'A';                
            case 0b01 -> 'C';
            case 0b10 -> 'G';
            case 0b11 -> 'T';    
            default -> 
                throw new IllegalStateException("Should not get here");
        };
    }

Reinderien · Accepted Answer · 2025-12-31 01:52:51Z

Nucleotide representation

checkNucleotideBase should not exist. It's reinventing the concept of an enum. Use an enum instead, and in your other classes pass around enum instances rather than characters. The enum could look like

package io.github.coderodde.dna.kmerindex;

public enum NucleotideBase {
    A((byte)0),
    C((byte)1),
    G((byte)2),
    T((byte)3);

    public final byte index;
    private static final NucleotideBase[] values = NucleotideBase.values();

    NucleotideBase(byte index) {
        this.index = index;
    }

    public static NucleotideBase fromIndex(byte index) {
        return values[index];
    }

    public static NucleotideBase valueOf(char c) {
        return NucleotideBase.valueOf(Character.toString(c));
    }
}

This:

    private static final int NUCLEOTIDES_PER_BYTE = 4;
    private static final int BITS_PER_CODE = 2;

should be

    private static final int BITS_PER_CODE = 2;
    private static final int CODES_PER_BYTE = 8/BITS_PER_CODE;

Don't use this magic constant:

datum &= 0b11;

Instead, define a mask as (1 << BITS_PER_CODE) - 1.

This:

byte mask = encodeNucleotideName(nucleotideBase);

is wrong; that isn't a mask. It's a datum, a name you've used elsewhere. And anyway, encodeNucleotideName should be replaced with a method on the enum.

JUnit Tests

It's odd that Demo gets its own sub-package but the tests don't. I would expect the tests to get their own sub-package.

Your test framework and API usage is (extremely) out of date. In Gradle, it would look like

dependencies {
    testImplementation platform('org.junit:junit-bom:6.0.1')
    testImplementation 'org.junit.jupiter:junit-jupiter:6.0.1'
    testRuntimeOnly 'org.junit.platform:junit-platform-launcher:6.0.1'
}

then your imports will change to

import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;

Don't splat-import import static org.junit.Assert.*;.

Versioning and Doxygen

This:

/**
 * @version 1.0.0 (Aug 16, 2025)
 * @since 1.0.0 (Aug 16, 2025)

needs to stop. This is guaranteed to go out-of-date, and will always be less maintainable and detailed than git. Use version control. For numbered releases, use tags. That reduces the boilerplate in your demo comment block from

/**
 * This class provides the demonstration program for the {@code k}-mer index.
 * 
 * @version 1.0.0 (Aug 16, 2025)
 * @since 1.0.0 (Aug 16, 2025)
 */

to, simply,

/// Demonstration program for the k-mer index

Yes, this is still Doxygen-compatible; yes, it still gets picked up by IntelliJ. No, we aren't being paid by the character. Also, I seem to be in the minority on this opinion, but I find writing block comments with a leading * on each line to be awful noise. Yes, Doxygen succeeds when these are removed.

As is so often the case with boilerplate-documented code, the code is actually missing crucial documentation that has been skipped over, for instance:

private final int length;  // number of codes (not bytes)

Likewise, DnaKmerIndex(GenomicSequence sequence, int k) is not documented, especially k.

For the doc block on DnaKmerIndex:

/**
 * This class implements the <code>k</code>-mer index data structures that maps
 * each <code>k</code>-mer to the list of indices at which that very 
 * <code>k</code>-mer appears. The alphabet is restricted to the set of 
 * nucleotide bases <code>A, C, G, T</pre>.
 * 
 * @version 2.0.0 (Aug 18, 2025)
 * @since 1.0.0 (Aug 17, 2025)
 */

the pre and code are not only noise, the tags are mismatched. This begs a broader question of "who you're writing this for". If you're writing this for programmers, programmers most often read the code rather than some generated API documentation in HTML or PDF. You should establish a strong bias for legibility of the code over formatting of a doc tool's output. Reduce this to something like

/**
 This class implements the k-mer index data structures that map
 each k-mer to the list of indices at which that k-mer appears.
 The alphabet is restricted to the set of nucleotide bases A, C, G, T.
*/

Random generation

RandomGenomicSequenceProvider is mis-expressed. It relies on state but is unable to refer to state due to being static. It can be made both more useful and more simple if it is non-static and holds onto an instance of Random. If you care about the seed for the purposes of subsequent reproducibility, it can be the responsibility of the caller to choose a seed when constructing the Random to be passed in.

Note well that, whereas for your purposes using currentTimeMillis for a seed is fine, the JRE does a better job:

    public Random() {
        this(seedUniquifier() ^ System.nanoTime());
    }

So at the least, use nanoTime rather than milliseconds; and if you don't need to record the seed for reproducibility, let Random construct itself instead.

RandomGenomicSequenceProvider makes an inappropriate leap: it shouldn't generate strings directly. Instead, it should generate a packed GenomicSequence, which is the business logic representation rather than the presentation format; and it should be the responsibility of DnaKmerIndexToStringConverter to format it. Among other reasons, this will directly produce a memory-efficient representation with no memory-sparse intermediate. The kmer method also has this problem: it enters and then leaves the string domain.

String rendering

For exceptionMessage etc., prefer instance .formatted() method over static String.format.

Mandating that DnaKmerIndexToStringConverter be instantiated from the outside is inconvenient. Why not just implement a toString() on DnaKmerIndex? Several problems betray the fact that this entire class is ill-designed: it relies on package-private (essentially 'friend') access to the class it's formatting, and it holds sort state that shouldn't be state at all.

setSorted/isSorted is a classic anti-pattern baked into the Java culture. If there's no business logic necessary when the methods are called and there's no access protection, get rid of these, replacing them with a simple field if access is really needed from the outside. In this particular case, sorted seems like both a misleading name and something that shouldn't be state. It seems like it should be a bool sort or bool shouldSort.

The acrobatic routine where you unconditionally .append('\n');, then conditionally truncate, then conditionally split is all pretty non-ideal. Instead, just use a stream, conditionally sort it, then unconditionally join it:

    public String toString(boolean sort) {
        Stream<String> lines = index.entrySet().stream()
            .map(
                entry -> "%s -> %s".formatted(entry.getKey(), entry.getValue())
            );
        if (sort)
            lines = lines.sorted();
        return lines.collect(Collectors.joining("\n"));
    }

Misc

Relying on implicit long-to-string casting and concatenation as in

System.out.println("seed = " + seed);

is nasty. This isn't JavaScript, and it's a small tragedy that Java allows it. Use printf instead.

For getListOfStartingIndices avoid including type names in the method name; write simply getStartingIndices.

Reverse this ternary:

        return !index.containsKey(kmer) ?
            List.of() :
            Collections.unmodifiableList(index.get(kmer));

so that the positive case appears first. However, it requires a double lookup. Don't do that; instead:

        List<Integer> value = index.get(kmer);
        if (value == null)
            return List.of();
        return value;

because Map.get returns null if missing.

In the trivial construction cases like Random random = new Random(100L);, use var for the type.

Rather than a division, modulus and condition, use ceilDiv:

        data = new byte[Math.ceilDiv(length, CODES_PER_BYTE)];

GenomicSequence seems like it should implement List for it to be usable in standard ways. This is entirely feasible if somewhat annoying due to the impositions of the Java API. My proposed code is long so I'm not including it here, but it does work just fine. Among other changes you'll want to do in the API usage, call size() instead of length(), call isEmpty() rather than length() == 0, call set() rather than your write method, and use a for-each when applicable. Also, inherit from the marker interface RandomAccess for informational and performance reasons.

There's (a lot) more, and generally this is a lot of code, so I think I'll stop here; though I've sent a placeholder PR to https://github.com/coderodde/DnaKmerIndex.java via https://github.com/coderodde/DnaKmerIndex.java/pull/1 .

I don't use gradle. If you are interested in contributing to the repository, please fall back to Maven. — coderodde
– coderodde, Commented Dec 31, 2025 at 5:56
@coderodde (a) since you didn't show any Maven in your question, it's perfectly reasonable to show Gradle here and presume that you could use it elsewhere; (b) that would be better-suited as a comment on the PR; and (c) ...why? Presumably you're here to learn, and since you are, I strongly recommend learning Gradle, It's more modern, powerful, and tends to be more terse. — Reinderien
– Reinderien, Commented Dec 31, 2025 at 13:48

Stack Exchange Network

Funny time with DNA: a \$k\$-mer index data structure in Java, Take II

Intro

Code

Typical demo output

Critique request

2 Answers 2

Switch expressions

Nucleotide representation

JUnit Tests

Versioning and Doxygen

Random generation

String rendering

Misc

You must log in to answer this question.

Linked

Hot Network Questions

Funny time with DNA: a \$k\$-mer index data structure in Java, Take II

Intro

Code

Typical demo output

Critique request

2 Answers 2

Switch expressions

Nucleotide representation

JUnit Tests

Versioning and Doxygen

Random generation

String rendering

Misc

You must log in to answer this question.

Linked

Related

Hot Network Questions