5
\$\begingroup\$

Given an input string contains letters from 'a' to 'g'

Count how many substrings are there in the input string such that frequency of any character inside the substring is not more than the number of distinct characters present in the substring.

Example productSeq="abaa" Output = 8

Explanation: valid combinations are:

a 
b
a
a
ab
ba
aba
baa

constraints:

1 <= input string length <= 10^5
characters in input string are in range a,b,c,d,e,g

Here is my code:

import java.util.HashMap;

public class Main {
    public static int solve(String str) {
        int n = str.length();
        int validSubstrings = 0;
        // Iterate over all possible starting points
        for (int start = 0; start < n; start++) {
            HashMap<Character, Integer> charCount = new HashMap<>();
            int distinctChars = 0;
            // Extend substring from 'start' to 'end'
            for (int end = start; end < n; end++) {
                char c = str.charAt(end);
                charCount.put(c, charCount.getOrDefault(c, 0) + 1);
                // If this is the first occurrence of 'c', increase distinct count
                if (charCount.get(c) == 1) {
                    distinctChars++;
                }
                // Check if the substring is valid
                if (isValid(charCount, distinctChars)) {
                    validSubstrings++;
                } 
            }
        }
        return validSubstrings;
    }

    // Function to check if a substring is valid
    private static boolean isValid(HashMap<Character, Integer> charCount, int distinctChars) {
        for (int count : charCount.values()) {
            if (count > distinctChars) {
                return false;
            }
        }
        return true;
    }

    public static void main(String[] args) {
        System.out.println(solve("abaa"));  // Output: 8 
        System.out.println(solve("abab"));  // Output: 10
        System.out.println(solve("aacd"));  // Output: 9 
    }
}

My code works fine when input string length is small. I am looking for better approach to solve this in less time complexity.

\$\endgroup\$
6
  • 1
    \$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers. \$\endgroup\$
    – Mast
    Commented Mar 13 at 18:18
  • \$\begingroup\$ @Mast, I have not updated the original code which I posted initially, I added another approach which I tried that runs in less time complexity , but missing something in it that is causing wrong output. So is it allowed to add more details to my post with what I have tried and failed to achieve required results? \$\endgroup\$ Commented Mar 13 at 18:21
  • \$\begingroup\$ Code producing wrong output is not reviewable, what would be your goal in showing it? \$\endgroup\$
    – Mast
    Commented Mar 13 at 18:23
  • \$\begingroup\$ If the new code would be in a completely different style, I can understand why you'd say "I tried this, but couldn't get it to work this way." If the time complexity is the goal though and the output wrong, the new approach may well lead to higher time complicity once the output is fixed sufficiently again. \$\endgroup\$
    – Mast
    Commented Mar 13 at 18:24
  • 1
    \$\begingroup\$ It appears revision 4 was contributed by user Sid. Is that another account of yours? If it is and you'd like them merged you can use the contact and request the accounts be merged. \$\endgroup\$ Commented Mar 13 at 18:55

4 Answers 4

6
\$\begingroup\$

In order to effectively talk about your algorithm complexity, let's define a few things:

  • n = input string length
  • k = maximum number of valid characters (7 for a..g)

Your current algorithm is \$O(n^2k)\$, because you have 3 nested loops:

  • for (int start = 0; start < n; start++)
    • for (int end = start; end < n; end++), and
      • for (int count : charCount.values()) (inside isValid())

Loop Elimination

Your inner most loop (the one in isValid()) is actually unnecessary.

Instead, like distinctChars, keep track of the maxCharCount. You are only ever incrementing one element of charCount, so you only need to check if the maximum increases using that one value:

        for (int start = 0; start < n; start++) {
            HashMap<Character, Integer> charCount = new HashMap<>();

            int distinctChars = 0;
            int maxCharCount = 0;
            
            // Extend substring from 'start' to 'end'
            for (int end = start; end < n; end++) {
                
                // Next character in substring
                char c = str.charAt(end);
                
                // Increment count of this character
                int count = charCount.getOrDefault(c, 0) + 1;
                charCount.put(c, count);

                // Keep track largest character count
                maxCharCount = Integer.max(count, maxCharCount);
                
                // If this is the first occurrence of 'c', increase distinct count
                if (count == 1)
                    distinctChars++;

                // Check if the substring is valid
                if (maxCharCount <= distinctChars)
                    validSubstrings++;

By eliminating the isValid() loop, we've reduced the time complexity from \$O(n^2k)\$ to \$O(n^2)\$. Since the input character set is limited to 7 characters, at best this produces a 7x speed improvement. Not as dramatic as the 1000-fold speed up vnp has suggested, but used together you could see a 7000-fold speed up. eg)

    private final static int K = 7;

...

            // Extend substring from 'start' by at most K² characters
            int end_limit = Integer.min(start + K*K, n);
            for (int end = start; end < end_limit; end++) {

Now the time complexity is only \$O(nk^2)\$

Actually, tracking maxCharCount gives another way of getting vnp’s optimization. Since maxCharCount only increases, if it ever exceeds 7 we can break to the outer loop. Instead of limiting substrings to a maximum of 49 characters, longest possible valid substring, we stop when a character occurrence exceeds the limit which can happen well short of 49 characters.

            for (int end = start; end < n; end++) {

...

                // If this is the first occurrence of 'c', increase distinct count
                if (count == 1)
                    distinctChars++;

                // Check if we've gone too far
                else if (count > K)
                    break;

Now the inner loop will stop when the first character count exceeds K, instead of looping through all K² possible substring characters. With a reasonable character distribution, the time complexity might be closer to \$O(nk)\$, but with an input like "abcdefg"*1428, it will still be \$O(nk^2)\$.


Allocate once

You allocate HashMap<Character, Integer> charCount once for each iteration of the outer loop. For a 10,000 characters string, this is 10,000 allocations.

Instead, if you moved the allocation out of the loop, it would only be allocated once. Of course, you would need to charCount.clear() the storage each time through the loop, but that should still be faster than allocating the structure each time.

However, instead of charCount.clear(), consider zero’ing the map at the start of the outer loop instead:

for (char ch = 'a'; ch <= 'g'; ch++)
    charCount.put(ch, 0);

Now, all Map structures are only allocated once. Successive passes through the outer loop merely store a new value (0) in preallocated key-value structures.

Bonus: By prefilling the charCount storage structure with 0’s, you can just use charCount.get(c) instead of charCount.getOrDefault(c, 0).


Storage Structure

HashMap<Character, Integer> charCount may be the wrong structure to store character counts in. As originally written, it requires frequent boxing of char c into Character(c), as well as boxing and unboxing of int values to and from Integer.

Instead, consider using int[] charCount = new int[7], indexed using charCount[c - 'a']. No boxing or unboxing is necessary. This is obviously less flexible; it won’t work without modification if the character range changes. Still, if the character range is fixed in stone, it will be more efficient.

\$\endgroup\$
4
  • \$\begingroup\$ Using HashMap in my code is not a concern, also storing 10,000 items in map is not an issue. My main issue is about time complexity, in my code I have 2 nested major loops for (int start = 0; start < n; start++) { for (int end = start; end < n; end++) {` this is causing O(n^2) time complexity which I wanted to reduce this time complexity. \$\endgroup\$ Commented Mar 13 at 18:38
  • 1
    \$\begingroup\$ You have 3 nested loops: for(int start…), for(int end…), and for(int count…), making your algorithm O(n²k), where k is the number of unique letters. My first point shows how to eliminate that third loop, reducing the algorithm to O(n²), or combined with vnp’s improvement, down to O(nk²) and leaning towards O(nk). The second point speaks to improving the efficiency of using the HashMap, which is a valid point, and reviewers are free to comment on any aspect of the code. The third point was couched in “may” and “consider”. It makes the code faster but less flexible; feel free to ignore \$\endgroup\$
    – AJNeufeld
    Commented Mar 13 at 19:00
  • \$\begingroup\$ combined with vnp’s improvement, down to O(nk²) and leaning towards O(nk) I am missing this part on how to implement, can you please provide more details on how it can be done, if possible with code or Pseudocode. \$\endgroup\$ Commented Mar 13 at 19:22
  • 1
    \$\begingroup\$ @CodeCrusader, it is about limiting the number of iterations of the for(int end...) loop. In your present code, that loop performs O(n) iterations, but (i) at most the first k² (49) can contribute anything to the substring count, and (ii) for most inputs, it could be terminated even sooner than that, once the count of any one character exceeds k (7), and you can test that for O(1) cost. \$\endgroup\$ Commented Mar 13 at 21:52
6
\$\begingroup\$

characters in input string are in range a,b,c,d,e,g

I don't know what happened to f.

Anyway, assuming that f is there, and the input alphabet is of 7 characters, it makes no sense to test substrings longer than 49. In such substring some character is guaranteed to appear more than 7 times, and a number of distinct characters is at most 7. This alone gives you a 1000-fold speedup.

Other optimizations may stem from the facts that all single-character substrings are good, and if the substring starts with two distinct characters, the three-character substring is good no matter what the third character is. I don't know how good these optimizations are, but they are worthy an investigation.


There is no need to count distinctChars. You may just call charCount.size() instead.

\$\endgroup\$
0
0
\$\begingroup\$

An improvement would be to avoid using data structure of wrappers the way HashMap<Character, Integer> is. Just parsing the given string and count char occurrences between dynamically computed boundaries of substrings it is an improvement even with the bellow convoluted implementation with the side note that strings abaa and aacd are of the same pattern therefore the output should be the same for both.

public class Main {
    public static int solve(String string) {

        int count = 0;
        int length = string.length();
        for (int start = 0; start < length; start++) {
            for (int size = 1; size < length; size++) {

                boolean countable = false;
                int boundry = start + size;

                if (boundry > length) {
                    continue;
                }

                for (int k = start; k < boundry; k++) {

                    int frequency = 0;
                    for (int l = start; l < boundry; l++) {

                        if (string.charAt(k) == string.charAt(l)) {
                            frequency++;
                        }
                    }

                    countable = frequency < size;
                }

                if (countable) {
                    count++;
                }
            }
        }

        return count + length;
    }

    public static void main(String[] args) {
        System.out.println(solve("abaa"));  // Output: 8 
        System.out.println(solve("abab"));  // Output: 9
        System.out.println(solve("aacd"));  // Output: 8 
    }   
}
\$\endgroup\$
2
  • \$\begingroup\$ Using HashMap is not an issue for my problem statement, main issue is with nested for loops that makes time complexity as O(n^2) in my code, I want an approach to reduce this time complexity \$\endgroup\$ Commented Mar 13 at 18:18
  • \$\begingroup\$ This code makes no sense. countable is computed each time through the for(int k…) loop, but only the value from the last iteration is used. It could be “optimized” by just doing the last iteration, which means most of the character frequencies are unused, so the implementation wont be producing the correct output. \$\endgroup\$
    – AJNeufeld
    Commented Mar 14 at 4:47
-3
\$\begingroup\$

This implementation is linear, and obviously so. Think about how you'd do it in your head: first, determine the # of distinct characters (Java: new Set<Character>(s.toChars()).size()). Now maintain two inclusive indices for a current allowable substring. Advance the second one up to but not including a character that will put you over the limit. You can easily do this with a Java HashMap<Character,Integer. Every substring of this string meets the requirement. Now, to add the character after last, you need to advance first until you've gotten rid of an instance of that character. Every character is added to the counter only once. Every index is incremented, and never decremented. Operations on HashSet and HashMap are linear.

input = 'aaba'

def do_the_thing(s) -> int:
    if not s:
        return 1
    # Let's allow a length of 1, as a edge case.
    from collections import defaultdict
    # A character that hasn't appeared will have a count of 0.
    counter = defaultdict(int)
    # Count the # of distinct characters.
    limit = len(set(s))
    result = 0
    # The *inclusive* end of the current substring.
    start = 0
    end = 0

    def advance():
        nonlocal start
        nonlocal end
        nonlocal counter
        nonlocal limit
        for c in s[start:]:
            if counter[c] == limit:
                break
            counter[c] += 1
            end += 1

    advance()
    # The substring s[start:end] satisfies the requirement, 
    # as do any of its substrings.
    result += end - start + 1
    while end < len(s):
        # The character s[end] was the one that exceeded the
        # the limit. We need to get rid of the first instance
        # of it in the current substring, i.e. we have to 
        # delete the prefix up to and including it.
        while counter[s[end]] == limit:
            counter[s[start]] -= 1
            start += 1
        # Now start adding characters again.
        advance()
        # See above.
        result += end - start

    return result

output = do_the_thing(input)
print(output)
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Welcome! This looks like Python code. If you intentionally posted Python on a Java question, please edit the answer to make a note of that. Otherwise, it might be confusing to some readers. \$\endgroup\$
    – toolic
    Commented Mar 13 at 17:11
  • \$\begingroup\$ The OP's problem is "how many substrings are there in the input string such that frequency of any character inside the substring is not more than the number of distinct characters present in the substring". This approach seems to solve a different problem, as if the highlighted condition were instead "in the input string". \$\endgroup\$ Commented Mar 13 at 21:14
  • 1
    \$\begingroup\$ Consider the input "aab". The approach presented here erroneously accepts substring "aa". And I don't think this approach can be adapted to the actual problem, for with that input, all of "a", a second "a", and "aab" should be accepted (among others). That is, inclusive bounds [0, 0], [1, 1], and [0, 2]. You can't cover that if the start and end indexes are both advance-only. \$\endgroup\$ Commented Mar 13 at 21:29

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.