5

I have a byte array in java. That array contains '%' symbol somewhere in it. I want to find the position of that symbol in that array. Is there any way to find this?

[EDIT] I tried below code and it worked fine.

    byte[] b = {55,37,66};
    String s = new String(b);
    System.out.println(s.indexOf("%"));

I have a doubt. Does every character take exactly one byte in Java?

12
  • 1
    Is there any reason this is a byte array and not a CharSequence / String? Commented Feb 17, 2012 at 20:32
  • I have updated the question with what i have tried! Commented Feb 17, 2012 at 20:47
  • @Powerlord the reason is im reading a raw resource file in Android and it returns only inputstream. Commented Feb 17, 2012 at 20:51
  • 3
    The number of bytes a character has depends on its encoding. Character encodings are provided by the Charset class. Commented Feb 17, 2012 at 21:00
  • 1
    What the heck, 13 years and no correct answer. Commented Sep 9 at 11:42

5 Answers 5

3
+50

Here is a solution that gives the byte position of a character in a byte array, given a Charset. It's based on CharsetDecoder:

import java.nio.charset.*;
import java.nio.*;

class Test
{
    public static int find(byte[] arr, Charset charset, char c) throws CharacterCodingException
    {
        CharsetDecoder decoder = charset.newDecoder();
        ByteBuffer in = ByteBuffer.wrap(arr);
        CharBuffer out = CharBuffer.allocate(1);
        while(in.hasRemaining())
        {
            out.clear();
            int pos = in.position();
            CoderResult result = decoder.decode(in, out, true);
            if(result.isError())
                result.throwException();
            if(out.get(0)==c)
                return pos;
        }
        return -1;
    }

    public static void test(String str, Charset charset, char c) throws CharacterCodingException
    {
        byte[] arr = str.getBytes(charset);
        int pos = find(arr, charset, c);
        System.out.printf("With %s, char is at position %d\n", charset.name(), pos);
    }

    public static void main(String[] args) throws CharacterCodingException
    {
        String str = "José has 10 €";
        char c = '€';

        test(str, Charset.forName("windows-1252"), c);
        test(str, Charset.forName("UTF-8"), c);
        test(str, Charset.forName("UTF-16LE"), c);
    }
}

The test converts a string to a byte array for a given encoding and looks for a character in it.

Output:

With windows-1252, char is at position 12
With UTF-8, char is at position 13
With UTF-16LE, char is at position 24
Sign up to request clarification or add additional context in comments.

5 Comments

I'm concerned about infinite loop. Could you truncate one byte from the end of test array to ensure that incomplete characters are not causing a problem?
@Basilevs Incomplete characters will throw a java.nio.charset.MalformedInputException (because of the result.throwException();).
isError() returns false for UNDERFLOW condition. github.com/openjdk/jdk20/blob/…
I see now, the third decoder argument transforms underflow to error: ``` if (cr.isUnderflow()) { if (endOfInput && in.hasRemaining()) { cr = CoderResult.malformedForLength(in.remaining()); // Fall through to malformed-input case } else { return cr; } } ```
2

A correct and more direct Guava solution:

Bytes.indexOf(byteArray, (byte) '%');

5 Comments

I looked for indexOf and didn't see it. I guess I'm blind :-) You get the vote
It's possible that the problem is, indeed, an encoding issue -- which would imply that you really do need to use a String somehow.
Why can't % byte be a part of another symbol? Will this work at all in UTF-16?
I'm not aware of any solution better than an explicit loop for that scenario.
Explicit loop is kind of complicated too. Need to play with StreamDecoder. Unclear how to split the string for speed, etc.
2

The accepted answer works in the case that the character you are searching for has a single byte representation.

However:

Is every character takes exactly one byte in java?

It depends on the character in question, and in the encoding that you are using.

When you convert a byte array to a Java string using the String(byte[]) constructor, you will be converting the character data represented by the bytes to a sequence of Unicode codepoints. The bytes-to-codepoints conversion is done the JVM's default Charset.

Some encodings every character is encoded as one byte. (For example the character sets defined by ISO/IEC 8859 ... such as LATIN-1.) In others (for example UTF-16 and UTF-32) every character is encoded as two or more bytes. In yet others (for example UTF-8, JIS, etcetera), a variable number of bytes are needed to represent a character.

Now, in most commonly used encodings, the % character is encoded by a single byte. But it isn't universally true, and this certainly doesn't apply to all Unicode codepoints. (For a start, some encoding schemes don't support all codepoints!)

If you wanted to match an arbitrary Unicode codepoint in a byte array, you could do something like this (using Guava Bytes).

import com.google.common.primitives.Bytes;

...

Charset charset = ...    // the encoding used for the byte array
byte[] bytes = ...       // the byte array to search
int codepoint = ...      // the codepoint you are searching for
int[] codepoints = new int[]{codepoint};
String s = new String(codepoints, 0, 1);
byte[] b = s.getBytes(charset);

int pos = Bytes.indexOf(bytes, b);

Note that it is no longer valid to equate a "character" with a Java char. Some characters (for example emojis) have Unicode codepoints that cannot be represented by a single Java char.


If you are searching for a string in a byte array

Charset charset = ...    // the encoding used for the byte array
byte[] bytes = ...       // the byte array to search
String s = ...           // the string you are searching for
byte[] b = s.getBytes(charset);

int pos = Bytes.indexOf(bytes, b);

Note that there is a difficult issue at the root of this problem. Depending on the encoding scheme, you could potentially get a false match where the byte sequence in b matches the end of one encoded character and the beginning of another.

This doesn't happen with UTF-8 because the first byte of UTF-8 encoded character is always distinguishable from a continuation byte; see this table. However, this can occur with UTF-16, and other encodings.

  • For UTF-16, you can address this by searching for matches at an even byte offset.
  • For other variable byte encodings that are not "bytewise self-synchron­izing" (see this table) a more complicated encoding specific algorithm will be needed.

6 Comments

Bytes.indexOf(bytes, b) probably will not work, because in some encodings, a target codepoint may be constructed from a tail and a head of two different adjacent ones. This answer would be good if there was a proof that no popular encodings have such problem.
I forgot to mention that. But it depends on the encoding. For example in UTF-8 the start byte of an encoded codepoint is distinguishable from the continuation bytes.
Excuse me, I'm just curious about the point of getting the byte position. In a real-world application, what will be the use of this idea?
I imagine that the point is to search for a character or string in (say) a file without the runtime overhead of decoding all of the bytes read from the file into characters.
Stephen is correct. I have a text file of a few gigabytes in size. The file contains variable length entries delimited with a certain delimiter. I need to process the file quickly, but need to find a good position to split it into chunks for parallel processing without damaging an entry. To do so I need an efficient way to find symbols in a file. Finding a symbol in a byte array is part of the solution. Another part is to ensure that a byte array starts at the character or codepoint beginning.
Crtically, the split is happening in byte space because seek file operation or memory mapping work only with bytes.
0

using Google Guava:

com.google.common.primitives.Bytes.asList(byteArray).indexOf(Byte.valueOf('%'))

3 Comments

This will not actually work, I think. '%' will be wrapped in a Character object, which and Character.valueOf('%').equals(someByteObject) will return false.
I had inkling I had a type conversion issue, I think the edit should fix it.
Does not work for multibyte encodings.
-1

I come from the future with some streaming and lambda stuff. If it's just a matter of finding a byte in a byte[]:

Input:

byte[] bytes = {55,37,66}; 
byte findByte = '%';

With streaming and lambda stuff:

OptionalInt firstMatch = IntStream.range(0, bytes.length).filter(i -> bytes[i] == findByte).findFirst();
int index = firstMatch.isPresent() ? firstMatch.getAsInt() : -1;

Which is pretty much the same as:

Actually, I think I still just prefer this. (e.g. and put it in some utility class).

int index = -1;
for (int i = 0 ; i < bytes.length ; i++) 
  if (bytes[i] == findByte) 
  {
    index = i;
    break;
  }

#EDIT

Your question is actually more about finding a character rather than finding a byte.

What could be improved in your solution:

String s = new String(bytes); // will not always give the same result

// there is an invisible 2nd argument : i.e. charset
String s = new String(bytes, charset); // default charset depends on your system.

So, your program may act different on different platforms. Some charsets use 1 byte per character, others use 2, 3, ... or are irregular. So, the size of your string may vary from platform to platform.

Secondly, some byte sequences cannot be represented as strings at all. i.e. if the charset does not have a character for the matching value.

So, how could you improve it:

If you just know that your byte array will always contain plain old ascii values, you could use this:

byte[] b = {55,37,66};
String s = new String(b, StandardCharsets.US_ASCII);
System.out.println(s.indexOf("%"));

On the other hand, if you know that your content contains UTF-8 characters, use :

byte[] b = {55,37,66};
String s = new String(b, StandardCharsets.UTF-8);
System.out.println(s.indexOf("%"));

etc ...

1 Comment

This returns char position, not byte position

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.