Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

6
  • Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in '\xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding... Commented Jun 9, 2012 at 13:34
  • I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works. Commented Jun 9, 2012 at 14:50
  • As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order \x0A\x09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the \x0A\x09 combination, which I do encounter. Commented Jun 9, 2012 at 16:38
  • Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet. Commented Jun 9, 2012 at 16:39
  • 1
    I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output. Commented Jun 9, 2012 at 20:56