I've exported my email archive of 10 years which is very large.
I want to parse all the text for any string that is 64 characters long in search of a bitcoin private key.
How can I parse strings of a certain length in characters?
If you mean to search for a 256-bit number in hexadecimal form (64 chars from the range 0-9
and A-F
-- one of the formats in which a bitcoin private key could appear), this should do:
egrep -aro '\<[A-F0-9]{64}\>' files and dirs ...
Add the -i
option or also include the a-f
range if some of the keys are in lowercase.
For the general problem of finding runs of characters from the same class having a specified length, you would better use pcre regexps, which could be used with GNU grep with the -P
option. For instance, to find runs of uppercase letters from any charset, of min length of 2 and max length of 4, and which are delimited by chars which are not uppercase letters:
echo ÁRVÍZtűrő tükörFÚRÓgép |
LC_CTYPE=en_US.UTF-8 grep -Po '(?<!\p{Lu})\p{Lu}{2,4}(?!\p{Lu})'
FÚRÓ
Replace \p{Lu}
with \p{Ll}
for lowercase letters, \S
for non-spaces, etc. See here and here for the full list.
(?<!...)
and (?!...)
are negative lookbehind and lookahead zero-width assertions; e.g. (?<!<)\w(?!>)
will match a "word" character when not bracketed by <
and >
. The \<
zero-width assertion from vi
could be implemented by (?<!\w)(?=\w)
.
If you want to find all words of length 64 from /path/to/file
, you can use
tr -c '[:alnum:]' '\n' < /path/to/file | grep '^.\{64\}$'
This replaces all non-alphanumeric characters by newlines, so each word is on its own line. Then it filters this result to include only the words of length 64.
.
), comma (,
), colon (:
), semicolon (;
) and many other usual punctuation characters, shoudln't those also be converted to a newline ?
If you have GNU grep
(default on Linux), you can do:
grep -Po '(^|\s)\S{64}(\s|$)' file
The -P
enables Perl Compatible Regular Expressions, which give us \b
(word-boundaries) \S
(non-whitespace) and {N}
(find exactly N characters), and the -o
means "print only the matching part of the line. Then, we look for stretches of non-whitespace that are exactly 64 characters long that are either at the beginning of the line (^
) or after whitespace ('s
) and which end either at the end of the line ($
) or with another whitespace character.
Note that the result will include any whitespace characters at the beginning and end of the string, so if you want to parse this further, you might want to use this instead:
grep -Po '(^|\s)\K\S{64}(?=\s|$)'
That will look for a whitespace character or the beginning of the string (\s|^)
, then discard it \K
and then look for 64 non-whitespace characters followed by (the (?=foo)
is called a "lookahead" and will not be included in the match) either a whitespace character, or the end of the line.
grep -Po '(?<!\S)\S{64}(?!\S)'
is enough to find runs of 64 non-spaces; but please read my answer for why that's probably not what's intended.
\K
approach much more readable and elegant.
It seems that grep is the correct tool to "search" for an string. What is left to do is to define such string with a regex. The first issue is to define the limits of a word. It is not as simple as "an space", as a book, a lamp
use ,
as word delimiter, in the same concept, many other characters, or even the start or end of a line could act as word delimiter. There are some word delimiters in GNU grep:
\<
word start.\>
word end.\b
word boundary.All of them assume that a word is a sequence of [a-zA-Z0-9_]
characters. If that is ok for you, this regex could work:
grep -o '\<.\{64\}\>' file
If you could use extended regex, the \
could be reduced:
grep -oE '\<.{64}\>' file
That selects from a "word start" (\<
), 64 ({64}
) characters (.
), to a "word end" (\>
) and prints only the matching (-o
) parts.
However, the dot (.
) will match any character, that may be too much.
If you want to be more strict on the selection (hex digits), use:
grep -oE '\<[0-9a-fA-F]{64}\>' file
Which will allow hex digits in lowercase or uppercase. But if you really want to be strict, as some non-ASCII characters might be included, use:
LC_ALL=C grep -oE '\<[0-9a-fA-F]{64}\>' file
Some implementations of grep (as grep -P) do not have a "start of word" or "end of word" (as \<
and \>
) but have "word boundary" (as \b
):
grep -oP '\b[0-9a-fA-F]{64}\b' file
There are some languages that accept the POSIX word boundaries [[:<:]]
and [[:>:]]
, but not perl, and only from PCRE 8.34.
txt
or something I cancat
and parse with a pipe.txt
for easy parsing