2

I have a file with a name list as shown below:

Ishmael
Mark
Anton
Rajesh
Pete

I am trying to print something like this:

Iae 3
a   1
Ao  2
ae  2
ee  2

I developed this code:

cat names.txt | grep -Eo '[AEIOUaeiou]' | wc -l 

but I received as reply the total amount of vowels in the whole file. What is missing in my code in order to display as expected above? Thank you for you help

6
  • 2
    This question is similar to: How to count the number of a specific character in each line?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Commented Mar 10 at 19:21
  • 2
    Does your text have any chance of having diacritics (accents, etc.)? Like André or Jürgen or Świętopełk?
    – jcaron
    Commented Mar 11 at 11:24
  • 2
    Is “y” a vowel? In many languages, it is.
    – bindiff
    Commented Mar 11 at 22:28
  • 1
    @bindiff, including English words (rhythm) and names (Evelyn, for example). But how about w, from Welsh. That's a vowel in some loanwords (cwm being the most common, probably) and even some names (Glyndŵr unambiguously, debatably in many others). But both Y and W are also consonants in this context (Yusuf, Wendy). So it's a problem more of linguistics than text-parsing
    – Chris H
    Commented Mar 12 at 10:12

5 Answers 5

7

What you want to do is basically to remove everything except the vowels and then count the number of characters in the line. Your attempt failed because grep -o will print each match on a separate line, so it was printing each vowel by itself. You were looking for something like this:

$ tr -dc 'AEIOUaeiou\n' < names.txt | awk '$0=$0" "length' 
Iae 3
a 1
Ao 2
ae 2
ee 2

The tr command can translate between sets of characters. With the -d flag it deletes them, and the -c makes it take the complement of what you give it. So tr -dc x < file will print out the contents of file after deleting all characters except an x. Here, tr -dc 'AEIOUaeiou\n' will delete everything that isn't a vowel or newline:

$ tr -dc 'AEIOUaeiou\n' < names.txt 
Iae
a
Ao
ae
ee

So we just need the count, and I used awk for that.

$0 is the $ operator applied to the 0 number which results in the full current input record (records being lines by default). For simplicity, you can think of $0 as a special variable that holds the current line.

length is a function that returns the length (in characters for scalars and in number of elements for arrays) of what you give it, and when you give it nothing, it operates on $0. So that gives us the number of characters.

$0=$0" "length then, means "add a space and then the result of length to the end of the line. Finally, in awk, the default action when something evaluates to true is to print the current value of $0. Because of the string concatenation in $0=$0" "length, the result of that assignment will be a string and a string is considered as true in awk as long as it's non-empty which will always be the case here, so the resulting $0 will always be printed.

I am just reformulating doneal24's approach here using a slightly different combination of tools.

2
  • I am using Ubuntu/Linux to write the script you shared with me but this error message comes up: awk: function lenght never defined. What is that for? Commented Mar 21 at 21:52
  • @IsmaelSanchez did you perhaps write lenght instead of length?
    – terdon
    Commented Mar 21 at 23:54
6

How about

awk '{gsub (/[^AEIOUaeiou]/, ""); print $0, length}' names.txt

or (somewhat shorter, but assuming the GNU implementation of awk)

gawk -v IGNORECASE=1 '{gsub (/[^AEIOU]/, ""); print $0, length}' names.txt

You also have a useless invocation of cat in your test code.

[EDIT] I did not see the link Chris Davies provided. My answer is indeed a duplicate of previous questions.

2
  • I ran this query of yours but I realized the result is the names without vowels. It's very important though. But I am trying to get the vowels only and the count. Thanks anyway for your support. Very appreciated Commented Mar 10 at 19:51
  • @IsmaelSanchez please try again, there was a typo.
    – terdon
    Commented Mar 10 at 20:43
3

With perl, assuming ASCII input, and only considering aeiouAEIOU without diacritics¹ as being vowels:

perl -ne '
  @vowels = /[aeiou]/gi;
  printf "%-6s %d\n", join("", @vowels), scalar@vowels' < input

To also count those aeiouAEIOU when they have diacritics (like the é in my first name) or are found in characters such as ℡ ℻ Ⅷ ⅲ ⒜ ㉐ ㎉ ffi 🄐 (like in Effie), you could look for them as grapheme cluster (matched by \X in perl regexp) in their canonical decomposition normalisation form (NFKD) with something like:

perl -C -MUnicode::Normalize=NFKD,NFKC -lne '
  @vowels = map {NFKC($_)} (NFKD($_) =~ /(?=[aeiou])\X/gi);
  printf "%-6s %d %s\n", join("", @vowels), scalar@vowels'

Which on Henry Ⅷ, Stéphane, Effie and Louisiana gives:

eIII   4
éae    3
Eie    3
ouiiaa 6

(not that many would consider the Is in as vowels but then again same may go for the I or Y in Iota / Yeti which some consider as not producing vowel sounds).

Note the dot in i is not considered as a diacritic in that instance, and the dotless i (ı) would not be counted even though both I and İ would be counted (regardless of whether you use aeiou or AEIOU in the regexp).


¹ unless they're expressed with combining characters (note there are no such combining diacritic characters in ASCII) applied on a aeiouAEIOU base in which case only the base will be reported.

2

If you want to do this in pure bash without launching other programs:

while IFS= read -r line; do
  vowels="${line//[^aAeEiIoOuU]/}" # Remove all but vowels
  echo "$vowels ${#vowels}"        # Print vowels and length
done < input.txt

Note that while read ... over a file is discouraged because it’s insecure, slow, and not very beautiful.

1

Using Raku (formerly known as Perl_6)

~$ raku -ne '.comb(/ :i <[aeiou]> /).join andthen put( $_ => $_.chars );'  file

#OR:

~$ raku -ne 'my $vowels = .comb(/ :i <[aeiou]> /).join; put $vowels ~"\t"~ $vowels.chars;'  file

Sample Input:

Ishmael
Mark
Anton
Rajesh
Pete

Sample Output:

Iae 3
a   1
Ao  2
ae  2
ee  2

Above we see Raku's comb routine in action, which might be thought of as the converse of split. Particular patterns are sought, and non-matching patterns eliminated.

This leads us to a nice feature regarding Raku, which is that Raku is Unicode-ready. Except for filenames, codepoints by default undergo NFC normalization. So even though an input string might denote the á character in the following two different ways, Raku will still only count it as one (1) character:

~$ raku -e 'put "\c[LATIN SMALL LETTER A WITH ACUTE]";'
á
~$ raku -e 'put "\c[LATIN SMALL LETTER A]\c[COMBINING ACUTE ACCENT]";'
á
~$ raku -e 'put "\c[LATIN SMALL LETTER A WITH ACUTE]";' | raku -ne '.chars.put;'
1
~$ raku -e 'put "\c[LATIN SMALL LETTER A]\c[COMBINING ACUTE ACCENT]";' | raku -ne '.chars.put;'
1

EDIT: Raku has functions for NFKC and NFKD decomposition, but--since input is already NFC normalized--perhaps the easiest thing to do is just add target vowel letters by hex-range (see link below for "Enumerated character classes and ranges"):

~$ raku -e  'put  "a e i o u\n",   "\x00C0".."\x00C6";'  > ltrs.txt | cat
a e i o u
À Á Â Ã Ä Å Æ
~$ raku -ne '.comb(/ :i <[aeiou] + [\x00C0 .. \x00C6]> /).join andthen  \
             put $_ => $_.chars;' < ltrs.txt
aeiou   5
ÀÁÂÃÄÅÆ 7

https://docs.raku.org/language/unicode
https://docs.raku.org/language/regexes#Enumerated_character_classes_and_ranges
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.