Histogram of numeric data

Question

I wrote this simple code in 2003, and I haven't modified it since because it just works the way I need it to.

It reads in a list of numeric values and displays an ASCII histogram of the data. If the numbers are in a file, the file should be a simple list of numbers, with one number per line.

Here is an example of usage, where I use Perl to generate a stream of random integers and pipe it into the hist program:

perl -E 'say int rand 99 for 1..666' | hist

    Number of samples in population: 666
    Value range: 0 - 98
    Mean value: 49
    Median value: 51
    10% value: 10
    90% value: 89

                 < 0:      0
               0 - 9:     54
              9 - 18:     56
             18 - 27:     59
             27 - 36:     60
             36 - 45:     66
             45 - 54:     60
             54 - 63:     65
             63 - 72:     70
             72 - 81:     60
             81 - 90:     51
               >= 90:     65

To see the usage, run:

hist -h

Here is the hist code:

#!/usr/bin/env perl

# Create a histogram of numeric values.
#
# Input is a file (or a pipe) consisting of a single column of values.
# Output is STDOUT.
#
# Usage:    hist file
# Example:  hist data.txt

use warnings;
use strict;
use List::Util qw(sum);

my $nbins = 10; # Number of bins

my @sorted;
my @freq;
my $nsamp;
my $maxval;
my $minval;
my $upper_lim;
my $lower_lim;
my $user_upper_lim;
my $user_lower_lim;
my $binsize;

parse_args();
read_input();
create_histogram();
print_histogram();

# Check for any unexpected error conditions:
if ($!) {
    warn "Error message = $!\n\n";
    exit 1;
}
else {
    exit 0;
}


########################################################################
########################################################################


sub read_input {
    my @raw;

    while (<>) {
        s/\s+//g;   # Remove all whitespace
        unless (check_if_numeric($_)) {
            die "Error: Non-numeric data '$_' found.  " .
                "Can't calculate histogram.";
        }
        push @raw, $_;
    }
    unless (@raw) {
        die "Error: No data.  Can't calculate histogram.\n";
    }
    @sorted = sort {$a <=> $b} @raw;

    $nsamp  = scalar @sorted;   # Number of elements in array.
    $maxval = $sorted[-1];      # Last element in sorted array
                                #   is the maximum value.
    $minval = $sorted[0];       # First element is the minimum.

    $lower_lim = int( (defined $user_lower_lim) ? $user_lower_lim : $minval );
    $upper_lim = int( (defined $user_upper_lim) ? $user_upper_lim : $maxval );
    if ($upper_lim < ($lower_lim + 10)) {
        $upper_lim = $lower_lim + 10;
    }
    if ($lower_lim > $maxval) {
        die "Error: Lower limit must be less than $maxval.  " .
            "Can't calculate histogram.";
    }

    $binsize = int(($upper_lim-$lower_lim)/$nbins);
}


sub create_histogram {
    my $absmin  = -9e99;
    my $absmax  =  9e99;
    my $lo      = $absmin;
    my $hi      = $lower_lim;
    my $cnt     = 0;
    for (@sorted) {
        until ( ($lo <= $_) && ($_ < $hi) ) {
            push @freq, $cnt;
            $cnt = 0;
            $lo  = $hi;
            if ($hi < $upper_lim) {
                $hi += $binsize;
            }
            else {
                $hi = $absmax;
            }
        }
        $cnt++;
    }
    push @freq, $cnt;
}


sub check_if_numeric {
    # Check to see if a value is numeric.
    # This nasty regular expression belongs in its own sub.
    # From Perl Cookbook, sec. 2.1
    my $value = shift;
    if ($value =~ /^-?(?:\d+(?:\.\d*)?|\.\d+)$/) {
        return 1;   # value is numeric
    }
    else {
        return 0;   # value contains non-numeric characters
    }
}


sub print_histogram {
    my $lower = $lower_lim;
    my $nfreq = scalar @freq;   # Number of elements in array.
    my $mid   = int($nsamp/2) - 1;
    my $pct10 = int($nsamp * 0.1) - 1;
    my $pct90 = int($nsamp * 0.9) - 1;
    my $median;
    if (($nsamp % 2) == 0) {    # Even number of samples
        $median = int( ($sorted[$mid] + $sorted[$mid+1])/2 );
    }
    else {                      # Odd number of samples
        $median = int($sorted[$mid+1]);
    }
    print "\n";
    print "\tNumber of samples in population: $nsamp\n";
    print "\tValue range: $minval - $maxval\n";
    print "\tMean value: ", int(sum(@sorted)/$nsamp), "\n";
    print "\tMedian value: $median\n";
    print "\t10% value: $sorted[$pct10]\n";
    print "\t90% value: $sorted[$pct90]\n";
    print "\n";
    my $range = sprintf ' < %d', $lower;
    printf "%20s: %6d\n", $range, $freq[0];
    for (my $i=1; $i<($nfreq-1); $i++) {
        $range = sprintf '%d - %d', $lower, ($lower + $binsize);
        printf "%20s: %6d\n", $range, $freq[$i];
        $lower = $lower + $binsize;
    }
    $range = sprintf ' >= %d', $lower;
    printf "%20s: %6d\n", $range, $freq[$nfreq-1];
    print "\n";
    if (sum(@freq) != $nsamp) {
        die "Error: Histogram not calculated properly.  " .
            "Number of samples ($nsamp) should be equal to " .
            "sum of frequencies (". sum(@freq) . ").\n";
    }
}


sub parse_args {
    use Getopt::Std;

    use vars qw($opt_h $opt_l $opt_u);
    unless (getopts('hl:u:')) {
        print_usage();
        die "Error: Unsupported command option.";
    }

    if ($opt_h) {
        print_usage();
        exit 1;
    }

    # The "defined" check is necessary since the value "0" is
    # a valid value.  Perl treats "0" as a special value.
    if (defined $opt_l) {
        $user_lower_lim = $opt_l;
        unless (check_if_numeric($user_lower_lim)) {
            print_usage();
            die "Error: Lower limit must not contain non-numerics.";
        }
    }

    if (defined $opt_u) {
        $user_upper_lim = $opt_u;
        unless (check_if_numeric($user_upper_lim)) {
            print_usage();
            die "Error: Upper limit must not contain non-numerics.";
        }
    }
}


sub print_usage {
    warn <<"EOF";

USAGE
  $0 [-h] [-u upper_lim] [-l lower_lim] [file ...]

DESCRIPTION
  Create a histogram from a column of numeric data values.
  The histogram is printed to STDOUT.  The input data must be formatted as
  a single column of numeric data.  By default, the histogram is auto-scaled
  based on the minimum and maximum values of the input data.  The histogram
  can be rescaled by the user.

OPTIONS
  -h            Print this help message
  -u upper_lim  User-defined upper limit
  -l lower_lim  User-defined lower limit (lower-case letter L)

OPERANDS
  file          A path name of a file containing numerical data.
                If  no file operands are specified, the standard input
                will be used.

EXAMPLES
  $0 data.txt
  awk '{print \$4}' file.txt | $0
  $0 -l 0.0 -u 300 data.txt
  $0 -h

EXIT STATUS
  0     Successful completion
  >0    An error occurred

NOTES
  This program performs some rounding off to integer values
  to simplify printout.

EOF
}

If I were to re-write this in 2026, I would use a different style in a number of places. I just thought it would be fun to revisit some code from a different era. Feel free to offer any type of feedback.

I guess that would be mostly cosmetic changes. it's all ai nowadays, perhaps trying a few models and see if anything meaningful comes out. 🙂 — mpapec
– mpapec, Commented yesterday
cosmetic as in scope of variables, input parsing method, our, // and maybe ?: operator, numeric check.. — mpapec
– mpapec, Commented yesterday
Nice to see that Historical Friday is a (slowly) developing trend...! — Toby Speight
– Toby Speight, Commented yesterday

Toby Speight · Accepted Answer · 2026-02-27 13:50:54Z

Looks good, generally. Just minor nit-picks really.

When I ask for the usage information, the program exits with a failure status, despite successfully producing what I asked for (albeit to the error stream rather than the standard output stream as I would expect).

Perhaps it makes sense for the number of bins to be a user option, rather than hardcoded to 10?

Input format conversion could be improved. At present, I need to convert scientific-notation values to fixed-point representation. On the other hand, I get no complaints about lines with multiple integer values - they just get concatenated.

The rounding to integer is problematic in some use cases. For example, I fed in a set of inputs between 0 and 1 and got this result:

    Number of samples in population: 999
    Value range: 0.001 - 0.999
    Mean value: 0
    Median value: 0
    10% value: 0.099
    90% value: 0.899

                 < 0:      0
                >= 0:    999

That's not much of a histogram, really.

We give a useful error to the user if they specify a lower limit that's too high for the range, but if their upper limit is too low, we simply use it and allow all bins to be empty. We should be consistent one way or another.

The $absmin and $absmax values in create_histogram seem arbitrary. We should probably be using ±"inf" instead.

Agreed about the int. I guess I never tested it on numbers between 0 and 1; most of my use cases were integers. Thank you. — toolic
– toolic, Commented yesterday
Dealing nicely with both large and tiny numbers isn't a trivial undertaking - I'm not surprised you didn't tackle that, especially since you've never needed it! I mentioned it mainly for completeness. — Toby Speight
– Toby Speight, Commented yesterday

chux · Accepted Answer · 2026-02-27 21:18:11Z

4

Asymmetric output

Given Value range: 0 - 98, since output has a < 0 line, I'd expect a > 98 line too and not lumped into the >= 90: line.
This will better demo no off-by-one errors.

edited yesterday

answered yesterday

chux

37.3k2 gold badges44 silver badges97 bronze badges

Add a comment |

toolic · Accepted Answer · 2026-02-27 19:39:02Z

In addition to the feedback already received, here are changes I would make in 2026.

Modules

All use lines belong at the top of the file, not buried inside any sub, as they are in parse_args.

Getopt

Use Getopt::Long instead of Getopt::Std because Long allows for more flexibility.

Explicitly import only needed functions, like GetOptions:

use Getopt::Long qw(GetOptions);

I did not do this with Std, potentially polluting the namespace.

Instead of declaring individual option scalar variables:

use vars qw($opt_h $opt_l $opt_u);

declare an option hash variable:

my %opt;

Don't use the vars pragma since it is now discouraged.

Documentation

Replace the print_usage sub and its heredoc with plain old documentation (POD) and the corresponding Pod::Usage module.

Operator

In this line, I now prefer to use the and operator instead of && because it reads better:

until ( ($lo <= $_) && ($_ < $hi) ) {

Warnings

I prefer the stricter "Fatal":

use warnings FATAL => 'all';

Errors

This check is dubious:

# Check for any unexpected error conditions:
if ($!) {

I moved away from blindly checking for system errors.

Stack Exchange Network

Histogram of numeric data

3 Answers 3

Modules

Getopt

Documentation

Operator

Warnings

Errors

You must log in to answer this question.

Linked

Hot Network Questions

Histogram of numeric data

3 Answers 3

Modules

Getopt

Documentation

Operator

Warnings

Errors

You must log in to answer this question.

Linked

Related

Hot Network Questions