package Data::Frame::Examples;
$Data::Frame::Examples::VERSION = '0.006004';
# ABSTRACT: Example data sets
use Data::Frame::Setup;
use File::ShareDir qw(dist_dir);
use Module::Runtime qw(module_notional_filename);
use Path::Tiny;
use Data::Frame;
use Data::Frame::Util qw(factor);
use parent qw(Exporter::Tiny);
my %data_setup = (
airquality => {},
diamonds => {
postprocess => sub {
my ($df) = @_;
return _factorize(
$df,
cut => [ 'Fair', 'Good', 'Very Good', 'Premium', 'Ideal' ],
color => [ 'D' .. 'J' ],
clarity => [qw(I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF)]
);
}
},
economics => { params => { dtype => { date => 'datetime' } } },
economics_long => { params => { dtype => { date => 'datetime' } } },
faithfuld => {},
iris => { params => { dtype => { Species => 'factor' } } },
mpg => {},
mtcars => {},
txhousing => {},
);
my @data_names = sort keys %data_setup;
our @EXPORT_OK = ( @data_names, 'dataset_names' );
our %EXPORT_TAGS = (
datasets => \@data_names,
all => \@EXPORT_OK,
);
my $data_raw_dir;
#TODO: Change this dist name when merging this to Data::Frame.
try { $data_raw_dir = dist_dir('Data-Frame'); }
catch ($e) {
# for dev env only
my $path = path( $INC{ module_notional_filename(__PACKAGE__) } );
$data_raw_dir =
path( $path->parent( ( () = __PACKAGE__ =~ /(::)/g ) + 2 ), 'data-raw' )
. '';
}
for my $name (@data_names) {
no strict 'refs';
*{$name} = _make_data( $name, $data_setup{$name} );
}
sub dataset_names { @data_names; }
sub _factorize {
my ($df, %var_levels ) = @_;
for my $var (sort keys %var_levels) {
my $levels = $var_levels{$var};
$df->set(
$var,
factor(
$df->at($var),
levels => $levels,
ordered => true
)
);
}
return $df;
};
#TODO: switch from csv to some other format for speed
sub _make_data {
my ( $name, $setup ) = @_;
return sub {
state $df;
unless ( defined $df ) {
$df = Data::Frame->from_csv(
"$data_raw_dir/$name.csv",
header => true,
%{ $setup->{params} }
);
if (my $postprocess = $setup->{postprocess}) {
$df = $postprocess->($df);
}
}
return $df;
};
}
1;
__END__
=pod
=encoding UTF-8
=head1 NAME
Data::Frame::Examples - Example data sets
=head1 VERSION
version 0.006004
=head1 SYNOPSIS
use Data::Frame::Examples qw(:datasets dataset_names);
my $datasets = dataset_names(); # names of all example datasets
my $mtcars = mtcars();
=head1 DESCRIPTION
Example datasets as L<Data::Frame> objects.
Checkout C<Data::Frame::Examples::dataset_names()> for an array of
example datasets provided by this module.
=head1 FUNCTIONS
=head2 dataset_names
Returns an array of names of the datasets in this module.
=head1 DATASETS
=head2 airquality
A dataset with 154 observations on 6 variables,
for daily readings of the following air quality values for May 1, 1973 to
September 30, 1973.
The variables are,
=over 4
=item *
Ozone
numeric Ozone (ppb)
=item *
Solar_R
numeric Solar R (lang)
=item *
Wind
numeric Wind (mph)
=item *
Temp
numeric Temperature (degrees F)
=item *
Month
numeric Month (1-12)
=item *
Day
numeric Day of month (1-31)
=back
=head2 diamonds
A dataset containing the prices and other attributes of almost 53,940
diamonds on 10 variables.
The variables are,
=over 4
=item *
price
price in US dollars
=item *
carat
weight of the diamond
=item *
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
=item *
color
diamond colour, from J (worst) to D (best)
=item *
clarity
a measurement of how clear the diamond is
(I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
=item *
x
length in mm
=item *
y
width in mm
=item *
z
depth in mm
=item *
depth
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
=item *
table
width of top of diamond relative to widest point
=back
=head2 economics
A dataset with 574 rows and 6 variables,
produced from US economic time series data available from
L<http://research.stlouisfed.org/fred2>.
The variables are,
=over 4
=item *
date
Month of data collection
=item *
psavert
personal saving rate
=item *
pce
personal consumption expenditures, in billions of dollars
=item *
unemploy
number of unemployed in thousands
=item *
uempmed
median duration of unemployment, in weeks
=item *
pop
total population, in thousands
=back
=head2 economics_long
A dataset with 2870 rows and 4 variables.
It's from the same data source as C<economics>, except that C<economics>
is in "wide" format, this C<economics_long> is in "long" format.
=head2 faithfuld
A 2d density estimate of the waiting and eruptions variables data faithful.
5,625 observations and 3 variables.
=head2 iris
A dataset with 150 cases and 5 variables, for 50 flowers from each of 3
species of iris.
The variables are,
=over 4
=item *
Sepal_Length
=item *
Sepal_Width
=item *
Petal_Length
=item *
Petal_Width
=item *
Species
The species are I<setosa>, I<versicolor>, and I<virginica>.
=back
=head2 mpg
A subset of the fuel economy data that the EPA makes available on
L<http://fueleconomy.gov>. 234 rows and 11 variables.
The variables are,
=over 4
=item *
manufacturer
=item *
model
model name
=item *
displ
Engine displacement, in litres
=item *
year
year of manufacture
=item *
cyl
number of cylinders
=item *
trans
type of transmission
=item *
drv
f = front-wheel drive, r = rear wheel drive, 4 = 4wd
=item *
cty
city miles per gallon
=item *
hwy
highway miles per gallon
=item *
fl
fuel type
=item *
class
"type" of car
=back
=head2 mtcars
Data extracted from the 1974 I<Motor Trend US> magazine, for 32 automobiles
(1973-74 models). 32 observations on 11 variables.
The variables are,
=over 4
=item *
mpg
Miles/(US) gallon
=item *
cyl
Number of cylinders
=item *
disp
Displacement (cu.in.)
=item *
hp
Gross horsepower
=item *
drat
Rear axle ratio
=item *
wt
Weight (1000 lbs)
=item *
qseq
1/4 mile time
=item *
vs
V/S
=item *
am
Transmission (0 = automatic, 1 = manual)
=item *
gear
Number of forward gears
=item *
carb
Number of carburetors
=back
=head2 txhousing
Information about the housing market in Texas provided by the TAMU real
estate center, L<http://recenter.tamu.edu/>.
8602 observations and 9 variables.
The variables are,
=over 4
=item *
city
Name of MLS area
=item *
year,month,date
=item *
sales
Number of sales
=item *
volume
Total value of sales
=item *
median
Median sale price
=item *
listings
Total active listings
=item *
inventory
"Months inventory": amount of time it would take to sell all current
listings at current pace of sales.
=back
=head1 SEE ALSO
L<Data::Frame>
=head1 AUTHORS
=over 4
=item *
Zakariyya Mughal <zmughal@cpan.org>
=item *
Stephan Loyd <sloyd@cpan.org>
=back
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2014, 2019-2022 by Zakariyya Mughal, Stephan Loyd.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.
=cut