I have written this function (and others similar to that one) But I am not sure I am using references on their full power.
My currently concerns is if I make a huge use of memory. The subroutine recieve a reference to two files, the subroutines, like these one return a hash (except &log_time, which returns a scalar). The subroutines expect a reference, thats why I use my %current_seq = ($id_name[0] => $seqs[$j]); my %freq_seq = &hexamer_freq(\%current_seq); on the subroutine, I don't think this is a good way to do it, but I can't imagine a better way to do it.
sub comparer{
# Compare a sequence with the reference frequency of hexamers
# First argument the file of to analyse, second argument the file of reference
my %score;
#Reading arguments
my $seq = shift;
my $ref_seq = shift;
# Calculating the reference log2
my %ref_seq = &read_fasta($$ref_seq);
my %freq_ref = &hexamer_freq(\%ref_seq);
# Counting hexamers and frequencies for each sequence
my %seqs = &read_fasta($seq);
while( my ($id, $sequen) = each %seqs){
my @id_seq = split(/\s+/, $id);
my @id_name = split(/\./, $id_seq[0]);
my $max = 0;
my $min = 999;
for (my $i = 0; $i < 3; $i++){
last if length $sequen <= $i;
my $sequ = substr($sequen, $i);
next unless (defined $sequ);
my $rev_sequ = reverse($sequ);
my @seqs = ($sequ, $rev_sequ);
for (my $j = 0; $j < scalar(@seqs); $j++){
my %current_seq = ($id_name[0] => $seqs[$j]);
my %freq_seq = &hexamer_freq(\%current_seq);
# Handle the sequences that are too short to contain an hexamer
if (scalar keys %freq_seq == 0){
print STDERR &log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
next;
};
# Calculate the hexamer score
my $score = 0;
my $n_hexamers = scalar keys %freq_seq;
foreach my $hex (keys %freq_seq){
if (defined $freq_ref{$hex}){
$score += log2($freq_seq{$hex}/$freq_ref{$hex});
}
};
# Store the two possible candidates of "best score"
if ($score/$n_hexamers > $max){
$max = $score/$n_hexamers;
};
unless ($score/$n_hexamers > $min) {
$min = $score/$n_hexamers
}
# Store the data for each sequence
my $key = $id_name[0] . " frame: $i";
$key .= " FWD" if $j == 0; # The fwd + or - have the same hexamers
$key .= " REV" if $j == 1;
$score{$key} = [$max, $min];
};
};
}
return %score;
};
Besides, I would like to improve the way the $min is calculated. Now the 999 is an arbitrary number, I expect it won't be reached, because the $freq_ are numbers between 0 and 1, and it is unlikely to have so big numbers (but it may happen).