Find the number of matching two characters in a string in Perl

https://stackoverflow.com/questions/8184385

04-03-2021
|

質問

Is there a method in Perl (not BioPerl) to find the number of each two consecutive letters.

I.e., number of AA, AC, AG, AT, CC, CA, ... in a sequence like this:

$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'

PS: We can make it manually by using the regular expression, i.e., $GC=($sequence=~s/GC/GC/g) which return the number of GC in the sequence.

I need an automated and generic way.

解決

You had me confused for a while, but I take it you want to count the dinucleotides in a given string.

Code:

my @dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

for my $dinuc (@dinucs) {
    $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}

Output from Data::Dumper:

$VAR1 = {
          "AC" => 5,
          "CC" => "",
          "AG" => "",
          "AA" => 1,
          "CG" => 3,
          "CA" => ""
        };

他のヒント

Close to TLP's answer, but without substitution:

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

for my $dinuc (@dinucs) {
    while($sequence=~/$dinuc/g) {
        $count{$dinuc}++;
    }
}

Benchmark:

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

my $count = -3;
my $r = cmpthese($count, {
        'match' => sub {
            for my $dinuc (@dinucs) {
               while($sequence=~/$dinuc/g) {
                    $count{$dinuc}++;
               }
            }
        },
        'substitute' => sub {
            for my $dinuc (@dinucs) {
                $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
            }
         }
});

Output:

              Rate substitute      Match
Substitute 13897/s         --       -11%
Match      15622/s        12%         --

Regex works if you're careful, but there's a simple solution using substr that will be faster and more flexible.

(As of this posting, the regex solution marked as accepted will fail to correctly count dinucleotides in repeated regions like 'AAAA...', of which there are many in naturally occurring sequences.

Once you match 'AA', the regex search resumes on the third character, skipping the middle 'AA' dinucleotide. This doesn't affect the other dinucleotides since if you have 'AC' at one position, you're guaranteed not to have it in the next base, naturally. The particular sequence given in the question will not suffer from this problem since no base appears three times in a row.)

The method I suggest is more flexible in that it can count words of any length; extending the regex method to longer words is complicated since you have to do even more gymnastics with your regex to get an accurate count.

sub substrWise {
    my ($seq, $wordLength) = @_;

    my $cnt = {};

    my $w;
    for my $i (0 .. length($seq) - $wordLength) {
        $w = substr($seq, $i, $wordLength);
        $cnt->{$w}++;
    }

    return $cnt;
}

sub regexWise {
    my ($seq, $dinucs) = @_;

    my $cnt = {};
    for my $d (@$dinucs) {
        if (substr($d, 0,1) eq substr($d, 1,1) ) {
            my $n = substr($d, 0,1);
            $cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
        } else {
            $cnt->{$d} = ($seq =~ s/$d/$d/g);
        }
    }

    return $cnt;
}


my @dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

use Test::More tests => 1;
my $rWise = regexWise($sequence, \@dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for @dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO,
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');

use Benchmark qw(:all);
cmpthese(100000, {
    'regex' => sub {
        regexWise($sequence, \@dinucs);
    },
    'substr' => sub {
        substrWise($sequence, 2);
    }

Output:

1..1
ok 1 - verify equivalence
          Rate  regex substr
regex  11834/s     --   -85%
substr 76923/s   550%     --

For longer sequences (10-100 kbase), the advantage is not as pronounced, but it still wins by about 70%.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow