質問

I'm new to this website. Here's a problem that troubled me for >2 hr. I have a string (phylogenetic tree in newick format), which looks like:

((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35);

The tree may have multiple levels, indicated by parentheses. Now I want to add a number, say, 10, to the top level numbers (branch lengths). Here there are only three top level numbers: 22, 76, 35. After the convertion the string should look like:

((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45);

I have tried my best thinking out a proper regex, but finally admitted my limitation. How can it be done really?

役に立ちましたか?

解決

s/(?:^\(|(\((?:(?>[^()]*)|(?1))*\)))\K|:\K([0-9]+)/$2?$2+10:""/ge

Match either things you want to skip or digits preceded by a :.

Things you want to skip are either the leading ( or any balanced set of parentheses (balanced parentheses regex taken almost literally from perlre).

In the substitution, add ten if digits to be modified were matched, otherwise match nothing.

But you are better off not being clever and instead going to the work to parse, modify, and reserialize your tree.

他のヒント

Although I would opt for parsing the whole tree, the problem can be solved when using only regexes:

use strict; use warnings; use feature qw(say);
my $string = "((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35)";
$string =~ s/^\(//;
$string =~ s/\)$//;
$string =~ s{
    \G ((?&PRELEM)) : (\d+) (,|$)
    (?(DEFINE)
        (?<SUBLIST> [(] (?&ELEM)(?:,(?&ELEM))* [)] )
        (?<ELEM> (?&PRELEM) : \d+ )
        (?<PRELEM> (?:[A-Z]|(?&SUBLIST)) )
    )
 }{"$1:".($2+10).$3}gex;
 say "($string)";

Prints ((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45).

I define a small grammar for top-down recursive parsing, please adapt as needed. On the top level, we have uninteresting Pre-Elements, which we store in $1 They can be a single letter or a tree enclosed in parenthesis. After a : comes the number which we want to increment, stored in $2. It is followed by the end of string or a comma. We match iteratively, starting where the last match left of (Symbolized by the /g option and the \G assertion). The addition happens when we build the substitution string (We are using the /e option).

This needs a recursive regular expression to match the nested parentheses.

First define a 'key', which is either a string of capital letters or any number of key:value pairs between parentheses.

Then find all keys followed by a colon and a decimal number and do the arithemtic on the number.

use strict;
use warnings;

my $str = '((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35)';

my $key = qr/ (?<key> [A-Z]+ | \( (?&key) : \d+ (?: , (?&key) : \d+ )* \)  ) /x;

$str =~ s/$key : \K ( \d+ ) /$2 + 10/xge;

print $str;

output

((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45)

First, I'd like to thank ysth for his very interesting posting in this thread. From this posting, I learned how and why to apply the \Keep modifier.

I added another \K (to the first subexpression) and made use of the new ++notation for atomic groups:

my $r = qr{
  (?:
     (?: ^ \(\K )
     |
     (
       \( (?: [^()]++ | (?1) )* \)
     )\K
  )
  |
  :\K (\d+)
}x;

The output string now matches exactly the input string - except for the incremented values:

$t =~ s/$r/$2?$2+10:''/ge;

input:  ((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35)
output: ((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45)
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top