Question

Why do we need to append "$" to the original string when we implement a suffix tree?

Était-ce utile?

La solution

There can be special reasons for appending one (or even more) special characters to the end of the string when specific construction algorithms are used – both in the case of suffix trees and suffix arrays.

However, the most fundamental underlying reason in the case of suffix trees is a combination of two properties of suffix trees:

  1. Suffix trees are PATRICIA trees, i.e. the edge labels are, unlike the edge labels of tries, strings consisting of one or more characters
  2. Internal nodes exist only at branching points

This means you can potentially have a situation where one edge label is a prefix of another:

enter image description here

The idea here is that the black node on the right is a leaf node, i.e. a suffix ends here. But if the text has a suffix aa, then the single character a must also be a suffix. But there is no way for us to store the information that a suffix ends after the first a, because aa forms one continuous edge of the tree (property 1 above). We would have to introduce an intermediate node in which we could store the information, like this:

enter image description here

But this would be illegal because of property 2: No inner node must exist unless there is a branching point.

The problem is solved if we can guarantee that the last character of the text is a character that occurs nowhere else in the entire string. The dollar sign is normally used as a symbol for that.

Clearly, if the last character occurs nowhere else, there can't possible be any repetition (such as aa, or even a more complex one like abcabc) at the end of the string, hence the problem of non-branching inner nodes does not occur. In the example above, the effect of putting $ at the end of the string is:

enter image description here

There are three suffixes now: aa$, a$ and $, but none is a prefix of another. Obviously, this means we need to introduce an inner node after all, and there are a total of three leaves now. So, to be sure, the advantage of this is not that we save space or anything becomes more efficient. It's just a way to guarantee the two properties above. These properties are important when we prove certain useful characteristics of suffix trees, including the fact that its number of inner nodes is linear in the length of the string (you could not prove this if non-branching inner nodes were allowed).

This also means that in practice, you might use different ways of dealing with suffixes that are prefixes of other suffixes, and with non-branching inner nodes. For example, if you use the well-known Ukkonen algorithm to construct the tree, you can do that without appending a unique character to the end; you just have to make sure that at the end, after the final iteration, you put non-branching inner nodes to the end of every implicit suffix (i.e. every suffix that ends in the middle of an edge).

Again, there can be further, and very specific reasons for putting $ at the end of text before constructing a suffix tree or array. For example, in construction algorithms for suffix arrays that are based on the DC (difference cover) principle, you must put two $ signs to the end of the string to ensure that even the last character of the string is part of a complete character trigram, as the algorithm is based on trigram sorting. Furthermore, there are specific situations when the unique $ character must be interpreted in a special way. For the Ukkonen construction algorithm, it is sufficient for $ to be unique; for the DC suffix array algorithms it is necessary, in addition to uniqueness, that $ is lexicographically smaller than all other characters, and in the suffix-tree based circular string cutting algorithm (mentioned recently here) it is actually necessary to interpret $ as the lexicographically largest character.

Autres conseils

I suspect that it for traversing purposes. When you are generating something from the suffix tree you need to know if you at a node where the string finishes or not, if not then you know that you have to keep going. Looking at the longest common substring to which a suffix tree provides a linear time solution, you need the $ sentinels to determine that you've arrived at a node where the string terminates. You can't finish after A-NA.

from Wikipedia

enter image description here

1. NOT A PATRICIA

Suffix tree is NOT a Patricia Tree, which is Radix 2. Suffix Tree node may have 2 or MORE children.

2. NO ANY VALID REASON TODAY

There is no any reasons to add a special character other than

  • requirement to have 2 or more childrens
  • requirement to have exactly n leaves for string of n characters

Suffix tree can be implemented the same way as compressed trie (or radix tree as one of the kind), without any special symbols and there is no any functional disadvantages in this case.

3. OLD TRAILS

If you'll look into old book from 1973, you'll see very similar to trie structure, which is named "uncompressed suffix tree", but with values and termination symbols. Then they compact it.

4. BUT WHAT'S DIFFERENT?

Prefix and suffix trees both have metadata in nodes, right? Which is implemented as value of the node.

But with suffix tree we've got one interesting requirement - we need to have an index of a suffix. So, in the last node we have to keep TWO metadata fields, TWO values. And you need to keep nodes of the same size, byte-to-byte. SO THEY DID IT THROUGH ADDITIONAL NODE, END NODE

In the modern world, you can keep as many fields as you want, you are not going to save each and every byte spent, so you don't need this trick.

5. SO, DO WE HAVE A REASONS FOR END SYMBOL?

Yes, potentially we have a non-functional reason - save few bytes in each non-leaf node.

6. STILL ... ANY FUNCTIONAL REASON FOR END SYMBOL?

Yes, we may have one case where end symbolS are useful - GENERALIZED suffix tree, not just a suffix tree.

Generalized suffix tree will require different end markers, in a collection on the node or as a separate end symbols. Again, you can implement with or without special symbols.

7. BOTTOMLINE

  • These requirement seems to be a legacy for old systems
  • Feel free to implement suffix tree in a same way as a compressed prefix tree, there is no caveats except few bytes wasted in each node for unused end index flag.
  • Generalized suffix tree is a structure where end symbolS may be useful (but you still can build it without them)

I hope this will make situation clearer.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top