Question

I am trying to find all the possible longest common subsequence from the same position of multiple fixed length strings (there are 700 strings in total, each string have 25 alphabets ). The longest common subsequence must contain at least 3 alphabets and belong to at least 3 strings. So if I have:

String test1 = "abcdeug";
String test2 = "abxdopq";
String test3 = "abydnpq";
String test4 = "hzsdwpq";

I need the answer to be:

String[] Answer = ["abd", "dpq"];

My one problem is this needs to be as fast as possible. I am trying to find the answer with suffix tree, but the solution of suffix tree method is ["ab","pq"].Suffix tree can only find continuous substring from multiple strings.The common longest common subsequence algorithm cannot solve this problem. Does anyone have any idea on how to solve this with low time cost? Thanks

Était-ce utile?

La solution

I suggest you cast this into a well known computational problem before you try to use any algorithm that sounds like it might do what you want.

Here is my suggestion: Convert this into a graph problem. For each position in the string you create a set of nodes (one for each unique letter at that position amongst all the strings in your collection... so 700 nodes if all 700 strings differ in the same position). Once you have created all the nodes for each position in the string you go through your set of strings looking at how often two positions share more than 3 equal connections. In your example we would look first at position 1 and 2 and see that three strings contain "a" in position 1 and "b" in position 2, so we add a directed edge between the node "a" in the first set of nodes of the graph and "b" in the second group of nodes (continue doing this for all pairs of positions and all combinations of letters in those two positions). You do this for each combination of positions until you have added all necessary links.

Once you have your final graph, you must look for the longest path; I recommend looking at the wikipedia article here: Longest Path. In our case we will have a directed acyclic graph and you can solve it in linear time! The preprocessing should be quadratic in the number of string positions since I imagine your alphabet is of fixed size.

P.S: You sent me an email about the biclustering algorithm I am working on; it is not yet published but will be available sometime this year (fingers crossed). Thanks for your interest though :)

Autres conseils

You may try to use hashing.
Each string has at most 25 characters. It means that it has 2^25 subsequences. You take each string, calculate all 2^25 hashes. Then you join all the hashes for all strings and calculate which of them are contained at least 3 times. In order to get the lengths of those subsequences, you need to store not only hashes, but pairs <hash, subsequence_pointer> where subsequence_pointer determines the subsequence of that hash (the easiest way is to enumerate all hashes of all strings and store the hash number).
Based on the algo, the program in the worst case (700 strings, 25 characters each) will run for a few minutes.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top