algorithm for making optimise string from sub strings

Question

Suppose I have a collection of substrings, for example:

string a = {"cat","sensitive","ate","energy","tense"}

Then the output for this should be as follows:

catensesensitivenergy

How can I do this?

In what language? Is this just a theoretical question (e.g. the kind you'd find in schoolwork) for which the language doesn't matter much? — doppelgreener
– doppelgreener, Commented Dec 15, 2014 at 13:19
What are the rules for constructing the string? Maximum overlap of substrings? — AndyG
– AndyG, Commented Dec 15, 2014 at 14:06
we can use native language c. The reason behind asking for algorithm is there should be not or minimum use of function, we can use stdin.h and stdio.h function — Raj Damani
– Raj Damani, Commented Dec 15, 2014 at 14:29

kraskevich · Accepted Answer · 2014-12-15 16:59:42Z

5

This problem is known as the shortest common superstring problem and it is NP-hard, so if you need an exact solution you cannot do much better then trying all possibilities and choosing the best one.

One possible exponential solution is to generate all permutations of the input strings, find the shortest common superstring greedily for each permutation(a permutation specifies the order of strings and it is possible to prove that for a fixed order greedy algorithm always works correctly) and choose the best one.

edited Dec 15, 2014 at 16:59

answered Dec 15, 2014 at 15:54

kraskevich

18.6k4 gold badges37 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kraskevich Over a year ago

@AndyG supersequence and superstring are different things, actually.

user2566092 Over a year ago

"So you cannot do much better than trying all possibilities and choosing the best one." This is an over-simplification of what NP-hard means. Often totally exhaustive solutions can be beaten quite well, albeit not in provable polynomial time, e.g. for factoring integers which may very well be an NP-hard problem yet there are known solutions that are provably so much faster than trial division.

Rerito Over a year ago

To be more specific, the problem at hand is NP-hard because it reduces to finding the shortest hamiltonian path in a directed graph

George Houpis · Accepted Answer · 2014-12-15 16:57:07Z

Using user2040251 suggestion:

#include <string>
#include <iostream>
#include <algorithm>

std::string merge_strings( const std::vector< std::string > & pool )
{
    std::string retval;
    for( auto s : pool )
        if( retval.empty() )
            retval.append( s );
        else if( std::search( retval.begin(), retval.end(), s.begin(), s.end() ) == retval.end() )
        {
            size_t len = std::min( retval.size(), s.size() ); 
            for( ; len; --len )
                if( retval.substr( retval.size() - len ) == s.substr( 0, len ) )
                {
                    retval.append( s.substr( len ) );
                    break;
                }
            if( !len )
                retval.append( s );
        }
    return retval;           
}

std::string shortest_common_supersequence( std::vector< std::string > & pool )
{
    std::sort( pool.begin(), pool.end() );

    std::string buffer;
    std::string best_reduction = merge_strings( pool );
    while( std::next_permutation( pool.begin(), pool.end() ) )
    {
        buffer = merge_strings( pool );
        if( buffer.size() < best_reduction.size() )
            best_reduction = buffer;
    }
    return best_reduction;
}


int main( int argc, char ** argv )
{
    std::vector< std::string > a{"cat","sensitive","ate","energy","tense"};
    std::vector< std::string > b{"cat","sensitive","ate","energy","tense","sit"};
    std::vector< std::string > c{"personal","ate","energy","tense","gyroscope"};
    std::cout << "best a --> \"" << shortest_common_supersequence( a ) << "\"\n";
    std::cout << "best b --> \"" << shortest_common_supersequence( b ) << "\"\n";
    std::cout << "best c --> \"" << shortest_common_supersequence( c ) << "\"\n";

    return 0;
}

Output:

best a --> "catensensitivenergy"
best b --> "catensensitivenergy"
best c --> "atensenergyroscopersonal"

By checking every permutation arent you suboptimal ? Most of the time the underlying graph of the words won't be complete making our life a bit easier :)

Rerito · Accepted Answer · 2014-12-23 08:30:21Z

Break the problem down and see what we got. Starting with only two strings. We must check which suffix of one string is the longest prefix of the other. This gives us the order for the best concatenation.

Now, with a set of n word, how do we do ? We start by building a trie containing every word (a key for each word). If a word is a duplicate of an other we can easily flag it as such while building the prefix tree.

I made a quick implementation of a regular Trie. You can find it here.

We have the tools to build a directed graph linking the different words wether a suffix of the first is a prefix of the second. The weight of the edge is the length of the suffix.

To do so, for each word w of the input set, we must see which words we can reach with a suffix of w :

We walk down the trie using the suffix. We will end up in a node (or not).
From this node, provided it exists, we scan the remaining subtree to see which words are available.
- If a given suffix of length l yields a match with a prefix of word w', then we add an edge w → w', with weight length(w') - l.
- If such an edge already exists, we just update the weight to keep the lowest.

From there, the graph is set and we must find the shortest path that runs through every vertex (eg. word) only once. If the graph is complete, this is the Traveling Salesman Problem. Most of the times, the graph won't be complete.

Still, it remains a NP-hard problem. In more "technical" terms, the problem at hand is to find the shortest hamiltonian path of a digraph.

Note : Given an Hamiltonian path (if it exists) with its cost C, and its starting vertex (word) W, the supestring length is given by :

L_super = L_W + C

Note : If two words have no suffix linking them to another word, then the graph is not connected and there is no hamiltonian path.

Collectives™ on Stack Overflow

algorithm for making optimise string from sub strings

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related