How to get unique substrings from list of strings with python

Question

Given data similar to the following:

['blah_12_1_bbc_services_cbbc',
 'blah_12_1_high-profile_and_a',
 'blah_12_1_iplayer,_known',
 'blah_12_1_sport,_as_co-branded',
 'er_ds_such_it',
 'er_ds_websites_bbc_video',
 'er_ds_bbc',
 'er_ds_service._sport,',
 'th_ss_13_a',
 "th_ss_13_iplayer,_large_bbc's",
 'th_ss_13_the_a_co-branded',
 "th_ss_13_the_bbc's_bbc's"]

I'd like to create a list as:

['blah_12_1_',
 'blah_12_1_',
 'blah_12_1_',
 'blah_12_1_',
 'er_ds_',
 'er_ds_',
 'er_ds_',
 'er_ds_',
 'th_ss_13_',
 'th_ss_13_',
 'th_ss_13_',
 'th_ss_13_']

Given that the substrings to extract have differing lengths and structures I'm not sure how to go about this.

Doesn't look like there is any pattern. If there isn't any pattern then your problem falls into Computational Intelligence field. If there is any specific pattern like part of the string before a certain character or punctuation mark then that's fairly easy. — Abdur Rakib
– Abdur Rakib, Commented Dec 2, 2021 at 11:39
Well you could define the problem as getting the longest common prefix within each group, but you would have to have some kind of rule about when the string is sufficiently different to the previous one to start a new group -- and that rule would need to be defined in the question and not left for people to guess at in their answers. — alani
– alani, Commented Dec 2, 2021 at 11:44
First, pick the strings as random and then push and mark task as complete. Someone will complain for sure and tell you the right pattern. — Florin C.
– Florin C., Commented Dec 2, 2021 at 11:48
The second list in your example does not contain the unique substrings. You can split your strings by the regexp if they always have the prefix with the two underscores and then take the second part that seems the unique in your example: 'bbc_services_cbbc' and so on. — IgorZ
– IgorZ, Commented Dec 2, 2021 at 11:50
@IgorZ sorry - the second list are the ID's which have been extracted from the first, I'm thinking this might not be possible programmatically though, and that it just takes some human labelling etc — baxx
– baxx, Commented Dec 2, 2021 at 12:13

Ajax1234 · Accepted Answer · 2021-12-02 17:36:21Z

You can use recursion:

from collections import defaultdict
def paths(p, s = '', c = None):
   d = defaultdict(list)
   for a, *b in p:
      d[a].extend(b if not b else [b])
   if c is None or len(d) == 1:
      yield from [j for a, b in d.items() for j in paths(b, s=s+a+'_', c=c if c is not None else len(b))]
   else:
      yield from [s]*c

data = ['blah_12_1_bbc_services_cbbc', 'blah_12_1_high-profile_and_a', 'blah_12_1_iplayer,_known', 'blah_12_1_sport,_as_co-branded', 'er_ds_such_it', 'er_ds_websites_bbc_video', 'er_ds_bbc', 'er_ds_service._sport,', 'th_ss_13_a', "th_ss_13_iplayer,_large_bbc's", 'th_ss_13_the_a_co-branded', "th_ss_13_the_bbc's_bbc's"]
r = list(paths([i.split('_') for i in data]))

Output:

['blah_12_1_', 
 'blah_12_1_', 
 'blah_12_1_', 
 'blah_12_1_', 
 'er_ds_', 
 'er_ds_', 
 'er_ds_',  
 'er_ds_', 
 'th_ss_13_', 
 'th_ss_13_', 
 'th_ss_13_', 
 'th_ss_13_']

yht · Accepted Answer · 2021-12-02 12:56:26Z

This should meet your requirements.

# Imports
import os
import numpy as np

list_str = ['blah_12_1_bbc_services_cbbc',
 'blah_12_1_high-profile_and_a',
 'blah_12_1_iplayer,_known',
 'blah_12_1_sport,_as_co-branded',
 'er_ds_such_it',
 'er_ds_websites_bbc_video',
 'er_ds_bbc',
 'er_ds_service._sport,',
 'th_ss_13_a',
 "th_ss_13_iplayer,_large_bbc's",
 'th_ss_13_the_a_co-branded',
 "th_ss_13_the_bbc's_bbc's"]

def get_common_substring(list_str):
    # List sorted
    list_sorted = list(np.sort(list_str))
    
    # Necessary to check if the new common string already exists in a more contracted form 
    # (i.e for ["th_ss_13_iplayer,_large_bbc's", 'th_ss_13_the_a_co-branded', "th_ss_13_the_bbc's_bbc's"]
    # we want "th_ss_13_" not "th_ss_13_the" )
    last_common_substring = None
    
    # Association between items and common substrings
    common_substrings = {}
    
    # result as a list as expected 
    results = []
    
    for i in range(len(list_sorted) - 1):
        # Iteration two by two items
        pair_items = list_sorted[i:i+2]
    
        # Search for common substring
        common_substring = os.path.commonprefix(pair_items)
        
        if len(common_substring) > 0:
            if last_common_substring is not None and last_common_substring in common_substring:
                # Necessary to check if the new common string already exists in a more contracted form 
                common_substring = last_common_substring
            
            # We add the association for the pair of items  
            common_substrings[pair_items[0]] = common_substring
            common_substrings[pair_items[1]] = common_substring

            # We keep trace of the last common substring
            last_common_substring = common_substring
        else:
            if pair_items[0] not in common_substrings:
                # This is the case when there are no common substrings
                common_substrings[pair_items[0]] = None

    # To have the result as a list as expected            
    for item in common_substrings:
        results.append(common_substrings[item])
        
    return results

get_common_substring(list_str)

Collectives™ on Stack Overflow

How to get unique substrings from list of strings with python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related