1

Given data similar to the following:

['blah_12_1_bbc_services_cbbc',
 'blah_12_1_high-profile_and_a',
 'blah_12_1_iplayer,_known',
 'blah_12_1_sport,_as_co-branded',
 'er_ds_such_it',
 'er_ds_websites_bbc_video',
 'er_ds_bbc',
 'er_ds_service._sport,',
 'th_ss_13_a',
 "th_ss_13_iplayer,_large_bbc's",
 'th_ss_13_the_a_co-branded',
 "th_ss_13_the_bbc's_bbc's"]

I'd like to create a list as:

['blah_12_1_',
 'blah_12_1_',
 'blah_12_1_',
 'blah_12_1_',
 'er_ds_',
 'er_ds_',
 'er_ds_',
 'er_ds_',
 'th_ss_13_',
 'th_ss_13_',
 'th_ss_13_',
 'th_ss_13_']

Given that the substrings to extract have differing lengths and structures I'm not sure how to go about this.

5
  • Doesn't look like there is any pattern. If there isn't any pattern then your problem falls into Computational Intelligence field. If there is any specific pattern like part of the string before a certain character or punctuation mark then that's fairly easy. Commented Dec 2, 2021 at 11:39
  • Well you could define the problem as getting the longest common prefix within each group, but you would have to have some kind of rule about when the string is sufficiently different to the previous one to start a new group -- and that rule would need to be defined in the question and not left for people to guess at in their answers. Commented Dec 2, 2021 at 11:44
  • 1
    First, pick the strings as random and then push and mark task as complete. Someone will complain for sure and tell you the right pattern. Commented Dec 2, 2021 at 11:48
  • The second list in your example does not contain the unique substrings. You can split your strings by the regexp if they always have the prefix with the two underscores and then take the second part that seems the unique in your example: 'bbc_services_cbbc' and so on. Commented Dec 2, 2021 at 11:50
  • @IgorZ sorry - the second list are the ID's which have been extracted from the first, I'm thinking this might not be possible programmatically though, and that it just takes some human labelling etc Commented Dec 2, 2021 at 12:13

2 Answers 2

1

You can use recursion:

from collections import defaultdict
def paths(p, s = '', c = None):
   d = defaultdict(list)
   for a, *b in p:
      d[a].extend(b if not b else [b])
   if c is None or len(d) == 1:
      yield from [j for a, b in d.items() for j in paths(b, s=s+a+'_', c=c if c is not None else len(b))]
   else:
      yield from [s]*c

data = ['blah_12_1_bbc_services_cbbc', 'blah_12_1_high-profile_and_a', 'blah_12_1_iplayer,_known', 'blah_12_1_sport,_as_co-branded', 'er_ds_such_it', 'er_ds_websites_bbc_video', 'er_ds_bbc', 'er_ds_service._sport,', 'th_ss_13_a', "th_ss_13_iplayer,_large_bbc's", 'th_ss_13_the_a_co-branded', "th_ss_13_the_bbc's_bbc's"]
r = list(paths([i.split('_') for i in data]))

Output:

['blah_12_1_', 
 'blah_12_1_', 
 'blah_12_1_', 
 'blah_12_1_', 
 'er_ds_', 
 'er_ds_', 
 'er_ds_',  
 'er_ds_', 
 'th_ss_13_', 
 'th_ss_13_', 
 'th_ss_13_', 
 'th_ss_13_']
Sign up to request clarification or add additional context in comments.

Comments

0

This should meet your requirements.

# Imports
import os
import numpy as np

list_str = ['blah_12_1_bbc_services_cbbc',
 'blah_12_1_high-profile_and_a',
 'blah_12_1_iplayer,_known',
 'blah_12_1_sport,_as_co-branded',
 'er_ds_such_it',
 'er_ds_websites_bbc_video',
 'er_ds_bbc',
 'er_ds_service._sport,',
 'th_ss_13_a',
 "th_ss_13_iplayer,_large_bbc's",
 'th_ss_13_the_a_co-branded',
 "th_ss_13_the_bbc's_bbc's"]

def get_common_substring(list_str):
    # List sorted
    list_sorted = list(np.sort(list_str))
    
    # Necessary to check if the new common string already exists in a more contracted form 
    # (i.e for ["th_ss_13_iplayer,_large_bbc's", 'th_ss_13_the_a_co-branded', "th_ss_13_the_bbc's_bbc's"]
    # we want "th_ss_13_" not "th_ss_13_the" )
    last_common_substring = None
    
    # Association between items and common substrings
    common_substrings = {}
    
    # result as a list as expected 
    results = []
    
    for i in range(len(list_sorted) - 1):
        # Iteration two by two items
        pair_items = list_sorted[i:i+2]
    
        # Search for common substring
        common_substring = os.path.commonprefix(pair_items)
        
        if len(common_substring) > 0:
            if last_common_substring is not None and last_common_substring in common_substring:
                # Necessary to check if the new common string already exists in a more contracted form 
                common_substring = last_common_substring
            
            # We add the association for the pair of items  
            common_substrings[pair_items[0]] = common_substring
            common_substrings[pair_items[1]] = common_substring

            # We keep trace of the last common substring
            last_common_substring = common_substring
        else:
            if pair_items[0] not in common_substrings:
                # This is the case when there are no common substrings
                common_substrings[pair_items[0]] = None

    # To have the result as a list as expected            
    for item in common_substrings:
        results.append(common_substrings[item])
        
    return results

get_common_substring(list_str)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.