2

How to remove or replace familiar words in string, by list of familiar stubs of that words?

List contain ~40 stubs (substrings). With all of this I expect substitute hundreds matching words.

I'm interested in formula solution because I already know how to do this, coding VBA.


I play around TEXTJOIN with FILTERXML, but its not possible to use it for big stub list:

Formula 1. FILTERXML (one by one entered stub)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(A2;" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti') or contains(., 'AuTi') or contains(., 'tion') or contains(., 'mpl') or contains(., 'Mpl')or contains(., 'etc'))]"));"")

Result is good, but unfortunately is case sensitive, so one by one entered substrings for all capitalization version (Propercase, UPERCASE, lowercase, MixEdCase) generate a very big formula impossible to use because of length limitation (255 characters).

Formula 2. FILTERXML (with case translation)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<T><S>"&SUBSTITUTE(A2;" ";"</S><S>")&"</S></T>";"//S[not(contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'auti')or contains(., 'tion')or contains(., 'mpl')or contains(., 'etc'))]"));"")

Result is good apparently, but there is some strange missed capitalization matches (check cell D14 in attached image). To avoid this, I can use translation for each stub translate(., 'AUTI', 'auti'),'auti') but length limitation will not leave me to use all list.

Formula 3. FILTERXML (with LOWER-ing all string)

=IFERROR(TEXTJOIN(" ";;FILTERXML("<A><B>"&SUBSTITUTE(LOWER(A2);" ";"</B><B>")&"</B></A>";"//B[not(contains(., 'auti')or contains(., 'tion') or contains(., 'mpl') or contains(., 'etc'))]"));"")

Result is good, but output string is totally lower. And again length limitation is still there.

example image

I know FILTERXML is a good way for this task, but I don't know how to work with. Unfortunately Xpath 1.0 doesn't allow things like matches()

1 Answer 1

2

Very nice question, but you are stretching the capabilities of a formula made through Excel-2019 to it's very limits. Here is what I came up with considering the limitations of xpath 1.0 (no matches() nor lists/array handling:

enter image description here

Formula in B1:

=TEXTJOIN(" ",,IF(ROW(A$1:INDEX(A:A,LEN(A2)))=TRANSPOSE(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,IFERROR(ROW(A$1:INDEX(A:A,COUNTA(FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s"))))&"<s>"&FILTERXML("<t><s>"&SUBSTITUTE(LOWER(A2)," ","</s><s>")&"</s></t>","//s"),""))&"</s></t></x>","//t[.//*[not(contains(., '"&TEXTJOIN("') or contains (., '",,{"auti","tion","mpl","etc"})&"'))]]")),FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s"),""))

I have tested this against an array of 40 items of say 3-4 characters long and it did not exceed the limit of both TEXTJOIN()'s parameters nor the character limit it can handle.

It's an array-entered formula (CSE).

To summarize what this does:

  • Split your input on space (keep case-specifications intact);
  • Create a new input string using TEXTJOIN() to use another level of xpath childs to give each word an index;
  • With FILTERXML() we return each index that did not contain any of the array-elements you specified;
  • Use these numbers to retrieve all nodes from the 1st step and concat those back together;
  • Note that this will impact punctuation.

Note that this will become significantly easier with ms365's lambda's helper function REDUCE():

=REDUCE(A1,{"auti","tion","mpl","etc"},LAMBDA(a,b,TEXTJOIN(" ",,FILTERXML("<t><s>"&SUBSTITUTE(a," ","</s><s>")&"</s></t>","//s[not(contains(translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'), '"&b&"'))]"))))
Sign up to request clarification or add additional context in comments.

3 Comments

I dont know how you do this. But every time your answer is working, no meter what level of complexity it is. Thank you for patience to build also 365 version of formula every time. Soon i will upgrade to excel 365, and i will use also that version. Big thanks!
Just a small note for beginners like me: Be careful with formula in row builder section here ...(ROW(A$1:INDEX(A:A,LEN(A1)).... to exclude some confusion is better to build row index in another cell address, like this: ...(ROW(X$1:INDEX(X:X,LEN(A1))...
Also I suggest to put entire formula in IFERROR() function to exclude wrong returns of blank or trimmed cells.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.