2

I need to replace all instances of a character (period in my case) in 1+ portions/segments/ranges of a string. I'm using Bash on Linux. Ideally the solution is in Bash, but if it's either not possible or terribly complex I can call any app commonly found on Linux (sed, Python, etc).

Example:

Starting String: "<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark> ..." .

Needed transformation: Replace all periods "." between <mark> and </mark> with the string "<wbr />" .

Desired Result: "<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>" .

EDITS:

The starting string will never contain <mark> or </mark> within a set of them (ie. the range markers are never nested).

I'm asking for help with some built-in Bash capability to perform this. The obvious mechanism is to try to find and , and then perform substitution in the content between. I know Bash can do offset finding (in an indirect way), and substitution. But can it be performed on a subset?

For the comments regarding parsing this as XML: I did not say this is XML so you should not assume it. Ultimately it's irrelevant to my question; the range markers can be anything.

Here's something I got working. It's not pure Bash, but it's simple.

while $(echo "${my_str}" | grep -E '<mark>[^.]*\.[^<]*</mark>' >/dev/null 2>&1) ; do
    my_str=$(echo "${my_str}" | sed -E -e 's,(<mark>[^.]*)\.([^<]*</mark>),\1<wbr />\2,g')
done
9
  • 2
    Do not parse XML with regex. Use an XML parser. Commented Mar 24 at 18:26
  • 2
    Post valid XML in your question. Commented Mar 24 at 18:34
  • 1
    This quick hack (which absolutely will not work for general XML strings) may help to get you started on a pure Bash solution: tmp=$string; newstr=; while [[ $tmp == *'<mark>'*'</mark>'* ]]; do tmp2=${tmp#*<mark>*</mark>}; tmp3=${tmp%"$tmp2"}; tmp=$tmp2; tmp4=${tmp3%%<mark>*</mark>}; tmp5=${tmp3#"$tmp4"}; tmp5=${tmp5//./'<wbr />'}; tmp5="<begin>${tmp5#<mark>}"; tmp5="${tmp5%</mark>}</end>"; newstr+=$tmp4$tmp5; done; newstr+=$tmp; printf '%s\n' "$newstr" Commented Mar 24 at 19:25
  • @Shawn - good observation! I changed the tags mid-edit and missed some. I've corrected the Desired Result. Commented Mar 24 at 20:33
  • 1
    you could start by replacing the while $(echo ... | grep ...); do with while grep -q -E '<mark>[^.]*\.[^<]*</mark>' <<< "${my_str}"; do to eliminate two subshell calls on each pass through the loop; the $(echo ... | sed ...) could be replaced with $(sed ... <<< "${my_str}") to eliminate another subshell, while this last subshell could be replaced with some creative parameter substitutions; though I'd look into how to compare ${my_str} to a regex and how that populates the BASH_REMATCH[] array, then the BASH_REMATCH[] results can be used to formulate the parameter substitution Commented Mar 24 at 21:33

5 Answers 5

4

Setup:

string='<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark>'

One bash solution:

regex='(<mark>[^<]*</mark>)'           # assumes no "<" between "<mark>" and "</mark>" tags
unset prev_string                      # used to test for a change to 'string'

# while we have a match and a change has been made to 'string' ...

while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
    # typeset -p BASH_REMATCH          # uncomment to see contents of the BASH_REMATCH[] array

    prev_string="${string}"

    # use nested parameter substitutions to make replacement

    string="${string/${BASH_REMATCH[1]}/${BASH_REMATCH[1]//\./<wbr \/>}}"
done

NOTE: "${prev_string}" != "${string}" added as a quick hack to insure we don't go into an infinite loop in the case where no modifications are made to string (eg, no periods between the tags)

A variation on the above which adds a few cpu cycles while making the parameter substitutions easier to read and understand:

regex='(<mark>[^<]*</mark>)'
unset prev_string

while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
    old="${BASH_REMATCH[1]}"           # copy the match; makes follow-on commands a bit cleaner
    new="${old//\./<wbr \/>}"          # replace all periods with "<wbr />"

    prev_string="${string}"
    string="${string/${old}/${new}}"   # update "string" by replacing "${old}" with "${new}"
done

These both generate:

$ typeset -p string
declare -- string="<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>"
Sign up to request clarification or add additional context in comments.

1 Comment

Test with string='<mark>&.*</mark><mark>A.B</mark>'. (The & causes a problem if patsub_replacement is enabled with Bash version 5.2 or later.)
2

Feed Perl from stdin or append a file name:

perl -pe 's%(<mark>.*?</mark>)% $1 =~ s|\.|<wbr />|gr %eg'

Output:

<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>

Source: https://unix.stackexchange.com/a/152623/74329

3 Comments

Pearl has native libxml support. It is counterproductive to parse XML with pcre.
@LéaGris: codesniffer has now further specified the question.
Impressive find, thanks @Cyrus ! While it's not pure Bash, I like that this solution does not require a separate script file.
2

This is probably super inperformant, but it only uses a single regex to search and replace - no loop needed. I am no expert in shell scripts, so I will not provide one, but this should work inside a Perl call.

Try matching:

([^.]+|\G)\.(?=(?:(?!<mark>).)+<\/mark>)

and replacing with:

$1<wbr />

See: regex101


Explanation

MATCH:

  1. Match all .:
  • ( ... ): Capture to group 1 either
    • [^.]+: anything but a dot
    • |\G: or the end of the last match
  • \.: then match a dot
  1. Ensure the dot is inside <mark> ... </mark> tags:
  • (?= ... ): Look ahead and assert
    • (?: ... )+: that you match anything
      • (?!<mark>).: but it cannot be <mark>.
    • <\/mark>: Find </mark>, ensuring that you must be inside the tag

REPLACE:

  • $1: Keep the first group (everything before a dot, but inside tag)
  • <wbr />: and replace the dots with <wbr />

6 Comments

Quite powerful regex! Is it supposed to match <mark>.</mark> as well?
Thanks @Philippe. I thought it wold be quite a puzzle to do in one regex, so I tried it for fun^^ With regard to your question, I do not think so; see in the Question: "I need to replace all instances of a character (period in my case) in 1+portions/segments/ranges of a string"
All the other answers can change <mark>.</mark>.<mark>.</mark> to <mark><wbr /></mark>.<mark><wbr /></mark>
@Philippe you could actually ommit the first part of my regex to end up with \.(?=(?:(?!<mark>).?)+<\/mark>) which would just match any "." within mark tags.
That worked great, thank you! One last question though, in ` \.(?=(?:(?!<mark>).?)+<\/mark>), the <mark>` is after the dot (.). How does it work?
|
2

Using any awk in any shell on all Unix boxes:

$ awk '
BEGIN {
    FS = OFS = "</mark>"
}
{
    for (i = 1; i <= NF; i++) {
        if ( match($i, /<mark>.*/) ) {
            tgt = substr($i, RSTART, RLENGTH)
            gsub(/\./, "<wbr />", tgt)
            $i = substr($i, 1, RSTART - 1) tgt
        }
    }
    print
}
' file
<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>

Comments

2

This Shellcheck-clean pure Bash code updates the value of the variable my_str:

tmp=$my_str
my_str=
while [[ $tmp =~ ^(.*)(\<mark\>.*\</mark\>)(.*)$ ]]; do
    tmp=${BASH_REMATCH[1]}
    my_str=${BASH_REMATCH[2]//./<wbr />}${BASH_REMATCH[3]}${my_str}
done
my_str=${tmp}${my_str}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.