-1

I have this text string that has certain HTML tags like \r, <a href= ............, and I want to remove them to leave the text completely clean, without tags, but I cannot remove them.

I have this query

DECLARE @text_original NVARCHAR(MAX);
DECLARE @clean_text NVARCHAR(MAX);

SET @text_original = ' several pop gems in their wake.\r\n\nThe groups original linitarist/vocal drummer/vocalist Debbi Peterson responded to an advertisement that guitarist/vocalist <a href=\"spotify:artist:2Sc4ukCRllIu02LZfHF0RL\">Susanna Hoffs</a> had placed in a local Los Angeles paper, The Recycler. Taking the name the Bangs, the trio released a single, \"Getting Out of Hand\"/\"Call on Me,\" on their own label, Downkiddie. They had to change their name early the following year to the Bangles, since there was already a New York-based group called the Bangs. After an appearance on a Rodney on the ROQ compilation and a series of local concerts which featured new bassist Annette Zilinskas, Miles Copeland signed the Bangles to the <a href=\"spotify:search:label%3A%22IRS%22\">IRS</a> subsidiary <a href=\"spotify:search:label%3A%22Faulty+Products%22\">Faulty himself, and when it came time to find a producer for the Bangles fifth record, <a href=\"spotify:artist:2Sc4ukCRllIu02LZfHF0RL\">Hoffs</a> didnt have far to look. The resulting sunny, very California-sounding Sweetheart of the Sun was released in September 2011. The following years saw the band regularly touring while playing some big shows like 2012s Rewind Festival, the 50th anniversary celebration for the famous L.A. nightclub the Whisky a Go Go, and 2013s Paisley Aboveground, which featured the Bangles playing alongside reunited paisley underground bands <a href=\"spotify:artist:3iayA3QkB3jfkUgcXqsL7C\">the Three OClock</a> and <a href=\"spotify:artist:1e9cOgMYTLJ0KFOjmTx13F\">Rain Parade</a>. In 2014, the band released Ladies and Gentlemen... The Bangles!, a compilation of early singles, EP tracks, and rarities. In 2019 the band appeared alongside <a href=\"spotify:artist:1e9cOgMYTLJ0KFOjmTx13F\">Rain Parade</a>, <a href=\"spotify:artist:1l0eT7EY5r7U1gMVm2SyoS\">the Dream Syndicate</a>, and <a href=\"spotify:artist:3iayA3QkB3jfkUgcXqsL7C\">the Three OClock</a> on the Paisley Underground compilation 3X4, which saw each band covering a song by the other three groups. ~ Stephen Thomas Erlewine & Andrew Leahey, Rovi'
SET @clean_text = REPLACE(
                    REPLACE(
                        REPLACE(
                            REPLACE(
                                REPLACE(
                                    REPLACE(
                                        REPLACE(
                                            REPLACE(
                                                REPLACE(
                                                    REPLACE(
                                                        @text_original,
                                                        '</a>', ''
                                                    ),
                                                    '<a[^>]*>', ''
                                                ),
                                                'href="spotify:[^"]*"', ''
                                            ),
                                            'href="[^"]*"', ''
                                        ),
                                        '\r\n\n', ' '
                                    ),
                                    '\r\n', ' '
                                ),
                                CHAR(13), ' '
                            ),
                            CHAR(10), ' '
                        ),
                        '>', ''
                    ),
                    '<', ''
                );
SELECT @clean_text AS clean_text;

But the output still showing the tags, any idea?

4
  • 2
    \r, are not html tags... and I assume you mean the newline etc chars, not that actual char? if so look into the CHAR() function. Commented Dec 22, 2023 at 1:55
  • 7
    You seem to think REPLACE accepts regular expressions (such as <a[^>]*>'). It doesn't Commented Dec 22, 2023 at 1:57
  • 6
    I would strongly suggest not doing this in TSQL - use a language with better string processing capabilities. (maybe CLR if for some reason this must be done in the DB) Commented Dec 22, 2023 at 2:11
  • Looks like HTML encoded into JSON or something. I suggest you first confirm what format this is, and then try find a suitable language and library to process it in. Commented Dec 22, 2023 at 11:46

1 Answer 1

0

Here is one way to solve the puzzle.

Though it is not a generic sustainable solution.

I had to apply few REPLACE(...) calls to clean up the input string and cast is as a legitimate XML.

SQL

DECLARE @text_original NVARCHAR(MAX) = ' several pop gems in their wake.\r\n\nThe groups original linitarist/vocal drummer/vocalist Debbi Peterson responded to an advertisement that guitarist/vocalist <a href=\"spotify:artist:2Sc4ukCRllIu02LZfHF0RL\">Susanna Hoffs</a> had placed in a local Los Angeles paper, The Recycler. Taking the name the Bangs, the trio released a single, \"Getting Out of Hand\"/\"Call on Me,\" on their own label, Downkiddie. They had to change their name early the following year to the Bangles, since there was already a New York-based group called the Bangs. After an appearance on a Rodney on the ROQ compilation and a series of local concerts which featured new bassist Annette Zilinskas, Miles Copeland signed the Bangles to the <a href=\"spotify:search:label%3A%22IRS%22\">IRS</a> subsidiary <a href=\"spotify:search:label%3A%22Faulty+Products%22\">Faulty himself, and when it came time to find a producer for the Bangles fifth record, <a href=\"spotify:artist:2Sc4ukCRllIu02LZfHF0RL\">Hoffs</a> didnt have far to look. The resulting sunny, very California-sounding Sweetheart of the Sun was released in September 2011. The following years saw the band regularly touring while playing some big shows like 2012s Rewind Festival, the 50th anniversary celebration for the famous L.A. nightclub the Whisky a Go Go, and 2013s Paisley Aboveground, which featured the Bangles playing alongside reunited paisley underground bands <a href=\"spotify:artist:3iayA3QkB3jfkUgcXqsL7C\">the Three OClock</a> and <a href=\"spotify:artist:1e9cOgMYTLJ0KFOjmTx13F\">Rain Parade</a>. In 2014, the band released Ladies and Gentlemen... The Bangles!, a compilation of early singles, EP tracks, and rarities. In 2019 the band appeared alongside <a href=\"spotify:artist:1e9cOgMYTLJ0KFOjmTx13F\">Rain Parade</a>, <a href=\"spotify:artist:1l0eT7EY5r7U1gMVm2SyoS\">the Dream Syndicate</a>, and <a href=\"spotify:artist:3iayA3QkB3jfkUgcXqsL7C\">the Three OClock</a> on the Paisley Underground compilation 3X4, which saw each band covering a song by the other three groups. ~ Stephen Thomas Erlewine & Andrew Leahey, Rovi';
DECLARE @clean_text NVARCHAR(MAX);

DECLARE @xml XML = CAST('<root>' + 
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@text_original, '\"', '"')
        , '&', '&amp;')
        , 'Bangles fifth record', 'Bangles fifth record</a>')
        , '\r', '')
        , '\n', '')
    + '</root>' AS XML);

SET @clean_text = @xml.value('data(.)', 'NVARCHAR(MAX)');

-- test
SELECT @clean_text;
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.