1

Let's assume that we have a table containing the following column

     URL                             Repeated 
    www.b.com/aa/aa/aa                X
    www.b.com/aa/                     X
    www.xy.com                        X
    .
    .
    .

Repeated column here just takes the default value of 'X'. I want this column to check whether there's the repeated patterns at the end. For instance

    www.b.com/aa/aa/aa   
    www.xyz.com/bc/bc

both contain repeated patterns, delimited by '/'.

I was wondering if there's a way to check it in SQL. Is there any built-in function in SQL that makes it easy? Also, can we extend this concept? (so that we can count how many repeated patterns we see)

Any reference or help would be greatly appreciated. Thanks.

  1. this is Trino SQL
  2. to clarify repeated patterns, it means the exact matches consecutively if you split the URL '/'

www.c.com/aa/bb/aa, www.aaa.com/aaa don't count as repeated patterns, but

www.c.com/aa/aa or www.c.com/aa/bb/bb or www.aaa.com/aaa/aaa would count as repeated patterns

4
  • 1
    Please specify which database system and which version are you using. Also, add more examples of what is not considered a repeated pattern, although it still contains some repetitions; for example, www.c.com/aa/bb/aa or www.aaa.com/aaa. Commented Oct 26 at 6:00
  • Should www.xyz.com/a/b/b/c count? Commented Oct 26 at 8:31
  • One of the Trino functions split or split_to_multimap seems to be a good choice for this..... Commented Oct 26 at 8:45
  • 1
    For path "www.c.com/aa/bb/bb/aa" is "/bb" repeated? For path "www.c.com/aa/bb/aa/bb" is "/aa/bb" repeated? Commented Oct 26 at 10:57

3 Answers 3

2

First, concerning the main task to check whether it has a repeated pattern at the end:

I successfully tested two queries.

It seems there is no free online fiddle site to demonstrate it, but in worked fine on my local system.

Simple variant (may return NULL)

WITH urls(url) AS (
  VALUES
    ('www.b.com/aa/aa/aa'),
    ('www.b.com/aa/'),
    ('www.xy.com'),
    ('www.c.com/aa/bb/aa'),
    ('www.xyz.com/bc/bc'),
    ('www.aaa.com/aaa/aaa')
)
SELECT
  url,
  element_at(filter(split(url, '/'), x -> x <> ''), -1)
  =
  element_at(filter(split(url, '/'), x -> x <> ''), -2) AS has_repeated_pattern
FROM urls;
  • How it works:

    1. split(url, '/') divides each URL into segments.

    2. filter(..., x -> x <> '') removes empty segments.

    3. element_at(..., -1) and element_at(..., -2) retrieve the last and second-to-last segments.

    4. The comparison returns true if they are equal.

  • Limitation:
    URLs with fewer than two segments produce NULL instead of false, because element_at(..., -2) is NULL.


Robust variant (always returns true or false)

WITH urls(url) AS (
  VALUES
    ('www.b.com/aa/aa/aa'),
    ('www.b.com/aa/'),
    ('www.xy.com'),
    ('www.c.com/aa/bb/aa'),
    ('www.xyz.com/bc/bc'),
    ('www.aaa.com/aaa/aaa')
)
SELECT
  url,
  COALESCE(element_at(filter(split(url, '/'), x -> x <> ''), -1)
  =
  element_at(filter(split(url, '/'), x -> x <> ''), -2), false) AS has_repeated_pattern
FROM urls;

So COALESCE here enforces false rather than NULL.

Output of the second query:

url has_repeated_pattern
www.b.com/aa/aa/aa true
www.b.com/aa/ false
www.xy.com false
www.c.com/aa/bb/aa false
www.xyz.com/bc/bc true
www.aaa.com/aaa/aaa true

Now, concerning the additional task to count the repetitions of repeated patterns at the end of the URLs:

I extended the previous idea to following query:

WITH urls(url) AS (
  VALUES
    ('www.b.com/aa/aa/aa'),
    ('www.b.com/aa/'),
    ('www.xy.com'),
    ('www.c.com/aa/bb/aa'),
    ('www.xyz.com/bc/bc'),
    ('www.aaa.com/aaa/aaa')
)
SELECT
  url,
  cardinality(
    filter(
      CASE
        WHEN cardinality(segs) >= 2 THEN sequence(2, cardinality(segs))
        ELSE array[]
      END,
      i -> segs[i] = segs[i-1]
    )
  ) AS repeated_pattern_repetitions
FROM (
  SELECT
    url,
    filter(split(url, '/'), x -> x <> '') AS segs
  FROM urls
) t;

How it works:

  1. split(url, '/') divides each URL into segments.

  2. filter(..., x -> x <> '') removes empty segments (e.g., from trailing slashes).

  3. cardinality(segs) >= 2 ensures we only compare arrays with at least two segments.

  4. sequence(2, cardinality(segs)) generates positions starting from the second segment.

  5. filter(..., i -> segs[i] = segs[i-1]) selects positions where a segment is equal to the previous one.

  6. cardinality(...) counts these matches, giving the total number of repetitions of repeated segments.

  7. For URLs with fewer than two segments, we return an empty array to avoid out-of-bounds errors, which results in a count of zero.

Output of this query:

url repeated_pattern_repetitions
www.b.com/aa/aa/aa 2
www.b.com/aa/ 0
www.xy.com 0
www.c.com/aa/bb/aa 0
www.xyz.com/bc/bc 1
www.aaa.com/aaa/aaa 1
Sign up to request clarification or add additional context in comments.

2 Comments

Just wrap the equality expression in COALESCE(x=y, false)?
True, edited. Thanks.
0

Trino SQL supports regex, if supporting back reference, you may try this one: '\/(\w+)\/\1$', if not supporting back reference you may compare the '\/(\w+)$' substring with the substring of the same length just before it in the url.

Comments

-2

Interesting challenge. The following should do it. I don't use Trino myself so is not tested.

SELECT \_.ID 
FROM  (  SELECT  ID,   
                  reverse(split(URL, "/")) as Elements  
      )  AS \_  
WHERE cardinality(\_.Elements) \>= 2 && \_.Elements\[0\] = \_.Elements\[1\]

2 Comments

Missing explanation, missing formatting, ....
Looks like you have not yet taken the tour. I recommend doing so. Also, if you have not already done so, you should visit the help center. I think you should also learn how to use markdown.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.