3

i'm using Flink SQL and the following scheme shows my source data (belonging to some Twitter data):

CREATE TABLE `twitter_raw` (
    `entities` ROW(
        `hashtags` ROW(
            `text` STRING,
            `indices` INT ARRAY
        ) ARRAY,
        `urls` ROW(
            `indices` INT ARRAY,
            `url` STRING,
            `display_url` STRING,
            `expanded_url` STRING
        ) ARRAY,
        `user_mentions` ROW(
            `screen_name` STRING,
            `name` STRING,
            `id` BIGINT
        ) ARRAY
    )
)
WITH (...);

I want to get only the hashtags in a collection. Therefore i have to map the collection of constructed objects (ROW) to an array of STRING.

Like this scheme:

CREATE TABLE `twitter_raw` (
    `entities` ROW(
        `hashtags` STRING ARRAY,
        `urls` STRING ARRAY,
        `user_mentions` STRING ARRAY
    )
)
WITH (...);

How can i achieve this with Flink-SQL? Maybe built-in functions (JSON-functions?) or own UDF or do i have to write a DataStream Job?

Thanks in advance.

2 Answers 2

4

The SQL command UNNEST helps in this case. It is like EXPLODE in Spark.

You can solve it by creating a new row for each hashtag in the hashtags array:

SELECT hashtag, index
FROM twitter_raw 
CROSS JOIN UNNEST(hashtags) AS t (hashtag, index)
Sign up to request clarification or add additional context in comments.

1 Comment

FYI, Unnest is useful for Map data type as well
1

You can either define a computed row, or a VIEW, and then extracting the hashtags field using the dot notation. e.g.:

CREATE VIEW hashtags_raw (hashtags) AS 
  SELECT entities.hashtags AS hashtags FROM twitter_raw

1 Comment

This view does not fix the problem of accessing the individual fields of the array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.