2

I'm trying to do some analytics using Postgres, where I do have 2 tables, called: predictionstate and pageviews.

The predictionstate table:

This table contains the columns with our algorithm outcomes, using the following structure:

  • id ({company_identifier}:{user_identifier})
  • model (reference string value)
  • prediction (float number between 0.0 and 1.0)

The pageviews table:

This table contains user information, using the following structure:

  • company_identifier
  • user_identifier
  • pageview_current_url_type

Question

I'm trying to get the data based on our best model, to analyze how accurate it is, where basically I need to know to create the Segments and count how many members I do have on it. The following code does that:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users"
FROM
  ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

But the issue that I have, since I don't know exactly how to do it, it's that for each (company, model, segment), and need to get the data of how accurate it is, querying the pageviews table and identifying the pageview_current_url_type == 'BUYSUCCESS'.

What I tried, but didn't work:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users",
  b.n as "converted_users"
FROM
  ranges r,
  (
    SELECT COUNT(DISTINCT(pvs.user_identifier)) as n
    FROM pageviews pvs
    INNER JOIN (
        SELECT
            SPLIT_PART(id, ':', 1) as company_identifier,
            SPLIT_PART(id, ':', 2) as user_identifier
        FROM predictionstate ps
        WHERE prediction BETWEEN r.r_min AND r.r_max ) users
        ON (
            pvs.user_identifier = users.user_identifier AND
            pvs.company_identifier= users.company_identifier) 
        WHERE pageview_current_url_type = 'BUYSUCCESS'

  ) b
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

TL;DR: I need to count a JOIN based on the main query users.

EDIT:

I added an SQL Fiddle https://www.db-fiddle.com/f/5sQiZD6mHwdnwvVfvL9MAh/0 .

What I want to know, for those segment_users, how many of them have a pageview_current_url_type = 'BUYSUCCESS', addind one more column to the result: segmented_really_bought.

EDIT 2: One more attempt not working (ERROR: column "p.id" must appear in the GROUP BY clause or be used in an aggregate function)

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users",
  COUNT(b.*) as "converted_users"
FROM
  ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
INNER JOIN (
  SELECT users.company_identifier, COUNT(users.user_identifier) AS n
  FROM pageviews
  INNER JOIN (
    SELECT SPLIT_PART(ps.id, ':', 2) AS user_identifier,
           SPLIT_PART(ps.id, ':', 1) AS company_identifier
    FROM predictionstate ps
    WHERE provider_id=47 AND
          prediction > 0.7
   ) users ON (
      pageviews.user_identifier=users.user_identifier AND
      pageviews.company_identifier=users.company_identifier
    )
  WHERE pageview_current_url_type='BUYSUCCESS'
  GROUP BY users.company_identifier
) AS b
ON (
  b.company_identifier = company_identifier
)
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

EDIT 3: Added the desired output

Generated using this code: https://gist.github.com/brunoalano/479265b934a67dc02092fb54a846fe1e

company, model, segment, segment_users, really_bought
company_a, model_a, 0.3-0.4, 1, 3
company_a, model_a, 0.5-0.6, 1, 1
company_a, model_b, 0.2-0.3, 1, 3
company_a, model_c, 0.2-0.3, 1, 1
company_a, model_c, 0.7-0.8, 1, 3
company_b, model_a, 0.3-0.4, 3, 2
company_b, model_b, 0.5-0.6, 2, 1
company_b, model_b, 0.6-0.7, 1, 1
company_b, model_c, 0.5-0.6, 1, 0
company_b, model_c, 0.8-0.9, 1, 1
5
  • 1. Why is your ID a concatenated string? It would be much easier in your code if you would have two columns as primary key. 2. This seems quiet complex. Could you please add a sample table and expected output? Commented Oct 2, 2018 at 6:20
  • @S-Man I created it here: db-fiddle.com/f/5sQiZD6mHwdnwvVfvL9MAh/0 Commented Oct 3, 2018 at 12:17
  • What's your expected result for the sample you've posted? Add it to your question please. Commented Oct 3, 2018 at 13:36
  • @KamilGosciminski I added the desired output and the code that I used to generate it. Sorry for that. Commented Oct 3, 2018 at 14:35
  • My answer seems to be what you're looking for, though I don't know why there are less segments in your output than the data generates. Commented Oct 3, 2018 at 14:49

2 Answers 2

1

It's hard to tell without sample output what you need, but I think what you're looking for is:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  p.company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(p.user_identifier)) as "segment_users",
  COUNT(CASE WHEN pv.pageview_current_url_type = 'BUYSUCCESS' THEN 1 END) AS segmented_really_bought
FROM
  ranges r
INNER JOIN (
  SELECT
    SPLIT_PART(id, ':', 1) as company_identifier,
    SPLIT_PART(id, ':', 2) as user_identifier,
    model,
    prediction
  FROM
    predictionstate
  ) p ON p.prediction BETWEEN r.r_min AND r.r_max
LEFT JOIN pageviews pv ON 
  p.company_identifier = pv.company_identifier
  AND p.user_identifier = pv.user_identifier
GROUP BY p.company_identifier, p.model, r.segment
ORDER BY p.company_identifier, p.model, r.segment;

Changes to your fiddle query:

  • replaced predictionstate with a subquery that we join to, where we do the split_part logic to get comapny and user identifiers as separate columns
  • used those identifiers to LEFT JOIN to pageviews
  • added segmented_really_bought column with a CASEd COUNT
Sign up to request clarification or add additional context in comments.

Comments

1

demo: db<>fiddle

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
), pstate AS (                                         -- A
  SELECT 
    SPLIT_PART(ps.id, ':', 1) AS company_identifier,
    SPLIT_PART(ps.id, ':', 2) AS user_identifier,
    model,
    prediction
  FROM predictionstate ps
)
SELECT 
  company_identifier, model, segment,
  COUNT(DISTINCT user_identifier) as segment_users,    -- B
  -- C: 
  COUNT(user_identifier) FILTER (WHERE pageview_current_url_type = 'BUYSUCCESS') as really_bought
FROM pstate ps
LEFT JOIN ranges r 
ON prediction BETWEEN r_min AND r_max
LEFT JOIN pageviews pv 
USING (company_identifier, user_identifier)
GROUP BY company_identifier, model, segment
ORDER BY company_identifier, model, segment

A: I would really recommend that your id column should be split into two columns for better handling. This would save you much time on splitting the string (on writing queries and execute them) and it's more readable. That's why I added the second CTE.

B: COUNT(DISTINCT) counts the distinct users in the group

C: counts all users (not distinct) but filters out the expected status before counting.


I was wondering: What if a prediction is exactly on a threshold, for example 0.3. With the BETWEEN clause this range would be joined both in range 0.2-0.3 and range 0.3-0.4 (because BETWEEN equals r_min >= x >= r_max). It would be better to define the ranges as r_min >= x > r_max or r_min > x >= r_max. I made the join as you mentioned in your example but I would prefer to change it. I still don't know in which direction

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.