So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how.

Here is a basic code block:

le_test = spark.sql("""
SELECT Country,
ROUND(MIN(Life_Expectancy)) AS Min_LE,
ROUND(AVG(Life_Expectancy)) AS Avg_LE,
ROUND(MAX(Life_Expectancy)) AS Max_LE,
ROUND(MAX(Life_Expectancy) - MIN(Life_Expectancy)) AS LE_range
FROM le_cleaned
GROUP BY Country
""")

This dataset has countries split up by years (over the course of about two decades), hence the aggregation. What I want to do is an additional column based on the difference in values between the first year row and last year row for each country, which is not the same as min and max values. So basically, the line should resemble something like this:

ROUND((Life_Expectancy WHERE "YEAR" = 2019) - (Life_Expectancy WHERE "YEAR" = 2000)) AS LE_difference

There is a long way to do this, but I assume there must be a short way if you can already do the same calculation with a min and max values in a groupby.

6 Replies 6

Please show both sample input data and expected result as markdown tables in your question.

I'm just asking how to make a certain kind of conditional SQL statement. All the information needed to provide context is provided.

That is, we have to guess (assume) that the table contains columns (country,year,Life_Expectancy) and years 2000,2019 should be hardcoded.Or (country,date,Life_Expectancy)?

the line should resemble something like this:

ROUND((Life_Expectancy WHERE "YEAR" = 2019) - (Life_Expectancy WHERE "YEAR" = 2000)) AS LE_dif

It looks you are searching for MIN_BY/MAX_BY:

Assumption: table contains a column named YEAR with single entry per year and country:

SELECT Country,
   ROUND(MIN(Life_Expectancy)) AS Min_LE,
   ROUND(AVG(Life_Expectancy)) AS Avg_LE,
   ROUND(MAX(Life_Expectancy)) AS Max_LE,
   MAX_BY(Life_Expectancy, YEAR) AS Max_LE_range,
   MIN_BY(Life_Expectancy, YEAR) AS Min_LE_range,
   MIN_BY(Life_Expectancy, YEAR) - MAX_BY(Life_Expectancy, YEAR) AS LE_diff
FROM le_cleaned
GROUP BY Country

I appreciate your frustration, especially in the deleted comment. That's why SO has relied on [mre] for its entire existance. Please read : Why should I provide a Minimal Reproducible Example, even for a very simple SQL query?

@Lakasz: Those MIN_BY/MAX_BY functions appear to be exactly what I am looking for... but they don't seem to be supported in Pyspark 2.3 (which is in HDP 2.6.5). That's unfortunate.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.