Pyspark SQL: How to do GROUP BY with specific WHERE condition

Question

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how.

Here is a basic code block:

le_test = spark.sql("""
SELECT Country,
ROUND(MIN(Life_Expectancy)) AS Min_LE,
ROUND(AVG(Life_Expectancy)) AS Avg_LE,
ROUND(MAX(Life_Expectancy)) AS Max_LE,
ROUND(MAX(Life_Expectancy) - MIN(Life_Expectancy)) AS LE_range
FROM le_cleaned
GROUP BY Country
""")

This dataset has countries split up by years (over the course of about two decades), hence the aggregation. What I want to do is an additional column based on the difference in values between the first year row and last year row for each country, which is not the same as min and max values. So basically, the line should resemble something like this:

ROUND((Life_Expectancy WHERE "YEAR" = 2019) - (Life_Expectancy WHERE "YEAR" = 2000)) AS LE_difference

There is a long way to do this, but I assume there must be a short way if you can already do the same calculation with a min and max values in a groupby.

Answer 1 · 2025-11-02 06:42:34Z

Jonas Metzler

• Nov 2 at 6:42

Please show both sample input data and expected result as markdown tables in your question.

Answer 2 · 2025-11-02 06:46:20Z

BeaverFever

• Nov 2 at 6:46

I'm just asking how to make a certain kind of conditional SQL statement. All the information needed to provide context is provided.

Answer 3 · 2025-11-02 07:03:07Z

ValNik

• modified Nov 2 at 7:13

That is, we have to guess (assume) that the table contains columns (country,year,Life_Expectancy) and years 2000,2019 should be hardcoded.Or (country,date,Life_Expectancy)?

Answer 4 · 2025-11-02 14:09:49Z

the line should resemble something like this:

ROUND((Life_Expectancy WHERE "YEAR" = 2019) - (Life_Expectancy WHERE "YEAR" = 2000)) AS LE_dif

It looks you are searching for MIN_BY/MAX_BY:

Assumption: table contains a column named YEAR with single entry per year and country:

SELECT Country,
   ROUND(MIN(Life_Expectancy)) AS Min_LE,
   ROUND(AVG(Life_Expectancy)) AS Avg_LE,
   ROUND(MAX(Life_Expectancy)) AS Max_LE,
   MAX_BY(Life_Expectancy, YEAR) AS Max_LE_range,
   MIN_BY(Life_Expectancy, YEAR) AS Min_LE_range,
   MIN_BY(Life_Expectancy, YEAR) - MAX_BY(Life_Expectancy, YEAR) AS LE_diff
FROM le_cleaned
GROUP BY Country

Answer 5 · 2025-11-02 18:20:50Z

MatBailie

• Nov 2 at 18:20

I appreciate your frustration, especially in the deleted comment. That's why SO has relied on [mre] for its entire existance. Please read : Why should I provide a Minimal Reproducible Example, even for a very simple SQL query?

Answer 6 · 2025-11-03 08:18:30Z

BeaverFever

• Nov 3 at 8:18

@Lakasz: Those MIN_BY/MAX_BY functions appear to be exactly what I am looking for... but they don't seem to be supported in Pyspark 2.3 (which is in HDP 2.6.5). That's unfortunate.

Collectives™ on Stack Overflow

Pyspark SQL: How to do GROUP BY with specific WHERE condition

6 Replies 6

Your Reply

Collectives™ on Stack Overflow

Pyspark SQL: How to do GROUP BY with specific WHERE condition

6 Replies 6

Your Reply

Sign up or log in

Post as a guest