So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how.
Here is a basic code block:
le_test = spark.sql("""
SELECT Country,
ROUND(MIN(Life_Expectancy)) AS Min_LE,
ROUND(AVG(Life_Expectancy)) AS Avg_LE,
ROUND(MAX(Life_Expectancy)) AS Max_LE,
ROUND(MAX(Life_Expectancy) - MIN(Life_Expectancy)) AS LE_range
FROM le_cleaned
GROUP BY Country
""")
This dataset has countries split up by years (over the course of about two decades), hence the aggregation. What I want to do is an additional column based on the difference in values between the first year row and last year row for each country, which is not the same as min and max values. So basically, the line should resemble something like this:
ROUND((Life_Expectancy WHERE "YEAR" = 2019) - (Life_Expectancy WHERE "YEAR" = 2000)) AS LE_difference
There is a long way to do this, but I assume there must be a short way if you can already do the same calculation with a min and max values in a groupby.