I am currently using the Google Address Validation API in a PySpark (Databricks) pipeline to validate addresses from a table. Each row contains an address in a column called 'Address', and I send a request to the API for validation.
However, when i test the process with just a single record, the API appears to send two requests instead of one, resulting in higher usage costs. I have verified that only one transformation is triggered, and there's no loop or retry logic implemented at our end.
# Define a user-defined function (UDF) to validate an address
def validate_address_udf(street, city, state, postal_code):
addr = f"{street}, {city}, {state} {postal_code}"
api_key = "API KEY"
url = f"https://addressvalidation.googleapis.com/v1:validateAddress?key={api_key}"
payload = {
"address": {
"addressLines": [addr]
}
}
headers = {
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
result_data = response.json().get("result", {})
formatted_address = result_data.get("address", {}).get("formattedAddress", "")
return formatted_address
else:
return f"Error: {response.status_code}, {response.text}"
# Register the UDF
validate_address = udf(validate_address_udf, StringType())
# Apply the UDF to the DataFrame to get validated addresses
result_df = distinct_df.withColumn("ValidatedAddress", validate_address(*required_columns))```
I've checked that:
* Only one row is being processed.
* No retries or duplicate calls are being made from our code explicitly.
* Disabling eager evaluation or caching does not change the outcome.
Has anyone else experienced this issue, or is there a known cause (e.g., Databricks execution behavior or UDF behavior) that could explain the double API calls?
Any guidance on how to prevent the duplicate request would be appreciated.