I'm seeing an inexplicable array index reference error,
Index 1 out of bounds for length 1
... which I can't explain because I don't see any relevant arrays being referenced in my context of an AWS MWAA+EMR Serverless pyspark SQL query.
Here's the query I'm running:
INSERT INTO TABLE dev.aggregates
PARTITION (p_date='2024-03-03')
BY NAME
SELECT '13x1' AS dimensions
FROM dev.other_aggregates
LIMIT 3
Interestingly, if I modify the query to get rid of the FROM portion, it executes successfully:
INSERT INTO TABLE dev.aggregates
PARTITION (p_date='2024-03-03')
BY NAME
SELECT '13x1' AS dimensions
Here's my target insert table definition from show create table dev.aggregates:
CREATE EXTERNAL TABLE `dev.aggregates`(
`dimensions` string COMMENT '')
PARTITIONED BY (
`p_date` varchar(11) COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://airflow-dev/tables/aggregates/parquet'
TBLPROPERTIES (
" 'parquet.compression'='SNAPPY', "
" 'projection.enabled'='true', "
" 'projection.p_date.format'='yyyy-MM-dd', "
" 'projection.p_date.interval'='1', "
" 'projection.p_date.interval.unit'='DAYS', "
" 'projection.p_date.range'='2023-12-01,NOW', "
" 'projection.p_date.type'='date', "
'storage.location.template'='s3://airflow-dev/tables/aggregates/parquet/p_date=${p_date}')
Here's a bit more of the stacktrace:
File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 66, in <module>
run_spark_commands(spark, sql_commands)
File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 41, in run_spark_commands
spark_runner.sql(command_string)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1631, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o101.sql.
: java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
at java.base/java.util.Objects.checkIndex(Objects.java:361)
at java.base/java.util.ArrayList.get(ArrayList.java:427)
at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:881)
at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:873)
at org.apache.hadoop.hive.ql.metadata.Partition.getSpec(Partition.java:416)
at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1245)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitions$4(HiveClientImpl.scala:832)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.Iterator.foreach(Iterator.scala:943)
...
Thanks in advance for any clues you might offer.