0

I'm seeing an inexplicable array index reference error, Index 1 out of bounds for length 1

... which I can't explain because I don't see any relevant arrays being referenced in my context of an AWS MWAA+EMR Serverless pyspark SQL query.

Here's the query I'm running:

INSERT INTO TABLE dev.aggregates
  PARTITION (p_date='2024-03-03')
  BY NAME
  SELECT '13x1' AS dimensions
  FROM dev.other_aggregates
  LIMIT 3

Interestingly, if I modify the query to get rid of the FROM portion, it executes successfully:

INSERT INTO TABLE dev.aggregates
  PARTITION (p_date='2024-03-03')
  BY NAME
  SELECT '13x1' AS dimensions

Here's my target insert table definition from show create table dev.aggregates:

CREATE EXTERNAL TABLE `dev.aggregates`(
  `dimensions` string COMMENT '')
PARTITIONED BY ( 
  `p_date` varchar(11) COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://airflow-dev/tables/aggregates/parquet'
TBLPROPERTIES (
"  'parquet.compression'='SNAPPY', "
"  'projection.enabled'='true', "
"  'projection.p_date.format'='yyyy-MM-dd', "
"  'projection.p_date.interval'='1', "
"  'projection.p_date.interval.unit'='DAYS', "
"  'projection.p_date.range'='2023-12-01,NOW', "
"  'projection.p_date.type'='date', "
  'storage.location.template'='s3://airflow-dev/tables/aggregates/parquet/p_date=${p_date}')

Here's a bit more of the stacktrace:

  File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 66, in <module>
    run_spark_commands(spark, sql_commands)
  File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 41, in run_spark_commands
    spark_runner.sql(command_string)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1631, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o101.sql.
: java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1
    at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
    at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
    at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
    at java.base/java.util.Objects.checkIndex(Objects.java:361)
    at java.base/java.util.ArrayList.get(ArrayList.java:427)
    at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:881)
    at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:873)
    at org.apache.hadoop.hive.ql.metadata.Partition.getSpec(Partition.java:416)
    at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1245)
    at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitions$4(HiveClientImpl.scala:832)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
...

Thanks in advance for any clues you might offer.

1 Answer 1

0

We may have had multiple issues, but the final piece of the puzzle had to do with the WHERE clause (which I neglected to include in the example above):

INSERT INTO table1
  SELECT ...
  FROM ...
  WHERE ...
    AND id1 IN (SELECT id2 FROM table2 WHERE dt = '2024-03-08')
  ...

For whatever reason, spark wasn't seeing table2's partition for dt = '2024-03-08' even though we could query it through Datagrip Athena (perhaps something to do with table1 a projection). So we explicitly re-added the partition before the INSERT query in the spark sql, and it worked.

ALTER TABLE table2 ADD IF NOT EXISTS
  PARTITION (dt='2024-03-08')
  LOCATION 's3://airflow-dev/tables/table2/dt=2024-03-08';
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.