0

I am trying to insert data from HBase into Teradata using PySpark. The data is read into a Spark DataFrame and inserts successfully when I limit the DataFrame to 3000–5000 rows like this:

df = df.limit(3000)

However, when I try to insert without limiting, and instead set the Teradata batchsize to 1000, I get this error:

java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 17.10.00.27] 
[Error 1338] [SQLState HY000] A failure occurred while executing a PreparedStatement batch request. 
The parameter set was not executed and should be resubmitted individually using the PreparedStatement executeUpdate method.

Details:

Complete Code snippet:

df = df.toDF(*renamed_columns)
df = df.withColumnRenamed("data_ROW", "ROW")
df.show(1, False)
df = df.limit(5)

teraDataIp = configFile["teraDataDevIP"]
teraDataBaseName = configFile["teraDataDbName"]
jdbc_url = "jdbc:teradata://{}/DATABASE={},tmode=ANSI,charSet=UTF8,SSLMODE=DISABLE".format(teraDataIp, teraDataBaseName)

# teradata_table = configFile["giskITDTableName"]

if "1" in hbaseTableStr:
    teradata_table = "____"
elif "2" in hbaseTableStr:
    teradata_table = "____"
elif "3" in hbaseTableStr:
    teradata_table = "____"
elif "4" in hbaseTableStr:
    teradata_table = "____"

print("Teradata Table Name is: ", teradata_table)
df = df.repartition(10)

df.write.format("jdbc") \
    .mode("append") \
    .option("driver", "com.teradata.jdbc.TeraDriver") \
    .option("url", jdbc_url) \
    .option("user", ____) \
    .option("password", ____) \
    .option("dbtable", teradata_table) \
    .option("batchsize", 1000).save()

Execution command:

spark-submit --conf "spark.driver.extraClassPath=[REDACTED]/hbase/lib/*" \
--jars [REDACTED]/hbase_connectors/hbase-spark-protocol-shaded.jar,\
[REDACTED]/lib/terajdbc4.jar,\
[REDACTED]/tdgssconfig.jar \
[REDACTED]/sparkHbase_v2.py

Full error stack trace:

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

Caused by: com.teradata.jdbc.jdbc_4.util.JDBCException 
[Teradata Database] [TeraJDBC 17.10.00.27] [Error 1338] [SQLState HY000] 
A failure occurred while executing a PreparedStatement batch request. 
The presentation of the failure can be found in the exception chain which is accessible with getNextException().
Details of the failure can be found in the exception chain which is accessible with getNextException().
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:198)
    at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.batchUpdateRowCount(StatementReceiveState.java:1406)
    at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.batchUpdateRowCount(StatementReceiveState.java:1389)
    at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.batchUpdateRowCount(StatementReceiveState.java:1371)

Caused by:
org.apache.spark.SparkException Job aborted due to stage failure:
Task 0 in stage 1 (TID 1) failed for unknown reasons
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)

Caused by:
java.sql.BatchUpdateException Batch entry 0 insert into table_name values (?, ?, ?) was aborted.
Call getNextException() to see other errors in the batch.
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.run(DataSourceRDD.scala$anon$1.scala$166)

What I tried:

  • Inserting smaller datasets works fine
  • Checked Teradata DB permissions — no issues

Question: Why does the insert work for small limits but fail when using batchsize for larger data, and how can I properly insert large DataFrames from PySpark to Teradata without hitting this BatchUpdateException?

5
  • Please do not upload images of code/data/errors. Instead, edit your question to include the code as properly formatted text. Commented Aug 11 at 7:57
  • 1
    This exception says one or more of the rows in the batch was not inserted successfully. The reason or reasons for that would be in the exception chain. Try setting .option("flatten","on") to see the nested exceptions, then edit this question or ask another. Commented Aug 11 at 16:06
  • Just an observation - JDBC 17.10 .00.27 is quite old at this point. That's probably not the issue here, but you should consider upgrading. Commented Aug 11 at 16:08
  • Hi, I have used .option("flatten","on") as well, but getting same error with that too Commented Aug 12 at 13:58
  • Once you include the flatten, the actual error should be down further in the stack trace Commented Aug 13 at 16:02

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.