I evaluate Spark 4 try_variant_get method handling variant type data. First I make sql statements examples.
CREATE TABLE family (
id INT,
data VARIANT
);
INSERT INTO family (id, data)
VALUES
(1, PARSE_JSON('{"name":"Alice","age":30}')),
(2, PARSE_JSON('[1,2,3,4,5]')),
(3, PARSE_JSON('42'));
When SQL is executed, no errors are brought. Then Below codes are the select command using try_variant_get method
SELECT
id,
try_variant_get(data, '$.name', 'STRING') AS name,
try_variant_get(data, '$.age', 'INT') AS age
FROM
family
WHERE
try_variant_get(data, '$.name', 'STRING') IS NOT NULL;
SQL output is successfully returned. Then I transform these SQL statements into java api codes.
SparkSession spark = SparkSession.builder().master("local[*]").appName("VariantExample").getOrCreate();
StructType schema = new StructType()
.add("id", DataTypes.IntegerType)
.add("data", DataTypes.VariantType);
Dataset<Row> df = spark.createDataFrame(
Arrays.asList(
RowFactory.create(1, "{\"name\":\"Alice\",\"age\":30}"),
RowFactory.create(2, "[1,2,3,4,5]"),
RowFactory.create(3, "42")
),
schema
);
Dataset<Row> df_sel = df.select(
col("id"),
try_variant_get(col("data"), "$.name", "String").alias("name"),
try_variant_get(col("data"), "$.age", "Integer").alias("age")
).where("name IS NOT NULL");
df_sel.printSchema();
df_sel.show();
But these java codes throw the following exceptions.
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
Exception in thread "main" java.lang.ClassCastException: class java.lang.String cannot be cast to class org.apache.spark.unsafe.types.VariantVal (java.lang.String is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.VariantVal is in unnamed module of loader 'app')
at org.apache.spark.sql.catalyst.expressions.variant.VariantGet.nullSafeEval(variantExpressions.scala:282)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:692)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:159)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:89)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$48.$anonfun$applyOrElse$83(Optimizer.scala:2231)
at scala.collection.immutable.List.map(List.scala:247)
at scala.collection.immutable.List.map(List.scala:79).....
The "String" parameter of try_variant_get method has some problems. But I have no idea what is wrong with these java codes. Kindly inform me how to fix these errors.
createDataFrame()part, have you verified the data is inserted as expected, like when you inserted manually? Does select still work? If so, then we can isolate the problem only to thedf.select()part