Newest 'apache-spark-ml' Questions

1 vote

0 answers

126 views

IllegalThreadStateException in Spark 4.0 when Adaptive Query Execution (AQE) is active and when running many queries in the same Spark instance

Upon upgrading to Spark 4, we get (deterministically) an IllegalThreadStateException in long series of queries including spark.ml or Delta Lake (e.g. in estimator.fit()) in the same long-running Spark ...

Ghislain Fourny

7,429

asked Jun 24 at 17:10

0 votes

0 answers

15 views

How to reproduce the results of an ML model in Spark? [duplicate]

I am creating a machine learning model (random forest) in Spark (Pyspark) with cross-validation and grid search. I have two dataframes: one for training and one for testing, both stored in Parquet. ...

cyber-cavalera

1

asked Jan 29 at 12:40

0 votes

0 answers

20 views

Spark map() operation fails with NotSerializableException

I am encountering a NotSerializableException while using a map() transformation in Spark, despite the object used in the transformation being serializable. The issue arises when I try to apply a ...

Sanjit Jha

21

asked Dec 11, 2024 at 1:43

0 votes

1 answer

299 views

How to Perform Inference in Spark with an XGBoost Model Trained in scikit-learn?

I found one xgboost model which was trained using sklearn in native python. How can I use that model to produce inference on new dataset in pyspark? Apart from using UDFs, what other options do I have?...

wholesale_error

33

asked Sep 27, 2024 at 20:14

1 vote

0 answers

105 views

How to create my own custom imputter to input constant values seamlessly in pyspark.ml pipelines

I would like to optimize the imputation of missing values on my dataset through a CV search. This is trivial to do in sklearn, with which I am familiar -- however, I am for the first time working with ...

GaloisFan

121

asked May 13, 2024 at 13:26

0 votes

1 answer

940 views

{Py4JJavaError}An error occurred while calling o339.save

I trained Logistic regression model in pyspark but couldn't save the model. Model = LogisticRegression(featuresCol='TF-IDF', labelCol='labels', maxIter=10)` lr_model = Model.fit(train_data)` type(...

Mohammed Thoufeeq

11

asked Oct 27, 2023 at 4:52

0 votes

0 answers

56 views

Understanding the constraint in Spark's StringIndexer: why must inputCols and outputCols be different?

I'm in the process of developing a data preprocessing pipeline utilizing Apache Spark, and I've encountered an intriguing behavior with the StringIndexer transformer. In my pipeline, I rely on the ...

jsn

81

asked Sep 25, 2023 at 19:47

1 vote

0 answers

81 views

CountVectorizer error: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x

LogisticRegression model and throwing exception to predict on new dataset: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. ... ...

Surya

21

asked Jun 20, 2023 at 23:28

4 votes

0 answers

346 views

How to implement undersampling techniques like NearMiss, TomekLinks, ClusterCentroids, ENN using PySpark?

I'm trying to work on a Fraud Detection dataset from kaggle Credit Card Transactions Fraud Detection Dataset I'm working on PySpark and wish to apply Undersampling techniques using PySpark. However, I ...

Sumit

51

asked Apr 28, 2023 at 13:07

1 vote

0 answers

25 views

spark auc and pr-auc not stable

when I try to use spark BinaryClassificationEvaluator , I will find that with same data and same raw prediction column and label col, evaluation result will change in multiple run. This will happen ...

G_cy

1,055

asked Feb 19, 2023 at 17:17

1 vote

0 answers

74 views

What is the point of VectorIndexer in pyspark?

VectorIndexer has the following purpose as I understand it: In VectorUDT typed columns it converts the values it deems categorical to numerical mappings However, It operates only on VectorUDT types ...

figs_and_nuts

5,881

asked Feb 7, 2023 at 18:54

1 vote

0 answers

124 views

Backward compatibility issues with SparkML Model migration from scala 2.11 to scala 2.12

We're migrating our MLpipeline from Spark 2.4(scala 2.11.11) to Spark 3.3.0(scala 2.12.17) we were not able to read the existing MLModel with spark 3. This is because scala won’t support BC with major ...

SivaSingh

11

asked Dec 19, 2022 at 4:56

0 votes

1 answer

328 views

Linear regression with SGD using pyspark.ml.linearegression

I'm using the LinearRegression model in the Spark ML for prediction. import pyspark.ml.regression.LinearRegression featureassembler = VectorAssembler(inputCols=[‘Year’, ‘Present_Price’, ...

isabella

13

asked Sep 8, 2022 at 11:19

1 vote

1 answer

3k views

ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Slave lost Driver stacktrace:

I'm trying to run ALS model on my pyspark dataframe and I'm always running into same error: Here's my spark config: spark_config["spark.executor.memory"] = "32G" spark_config["...

Chris_007

933

asked Aug 25, 2022 at 18:29

0 votes

1 answer

221 views

How to overcome "ValueError: Resolve param in estimatorParamMaps failed" PySpark error?

I am trying to save a grid-searched PySpark TrainValidationSplitModel object, and while tuning the regularization of the logistic regression I'm getting the following strange error: -------------------...

rjpost20

1

asked Aug 21, 2022 at 2:57

0 votes

3 answers

62 views

How to return a value to a val using if statement?

I am trying to convert a var assignment to a val assignment. Currently my code is // Numerical vectorizing for normalization var normNumericalColNameArray: Array[String] = Array() if (!...

Vinoth Manamala

9

asked Aug 16, 2022 at 5:53

0 votes

1 answer

233 views

How to get the Spark scala correlation output as a dataframe?

I am trying to calculate correlation for all columns in a Spark dataframe using the below code. import org.apache.spark.ml.linalg.{Matrix, Vectors} import org.apache.spark.ml.stat.Correlation import ...

Vinoth Manamala

9

asked Aug 5, 2022 at 20:05

1 vote

2 answers

224 views

Retrieve categories from Spark ML OneHotEncoder?

I've used a OneHotEncoder in a Spark ML pipeline: from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler schema = StructType( [StructField("PassengerId", DoubleType()...

itscarlayall

176

asked Jun 23, 2022 at 9:55

1 vote

1 answer

1k views

Convert string into vector in Spark

I have two PySpark dataframes of the following structure. I would like to perform cross join and calculate cosine similarity. The qry_emb is a string column with comma separated values. How to convert ...

Sree Aurovindh

705

asked Jun 15, 2022 at 6:06

0 votes

0 answers

687 views

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneType' object has no attribute 'split'' I ...

olasammy

7,526

asked May 6, 2022 at 11:52

1 vote

1 answer

477 views

How pyspark implementation of ALS is handling multiple ratings per user-item combination?

I observed that the input data to ALS need not have unique rating per user-item combination. Here is a reproducible example. # Sample Dataframe df = spark.createDataFrame([(0, 0, 4.0),(0, 1, 2.0), (1,...

Aditya Kansal

171

asked Apr 26, 2022 at 10:37

0 votes

1 answer

658 views

preparing product purchase data for pyspark ALS implicit recommendations

I'm trying to build a product recommender. I'm using a pyspark ml recommendation ALS matrix factorization model. I have data like the example data below, where I have customer and product id and the ...

user3476463

4,615

asked Apr 5, 2022 at 21:56

0 votes

1 answer

208 views

matrix factorization model returning much smaller dataframe after predicting ratings in pyspark

I'm trying to create a product recommender with the code below. I'm using matrix factorization from spark ml. I have data that has a customer_id, product_id, and a numeric rating value that has been ...

user3476463

4,615

asked Mar 25, 2022 at 22:04

0 votes

1 answer

166 views

How do I extract feature_importances from my model in SparklyR?

I would like to extract feature_importances from my model in SparklyR. So far I have the following reproducible code that is working: library(sparklyr) library(dplyr) sc <- spark_connect(method = &...

piper180

379

asked Mar 16, 2022 at 22:21

1 vote

0 answers

413 views

pyspark PCA in EMR Jupyter Notebook returning "error payload: {"msg":"requirement failed: Session isn't active."}" error

I'm reading in something like 10,000 images (3x100x100 pixels) into a pyspark dataframe which then undergoes StandardScaling and then PCA reduction to 10 dimensions. The standardscaling works fine but ...

NicolasRx

49

asked Feb 22, 2022 at 14:25

Collectives™ on Stack Overflow

IllegalThreadStateException in Spark 4.0 when Adaptive Query Execution (AQE) is active and when running many queries in the same Spark instance

How to reproduce the results of an ML model in Spark? [duplicate]

Spark map() operation fails with NotSerializableException

How to Perform Inference in Spark with an XGBoost Model Trained in scikit-learn?

How to create my own custom imputter to input constant values seamlessly in pyspark.ml pipelines

{Py4JJavaError}An error occurred while calling o339.save

Understanding the constraint in Spark's StringIndexer: why must inputCols and outputCols be different?

CountVectorizer error: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x

How to implement undersampling techniques like NearMiss, TomekLinks, ClusterCentroids, ENN using PySpark?

spark auc and pr-auc not stable

What is the point of VectorIndexer in pyspark?

Backward compatibility issues with SparkML Model migration from scala 2.11 to scala 2.12

Linear regression with SGD using pyspark.ml.linearegression

ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Slave lost Driver stacktrace:

How to overcome "ValueError: Resolve param in estimatorParamMaps failed" PySpark error?

How to return a value to a val using if statement?

How to get the Spark scala correlation output as a dataframe?

Retrieve categories from Spark ML OneHotEncoder?

Convert string into vector in Spark

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

How pyspark implementation of ALS is handling multiple ratings per user-item combination?

preparing product purchase data for pyspark ALS implicit recommendations

matrix factorization model returning much smaller dataframe after predicting ratings in pyspark

How do I extract feature_importances from my model in SparklyR?

pyspark PCA in EMR Jupyter Notebook returning "error payload: {"msg":"requirement failed: Session isn't active."}" error

Hot Network Questions