934 questions
1
vote
0
answers
126
views
IllegalThreadStateException in Spark 4.0 when Adaptive Query Execution (AQE) is active and when running many queries in the same Spark instance
Upon upgrading to Spark 4, we get (deterministically) an IllegalThreadStateException in long series of queries including spark.ml or Delta Lake (e.g. in estimator.fit()) in the same long-running Spark ...
0
votes
0
answers
15
views
How to reproduce the results of an ML model in Spark? [duplicate]
I am creating a machine learning model (random forest) in Spark (Pyspark) with cross-validation and grid search. I have two dataframes: one for training and one for testing, both stored in Parquet.
...
0
votes
0
answers
20
views
Spark map() operation fails with NotSerializableException
I am encountering a NotSerializableException while using a map() transformation in Spark, despite the object used in the transformation being serializable. The issue arises when I try to apply a ...
0
votes
1
answer
299
views
How to Perform Inference in Spark with an XGBoost Model Trained in scikit-learn?
I found one xgboost model which was trained using sklearn in native python.
How can I use that model to produce inference on new dataset in pyspark?
Apart from using UDFs, what other options do I have?...
1
vote
0
answers
105
views
How to create my own custom imputter to input constant values seamlessly in pyspark.ml pipelines
I would like to optimize the imputation of missing values on my dataset through a CV search. This is trivial to do in sklearn, with which I am familiar -- however, I am for the first time working with ...
0
votes
1
answer
940
views
{Py4JJavaError}An error occurred while calling o339.save
I trained Logistic regression model in pyspark but couldn't save the model.
Model = LogisticRegression(featuresCol='TF-IDF', labelCol='labels', maxIter=10)`
lr_model = Model.fit(train_data)`
type(...
0
votes
0
answers
56
views
Understanding the constraint in Spark's StringIndexer: why must inputCols and outputCols be different?
I'm in the process of developing a data preprocessing pipeline utilizing Apache Spark, and I've encountered an intriguing behavior with the StringIndexer transformer. In my pipeline, I rely on the ...
1
vote
0
answers
81
views
CountVectorizer error: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x
LogisticRegression model and throwing exception to predict on new dataset:
java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x.
...
...
4
votes
0
answers
346
views
How to implement undersampling techniques like NearMiss, TomekLinks, ClusterCentroids, ENN using PySpark?
I'm trying to work on a Fraud Detection dataset from kaggle
Credit Card Transactions Fraud Detection Dataset
I'm working on PySpark and wish to apply Undersampling techniques using PySpark. However, I ...
1
vote
0
answers
25
views
spark auc and pr-auc not stable
when I try to use spark BinaryClassificationEvaluator , I will find that with same data and same raw prediction column and label col, evaluation result will change in multiple run. This will happen ...
1
vote
0
answers
74
views
What is the point of VectorIndexer in pyspark?
VectorIndexer has the following purpose as I understand it:
In VectorUDT typed columns it converts the values it deems categorical to numerical mappings
However, It operates only on VectorUDT types ...
1
vote
0
answers
124
views
Backward compatibility issues with SparkML Model migration from scala 2.11 to scala 2.12
We're migrating our MLpipeline from Spark 2.4(scala 2.11.11) to Spark 3.3.0(scala 2.12.17) we were not able to read the existing MLModel with spark 3.
This is because scala won’t support BC with major ...
0
votes
1
answer
328
views
Linear regression with SGD using pyspark.ml.linearegression
I'm using the LinearRegression model in the Spark ML for prediction.
import pyspark.ml.regression.LinearRegression
featureassembler = VectorAssembler(inputCols=[‘Year’, ‘Present_Price’,
...
1
vote
1
answer
3k
views
ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Slave lost Driver stacktrace:
I'm trying to run ALS model on my pyspark dataframe and I'm always running into same error:
Here's my spark config:
spark_config["spark.executor.memory"] = "32G"
spark_config["...
0
votes
1
answer
221
views
How to overcome "ValueError: Resolve param in estimatorParamMaps failed" PySpark error?
I am trying to save a grid-searched PySpark TrainValidationSplitModel object, and while tuning the regularization of the logistic regression I'm getting the following strange error:
-------------------...
0
votes
3
answers
62
views
How to return a value to a val using if statement?
I am trying to convert a var assignment to a val assignment. Currently my code is
// Numerical vectorizing for normalization
var normNumericalColNameArray: Array[String] = Array()
if (!...
0
votes
1
answer
233
views
How to get the Spark scala correlation output as a dataframe?
I am trying to calculate correlation for all columns in a Spark dataframe using the below code.
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import ...
1
vote
2
answers
224
views
Retrieve categories from Spark ML OneHotEncoder?
I've used a OneHotEncoder in a Spark ML pipeline:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
schema = StructType(
[StructField("PassengerId", DoubleType()...
1
vote
1
answer
1k
views
Convert string into vector in Spark
I have two PySpark dataframes of the following structure. I would like to perform cross join and calculate cosine similarity. The qry_emb is a string column with comma separated values.
How to convert ...
0
votes
0
answers
687
views
Pyspark AttributeError: 'NoneType' object has no attribute 'split''
I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says:
AttributeError: 'NoneType' object has no attribute 'split''
I ...
1
vote
1
answer
477
views
How pyspark implementation of ALS is handling multiple ratings per user-item combination?
I observed that the input data to ALS need not have unique rating per user-item combination.
Here is a reproducible example.
# Sample Dataframe
df = spark.createDataFrame([(0, 0, 4.0),(0, 1, 2.0),
(1,...
0
votes
1
answer
658
views
preparing product purchase data for pyspark ALS implicit recommendations
I'm trying to build a product recommender. I'm using a pyspark ml recommendation ALS matrix factorization model. I have data like the example data below, where I have customer and product id and the ...
0
votes
1
answer
208
views
matrix factorization model returning much smaller dataframe after predicting ratings in pyspark
I'm trying to create a product recommender with the code below. I'm using matrix factorization from spark ml. I have data that has a customer_id, product_id, and a numeric rating value that has been ...
0
votes
1
answer
166
views
How do I extract feature_importances from my model in SparklyR?
I would like to extract feature_importances from my model in SparklyR. So far I have the following reproducible code that is working:
library(sparklyr)
library(dplyr)
sc <- spark_connect(method = &...
1
vote
0
answers
413
views
pyspark PCA in EMR Jupyter Notebook returning "error payload: {"msg":"requirement failed: Session isn't active."}" error
I'm reading in something like 10,000 images (3x100x100 pixels) into a pyspark dataframe which then undergoes StandardScaling and then PCA reduction to 10 dimensions.
The standardscaling works fine but ...