Skip to main content
Filter by
Sorted by
Tagged with
1 vote
0 answers
126 views

Upon upgrading to Spark 4, we get (deterministically) an IllegalThreadStateException in long series of queries including spark.ml or Delta Lake (e.g. in estimator.fit()) in the same long-running Spark ...
Ghislain Fourny's user avatar
0 votes
0 answers
15 views

I am creating a machine learning model (random forest) in Spark (Pyspark) with cross-validation and grid search. I have two dataframes: one for training and one for testing, both stored in Parquet. ...
cyber-cavalera's user avatar
0 votes
0 answers
20 views

I am encountering a NotSerializableException while using a map() transformation in Spark, despite the object used in the transformation being serializable. The issue arises when I try to apply a ...
Sanjit Jha's user avatar
0 votes
1 answer
299 views

I found one xgboost model which was trained using sklearn in native python. How can I use that model to produce inference on new dataset in pyspark? Apart from using UDFs, what other options do I have?...
wholesale_error's user avatar
1 vote
0 answers
105 views

I would like to optimize the imputation of missing values on my dataset through a CV search. This is trivial to do in sklearn, with which I am familiar -- however, I am for the first time working with ...
GaloisFan's user avatar
  • 121
0 votes
1 answer
940 views

I trained Logistic regression model in pyspark but couldn't save the model. Model = LogisticRegression(featuresCol='TF-IDF', labelCol='labels', maxIter=10)` lr_model = Model.fit(train_data)` type(...
Mohammed Thoufeeq's user avatar
0 votes
0 answers
56 views

I'm in the process of developing a data preprocessing pipeline utilizing Apache Spark, and I've encountered an intriguing behavior with the StringIndexer transformer. In my pipeline, I rely on the ...
jsn's user avatar
  • 81
1 vote
0 answers
81 views

LogisticRegression model and throwing exception to predict on new dataset: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. ... ...
Surya's user avatar
  • 21
4 votes
0 answers
346 views

I'm trying to work on a Fraud Detection dataset from kaggle Credit Card Transactions Fraud Detection Dataset I'm working on PySpark and wish to apply Undersampling techniques using PySpark. However, I ...
Sumit 's user avatar
  • 51
1 vote
0 answers
25 views

when I try to use spark BinaryClassificationEvaluator , I will find that with same data and same raw prediction column and label col, evaluation result will change in multiple run. This will happen ...
G_cy's user avatar
  • 1,055
1 vote
0 answers
74 views

VectorIndexer has the following purpose as I understand it: In VectorUDT typed columns it converts the values it deems categorical to numerical mappings However, It operates only on VectorUDT types ...
figs_and_nuts's user avatar
1 vote
0 answers
124 views

We're migrating our MLpipeline from Spark 2.4(scala 2.11.11) to Spark 3.3.0(scala 2.12.17) we were not able to read the existing MLModel with spark 3. This is because scala won’t support BC with major ...
SivaSingh's user avatar
0 votes
1 answer
328 views

I'm using the LinearRegression model in the Spark ML for prediction. import pyspark.ml.regression.LinearRegression featureassembler = VectorAssembler(inputCols=[‘Year’, ‘Present_Price’, ...
isabella 's user avatar
1 vote
1 answer
3k views

I'm trying to run ALS model on my pyspark dataframe and I'm always running into same error: Here's my spark config: spark_config["spark.executor.memory"] = "32G" spark_config["...
Chris_007's user avatar
  • 933
0 votes
1 answer
221 views

I am trying to save a grid-searched PySpark TrainValidationSplitModel object, and while tuning the regularization of the logistic regression I'm getting the following strange error: -------------------...
rjpost20's user avatar
0 votes
3 answers
62 views

I am trying to convert a var assignment to a val assignment. Currently my code is // Numerical vectorizing for normalization var normNumericalColNameArray: Array[String] = Array() if (!...
Vinoth Manamala's user avatar
0 votes
1 answer
233 views

I am trying to calculate correlation for all columns in a Spark dataframe using the below code. import org.apache.spark.ml.linalg.{Matrix, Vectors} import org.apache.spark.ml.stat.Correlation import ...
Vinoth Manamala's user avatar
1 vote
2 answers
224 views

I've used a OneHotEncoder in a Spark ML pipeline: from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler schema = StructType( [StructField("PassengerId", DoubleType()...
itscarlayall's user avatar
1 vote
1 answer
1k views

I have two PySpark dataframes of the following structure. I would like to perform cross join and calculate cosine similarity. The qry_emb is a string column with comma separated values. How to convert ...
Sree Aurovindh's user avatar
0 votes
0 answers
687 views

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneType' object has no attribute 'split'' I ...
olasammy's user avatar
  • 7,526
1 vote
1 answer
477 views

I observed that the input data to ALS need not have unique rating per user-item combination. Here is a reproducible example. # Sample Dataframe df = spark.createDataFrame([(0, 0, 4.0),(0, 1, 2.0), (1,...
Aditya Kansal's user avatar
0 votes
1 answer
658 views

I'm trying to build a product recommender. I'm using a pyspark ml recommendation ALS matrix factorization model. I have data like the example data below, where I have customer and product id and the ...
user3476463's user avatar
  • 4,615
0 votes
1 answer
208 views

I'm trying to create a product recommender with the code below. I'm using matrix factorization from spark ml. I have data that has a customer_id, product_id, and a numeric rating value that has been ...
user3476463's user avatar
  • 4,615
0 votes
1 answer
166 views

I would like to extract feature_importances from my model in SparklyR. So far I have the following reproducible code that is working: library(sparklyr) library(dplyr) sc <- spark_connect(method = &...
piper180's user avatar
  • 379
1 vote
0 answers
413 views

I'm reading in something like 10,000 images (3x100x100 pixels) into a pyspark dataframe which then undergoes StandardScaling and then PCA reduction to 10 dimensions. The standardscaling works fine but ...
NicolasRx's user avatar

1
2 3 4 5
19