Skip to main content
How are we doing? Please help us improve Stack Overflow. Take our short survey

All Questions

1 vote
0 answers
72 views

What is the point of VectorIndexer in pyspark?

VectorIndexer has the following purpose as I understand it: In VectorUDT typed columns it converts the values it deems categorical to numerical mappings However, It operates only on VectorUDT types ...
figs_and_nuts's user avatar
0 votes
1 answer
719 views

One-Hot Encoding to a list feature. Pyspark

I would like to prepare my dataset to be used by machine learning algorithms. I have a feature composed by a list of the tags associated to every TV series (my records). It is possible to apply the ...
Lorenzo Maggio's user avatar
0 votes
1 answer
232 views

Spark ALS model.transform(test) drops rows from test. What could be the reason?

test (a table with columns: user_id, item_id, rating, with 6.2M rows) als = ALS(userCol="user_id", itemCol="item_id", ratingCol="rating", ...
Anmol Deep's user avatar
0 votes
1 answer
1k views

Apply vectors.Dense() to an array float column in pyspark 3.2.1

In order to apply PCA from pyspark.ml.feature, I need to convert a org.apache.spark.sql.types.ArrayType:array<float> to org.apache.spark.ml.linalg.VectorUDT Say I have the following dataframe : ...
W.314's user avatar
  • 156
1 vote
0 answers
278 views

Trying to submit spark application in local model ,getting below error ''Cannot load main class from JAR"

I am trying to submit a spark application in my local and I am getting below error. Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file: ...
Rahul's user avatar
  • 537
0 votes
0 answers
499 views

Pyspark - specifying actual size for train test split instead of ratio?

Is it possible to split dataframe into training and testing sets by specifying the actual size i want instead of using ratio? I see most examples use randomSplit.. 463715 samples for training 51630 ...
James Omnipotent's user avatar
0 votes
1 answer
3k views

Perform NGram on Spark DataFrame

I'm using Spark 2.3.1, I have Spark DataFrame like this +----------+ | values| +----------+ |embodiment| | present| | invention| | include| | pairing| | two| | wireless| | device| | ...
Achyut Vyas's user avatar
1 vote
1 answer
3k views

Failed to execute user defined function RegexTokenizer in Pyspark

I am trying to perform text classification using text feature in the data using Pyspark. Below is my code for text preprocessing and the code is giving failed to execute user defined function ...
user11619814's user avatar
0 votes
1 answer
91 views

Pyspark NLTK save output

I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files. From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe. Now I want to save ...
Achyut Vyas's user avatar
1 vote
0 answers
305 views

PySpark Row with Label and Features vs LabeledPoint

I saw a PySpark Spark SQL example where this syntax is being used to do something similar to what creating a LabeledPoint in Spark Mllib does: from pyspark.sql import Row from pyspark.mllib.linalg ...
Odisseo's user avatar
  • 777
2 votes
0 answers
471 views

Getting null when trying to change datatype in pyspark

I have a dataset C1.txt that has one column named features.All the rows are string and represent x and y, The coordinates of a two-dimensional point. I want to change the type to double but when I'm ...
MH.AI.eAgLe's user avatar
1 vote
1 answer
291 views

Pyspark: Extract Multiclass Classification results as different columns

I'm using the RandomForestClassifier object for a multiclass classification problem. The output dataframe of the prediction presents the 'probability' columns as a vector: df.select('probability')....
paolof89's user avatar
  • 1,369
2 votes
1 answer
2k views

pyspark ml model map id column after prediction

I have trained a classification model using pyspark.ml.classification.RandomForestClassifier and applied it on a new dataset for prediction. I am removing the customer_id column before feeding the ...
Mrinal's user avatar
  • 1,906
0 votes
1 answer
2k views

In pyspark how to define the schema for list of list with datatype

I want col4 and col5 should comes as ArrayType they are coming as StringType. It is in pyspark. I want to know how we can do this. col4 --array (nullable = true) |-- element: IntegerType() (...
sonu's user avatar
  • 21
0 votes
0 answers
860 views

How to fix NULL when fitting train_data in linear regression model?

I am using spark.ml to run a linear regression model. But whenever i fit my train data to the model it gives me an error of scala.MatchError: [null,1.0,[136.0,21.0,25.0]] (of class org.apache.spark....
Japneet Singh's user avatar

15 30 50 per page
1
2 3 4 5
7