All Questions
96 questions
1
vote
0
answers
72
views
What is the point of VectorIndexer in pyspark?
VectorIndexer has the following purpose as I understand it:
In VectorUDT typed columns it converts the values it deems categorical to numerical mappings
However, It operates only on VectorUDT types ...
0
votes
1
answer
719
views
One-Hot Encoding to a list feature. Pyspark
I would like to prepare my dataset to be used by machine learning algorithms. I have a feature composed by a list of the tags associated to every TV series (my records).
It is possible to apply the ...
0
votes
1
answer
232
views
Spark ALS model.transform(test) drops rows from test. What could be the reason?
test (a table with columns: user_id, item_id, rating, with 6.2M rows)
als = ALS(userCol="user_id",
itemCol="item_id",
ratingCol="rating",
...
0
votes
1
answer
1k
views
Apply vectors.Dense() to an array float column in pyspark 3.2.1
In order to apply PCA from pyspark.ml.feature, I need to convert a org.apache.spark.sql.types.ArrayType:array<float> to org.apache.spark.ml.linalg.VectorUDT
Say I have the following dataframe :
...
1
vote
0
answers
278
views
Trying to submit spark application in local model ,getting below error ''Cannot load main class from JAR"
I am trying to submit a spark application in my local and I am getting below error.
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file: ...
0
votes
0
answers
499
views
Pyspark - specifying actual size for train test split instead of ratio?
Is it possible to split dataframe into training and testing sets by specifying the actual size i want instead of using ratio? I see most examples use randomSplit..
463715 samples for training
51630 ...
0
votes
1
answer
3k
views
Perform NGram on Spark DataFrame
I'm using Spark 2.3.1, I have Spark DataFrame like this
+----------+
| values|
+----------+
|embodiment|
| present|
| invention|
| include|
| pairing|
| two|
| wireless|
| device|
| ...
1
vote
1
answer
3k
views
Failed to execute user defined function RegexTokenizer in Pyspark
I am trying to perform text classification using text feature in the data using Pyspark. Below is my code for text preprocessing and the code is giving failed to execute user defined function ...
0
votes
1
answer
91
views
Pyspark NLTK save output
I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files.
From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe.
Now I want to save ...
1
vote
0
answers
305
views
PySpark Row with Label and Features vs LabeledPoint
I saw a PySpark Spark SQL example where this syntax is being used to do something similar to what creating a LabeledPoint in Spark Mllib does:
from pyspark.sql import Row
from pyspark.mllib.linalg ...
2
votes
0
answers
471
views
Getting null when trying to change datatype in pyspark
I have a dataset C1.txt that has one column named features.All the rows are string and represent x and y, The coordinates of a two-dimensional point. I want to change the type to double but when I'm ...
1
vote
1
answer
291
views
Pyspark: Extract Multiclass Classification results as different columns
I'm using the RandomForestClassifier object for a multiclass classification problem.
The output dataframe of the prediction presents the 'probability' columns as a vector:
df.select('probability')....
2
votes
1
answer
2k
views
pyspark ml model map id column after prediction
I have trained a classification model using pyspark.ml.classification.RandomForestClassifier and applied it on a new dataset for prediction.
I am removing the customer_id column before feeding the ...
0
votes
1
answer
2k
views
In pyspark how to define the schema for list of list with datatype
I want col4 and col5 should comes as ArrayType they are coming as StringType. It is in pyspark.
I want to know how we can do this.
col4 --array (nullable = true)
|-- element: IntegerType() (...
0
votes
0
answers
860
views
How to fix NULL when fitting train_data in linear regression model?
I am using spark.ml to run a linear regression model. But whenever i fit my train data to the model it gives me an error of scala.MatchError: [null,1.0,[136.0,21.0,25.0]] (of class org.apache.spark....