Newest 'pyspark+apache-spark-sql+apache-spark-mllib' Questions

1 vote

0 answers

72 views

What is the point of VectorIndexer in pyspark?

VectorIndexer has the following purpose as I understand it: In VectorUDT typed columns it converts the values it deems categorical to numerical mappings However, It operates only on VectorUDT types ...

figs_and_nuts

5,791

asked Feb 7, 2023 at 18:54

0 votes

1 answer

719 views

One-Hot Encoding to a list feature. Pyspark

I would like to prepare my dataset to be used by machine learning algorithms. I have a feature composed by a list of the tags associated to every TV series (my records). It is possible to apply the ...

Lorenzo Maggio

15

asked Jun 13, 2022 at 9:28

0 votes

1 answer

232 views

Spark ALS model.transform(test) drops rows from test. What could be the reason?

test (a table with columns: user_id, item_id, rating, with 6.2M rows) als = ALS(userCol="user_id", itemCol="item_id", ratingCol="rating", ...

Anmol Deep

693

asked May 26, 2022 at 13:55

0 votes

1 answer

1k views

Apply vectors.Dense() to an array float column in pyspark 3.2.1

In order to apply PCA from pyspark.ml.feature, I need to convert a org.apache.spark.sql.types.ArrayType:array<float> to org.apache.spark.ml.linalg.VectorUDT Say I have the following dataframe : ...

W.314

156

asked Apr 17, 2022 at 13:35

1 vote

0 answers

278 views

Trying to submit spark application in local model ,getting below error ''Cannot load main class from JAR"

I am trying to submit a spark application in my local and I am getting below error. Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file: ...

Rahul

537

asked Nov 9, 2020 at 4:55

0 votes

0 answers

499 views

Pyspark - specifying actual size for train test split instead of ratio?

Is it possible to split dataframe into training and testing sets by specifying the actual size i want instead of using ratio? I see most examples use randomSplit.. 463715 samples for training 51630 ...

James Omnipotent

45

asked Aug 16, 2020 at 8:06

0 votes

1 answer

3k views

Perform NGram on Spark DataFrame

Achyut Vyas

501

asked Aug 7, 2020 at 15:15

1 vote

1 answer

3k views

Failed to execute user defined function RegexTokenizer in Pyspark

I am trying to perform text classification using text feature in the data using Pyspark. Below is my code for text preprocessing and the code is giving failed to execute user defined function ...

user11619814

449

asked Jun 24, 2020 at 23:51

0 votes

1 answer

91 views

Pyspark NLTK save output

I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files. From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe. Now I want to save ...

Achyut Vyas

501

asked May 6, 2020 at 10:53

1 vote

0 answers

305 views

PySpark Row with Label and Features vs LabeledPoint

I saw a PySpark Spark SQL example where this syntax is being used to do something similar to what creating a LabeledPoint in Spark Mllib does: from pyspark.sql import Row from pyspark.mllib.linalg ...

Odisseo

777

asked Dec 25, 2019 at 18:52

2 votes

0 answers

471 views

Getting null when trying to change datatype in pyspark

I have a dataset C1.txt that has one column named features.All the rows are string and represent x and y, The coordinates of a two-dimensional point. I want to change the type to double but when I'm ...

MH.AI.eAgLe

683

asked Nov 1, 2019 at 7:03

1 vote

1 answer

291 views

Pyspark: Extract Multiclass Classification results as different columns

I'm using the RandomForestClassifier object for a multiclass classification problem. The output dataframe of the prediction presents the 'probability' columns as a vector: df.select('probability')....

paolof89

1,369

asked Oct 10, 2019 at 12:54

2 votes

1 answer

2k views

pyspark ml model map id column after prediction

I have trained a classification model using pyspark.ml.classification.RandomForestClassifier and applied it on a new dataset for prediction. I am removing the customer_id column before feeding the ...

Mrinal

1,906

asked Sep 16, 2019 at 18:03

0 votes

1 answer

2k views

In pyspark how to define the schema for list of list with datatype

I want col4 and col5 should comes as ArrayType they are coming as StringType. It is in pyspark. I want to know how we can do this. col4 --array (nullable = true) |-- element: IntegerType() (...

sonu

21

asked Jul 17, 2019 at 18:07

0 votes

0 answers

860 views

How to fix NULL when fitting train_data in linear regression model?

I am using spark.ml to run a linear regression model. But whenever i fit my train data to the model it gives me an error of scala.MatchError: [null,1.0,[136.0,21.0,25.0]] (of class org.apache.spark....

Japneet Singh

67

asked Jun 18, 2019 at 15:59

Collectives™ on Stack Overflow

All Questions

What is the point of VectorIndexer in pyspark?

One-Hot Encoding to a list feature. Pyspark

Spark ALS model.transform(test) drops rows from test. What could be the reason?

Apply vectors.Dense() to an array float column in pyspark 3.2.1

Trying to submit spark application in local model ,getting below error ''Cannot load main class from JAR"

Pyspark - specifying actual size for train test split instead of ratio?

Perform NGram on Spark DataFrame

Failed to execute user defined function RegexTokenizer in Pyspark

Pyspark NLTK save output

PySpark Row with Label and Features vs LabeledPoint

Getting null when trying to change datatype in pyspark

Pyspark: Extract Multiclass Classification results as different columns

pyspark ml model map id column after prediction

In pyspark how to define the schema for list of list with datatype

How to fix NULL when fitting train_data in linear regression model?

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags