What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

Question

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]. I know what label and featrues mean, but how should I understand rawPrediction|probability|prediction?

desertnaut · Accepted Answer · 2020-05-06 17:18:43Z

Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563

RawPrediction is typically the direct probability/confidence calculation. From Spark docs:

Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

The Prediction is the result of finding the statistical mode of the rawPrediction - viaargmax`:

  protected def raw2prediction(rawPrediction: Vector): Double =
          rawPrediction.argmax

The Probability is the conditional probability for each class. Here is the scaladoc:

Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.

The actual calculation depends on which Classifier you are using.

DecisionTree

Normalize a vector of raw predictions to be a multinomial probability vector, in place.

It simply sums by class across the instances and then divides by the total instance count.

 class_k probability = Count_k/Count_Total

LogisticRegression

It uses the logistic formula

 class_k probability: 1/(1 + exp(-rawPrediction_k))

Naive Bayes

 class_k probability = exp(max(rawPrediction) - rawPrediction_k)

Random Forest

 class_k probability = Count_k/Count_Total

Thans for your detailed explanation, but I still have some questions: why probability is needed after rawPrediction has been calculated since they all indicate the “probability” of each possible class and metric areaUnderROC and areaUnderPR in BinaryClassificationEvaluator are both calculated based on rawPrediction?
@StarLee The details on how the Prediction and Probability differ (are derived from ) the rawPrediction are shown in my answer - and taken directly from the source code. So I've answered this. Which part do you want more details about?
@desertnaut I spelunked the codebase for the above information.
@desertnaut Is it available for Xgboost Classifier also? P.S [sparkdl.xgboost.XgboostClassifier]

desertnaut · Accepted Answer · 2020-10-24 13:57:21Z

In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:

The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

It is not there in the later versions, but you can still find it in the Scala source code.

Anyway, and any unfortunate wording aside, the rawPrecictions in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x)).

Here is an example with toy data in Pyspark:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# |  0.0|[0.0,1.0]|
# |  1.0|[1.0,0.0]|
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)

Here is the result:

+---------+----------------------------------------+----------------------------------------+----------+ 
|features |                          rawPrediction |                            probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+ 
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]|      0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613]  |      1.0 | 
+---------+----------------------------------------+----------------------------------------+----------+

Let's now confirm that the logistic function of rawPrediction gives the probability column:

import numpy as np

x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])

x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])

i.e. this is the case indeed

So, to summarize regarding all three (3) output columns:

rawPrediction is the raw output of the logistic regression classifier (array with length equal to the number of classes)
probability is the result of applying the logistic function to rawPrediction (array of length equal to that of rawPrediction)
prediction is the argument where the array probability takes its maximum value, and it gives the most probable label (single number)

This is a better answer than mine because of the actual code/examples
yes i had seen that thx. I put the comment here for benefit of other readers and also a comment at top of my answer referencing this one

Bond Lee · Accepted Answer · 2019-03-29 09:39:34Z

-1

If classification model is logistic regression,

rawPrediction is equal (w*x + bias) variable coefficients values

probability is 1/(1+e^(w*x + bias))

prediction is 0 or 1.

edited Mar 29, 2019 at 9:39

Bond Lee

52 bronze badges

answered Oct 18, 2018 at 8:05

Hangyu Liu

1551 silver badge3 bronze badges

Collectives™ on Stack Overflow

What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

3 Answers 3

5 Comments

2 Comments

Comments

Linked

Hot Network Questions