How to calculate Binary Classification Metrics in Spark MLlib with Dataframe API

Question

I am using Spark MLlib with DataFrame API, given the following sample code:

val dtc = new DecisionTreeClassifier()
val testResults = dtc.fit(training).transform(test)

Can I calculate the model quality metrics over the testResult using the DataFrame API?

If not, how do I correctly transform my testResult (containing "label", "features", "rawPrediction", "probability", "prediction") so that I can use the BinaryClassificationMetrics (RDD API)?

NOTE: I am interested in the "byThreshold" metrics as well

jamborta · Accepted Answer · 2017-05-14 08:39:02Z

3

If you look at the constructor of the BinaryClassificationMetrics, it takes an RDD[(Double, Double)], score and labels. You can convert the Dataframe to the right format like this:

val scoreAndLabels = testResults.select("label", "probability")
    .rdd
    .map(row => 
            (row.getAs[Vector]("probability")(1), row.getAs[Double]("label"))
    )

EDIT:

Probability is stored in a Vector that is the same length as the number of classes you'd like to predict. In the case of binary classification the first one would correspond to label = 0 and the second is label = 1, you should pick the column that is your positive label (normally label = 1).

edited May 14, 2017 at 8:39

answered May 13, 2017 at 20:12

jamborta

5,2306 gold badges37 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Marsellus Wallace Over a year ago

I was thinking something like that. However, "probability" is a 2 elements Vector and I'm not sure which one to pick. If you take a look at how BinaryClassificationEvaluator works, it takes the rawPrediction at index 1 and that confuses me for 2 reasons: Why using rawPrediction instead of probability? and why taking always index 1? I am tempted to use "probability(1)" for the BinaryClassificationMetrics object.

Marsellus Wallace Over a year ago

Could you elaborate on why "... pick the column that is your positive label"? I am confused about why BinaryClassificationEvaluator always pick index 1 independently on what prediction has been geenerated

jamborta Over a year ago

BinaryClassificationEvaluator makes the assumption that label = 0 is the negative label and label = 1 is the positive one. In some ML libraries, (ie scikit-learn), you have an option to pick which column is your positive/negative label (scikit-learn.org/stable/modules/generated/…), looks like in mllib it is hardcoded.

Marsellus Wallace Over a year ago

If you always take probability(1) then the prediction is more likely to be correct for values very close to 1 and very close to 0 (where we are predictive negative with high probability). Is that the expected outcome?

jamborta Over a year ago

yes, probability(1) represents P(label = 1), where high probability would more likely to be label 1 and low probability is label 0. If you take probability(0), then high probability is more likely to be label 0 and low probability is more likely to be label 1 (as probability(0) + probability(1) = 1).

|

Collectives™ on Stack Overflow

How to calculate Binary Classification Metrics in Spark MLlib with Dataframe API

1 Answer 1

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Linked

Related