1

I am using Spark MLlib with DataFrame API, given the following sample code:

val dtc = new DecisionTreeClassifier()
val testResults = dtc.fit(training).transform(test)

Can I calculate the model quality metrics over the testResult using the DataFrame API?

If not, how do I correctly transform my testResult (containing "label", "features", "rawPrediction", "probability", "prediction") so that I can use the BinaryClassificationMetrics (RDD API)?

NOTE: I am interested in the "byThreshold" metrics as well

1 Answer 1

3

If you look at the constructor of the BinaryClassificationMetrics, it takes an RDD[(Double, Double)], score and labels. You can convert the Dataframe to the right format like this:

val scoreAndLabels = testResults.select("label", "probability")
    .rdd
    .map(row => 
            (row.getAs[Vector]("probability")(1), row.getAs[Double]("label"))
    )

EDIT:

Probability is stored in a Vector that is the same length as the number of classes you'd like to predict. In the case of binary classification the first one would correspond to label = 0 and the second is label = 1, you should pick the column that is your positive label (normally label = 1).

Sign up to request clarification or add additional context in comments.

7 Comments

I was thinking something like that. However, "probability" is a 2 elements Vector and I'm not sure which one to pick. If you take a look at how BinaryClassificationEvaluator works, it takes the rawPrediction at index 1 and that confuses me for 2 reasons: Why using rawPrediction instead of probability? and why taking always index 1? I am tempted to use "probability(1)" for the BinaryClassificationMetrics object.
Could you elaborate on why "... pick the column that is your positive label"? I am confused about why BinaryClassificationEvaluator always pick index 1 independently on what prediction has been geenerated
BinaryClassificationEvaluator makes the assumption that label = 0 is the negative label and label = 1 is the positive one. In some ML libraries, (ie scikit-learn), you have an option to pick which column is your positive/negative label (scikit-learn.org/stable/modules/generated/…), looks like in mllib it is hardcoded.
If you always take probability(1) then the prediction is more likely to be correct for values very close to 1 and very close to 0 (where we are predictive negative with high probability). Is that the expected outcome?
yes, probability(1) represents P(label = 1), where high probability would more likely to be label 1 and low probability is label 0. If you take probability(0), then high probability is more likely to be label 0 and low probability is more likely to be label 1 (as probability(0) + probability(1) = 1).
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.