Decision tree node split by Gini coefficient (used in RandomFerns/RandomForest algorithm)

Question

I am implementing the Random Ferns Algorithm for Classification. For simplicity, let's imagine a single decision tree with only a single node. As input we have a feature and the label of each dataset.

The function should work properly for any number of classes (length of set(labels)). The output is a feature threshold which leads to the best split. I plan to further implement other impurity measures such as misclassification rate or entropy.

For those interested in the topic, here is a link to a short introduction presentation in pdf format for the topic: classification trees and node split.

My current implementation works fine, yet I am sure there is plenty of place for improvement. If you have question regarding functionality, please ask. I gave some comments which explain what is to be done.

Example input:

test_features = [1,2,3,3,4]
test_labels = [0,0,1,1,1]

Example output:

Code as follows:

def best_split(feature_values, labels):
# training for each node/feature determining the threshold
impurity = []
possible_thresholds = sorted(list(set(feature_values)))

# the only relevant possibilities for a threshold are the feature values themselves

for threshold in possible_thresholds:
    # split node content based on threshold
    # to do here: what happens if len(right) or len(left) is zero
    right = [label for value, label in zip(feature_values, labels) if value >= threshold]
    left = [label for value, label in zip(feature_values, labels) if value < threshold]

    # compute distribution of labels for each split
    right_distribution = [len(list(group)) for key, group in groupby(right)]
    left_distribution = [len(list(group)) for key, group in groupby(left)]

    # compute impurity of split based on the distribution
    gini_right = 1 - np.sum((right_distribution / np.sum(right_distribution)) ** 2)
    gini_left = 1 - np.sum((left_distribution / np.sum(left_distribution)) ** 2)

    # compute weighted total impurity of the split
    gini_split = (len(right) * gini_right + len(left) * gini_left) / len(labels)

    impurity.append(gini_split)

# returns the threshold with the highest associated impurity value --> best split threshold
return possible_thresholds[impurity.index(min(impurity))]

This function is used for training of the Random Ferns class such as:

 def train(self, patches, labels):

        self.classes = list(set(labels))

        # here uniform distribution for each leaf is assumed
        # for each fern - for each feature combination - for each class - there is a posterior probability
        # these are all stored in a list of lists of lists named 'posterior'
        initial_distribution = [1 / len(self.classes)] * len(self.classes)
        self.posterior = [[initial_distribution] * (2 ** self.fernsize)] * self.number_of_ferns

        #determining the best threshold for each feature using best_split function
        all_thresholds = []

        for fern in self.ferns:
            fern_thresholds = []

            for feature_params in fern:
            # the function feature() extracts the feature values of a
            # specific feature (determined by feature_params) from each patch in patches
                feature_values = feature(patches, feature_params)

                fern_thresholds.append(best_split(feature_values, labels))

            all_thresholds.append(fern_thresholds)

        self.threshold = all_thresholds

@RichardNeumann I improved my code regarding PEP8 compliance and deleted the assert error condition checking. How would you instead check for correct input format/content in productive code if not using assert ? — Nikolas Rieble
– Nikolas Rieble, Commented Nov 21, 2016 at 12:42
@NikolasRieble By condition checking using if ... elif... else... or exception handling using try... except... else... finally.... — user51621
– user51621, Commented Nov 21, 2016 at 12:56

kraskevich · Accepted Answer · 2016-11-21 22:30:50Z

Assuming that the groupby function comes from the itertools library, the list should be sorted before this function is called. For instance,
[len(list(group)) for key, group in groupby(right)] should be
[len(list(group)) for key, group in groupby(sorted(right))]. It doesn't work properly if the input is not sorted according to the documentation.
After you make the split, you compute exactly the same function for the left and the right part. It might be a good idea to create a separate function for it (something like get_gini_index(values), that takes a list of labels and return the gini index).
# to do here: what happens if len(right) or len(left) is zero. This one looks sort of weird to me. I'd recommend to figure out whether it's actually a special case. If it is, it should be handled properly (namely, this split should be ignored). If it's possible to prove that it's not a special case, this comment needs to be removed to prevent confusion. I would agree that sometimes todo's can be left in the code, but in this case it looks important and easy to fix.

Regarding 3) the problem is less a programming issue, than a statistics issue. Further, due to the specific properties of the Random Ferns algorithm, I can not ignore a split. I did not yet find a solution. Thank you for your feedback! — Nikolas Rieble
– Nikolas Rieble, Commented Nov 22, 2016 at 20:10

Nikolas Rieble · Accepted Answer · 2016-12-21 12:40:14Z

Besides the previous answer - which was very helpful - I used the cprofile library as well as the line_profiler the iteratively improve my code regarding performance.

I did use two major improvements which affected the runtime as follows: For input such as:

feature_values = np.random.rand(10000)
labels = np.random.randint(0, high=3, size=10000, dtype='l')

No Improvement: Average Runtime (100 runs) - 76.2324668884 seconds
First Improvement: Average Runtime (100 runs) - 50.9863730192 seconds
Second Improvement: Average Runtime (100 runs) - 27.0581687212 seconds
Third Improvement: Average Runtime (100 runs) - 3.16519999504 seconds (!)

First improvement was done regarding the following two lines:

right = [label for value, label in zip(feature_values, labels) if value >= threshold]
left = [label for value, label in zip(feature_values, labels) if value < threshold]

I substituted them by:

selection = feature_values>=threshold        
right = labels[selection]
left = labels[np.invert(selection)]

This is better because only once a boolean array is computed and then used for selection twice, whereas in the first code, the selection was done twice (unneccesarily).

Second Improvement was done regarding the following lines:

right_distribution = [len(list(group)) for key, group in groupby(right)]
left_distribution = [len(list(group)) for key, group in groupby(left)]

I substituted them by:

right_distribution = list(collections.Counter(sorted(right)).values())
left_distribution = list(collections.Counter(sorted(left)).values())

Third improvement: Numpy, numpy, numpy

instead of len(right) - right.size

instead of the second improvement:

_ , right_distribution = np.unique(right, return_counts=True)
_ , left_distribution = np.unique(left, return_counts=True)

Finally i tried to use the map function instead of the for threshold in possible_thresholds: loop. This did not yield any relevant improvement, therefore I did not include it here. Further I changed some little details to prevent computing the same value twice or computing it in the loop. I am yet working on the code but it now looks as follows:

def best_splitV6(feature_values, labels):
# training for each node/feature determining the threshold
feature_values, labels = np.array(feature_values), np.array(labels)

impurity = []
possible_thresholds = np.unique(feature_values)

num_labels = labels.size

# the only relevant possibilities for a threshold are the feature values themselves except the lowest value    

for threshold in possible_thresholds:
    # split node content based on threshold
    # to do here: what happens if len(right) or len(left) is zero
    selection = feature_values>=threshold

    right = labels[selection]
    left = labels[~selection]

    num_right = right.size

    # compute distribution of labels for each split
    _ , right_distribution = np.unique(right, return_counts=True)
    _ , left_distribution = np.unique(left, return_counts=True)

    # compute impurity of split based on the distribution
    gini_right = 1 - np.sum((np.array(right_distribution) / num_right) ** 2)        
    gini_left = 1 - np.sum((np.array(left_distribution) / (num_labels-num_right)) ** 2)

    # compute weighted total impurity of the split
    gini_split = (num_right * gini_right + (num_labels-num_right) * gini_left) / num_labels

    impurity.append(gini_split)


# returns the threshold with the highest associated impurity value --> best split threshold
return possible_thresholds[np.argmin(impurity)]

This function yet is the bottleneck of the target process, if anyone see more chances to improve the runtime, please comment/answer — Nikolas Rieble
– Nikolas Rieble, Commented Dec 22, 2016 at 10:49

Stack Exchange Network

Decision tree node split by Gini coefficient (used in RandomFerns/RandomForest algorithm)

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Decision tree node split by Gini coefficient (used in RandomFerns/RandomForest algorithm)

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions