Plot an histogram with y-axis as percentage (using FuncFormatter?)

Question

I have a list of data in which the numbers are between 1000 and 20 000.

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

When I plot a histogram using the hist() function, the y-axis represents the number of occurrences of the values within a bin. Instead of the number of occurrences, I would like to have the percentage of occurrences.

Code for the above plot:

f, ax = plt.subplots(1, 1, figsize=(10,5))
ax.hist(data, bins = len(list(set(data))))

I've been looking at this post which describes an example using FuncFormatter but I can't figure out how to adapt it to my problem. Some help and guidance would be welcome :)

EDIT: Main issue with the to_percent(y, position) function used by the FuncFormatter. The y corresponds to one given value on the y-axis I guess. I need to divide this value by the total number of elements which I apparently can' t pass to the function...

EDIT 2: Current solution I dislike because of the use of a global variable:

def to_percent(y, position):
    # Ignore the passed in position. This has the effect of scaling the default
    # tick locations.
    global n

    s = str(round(100 * y / n, 3))
    print (y)

    # The percent symbol needs escaping in latex
    if matplotlib.rcParams['text.usetex'] is True:
        return s + r'$\%$'
    else:
        return s + '%'

def plotting_hist(folder, output):
    global n

    data = list()
    # Do stuff to create data from folder

    n = len(data)
    f, ax = plt.subplots(1, 1, figsize=(10,5))
    ax.hist(data, bins = len(list(set(data))), rwidth = 1)

    formatter = FuncFormatter(to_percent)
    plt.gca().yaxis.set_major_formatter(formatter)

    plt.savefig("{}.png".format(output), dpi=500)

EDIT 3: Method with density = True

Actual desired output (method with global variable):

ImportanceOfBeingErnest · Accepted Answer · 2018-07-23 10:59:51Z

170

Other answers seem utterly complicated. A histogram which shows the proportion instead of the absolute amount can easily produced by weighting the data with 1/n, where n is the number of datapoints.

Then a PercentFormatter can be used to show the proportion (e.g. 0.45) as percentage (45%).

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

plt.hist(data, weights=np.ones(len(data)) / len(data))

plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.show()

Here we see that three of the 7 values are in the first bin, i.e. 3/7=43%.

edited Jul 23, 2018 at 10:59

answered Jul 23, 2018 at 10:50

ImportanceOfBeingErnest

342k61 gold badges737 silver badges771 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Outcast Over a year ago

Hi, this looks good. However, the bar plots are not finishing exactly on the x-axis ticks but they are going a bit to the right each time. How can I make these be aligned?

ImportanceOfBeingErnest Over a year ago

@PoeteMaudit You don't align bars of a histogram. They are precisely at the bin edges. If you want to change the bin edges, use histogram's bins argument.

Outcast Over a year ago

Thank you for your response but visually the bin edges are not aligned to the tick marks of the x-axis. Therefore, it gets even difficult to interpret what are the values related to each bin.

ImportanceOfBeingErnest Over a year ago

You fix this by choosing the bin edges, such that they are at nice numbers and set the ticks to those numbers, not the inverse.

Petr Vepřek Over a year ago

To remove dependency on numpy, one can replace weights=np.ones(len(data)) / len(data) with weights = [1/len(data)] * len(data).

|

Dharman · Accepted Answer · 2025-01-02 13:15:40Z

14

Simply set density to true, the weights will be implicitly normalized.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

plt.hist(data, density=True)

plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.show()

edited Jan 2 at 13:15

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Mar 28, 2021 at 18:58

Jiro

3052 silver badges4 bronze badges

5 Comments

PatrickT Over a year ago

density=True does not give "the percentage of occurrences," as requested by the OP. Most users will not be looking for density=True( Google it to see why).

tturbo Over a year ago

"Google it to see why" -> or just give the answer: It represents the percentage of the area under the curve (the sum of the areas of every bars is 1.0)

SomethingSomething Jan 2 at 8:53

This is a very misleading wrong answer. Should be removed.

Dharman Jan 2 at 13:17

@SomethingSomething We don't remove wrong answers, we downvote them instead. How else will other people know that they should avoid this approach? Please don't add your comments into the answer either. Comments should be in the comment section.

SomethingSomething Jan 2 at 19:25

@Dharman Sure, but there are still 16 upvotes, which seems like a reliable answer... what a simple solution, just add density=True, everybody likes it and upvotes blindly. And, BTW, if you scroll down, you'll see a similar answer that has been removed. I think this answer is amongst the exceptions where downvotes are not enough.

Mansour Zayer · Accepted Answer · 2021-06-27 13:31:39Z

9

I think the simplest way is to use seaborn which is a layer on matplotlib. Note that you can still use plt.subplots(), figsize(), ax, and fig to customize your plot.

import seaborn as sns

And using the following code:

sns.displot(data, stat='probability'))

Also, sns.displot has so many parameters that allow for very complex and informative graphs very easily. They can be found here: displot Documentation

answered Jun 27, 2021 at 13:31

Mansour Zayer

3844 silver badges14 bronze badges

Comments

DavidG · Accepted Answer · 2018-07-23 09:04:40Z

You can calculate the percentages yourself, then plot them as a bar chart. This requires you to use numpy.histogram (which matplotlib uses "under the hood" anyway). You can then adjust the y tick labels:

import matplotlib.pyplot as plt
import numpy as np

f, ax = plt.subplots(1, 1, figsize=(10,5))
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

heights, bins = np.histogram(data, bins = len(list(set(data))))

percent = [i/sum(heights)*100 for i in heights]

ax.bar(bins[:-1], percent, width=2500, align="edge")
vals = ax.get_yticks()
ax.set_yticklabels(['%1.2f%%' %i for i in vals])

plt.show()

tturbo · Accepted Answer · 2022-10-11 11:11:46Z

I found yet an other way to do so. As you can see in other answers, density=True alone doesn't solve the problem, as it calculates the area under the curve in percentage. But that can easily be converted, just divide it by the width of the bars.

import matplotlib.pyplot as plt

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
bins=10

plt.hist(data, bins=bins, density=True)

bar_width = (max(data)-min(data))/bins # calculate width of a bar
ticks = plt.yticks()[0] # get ticks
tick_labels = ticks * bar_width # calculate labels for ticks
tick_labels = map(lambda f: f"{f:0.2}%",tick_labels) # format float to string
plt.yticks(ticks=ticks, labels=tick_labels) # set new labels

plt.show()

However, the solution weights=np.ones(len(data)) / len(data) may be a shorther and cleaner. This is just an other way and without numpy

ImportanceOfBeingErnest · Accepted Answer · 2018-07-23 11:44:24Z

You can use functools.partial to avoid using globals in your example.

Just add n to function parameters:

def to_percent(y, position, n):
    s = str(round(100 * y / n, 3))

    if matplotlib.rcParams['text.usetex']:
        return s + r'$\%$'

    return s + '%'

and then create a partial function of two arguments that you can pass to FuncFormatter:

percent_formatter = partial(to_percent,
                            n=len(data))
formatter = FuncFormatter(percent_formatter)

Full code:

from functools import partial

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]


def to_percent(y, position, n):
    s = str(round(100 * y / n, 3))

    if matplotlib.rcParams['text.usetex']:
        return s + r'$\%$'

    return s + '%'


def plotting_hist(data):    
    f, ax = plt.subplots(figsize=(10, 5))
    ax.hist(data, 
            bins=len(set(data)), 
            rwidth=1)

    percent_formatter = partial(to_percent,
                                n=len(data))
    formatter = FuncFormatter(percent_formatter)
    plt.gca().yaxis.set_major_formatter(formatter)

    plt.show()


plotting_hist(data)

gives:

@ImportanceOfBeingErnest Could you explain why this output is incorrect and the one from DavidG is correct? I really don't see the difference. They also don't have 43% in the first bin.
Sorry, it seems correct. But I don't think it's useful to have arbitrarily complicated numbers on the axes, like 42.857 instead of 40.
Both of yours are correct, but the one from @ImportanceOfBeingErnest is simpler.

Collectives™ on Stack Overflow

Plot an histogram with y-axis as percentage (using FuncFormatter?)

6 Answers 6

7 Comments

5 Comments

Comments

Comments

Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

7 Comments

5 Comments

Comments

Comments

Comments

3 Comments

Linked

Related