83

I have a list of data in which the numbers are between 1000 and 20 000.

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

When I plot a histogram using the hist() function, the y-axis represents the number of occurrences of the values within a bin. Instead of the number of occurrences, I would like to have the percentage of occurrences.

Histogram of the data list above

Code for the above plot:

f, ax = plt.subplots(1, 1, figsize=(10,5))
ax.hist(data, bins = len(list(set(data))))

I've been looking at this post which describes an example using FuncFormatter but I can't figure out how to adapt it to my problem. Some help and guidance would be welcome :)

EDIT: Main issue with the to_percent(y, position) function used by the FuncFormatter. The y corresponds to one given value on the y-axis I guess. I need to divide this value by the total number of elements which I apparently can' t pass to the function...

EDIT 2: Current solution I dislike because of the use of a global variable:

def to_percent(y, position):
    # Ignore the passed in position. This has the effect of scaling the default
    # tick locations.
    global n

    s = str(round(100 * y / n, 3))
    print (y)

    # The percent symbol needs escaping in latex
    if matplotlib.rcParams['text.usetex'] is True:
        return s + r'$\%$'
    else:
        return s + '%'

def plotting_hist(folder, output):
    global n

    data = list()
    # Do stuff to create data from folder

    n = len(data)
    f, ax = plt.subplots(1, 1, figsize=(10,5))
    ax.hist(data, bins = len(list(set(data))), rwidth = 1)

    formatter = FuncFormatter(to_percent)
    plt.gca().yaxis.set_major_formatter(formatter)

    plt.savefig("{}.png".format(output), dpi=500)

EDIT 3: Method with density = True

enter image description here

Actual desired output (method with global variable):

enter image description here

0

6 Answers 6

170

Other answers seem utterly complicated. A histogram which shows the proportion instead of the absolute amount can easily produced by weighting the data with 1/n, where n is the number of datapoints.

Then a PercentFormatter can be used to show the proportion (e.g. 0.45) as percentage (45%).

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

plt.hist(data, weights=np.ones(len(data)) / len(data))

plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.show()

enter image description here

Here we see that three of the 7 values are in the first bin, i.e. 3/7=43%.

Sign up to request clarification or add additional context in comments.

7 Comments

Hi, this looks good. However, the bar plots are not finishing exactly on the x-axis ticks but they are going a bit to the right each time. How can I make these be aligned?
@PoeteMaudit You don't align bars of a histogram. They are precisely at the bin edges. If you want to change the bin edges, use histogram's bins argument.
Thank you for your response but visually the bin edges are not aligned to the tick marks of the x-axis. Therefore, it gets even difficult to interpret what are the values related to each bin.
You fix this by choosing the bin edges, such that they are at nice numbers and set the ticks to those numbers, not the inverse.
To remove dependency on numpy, one can replace weights=np.ones(len(data)) / len(data) with weights = [1/len(data)] * len(data).
|
14

Simply set density to true, the weights will be implicitly normalized.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

plt.hist(data, density=True)

plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.show()

5 Comments

density=True does not give "the percentage of occurrences," as requested by the OP. Most users will not be looking for density=True( Google it to see why).
"Google it to see why" -> or just give the answer: It represents the percentage of the area under the curve (the sum of the areas of every bars is 1.0)
This is a very misleading wrong answer. Should be removed.
@SomethingSomething We don't remove wrong answers, we downvote them instead. How else will other people know that they should avoid this approach? Please don't add your comments into the answer either. Comments should be in the comment section.
@Dharman Sure, but there are still 16 upvotes, which seems like a reliable answer... what a simple solution, just add density=True, everybody likes it and upvotes blindly. And, BTW, if you scroll down, you'll see a similar answer that has been removed. I think this answer is amongst the exceptions where downvotes are not enough.
9

I think the simplest way is to use seaborn which is a layer on matplotlib. Note that you can still use plt.subplots(), figsize(), ax, and fig to customize your plot.

import seaborn as sns

And using the following code:

sns.displot(data, stat='probability'))

Also, sns.displot has so many parameters that allow for very complex and informative graphs very easily. They can be found here: displot Documentation

Comments

8

You can calculate the percentages yourself, then plot them as a bar chart. This requires you to use numpy.histogram (which matplotlib uses "under the hood" anyway). You can then adjust the y tick labels:

import matplotlib.pyplot as plt
import numpy as np

f, ax = plt.subplots(1, 1, figsize=(10,5))
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]

heights, bins = np.histogram(data, bins = len(list(set(data))))

percent = [i/sum(heights)*100 for i in heights]

ax.bar(bins[:-1], percent, width=2500, align="edge")
vals = ax.get_yticks()
ax.set_yticklabels(['%1.2f%%' %i for i in vals])

plt.show()

enter image description here

Comments

2

I found yet an other way to do so. As you can see in other answers, density=True alone doesn't solve the problem, as it calculates the area under the curve in percentage. But that can easily be converted, just divide it by the width of the bars.

import matplotlib.pyplot as plt

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
bins=10

plt.hist(data, bins=bins, density=True)

bar_width = (max(data)-min(data))/bins # calculate width of a bar
ticks = plt.yticks()[0] # get ticks
tick_labels = ticks * bar_width # calculate labels for ticks
tick_labels = map(lambda f: f"{f:0.2}%",tick_labels) # format float to string
plt.yticks(ticks=ticks, labels=tick_labels) # set new labels

plt.show()

enter image description here

However, the solution weights=np.ones(len(data)) / len(data) may be a shorther and cleaner. This is just an other way and without numpy

Comments

1

You can use functools.partial to avoid using globals in your example.

Just add n to function parameters:

def to_percent(y, position, n):
    s = str(round(100 * y / n, 3))

    if matplotlib.rcParams['text.usetex']:
        return s + r'$\%$'

    return s + '%'

and then create a partial function of two arguments that you can pass to FuncFormatter:

percent_formatter = partial(to_percent,
                            n=len(data))
formatter = FuncFormatter(percent_formatter)

Full code:

from functools import partial

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]


def to_percent(y, position, n):
    s = str(round(100 * y / n, 3))

    if matplotlib.rcParams['text.usetex']:
        return s + r'$\%$'

    return s + '%'


def plotting_hist(data):    
    f, ax = plt.subplots(figsize=(10, 5))
    ax.hist(data, 
            bins=len(set(data)), 
            rwidth=1)

    percent_formatter = partial(to_percent,
                                n=len(data))
    formatter = FuncFormatter(percent_formatter)
    plt.gca().yaxis.set_major_formatter(formatter)

    plt.show()


plotting_hist(data)

gives:

enter image description here

3 Comments

@ImportanceOfBeingErnest Could you explain why this output is incorrect and the one from DavidG is correct? I really don't see the difference. They also don't have 43% in the first bin.
Sorry, it seems correct. But I don't think it's useful to have arbitrarily complicated numbers on the axes, like 42.857 instead of 40.
Both of yours are correct, but the one from @ImportanceOfBeingErnest is simpler.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.