1

I am VERY new to the world of python/pandas/matplotlib, but I have been using it recently to create box and whisker plots. I was curious how to create a box and whisker plot for each sheet using a specific column of data, i.e. I have 17 sheets, and I have column called HMB and DV on each sheet. I want to plot 17 data sets on a Box and Whisker for HMB and another 17 data sets on the DV plot. Below is what I have so far.

I can open the file, and get all the sheets into list_dfs, but then don't know where to go from there. I was going to try and manually slice each set (as I started below before coming here for help), but when I have more data in the future, I don't want to have to do that by hand. Any help would be greatly appreciated!

import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import ExcelWriter
from pandas import ExcelFile
from pandas import DataFrame

excel_file =  'Project File Merger.xlsm'

list_dfs = []

xls = xlrd.open_workbook(excel_file,on_demand=True)
for sheet_name in xls.sheet_names():
    df = pd.read_excel(excel_file,sheet_name)
    list_dfs.append(df) 

d_psppm = {}
for i, sheet_name in enumerate(xls.sheet_names()):
    df = pd.read_excel(excel_file,sheet_name)
    d_psppm["PSPPM" + str(i)] = df.loc[:,['PSPPM']]

values_list = list(d_psppm.values())
print(values_list[:])

A sample output looks like below, for 17 list entries, but with different number of rows for each.

                              PSPPM
0                             0.246769
1                             0.599589
2                             0.082420
3                             0.250000
4                             0.205140
5                             0.850000,
                              PSPPM
0                             0.500887
1                             0.475255
2                             0.472711
3                             0.412953
4                             0.415883
5                             0.703716,...

The next thing I want to do is create a box and whisker plot, 1 plot with 17 box and whiskers. I am not sure how to get the dictionary to plot with the values and indices as the name. I have tried to dig, and figure out how to convert the dictionary to a list and then plot each element in the list, but have had no luck.

Thanks for the help!

3
  • Matplotlib can be difficult to get to grips with. You can iterate over a dictionary, so no need to turn it into a list. I would create a figure using fig, ax = plt.subplots() and then iterate with multiple ax.boxplot() calls for each box. Personally, I would avoid boxplots (definitely make them notched if you decide to use them), and placing the data on as a jittered scatter is almost always better.
    – Andrew
    Commented Nov 26, 2018 at 15:25
  • I haven't heard of a jittered scatter, but upon looking it up, I really like the way it presents data. I think it will be perfect for what I need. Any advice on how to iterate over the dictionary to create the plot? Sorry, I have been using Python for about a week now so I am still learning the whole process. Commented Nov 26, 2018 at 15:43
  • Maybe consider using a seaborn.boxplot as you can group these on a categorical variables. I would keep the data in a pd.DataFrame (change the dict into a DataFrame?) then pd.melt to create long-form data and then plot sns.boxplot(x="variable", y="value", data=df)
    – Alex
    Commented Nov 26, 2018 at 15:54

1 Answer 1

2

I agree with @Alex that forming your columns into a new DataFrame and then plotting from that would be a good approach, however, if you're going to use the dict, then it should look something like this. Depending on the version of Python you're using, the dictionary may be unordered, so if the ordering on the plot is important to you, then you might want to create a list of dictionary keys in the order you want and iterate over that instead

import matplotlib.pyplot as plt
import numpy as np

#colours = []#list of colours here, if you want
#markers = []#list of markers here, if you want
fig, ax = plt.subplots()
for idx, k in enumerate(d_psppm, 1):
    data = d_psppm[k]
    jitter = np.random.normal(0, 0.1, data.shape[0]) + idx
    ax.scatter(jitter, 
               data,
               s=25,#size of the marker
               c="r",#colour, could be from colours
               alpha=0.35,#opacity, 1 being solid
               marker="^",#or ref. to markers, e.g. markers[idx]
               edgecolors="none"#removes black border
              )

As per Alex's suggestion, you could use the data to create a seaborn boxplot and overlay a swarmplot to show the data (depends on how many rows each has whether this is practical).

1
  • Thank you, both Alex and Andrew! I was able to get things working and now my data is looking pretty good :)! Commented Nov 27, 2018 at 16:38

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.