2

I'm trying to download and manipulate an xls file using urllib and xlrd.

The data is coming from url http://profiles.doe.mass.edu/search/search_export.aspx?orgCode=&orgType=5,12&runOrgSearch=Y&searchType=ORG&leftNavId=11238&showEmail=N

I'm using Python 2.7, xlrd 0.9.4, urllib 1.17, and I'm on a Mac.

I'm able to successfully download the file using this code.

saveLocation = home_dir+"/test/"
fileName = "data.xls"
page = <the url given above>
urllib.urlretrieve(page, saveLocation+fileName)

I then try to open the file using xlrd

wb = xlrd.open_workbook(saveLocation+fileName)

But get the error

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\r\n\r\n<htm' 

This tells me that the file is not downloading as a true xls file. I can open the file in Excel and get no popup warnings or compatibility errors. Oddly enough, if I then save the file (in Excel) as Excel 97-2004, the xlrd error goes away. So it appears that Excel "fixes" whatever was wrong with the file.

So my question is, how do I "fix" the file in python or download the data in an appropriate format that xlrd will recognize?

I've also tried downloading the file as an xlsx file and using openpyxl but get a similar error. openpyxl says its not a valid zip file. I've also tried downloading the data using different methods such as requests.

Thanks.

EDIT: Using the information provided by @DSM, I was able to download and use the Excel file. Here's the code I used.

dfs = pd.read_html(fileLocation+fileName, index_col = 7, header=0)[0]
writer = pd.ExcelWriter(fileLocation+fileName)
dfs.to_excel(writer,"Sheet1")
writer.save()

I was then able to access the file as a true Excel file

ws = pd.read_excel(fileLocation+fileName, 0) 
6
  • On Windows, Excel alerts me thusly: imgur.com/YkcuAWu IME this is common for files that are created by a non-Excel application (e.g., writing a buffer to a text file and then naming it with an xls extension, etc). The problem is usually fixed by opening the file in Excel application, save/saveAs, and closing the file. Commented Nov 2, 2015 at 21:00
  • Regarding openpyxl, if it is a .xls format, it is not a ZIP file (xlsm, xlsx, xlsb, etc., are all ZIP archives -- the older .xls extension is not). Commented Nov 2, 2015 at 21:01
  • I do not see any such warning on Mac. Yes, as I mentioned in my description, if I open the file in Excel and re-save it, the problem goes away. So the question still applies, what do I need to do to fix this programmatically. No I did not try an xls file with openpyxl. As you'll see above, I said I tried downloading it as an xlsx file, which is a zip format. So openpyxl complained that it was not a valid zip format.
    – Adrian
    Commented Nov 2, 2015 at 23:47
  • Oh I believe you're not getting the warning on the Mac, and my comment was not meant to be condescending, only an observation tht there is something funky about the origin of this file which is at fault here (coupled with the fact that Apple has neutered a lot of Excel and especially VBA functionality). Commented Nov 3, 2015 at 0:18
  • what do I need to do to fix this programmatically Not sure if either xlrd or openpyxl have any way of "fixing" the file -- these are both lightweight modules not intended to mimic Excel in full functionality, instead they are primarily intended for reading data from Excel files without the overhead of the Excel application. Commented Nov 3, 2015 at 0:20

1 Answer 1

2

As the <htm bit should hint, this is really the data presented in xml fashion, despite the name .xls. (It's almost always worth manually looking at the data header in your favourite editor to check to see what something actually is when it turns out to be hard to read.) Sometimes this can be a real nuisance to deal with, but fortunately here we can read it simply by using read_html:

>>> url="http://profiles.doe.mass.edu/search/search_export.aspx?orgCode=&orgType=5,12&runOrgSearch=Y&searchType=ORG&leftNavId=11238&showEmail=N"
>>> dfs = pd.read_html(url)
>>> len(dfs)
1
>>> dfs[0].iloc[:5,:5]
                                                   0         1  \
0                                           Org Name  Org Code   
1       Abby Kelley Foster Charter Public (District)  04450000   
2                                           Abington  00010000   
3  Academy Of the Pacific Rim Charter Public (Dis...  04120000   
4                                     Acton (non-op)  00020000   

                        2                      3              4  
0                Org Type               Function   Contact Name  
1        Charter District  Charter School Leader     Brian Haas  
2  Public School District         Superintendent  Peter Schafer  
3        Charter District  Charter School Leader  Chris Collins  
4  Public School District         Superintendent    Glenn Brand  

Looking more closely, we see that we can use the 0th row for the headers, and so:

>>> df = pd.read_html(url, header=0)[0]
>>> df.iloc[:5, :5]
                                            Org Name  Org Code  \
0       Abby Kelley Foster Charter Public (District)   4450000   
1                                           Abington     10000   
2  Academy Of the Pacific Rim Charter Public (Dis...   4120000   
3                                     Acton (non-op)     20000   
4                                   Acton-Boxborough   6000000   

                 Org Type               Function   Contact Name  
0        Charter District  Charter School Leader     Brian Haas  
1  Public School District         Superintendent  Peter Schafer  
2        Charter District  Charter School Leader  Chris Collins  
3  Public School District         Superintendent    Glenn Brand  
4  Public School District         Superintendent    Glenn Brand  

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.