I'm trying to download and manipulate an xls file using urllib and xlrd.
The data is coming from url http://profiles.doe.mass.edu/search/search_export.aspx?orgCode=&orgType=5,12&runOrgSearch=Y&searchType=ORG&leftNavId=11238&showEmail=N
I'm using Python 2.7, xlrd 0.9.4, urllib 1.17, and I'm on a Mac.
I'm able to successfully download the file using this code.
saveLocation = home_dir+"/test/"
fileName = "data.xls"
page = <the url given above>
urllib.urlretrieve(page, saveLocation+fileName)
I then try to open the file using xlrd
wb = xlrd.open_workbook(saveLocation+fileName)
But get the error
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\r\n\r\n<htm'
This tells me that the file is not downloading as a true xls file. I can open the file in Excel and get no popup warnings or compatibility errors. Oddly enough, if I then save the file (in Excel) as Excel 97-2004, the xlrd error goes away. So it appears that Excel "fixes" whatever was wrong with the file.
So my question is, how do I "fix" the file in python or download the data in an appropriate format that xlrd will recognize?
I've also tried downloading the file as an xlsx file and using openpyxl but get a similar error. openpyxl says its not a valid zip file. I've also tried downloading the data using different methods such as requests.
Thanks.
EDIT: Using the information provided by @DSM, I was able to download and use the Excel file. Here's the code I used.
dfs = pd.read_html(fileLocation+fileName, index_col = 7, header=0)[0]
writer = pd.ExcelWriter(fileLocation+fileName)
dfs.to_excel(writer,"Sheet1")
writer.save()
I was then able to access the file as a true Excel file
ws = pd.read_excel(fileLocation+fileName, 0)