8

I am trying to read in excel files to Pandas from the following URLs:

url1 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls'

url2 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/STTI_Historical.xls'

using the code:

pd.read_excel(url1)

However it doesn't work and I get the error:

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '2000/01/'

After searching on Google it seems that sometimes .xls files offered through URLs are actually held in a different file format behind the scenes such as html or xml.

When I manually download the excel file and open it using Excel I get presented with an error message: The file format and extension don't match. The file could be corrupted or unsafe. Unless you trust it's source don't open it"

When I do open it, it appears just like a normal excel file.

I came across a post online that suggested I open the file in a text editor to see if there is any additional info held as to proper file format but I don't see any additional info when opened using notepad++.

Could someone please help me get this "xls" file read into a pandas DataFramj properly please?

2 Answers 2

6

It seems you can use read_csv:

import pandas as pd

df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
                 sep='\t',
                 parse_dates=[0],
                 names=['a','b','c','d','e','f'])
print df

Then I check last column f if there are some other values as NaN:

print df[df.f.notnull()]

Empty DataFrame
Columns: [a, b, c, d, e, f]
Index: []

So there are only NaN, so you can filter last column f by parameter usecols:

import pandas as pd

df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
                 sep='\t',
                 parse_dates=[0],
                 names=['a','b','c','d','e','f'],
                 usecols=['a','b','c','d','e'])
print df
4
  • ah that's brilliant thanks! That worked perfectly! - did you just know read_csv would work or was there some way to tell?
    – s666
    Commented May 15, 2016 at 20:12
  • First my Excel return warning while I open url with file. Then I check file by Notepad++ and it seems as csv. So I use rather read_csv and it works very nice. Good luck!
    – jezrael
    Commented May 15, 2016 at 20:20
  • Thanks for the info - I opened it using notepad++ too to try to look but where did you see the additional information that it was csv? I just saw the text data contained within.
    – s666
    Commented May 15, 2016 at 20:22
  • Sorry, it is txt. No csv. But read_csv very often read some well structured txt very nice. Thank you for accepting.
    – jezrael
    Commented May 15, 2016 at 20:24
4

If this helps someone.. you can read a Google Drive File directly by URL in to Excel without any login requirements. I tried in Google Colab it worked.

  • Upload an XL File to Google Drive, or use an already uploaded one
  • Share the File to Anyone with the Link (i don't know if view only works, but i tried with full access)
  • Copy the Link

You will get something like this.

share url: https://drive.google.com/file/d/---some--long--string/view?usp=sharing

Get the download url from attempting to download the file (copy the url from there)

It will be something like this: (it has got the same google file id as above)

download url: https://drive.google.com/u/0/uc?id=---some--long--string&export=download

Now go to Google Colab and paste the following code:

import pandas as pd

fileurl   = r'https://drive.google.com/file/d/---some--long--string/view?usp=sharing'
filedlurl = r'https://drive.google.com/u/0/uc?id=---some--long--string&export=download'

df = pd.read_excel(filedlurl)
df

That's it.. the file is in your df.

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.