Revisions to Scrape data from website into dataframe(s) using Split function

Tweeted twitter.com/StackCodeReview/status/1027525581854990342

occurred Aug 9, 2018 at 12:02

edited tags

Link

edited Aug 8, 2018 at 8:34

301_Moved_Permanently

29.4k
3
49
98

added 14 characters in body

Source Link

edited Aug 8, 2018 at 7:52

QHarr

385
4
17

Outline:

This code uses the Split function to extract specific information from the following website: https://www.webscraper.io/test-sites/tables.

The required information are the four tables visible on the page with headers "#", "First Name","Last Name","Username". I am extracting the information within these into 4 dataframes.

Example table:

Description:

I use the requests library to make the GET request, and split the response text on "table table-bordered" to generate my individual table chunks.

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.

I have written a function, GetTable, to parse the required information from each chunk and return a dataframe. There is a difference between the Split delimiter for table 1 versus 2-4.

There isn't an awful lot of code but I would appreciate any pointers on improving the code I have written.

I am running this from Spyder 3.2.8 with Python 3.6.

Code:

def GetTable(tableChunk):
    split1 = tableChunk.split('tbody')[1]
    split2 = split1.split('<table')[0]
    values = []
    
    aList = split2.split('>\n\t\t\t\t<') 
    if len(aList) !=1:
        for iitem in aList[1:]:
                values.append(iitem.split('</')[0].split('d>'[1])[1])
    else:
        aList = split2.split('</td')
        for iitem in aList[:-1]:
           values.append(iitem.split('td>')[1])
    
    headers =  ["#", "First Name", "Last Name", "User Name"]  
    rowsnumberOfColumns = len(headers)
    numberOfRows = int((len(values) / len(headers)numberOfColumns))
   
    df = pd.DataFrame(np.array(values).reshape(rows numberOfRows,4 numberOfColumns ) , columns = headers)
    return df

import requests as req
import pandas as pd
import numpy as np

url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text
   
tableChunks = htmlText.split('table table-bordered')

for tableChunk in tableChunks[1:]:
   print(GetTable(tableChunk))
   print('\n')

Outline:

This code uses the Split function to extract specific information from the following website: https://www.webscraper.io/test-sites/tables.

The required information are the four tables visible on the page with headers "#", "First Name","Last Name","Username". I am extracting the information within these into 4 dataframes.

Example table:

Description:

I use the requests library to make the GET request, and split the response text on "table table-bordered" to generate my individual table chunks.

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.

I have written a function, GetTable, to parse the required information from each chunk and return a dataframe. There is a difference between the Split delimiter for table 1 versus 2-4.

There isn't an awful lot of code but I would appreciate any pointers on improving the code I have written.

I am running this from Spyder 3.2.8 with Python 3.6.

Code:

def GetTable(tableChunk):
    split1 = tableChunk.split('tbody')[1]
    split2 = split1.split('<table')[0]
    values = []
    
    aList = split2.split('>\n\t\t\t\t<') 
    if len(aList) !=1:
        for i in aList[1:]:
                values.append(i.split('</')[0].split('d>'[1])[1])
    else:
        aList = split2.split('</td')
        for i in aList[:-1]:
           values.append(i.split('td>')[1])
    
    headers =  ["#", "First Name", "Last Name", "User Name"]  
    rows = int((len(values) / len(headers)))
   
    df = pd.DataFrame(np.array(values).reshape(rows,4) , columns = headers)
    return df

import requests as req
import pandas as pd
import numpy as np

url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text
 
tableChunks = htmlText.split('table table-bordered')

for tableChunk in tableChunks[1:]:
   print(GetTable(tableChunk))
   print('\n')

Outline:

This code uses the Split function to extract specific information from the following website: https://www.webscraper.io/test-sites/tables.

The required information are the four tables visible on the page with headers "#", "First Name","Last Name","Username". I am extracting the information within these into 4 dataframes.

Example table:

Description:

I use the requests library to make the GET request, and split the response text on "table table-bordered" to generate my individual table chunks.

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.

I have written a function, GetTable, to parse the required information from each chunk and return a dataframe. There is a difference between the Split delimiter for table 1 versus 2-4.

There isn't an awful lot of code but I would appreciate any pointers on improving the code I have written.

I am running this from Spyder 3.2.8 with Python 3.6.

Code:

def GetTable(tableChunk):
    split1 = tableChunk.split('tbody')[1]
    split2 = split1.split('<table')[0]
    values = []
    
    aList = split2.split('>\n\t\t\t\t<') 
    if len(aList) !=1:
        for item in aList[1:]:
                values.append(item.split('</')[0].split('d>'[1])[1])
    else:
        aList = split2.split('</td')
        for item in aList[:-1]:
           values.append(item.split('td>')[1])
    
    headers =  ["#", "First Name", "Last Name", "User Name"]  
    numberOfColumns = len(headers)
    numberOfRows = int((len(values) / numberOfColumns))
   
    df = pd.DataFrame(np.array(values).reshape( numberOfRows, numberOfColumns ) , columns = headers)
    return df

import requests as req
import pandas as pd
import numpy as np

url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text  
tableChunks = htmlText.split('table table-bordered')

for tableChunk in tableChunks[1:]:
   print(GetTable(tableChunk))
   print('\n')

added 14 characters in body

Source Link

edited Aug 8, 2018 at 7:47

QHarr

385
4
17

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.

def GetTable(tableChunk):
    split1 = tableChunk.split('tbody')[1]
    split2 = split1.split('<table')[0]
    values = []
    
    aList = split2.split('>\n\t\t\t\t<') 
    if len(aList) !=1:
        for i in aListaList[1:
            if not i == aList[0]]:
                values.append(i.split('</')[0].split('d>'[1])[1])
    else:
        aList = split2.split('</td')
        for i in aList[:-1]:
           values.append(i.split('td>')[1])
     
    headers =  ["#", "First Name", "Last Name", "User Name"]  
    rows = int((len(values) / 4len(headers)))
    header =  ["#", "First Name", "Last Name", "User Name"]
    df = pd.DataFrame(np.array(values).reshape(rows,4) , columns = headerheaders)
    return df

import requests as req
import pandas as pd
import numpy as np

url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text

tableChunks = htmlText.split('table table-bordered')

for tableChunk in tableChunks[1:]:
   print(GetTable(tableChunk))
   print('\n')

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want.

def GetTable(tableChunk):
    split1 = tableChunk.split('tbody')[1]
    split2 = split1.split('<table')[0]
    values = []
    
    aList = split2.split('>\n\t\t\t\t<') 
    if len(aList) !=1:
        for i in aList:
            if not i == aList[0]:
                values.append(i.split('</')[0].split('d>'[1])[1])
    else:
        aList = split2.split('</td')
        for i in aList[:-1]:
           values.append(i.split('td>')[1])
           
    rows = int((len(values) / 4))
    header =  ["#", "First Name", "Last Name", "User Name"]
    df = pd.DataFrame(np.array(values).reshape(rows,4) , columns = header)
    return df

import requests as req
import pandas as pd
import numpy as np

url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text

tableChunks = htmlText.split('table table-bordered')

for tableChunk in tableChunks[1:]:
   print(GetTable(tableChunk))
   print('\n')

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.

def GetTable(tableChunk):
    split1 = tableChunk.split('tbody')[1]
    split2 = split1.split('<table')[0]
    values = []
    
    aList = split2.split('>\n\t\t\t\t<') 
    if len(aList) !=1:
        for i in aList[1:]:
                values.append(i.split('</')[0].split('d>'[1])[1])
    else:
        aList = split2.split('</td')
        for i in aList[:-1]:
           values.append(i.split('td>')[1])
     
    headers =  ["#", "First Name", "Last Name", "User Name"]  
    rows = int((len(values) / len(headers)))
   
    df = pd.DataFrame(np.array(values).reshape(rows,4) , columns = headers)
    return df

import requests as req
import pandas as pd
import numpy as np

url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text

tableChunks = htmlText.split('table table-bordered')

for tableChunk in tableChunks[1:]:
   print(GetTable(tableChunk))
   print('\n')

Source Link

asked Aug 8, 2018 at 6:58

QHarr

385
4
17

Loading

Stack Exchange Network

Return to Question