From the last few days I have been studying scraping webpages and for further development I challenge myself to transform the script I have created into a class which will receive inputs from a user.
The main reason I'm posting this is because I would like advice and insights on how should I approach and manipulate instance variables from given object.
I had quite the hard time to figure it out how should I start or initiate the flow of the program. I will give you briefly what the scrape script does and see for yourself if I make sense I have done in the later script.
Here's the scraper.
It's to scrape all the jobs vacancies from given State or city.
The program receive the first page of a given url from a State/city and extract 3 main information. The last page, the total of jobs and how many jobs it has each page.
Then it use this information to build a loop to keep scraping until it reaches the last page. I put the loop inside of CSV function in order to save the data into a file at the same time.
require 'nokogiri'
require 'rest-client'
require 'csv'
url = 'https://www.infojobs.com.br/empregos-em-santa-catarina.aspx'
begin
html = RestClient.get(url)
rescue => e
puts "ERROR:#{e.message}"
next
parsed_html = Nokogiri::HTML(html)
total = parsed_html.css('.js_xiticounter').text.gsub('.', '')
page = 1
per_pagina = parsed_html.css('.element-vaga').count
last_page = (total.to_f / per_page).round
CSV.open('data_infojobsSC.csv', 'w') do |csv|
csv << ['Title', 'Company', 'City', 'Área']
until page >= last_page
current_url = "https://www.infojobs.com.br/empregos-em-santa-catarina.aspx?Page=#{page}"
begin
current_html = RestClient.get(current_url)
rescue => e
puts "ERROR: #{current_url}"
puts "Exception Message:#{e.message}"
next
end
parsed_current_html = Nokogiri::HTML(current_html)
jobs = parsed_current_html.css('.element-vaga')
jobs.each do |job|
title = job.css('div.vaga > a > h2').text.strip()
company = job.css('div.vaga-company > a').text.strip()
company = job.empty? ? "Confidential Company" : company
city = job.css('p.location2 > span > span > span').text
area = job.css('p.area > span').text
csv << [title, company, city, area]
end
puts "Registrados da página #{page} salvos com sucesso."
page+=1
end
end
Here's my POO code. This is the third version of the first functional code. Each time I try to make more modular. This is my first POO ruby code. It was really hard to me grasp how the class itself would work because all the previous classes I had written were simple as a dog class and bark function from beginners videos...
require 'nokogiri'
require 'rest-client'
require 'csv'
class InfoJobs #this is the name of the website
attr_accessor :url, :parsed_html, :total, :per_page, :last_page, :list
attr_reader :city, :state
def city=(city)
@city = city.chomp.downcase.gsub(' ', '-')
end
def state=(state)
@state = state.chomp.downcase
end
def build_url
@url = 'https://www.infojobs.com.br/empregos-em-' + @city + ',-' + @state + '.aspx'
end
def parsing(url) #since I need to parse many urls I decided to make a function
begin
html = RestClient.get(url)
rescue => e
puts "ERROR on #{url}"
puts "Exception Class:#{e.class.name}"
puts "Exception Message:#{e.message}"
end
@parsed_html = Nokogiri::HTML(html)
end
def get_page_values #to the initial values to know how many pages to scrape and use these values to build a loop
self.parsing(@url)
@total = @parsed_html.css('.js_xiticounter').text.gsub('.', '')
@per_page = @parsed_html.css('.element-vaga').count
@last_page = (@total.to_f / @per_page).round
end
def scraping
@list = []
page = 1
@url = @url + "?Page="
until page >= @last_page
@url + page.to_s
jobs = self.parsing(@url).css('.element-vaga') #in this part of program, the instance variable @url is no longer that previosly page we sent to get_page_values method. Is ok to use in this way the same instance variable ? or the best practice is to create another instance variable called url_pagination?
jobs.each do |job|
company = job.css('div.vaga-company > a').text.strip()
company = company.empty? ? "Confidential Company" : company
data = {
title: job.css('div.vaga > a > h2').text.strip(),
company: company,
city: job.css('p.location2 > span > span > span').text,
area: job.css('p.area > span').text
}
@list << data
puts "The page #{page} was successfully saved."
page+=1
end
end
end
def writing
CSV.open("list_jobs.csv", "wb", {headers: list.first.keys} ) do |csv|
csv << ['Title', 'Company', 'City', 'Area']
list.each do |hash|
csv << hash
end
end
end
def run
self.build_url
self.parsing(@url)
self.get_page_values
self.scraping
self.writing
end
end
test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
test.run
The user input is to set a State and a city. And then build the url with these values.
Just a note...Previously to call each class method I had something like this (below). In the code above I made a method 'run' to keep the object calling the methods inside the class. I really don't know if this is a correct approach or not...
teste = InfoJobs.new
test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
teste.build_url
teste.get_page_values
teste.scraping
teste.writing
NOTE: Each code runs perfectly fine.