Tentative to transform this scrape script into a class object program

Question

From the last few days I have been studying scraping webpages and for further development I challenge myself to transform the script I have created into a class which will receive inputs from a user.

The main reason I'm posting this is because I would like advice and insights on how should I approach and manipulate instance variables from given object.

I had quite the hard time to figure it out how should I start or initiate the flow of the program. I will give you briefly what the scrape script does and see for yourself if I make sense I have done in the later script.

Here's the scraper.

It's to scrape all the jobs vacancies from given State or city.

The program receive the first page of a given url from a State/city and extract 3 main information. The last page, the total of jobs and how many jobs it has each page.

Then it use this information to build a loop to keep scraping until it reaches the last page. I put the loop inside of CSV function in order to save the data into a file at the same time.

require 'nokogiri'
require 'rest-client'
require 'csv'

url = 'https://www.infojobs.com.br/empregos-em-santa-catarina.aspx'
begin
    html = RestClient.get(url)
rescue => e
    puts "ERROR:#{e.message}"
    next
parsed_html = Nokogiri::HTML(html)
total = parsed_html.css('.js_xiticounter').text.gsub('.', '')
page = 1
per_pagina = parsed_html.css('.element-vaga').count
last_page = (total.to_f / per_page).round
CSV.open('data_infojobsSC.csv', 'w') do |csv|
    csv << ['Title', 'Company', 'City', 'Área']
    until page >= last_page
        current_url = "https://www.infojobs.com.br/empregos-em-santa-catarina.aspx?Page=#{page}"
        begin
            current_html = RestClient.get(current_url)
        rescue => e
            puts "ERROR: #{current_url}"
            puts "Exception Message:#{e.message}"
            next
        end
        parsed_current_html = Nokogiri::HTML(current_html)
        jobs = parsed_current_html.css('.element-vaga')
        jobs.each do |job|
            title = job.css('div.vaga > a > h2').text.strip()
            company = job.css('div.vaga-company > a').text.strip()
            company = job.empty? ? "Confidential Company" : company
            city = job.css('p.location2 > span > span > span').text
            area = job.css('p.area > span').text
            csv << [title, company, city, area]
        end
        puts "Registrados da página #{page} salvos com sucesso."
        page+=1
    end
end

Here's my POO code. This is the third version of the first functional code. Each time I try to make more modular. This is my first POO ruby code. It was really hard to me grasp how the class itself would work because all the previous classes I had written were simple as a dog class and bark function from beginners videos...

require 'nokogiri'
require 'rest-client'
require 'csv'

class InfoJobs #this is the name of the website

    attr_accessor :url, :parsed_html, :total, :per_page, :last_page, :list
    attr_reader :city, :state
    
    def city=(city)
        @city = city.chomp.downcase.gsub(' ', '-')
    end
    
    def state=(state)
        @state = state.chomp.downcase
    end

    def build_url
        @url = 'https://www.infojobs.com.br/empregos-em-' + @city + ',-' + @state + '.aspx'
    end

    def parsing(url) #since I need to parse many urls I decided to make a function
        begin
            html = RestClient.get(url)
        rescue => e
            puts "ERROR on #{url}"
            puts "Exception Class:#{e.class.name}"
            puts "Exception Message:#{e.message}"
        end
        @parsed_html = Nokogiri::HTML(html)
    end

    def get_page_values #to the initial values to know how many pages to scrape and use these values to build a loop
        self.parsing(@url)
        @total = @parsed_html.css('.js_xiticounter').text.gsub('.', '')
        @per_page = @parsed_html.css('.element-vaga').count
        @last_page = (@total.to_f / @per_page).round
    end

    def scraping
        @list = []
        page = 1
        @url = @url + "?Page=" 
        until page >= @last_page
            @url + page.to_s
            jobs = self.parsing(@url).css('.element-vaga') #in this part of program, the instance variable @url is no longer that previosly page we sent to get_page_values method. Is ok to use in this way the same instance variable ? or the best practice is to create another instance variable called url_pagination?
            jobs.each do |job|
                company = job.css('div.vaga-company > a').text.strip()
                company = company.empty? ? "Confidential Company" : company
                data = {
                    title: job.css('div.vaga > a > h2').text.strip(),
                    company: company,
                    city: job.css('p.location2 > span > span > span').text,
                    area: job.css('p.area > span').text
                }
                @list << data
                puts "The page #{page} was successfully saved."
                page+=1
            end
        end
    end

    def writing
        CSV.open("list_jobs.csv", "wb", {headers: list.first.keys} ) do |csv|
            csv << ['Title', 'Company', 'City', 'Area']
            list.each do |hash|
              csv << hash
            end
        end
    end

    def run
        self.build_url
        self.parsing(@url)
        self.get_page_values
        self.scraping
        self.writing
    end

end

test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
test.run

The user input is to set a State and a city. And then build the url with these values.

Just a note...Previously to call each class method I had something like this (below). In the code above I made a method 'run' to keep the object calling the methods inside the class. I really don't know if this is a correct approach or not...

teste = InfoJobs.new
test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
teste.build_url
teste.get_page_values
teste.scraping
teste.writing

NOTE: Each code runs perfectly fine.

Alter Lagos · Accepted Answer · 2020-02-27 20:29:03Z

In general, your OOP approach looks good to me.
Please check the comments to know more about the changes/details I found:

require 'nokogiri'
require 'rest-client'
require 'csv'

# Just as a general reference, I changed all the code's indentation from 4 to 2
# spaces as it's the most used way, also suggested by the ruby style guide
# https://rubystyle.guide/#spaces-indentation , but it's up to you to use 4 if
# you feel more comfortable with it.
class InfoJobs
  # If you're not really reading/writing object attributes when you create an
  # instance, like:
  #
  #   test = InfoJobs.new(city, state)
  #   test.total = 12345
  #   puts test.parsed_html
  #
  # there's no need to expose them for the object's instance. That's why I
  # remove all `attr_accessor`s. I have the feeling `city` and `state` are the
  # only attributes should be accessible, but only read access. Another option
  # would be to declare attributes as private `attr_reader`s, but I leave that
  # to you.
  attr_reader :city, :state

  # city/state attributes should be initialized in the constructor. Also is more
  # like an OOP approach.
  def initialize(city, state)
    @city = city.chomp.downcase.tr(' ', '-')
    @state = state.chomp.downcase
  end

  def run
    # Removed:
    # - `build_url`, as `base_url` will be called from every method when be
    #   required.
    # - `parsing(@url)` as it's later called in `initial_page_values`, where it
    #   really do something with the received html.
    # - All `self.` definitions, in general `self.` is required only when you're
    #   modifying a value of the instance, but it's not needed to read values or
    #   call methods.
    initial_page_values
    scraping
  ensure
    # With `ensure`, in case you ctrl+c the process or fail for some reason, you
    # "ensure" the gathered info so far is saved.
    writing
  end

  # The only accessible method for the class instance is `.run`, so the rest of
  # them should be kept as private.
  private

  # Changed method `build_url` to memoized method `base_url`. More info in
  # https://www.rubydoc.info/gems/rubocop/RuboCop/Cop/Naming/MemoizedInstanceVariableName
  def base_url
    # Changes:
    # - Using ||= to set a value only when this is empty,
    # - Using already defined city/state attr_readers instead of accessing
    #   directly to the global variables @city and @state
    # - Instead of using + to concatenate, I'm interpolating, which is the most
    #   prefered way. https://rubystyle.guide/#string-interpolation
    @base_url ||= "https://www.infojobs.com.br/empregos-em-#{city},-#{state}.aspx"
  end

  def parsing(url)
    # No need of `begin` when `rescue` contemplates the whole method
    html = RestClient.get(url)
    # No sense to have html parsing out of begin/rescue block. For `rescue`
    # case, the parsing would fail, so moving html parsing before rescue.
    # Also the result of this operation should be returned, not kept in a global
    # variable `@parsed_html`.
    Nokogiri::HTML(html)
  rescue StandardError => e
    puts "ERROR on #{url}"
    puts "Exception Class:#{e.class.name}"
    puts "Exception Message:#{e.message}"
  end

  # Minor change, but `get_` and `set_` method names are discouraged in ruby.
  # More info in https://rubystyle.guide/#accessor_mutator_method_names
  def initial_page_values
    # Following what was described in the previous method, you're going to need
    # the result of the parsing method only for the scope of this method, so
    # there's no need to save it as a global variable.
    parsed_html = parsing(base_url)
    # You're only using @last page out of this method, so there shouldn't be
    # necessary to have @total and @per_page as global variables.
    total = parsed_html.css('.js_xiticounter').text.delete('.')
    per_page = parsed_html.css('.element-vaga').count
    @last_page = (total.to_f / per_page).round
    puts "#{@last_page} pages found"
  end

  def scraping
    @list = []
    page = 1
    # Removed
    #   @url += '?Page='
    # as it should be better to handle it directly when the parsing is invoked.
    until page >= @last_page
      # This is wrong:
      #   @url + page.to_s
      # as you're adding the page to @url in a void context, after this line,
      # @url will keep having the same value as before this line.
      # This should be the url to parse:
      jobs = parsing("#{base_url}?Page=#{page}").css('.element-vaga')
      # Moving all the job parsing to its own method. Also to decrease ABC
      # metric https://en.wikipedia.org/wiki/ABC_Software_Metric and Cyclomatic
      # complexity https://en.wikipedia.org/wiki/Cyclomatic_complexity for this
      # `scraping` method.
      jobs.each { |job| @list.push(job_data(job)) }
      # Page increasing should be out of the jobs loop.
      puts "The page #{page} was successfully saved"
      page += 1
    end
  end

  def job_data(job)
    company = job.css('div.vaga-company > a').text.strip
    company = company.empty? ? 'Confidential Company' : company
    {
      title: job.css('div.vaga > a > h2').text.strip,
      company: company,
      city: job.css('p.location2 > span > span > span').text,
      area: job.css('p.area > span').text
    }#.tap { |jd| puts jd } # Uncomment if you need to debug values.
  end

  def writing
    # If you don't have values to save, simply return
    return unless @list && @list.any?

    puts "saving #{@list.count} jobs."
    # Just a minor change to write to a different file every time you run this,
    # but feel free to remove it.
    filename = "list_jobs_#{Time.now.to_i}.csv"
    CSV.open(filename, 'wb', headers: @list.first.keys) do |csv|
      csv << %w[Title Company City Area]
      @list.each do |hash|
        csv << hash
      end
    end
  end
end

puts '[ Type a city ] '
city = gets
puts '[ Type the state ]'
state = gets
puts 'Processing...'
test = InfoJobs.new(city, state)
test.run

When I talk about void context, check the following example:

# First definition of variable with a value
@url = 'www.stackoverflow.com' # => "www.stackoverflow.com"
# Then I'm adding a text, which is returning a value in a void context, in other
# words, there's no context, like a variable or a method, which is receiving the
# resulting value.
@url + '/tags' # => "www.stackoverflow.com/tags"
# If I access to the variable again, it has the same initial value
@url # => "www.stackoverflow.com"
# A different situation is when there's a context, in this case the same @url
# variable, which it will receive the resulting value of the addition.
@url = @url +  '/tags' # => "www.stackoverflow.com/tags"
@url # => "www.stackoverflow.com/tags"
# Or also using the shorter += version
@url += '?tab=name' # => "www.stackoverflow.com/tags?tab=name"
@url # => "www.stackoverflow.com/tags?tab=name"

I'm far to be an OOP expert (I've forgotten most of theory learned in Uni.��), but the idea is try to keep only the essential information of your object as attributes.
For your case, I think city and state are essential to make all the object's functioning works as expected.
Also, could be good to keep as attributes information to be accessed multiple times and that takes a while to get it. For your case, the jobs list was saved to be written later to a file (actually, rethinking the problem, maybe it wasn't necessary at all to have @list if after writing to a file isn't going to be used anymore).
Finally, you must to think that any created instance variable isn't garbage collected and will keep living in memory as long as your object's instance is alive. For your particular case, maybe this isn't a problem, but in the eventual case where you have thousands of those objects created, they should be as lightweight as possible or you could run out of memory.

You have my sincere thanks. I really appreciate the break down you have done in my code. I could grasp most of all you wrote, but I couldn't understand what do you mean by "void context". Another thing, instance variables are the DNA of each object, right? So you don't keep changing their values like a did in the previously code right? like I was setting different values to my '@url' and '@parsed_html' variables each time the page changes. — Greg M
– Greg M, Commented Feb 27, 2020 at 1:00

Stack Exchange Network

Tentative to transform this scrape script into a class object program

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Tentative to transform this scrape script into a class object program

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions