Skip to content
This repository was archived by the owner on Dec 10, 2025. It is now read-only.
This repository was archived by the owner on Dec 10, 2025. It is now read-only.

TFIDF content matching should check inter-scrape #97

@PaulMcInnis

Description

@PaulMcInnis

Description

Currently we remove duplicates everywhere but we only remove duplicates by description (TFIDF) between the masterlist and all scrape data.

We should allow masterlist to perform a content match to itself.

Steps to Reproduce

  1. scrape some jobs to .pkl
  2. copy-paste a row a few times, only changing the key_id
  3. run again with --no-scrape

Expected behavior

We should be running TFIDF inter-scrape data and inter-master csv

Actual behavior

Only duplicates in the ncoming dict are identified based on master CSV

Environment

  • Build: 3.0.0

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions