0

I am trying to extract all the elements from webpage which contains given set of words. E.g. Given i have an array of random words

words = ["sky", "element", "dry", "smooth", "java", "running", "illness", "lake", "soothing", "cardio", "gymnastic"]

and suppose all of them present in DOM. It means I need to check each possble tag for their presence. Below are the tags in which I am searching.

// 14 essential tags 
items = ["p", "li", "h2", "h3", "h4", "h5", "h6", "b", "small", "td", "span", "strong", "blockquote", "div"] 

also I need to ensure the whitespace boundary of word. So for each word I need to define 3 checks in jquery contains() selector like below.

tagName:contains('sky '), tagName:contains(' sky'), tagName:contains(' sky ')`

So in total the final query for above scenario makes words.length * items.length * 3 14*11*3 = 462 selectors. Which is itself a large query and as the words array start growing my query looks like below.

$(p:icontains('ACCESSIBLE '), p:icontains(' ACCESSIBLE'), p:icontains(' ACCESSIBLE '), li:icontains('ACCESSIBLE '), li:icontains(' ACCESSIBLE'), li:icontains(' ACCESSIBLE '), h3:icontains('ACCESSIBLE '), h3:icontains(' ACCESSIBLE'), h3:icontains(' ACCESSIBLE '), h4:icontains('ACCESSIBLE '), h4:icontains(' ACCESSIBLE'), h4:icontains(' ACCESSIBLE '), h5:icontains('ACCESSIBLE '), h5:icontains(' ACCESSIBLE'), h5:icontains(' ACCESSIBLE '), b:icontains('ACCESSIBLE '), b:icontains(' ACCESSIBLE'), b:icontains(' ACCESSIBLE '), td:icontains('ACCESSIBLE '), td:icontains(' ACCESSIBLE'), td:icontains(' ACCESSIBLE '), span:icontains('ACCESSIBLE '), span:icontains(' ACCESSIBLE'), span:icontains(' ACCESSIBLE '), strong:icontains('ACCESSIBLE '), strong:icontains(' ACCESSIBLE'), strong:icontains(' ACCESSIBLE '), div:icontains('ACCESSIBLE '), div:icontains(' ACCESSIBLE'), div:icontains(' ACCESSIBLE ')p:icontains('ADDRESS '), p:icontains(' ADDRESS'), p:icontains(' ADDRESS '), li:icontains('ADDRESS '), li:icontains(' ADDRESS'), li:icontains(' ADDRESS '), h3:icontains('ADDRESS '), h3:icontains(' ADDRESS'), h3:icontains(' ADDRESS '), h4:icontains('ADDRESS '), h4:icontains(' ADDRESS'), h4:icontains(' ADDRESS '), h5:icontains('ADDRESS '), h5:icontains(' ADDRESS'), h5:icontains(' ADDRESS '), b:icontains('ADDRESS '), b:icontains(' ADDRESS'), b:icontains(' ADDRESS '), td:icontains('ADDRESS '), td:icontains(' ADDRESS'), td:icontains(' ADDRESS '), span:icontains('ADDRESS '), span:icontains(' ADDRESS'), span:icontains(' ADDRESS '), strong:icontains('ADDRESS '), strong:icontains(' ADDRESS'), strong:icontains(' ADDRESS '), div:icontains('ADDRESS '), div:icontains(' ADDRESS'), div:icontains(' ADDRESS ')p:icontains('ANTIEMETICS '), p:icontains(' ANTIEMETICS'), p:icontains(' ANTIEMETICS '), li:icontains('ANTIEMETICS '), li:icontains(' ANTIEMETICS'), li:icontains(' ANTIEMETICS '), h3:icontains('ANTIEMETICS '), h3:icontains(' ANTIEMETICS'), h3:icontains(' ANTIEMETICS '), h4:icontains('ANTIEMETICS '), h4:icontains(' ANTIEMETICS'), h4:icontains(' ANTIEMETICS '), h5:icontains('ANTIEMETICS '), h5:icontains(' ANTIEMETICS'), h5:icontains(' ANTIEMETICS '), b:icontains('ANTIEMETICS '), b:icontains(' ANTIEMETICS'), b:icontains(' ANTIEMETICS '), td:icontains('ANTIEMETICS '), td:icontains(' ANTIEMETICS'), td:icontains(' ANTIEMETICS '), span:icontains('ANTIEMETICS '), span:icontains(' ANTIEMETICS'), span:icontains(' ANTIEMETICS '), strong:icontains('ANTIEMETICS '), strong:icontains(' ANTIEMETICS'), strong:icontains(' ANTIEMETICS '), div:icontains('ANTIEMETICS '), div:icontains(' ANTIEMETICS'), div:icontains(' ANTIEMETICS ')p:icontains('BRAND '), p:icontains(' BRAND'), p:icontains(' BRAND '), li:icontains('BRAND '), li:icontains(' BRAND'), li:icontains(' BRAND '), h3:icontains('BRAND '), h3:icontains(' BRAND'), h3:icontains(' BRAND '), h4:icontains('BRAND '), h4:icontains(' BRAND'), h4:icontains(' BRAND '), h5:icontains('BRAND '), h5:icontains(' BRAND'), h5:icontains(' BRAND '), b:icontains('BRAND '), b:icontains(' BRAND'), b:icontains(' BRAND '), td:icontains('BRAND '), td:icontains(' BRAND'), td:icontains(' BRAND '), span:icontains('BRAND '), span:icontains(' BRAND'), span:icontains(' BRAND '), strong:icontains('BRAND '), strong:icontains(' BRAND'), strong:icontains(' BRAND '), div:icontains('BRAND '), div:icontains(' BRAND'), div:icontains(' BRAND ')p:icontains('CAPABLE '), p:icontains(' CAPABLE'), p:icontains(' CAPABLE '), li:icontains('CAPABLE '), li:icontains(' CAPABLE'), li:icontains(' CAPABLE '), h3:icontains('CAPABLE '), h3:icontains(' CAPABLE'), h3:icontains(' CAPABLE '), h4:icontains('CAPABLE '), h4:icontains(' CAPABLE'), h4:icontains(' CAPABLE '), h5:icontains('CAPABLE '), h5:icontains(' CAPABLE'), h5:icontains(' CAPABLE '), b:icontains('CAPABLE '), b:icontains(' CAPABLE'), b:icontains(' CAPABLE '), td:icontains('CAPABLE '), td:icontains(' CAPABLE'), td:icontains(' CAPABLE '), span:icontains('CAPABLE '), span:icontains(' CAPABLE'), span:icontains(' CAPABLE '), strong:icontains('CAPABLE '), strong:icontains(' CAPABLE'), strong:icontains(' CAPABLE '), div:icontains('CAPABLE '), div:icontains(' CAPABLE'), div:icontains(' CAPABLE ')p:icontains('COMMAND '), p:icontains(' COMMAND'), p:icontains(' COMMAND '), li:icontains('COMMAND '), li:icontains(' COMMAND'), li:icontains(' COMMAND '), h3:icontains('COMMAND '), h3:icontains(' COMMAND'), h3:icontains(' COMMAND '), h4:icontains('COMMAND '), h4:icontains(' COMMAND'), h4:icontains(' COMMAND '), h5:icontains('COMMAND '), h5:icontains(' COMMAND'), h5:icontains(' COMMAND '), b:icontains('COMMAND '), b:icontains(' COMMAND'), b:icontains(' COMMAND '), td:icontains('COMMAND '), td:icontains(' COMMAND'), td:icontains(' COMMAND '), span:icontains('COMMAND '), span:icontains(' COMMAND'), span:icontains(' COMMAND '), strong:icontains('COMMAND '), strong:icontains(' COMMAND'), strong:icontains(' COMMAND '), div:icontains('COMMAND '), div:icontains(' COMMAND'), div:icontains(' COMMAND ')p:icontains('CONFLUENCE ') ....

enter image description here

A query with 87Kb of selectors and it takes nearly 30-40 seconds to execute on a page. Is there any way to optimize it. I am already eliminating the tags which are not present in the page before making the query.

EDIT: The reason I can't do whole page text parsing with regex is that I need to replace those matched words with some HTML after parsing, its similar to highlighting those words on page by injecting some HTML in their place. If i do regex I'll lost control over HTML element and their position. Also regex replacement doesnot gaurantee the text nodes. It also replaces the words written inside title attributes and other attributes if present.

4
  • 1
    I believe this question is more appropriate for codereview.stackexchange.com
    – Henfs
    Commented Mar 9, 2020 at 19:32
  • 2
    Most likely the better question is why you need something like this. To me it seems like a design error in an earlier stage.
    – Sirko
    Commented Mar 9, 2020 at 19:46
  • @Sirko help me with that. I didn;t get any better idea. I need to replce those matched words with some HTML after parsing, its similar to highlighting those words on page by injecting some HTML in their place. Suggest me a better workaround if you know.. Commented Mar 10, 2020 at 3:43
  • @Sirko please check edits for more details Commented Mar 10, 2020 at 3:59

1 Answer 1

1

Instead, why not just use the entire page html, and use regex's against the word array?

Test HTML:

<div id='container'>
<p>Chess</p>
<span>Chess and Checkers</span>
<h1> Chess and Checkers</h1>
</div>

Javascript:

$("document").ready(function() {
    var html = $("#container").html();
    var words = ["Chess", "checkers"];

    $.each(words, function() {
      var phrase = this;    
      html = html.replace(new RegExp(phrase, "gim"), "foo");
    });
    $("#container").html(html);
});

Output:

foo

foo and foo

foo and foo

2
  • That won;t work in my case. As I need to replace those matched words with some HTML after parsing, its similar to highlighting those words on page by injecting some HTML in their place. By Doing regex i could not get the element in their exact place. Regex has lot of issues as it also replaces the word written inside title attriubtes which i don't want. Commented Mar 10, 2020 at 3:48
  • The reason i don't need words written inside attribute to be replaced is that, i need to replace with HTML . Putting HTML element in place of a word written within title attribute breaks the whole element. Eg. <div title="this is WORD"> This is word</div> which becomes <div title="this is <span>replacement</span>">This is <span> replacement</span> </div> that doesn't make sense. Commented Mar 10, 2020 at 4:02

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.