All Questions
20 questions
1
vote
2
answers
3k
views
How can a scrape a page that literally contains "\x2d", but save that character as "-" in my item?
I need to scrape some text from within a script on a page, and save that text within a scrapy item, presumably as a UTF-8 string. However the actual literal text I'm scraping from has special ...
-3
votes
1
answer
5k
views
Using Scrapy sitemap spider, show me how to crawl for article titles
I am trying to crawl Washington Post Sitemap for articles with title that has the word "trump." I did my research here https://scrapy.readthedocs.io/en/latest/topics/spiders.html#sitemapspider, but I ...
1
vote
2
answers
93
views
python regex to find name of fielder
I am trying to crawl a website and parse cricket scoreboard using scrapy. I have been able to do most of it except for the field who caught the ball. There can be several ways in which the text can be ...
3
votes
1
answer
234
views
Working with Scrapy 'regex definition'
I have been trying to generate a script to scrape data from the website https://services.aamc.org/msar/home#null. I generated a python scrapy 2.7 script to get a piece of text from the website (I am ...
1
vote
2
answers
49
views
How can I identify a js array by its keys?
My spider returns javascript code as a string. From this code I need to retrieve an array which I can identify by its keys.
That means, I already have the keys but how do I get the complete array? ...
2
votes
2
answers
2k
views
remove special character in scrapy python
I try to remove the special characters between the following text:
sample_sample_sample_2.18.14
I tried following patterns for remove those special characters:
item['xxxx'] = item['aaaa'].replace('...
0
votes
1
answer
373
views
scrapy regex cannot find long dash
I'm using scrapy xpath + re to extract data from web pages. Characters are unicode (russian) and all strings to be extracted contain long dashes (python code '\u2014')
The problem is my regex cannot ...
0
votes
2
answers
43
views
unable to get text betweent character?
I am try to get text "XXXXX" between the characters.like / XXXXX .doc from the url link
I am trying to "item['xxxxx'] = re.search(r'/(.*?)/.doc', item['url']).group(1)"
Here unable to get the text ...
0
votes
1
answer
2k
views
Regular expression with Scrapy/Python
with Scrapy I want to extract some data from websites. This is my section for the parsing:
item['title'] = sel.xpath('//div[@class="box"]/h3/text()').extract()
item['date'] = sel.xpath('//div[@class="...
0
votes
1
answer
43
views
get sgml allow regex for "example.xom/page/200/"
I'm trying to get the regular expression for "example.com/page/200/".
Here's what I've done so far:
rules = (Rule (SgmlLinkExtractor(
allow=("//page/\d+",),
restrict_xpaths=('xxxxx',)),
...
9
votes
1
answer
12k
views
Scrapy Extract number from page text with regex
I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:
def parse(self, response):
title = ...
3
votes
1
answer
54
views
Can't get additional items from url
I'm scraping few items from this site, but it grabs items only from the first product and doesn't loop further. I know I'm doing simple stupid mistake, but if you can just point out where I got this ...
-1
votes
3
answers
90
views
Why this regular expression is not working
I am using python 2.7 with scrapy .20
I have this test
0552121152, +97143321090
I want to get the value before the comma and the value after it.
...
-2
votes
2
answers
65
views
why this regular expression returns empty [closed]
I have these strings:
Phone: 3396222
Phone: +33333388
I want to extract the numbers.
I tried this regular expression:
Phone:\s*(\d+\.\d+)
But I got an empty result
I am using scrapy so my code is ...
1
vote
2
answers
180
views
regular expression to get string from text()
I have this html:
<p class="marB0">Phone:+97143396222<br>
Email:[email protected]</p>
And I want to get the phone number
I get the text like this:
normalize-...