Skip to main content

All Questions

Tagged with
1 vote
2 answers
3k views

How can a scrape a page that literally contains "\x2d", but save that character as "-" in my item?

I need to scrape some text from within a script on a page, and save that text within a scrapy item, presumably as a UTF-8 string. However the actual literal text I'm scraping from has special ...
Chris's user avatar
  • 301
-3 votes
1 answer
5k views

Using Scrapy sitemap spider, show me how to crawl for article titles

I am trying to crawl Washington Post Sitemap for articles with title that has the word "trump." I did my research here https://scrapy.readthedocs.io/en/latest/topics/spiders.html#sitemapspider, but I ...
Rahmi Pruitt's user avatar
1 vote
2 answers
93 views

python regex to find name of fielder

I am trying to crawl a website and parse cricket scoreboard using scrapy. I have been able to do most of it except for the field who caught the ball. There can be several ways in which the text can be ...
Neel's user avatar
  • 625
3 votes
1 answer
234 views

Working with Scrapy 'regex definition'

I have been trying to generate a script to scrape data from the website https://services.aamc.org/msar/home#null. I generated a python scrapy 2.7 script to get a piece of text from the website (I am ...
mg520's user avatar
  • 33
1 vote
2 answers
49 views

How can I identify a js array by its keys?

My spider returns javascript code as a string. From this code I need to retrieve an array which I can identify by its keys. That means, I already have the keys but how do I get the complete array? ...
steph's user avatar
  • 565
2 votes
2 answers
2k views

remove special character in scrapy python

I try to remove the special characters between the following text: sample_sample_sample_2.18.14 I tried following patterns for remove those special characters: item['xxxx'] = item['aaaa'].replace('...
Karthick's user avatar
0 votes
1 answer
373 views

scrapy regex cannot find long dash

I'm using scrapy xpath + re to extract data from web pages. Characters are unicode (russian) and all strings to be extracted contain long dashes (python code '\u2014') The problem is my regex cannot ...
thepolina's user avatar
  • 1,274
0 votes
2 answers
43 views

unable to get text betweent character?

I am try to get text "XXXXX" between the characters.like / XXXXX .doc from the url link I am trying to "item['xxxxx'] = re.search(r'/(.*?)/.doc', item['url']).group(1)" Here unable to get the text ...
Karthick's user avatar
0 votes
1 answer
2k views

Regular expression with Scrapy/Python

with Scrapy I want to extract some data from websites. This is my section for the parsing: item['title'] = sel.xpath('//div[@class="box"]/h3/text()').extract() item['date'] = sel.xpath('//div[@class="...
ChristopherB's user avatar
0 votes
1 answer
43 views

get sgml allow regex for "example.xom/page/200/"

I'm trying to get the regular expression for "example.com/page/200/". Here's what I've done so far: rules = (Rule (SgmlLinkExtractor( allow=("//page/\d+",), restrict_xpaths=('xxxxx',)), ...
Suresh's user avatar
  • 123
9 votes
1 answer
12k views

Scrapy Extract number from page text with regex

I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows: def parse(self, response): title = ...
Xaxum's user avatar
  • 3,675
3 votes
1 answer
54 views

Can't get additional items from url

I'm scraping few items from this site, but it grabs items only from the first product and doesn't loop further. I know I'm doing simple stupid mistake, but if you can just point out where I got this ...
user3404005's user avatar
-1 votes
3 answers
90 views

Why this regular expression is not working

I am using python 2.7 with scrapy .20 I have this test 0552121152, +97143321090 I want to get the value before the comma and the value after it. ...
Marco Dinatsoli's user avatar
-2 votes
2 answers
65 views

why this regular expression returns empty [closed]

I have these strings: Phone: 3396222 Phone: +33333388 I want to extract the numbers. I tried this regular expression: Phone:\s*(\d+\.\d+) But I got an empty result I am using scrapy so my code is ...
user2226785's user avatar
1 vote
2 answers
180 views

regular expression to get string from text()

I have this html: <p class="marB0">Phone:+97143396222<br> Email:[email protected]</p> And I want to get the phone number I get the text like this: normalize-...
Marco Dinatsoli's user avatar

15 30 50 per page