Extract URL from specific pattern (Google Alert)

Question

I have tried every possible (to my poor knowledge) combination of grep and sed commands, but fail to extract URLs with the following pattern (Google Alert e-mails in plaintext):

"url": "https://www.google.com/url?rct=3Dj\u0026sa=3Dt\u0026url=3Dhtt=
p://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-launde=
ring-case-44765120\u0026ct=3Dga\u0026cd=3DCAEYACoTNzAxNDE5ODc4MzMzMTc5OTA4O=
TIaYjdkMGIxMjNmMjc0YWM4ODpjb206ZW46VVM\u0026usg=3DAFQjCNHKeTb3brU2sr0qOpXXJ=
fuW9Nfntg"

Obviously, what I want to extract is:

http://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-laundering-case-44765120

So I need to extract what is between "url=3D" and "\".

I have tried all kinds of grep and sed variations, but nothing works.

I would be very grateful if someone could help me figure this out.

PS: I know that once the URLs extracted I'll have to deal with the = characters, but one problem at a time :)

Maybe you should deal with "=\n" first, to join it into a single line, and then you can sed it (or do that joining in the sed program). — Ralph Rönnquist, Commented Feb 7, 2017 at 21:54
thanks, the problem is that when performing on a file with many such patterns, it doesn't work (it works only if I put this pattern only in a single file or as an echo input) — serge, Commented Feb 8, 2017 at 19:46
Hmm, apparently your sed doesn't have -z, then. If it had, you wouldn't end up with any lines ending in '='. You may need to highlight that in your question. Check man sed and sed --version — Ralph Rönnquist, Commented Feb 8, 2017 at 22:04
sed version is sed (GNU sed) 4.2.2 and I do have -z (just checked it with man) — serge, Commented Feb 8, 2017 at 22:13
Due to the complexity of the file formatting, I think what I should be able to do is: print all lines that contains url=3D BUT print only what comes AFTER url=3D, AND print UNTIL the line that contains u0026ct BUT print only what comes before u0026ct. This way I may be able to escape the problem caused by the formatting, which causes sed and grep to stop at each end of line? But I have no idea of how to do it. As of now, I managed to print everything it matches the line containing u0026ct: awk '{print} /u0026ct/ {exit}' INBOX > output.txt ... — serge, Commented Feb 8, 2017 at 22:23

Ralph Rönnquist · Accepted Answer · 2017-02-08 20:43:31Z

1

You can use a command line like the following for the processing:

cat INBOX | sed -z -e 's/=\n//g' | \
   sed -e 's/.*u0026url=3D//;t a;d;:a' -e 's/\\u0026ct=3D.*//'

The first sed step is for joining the lines the ends with "=" with their succeeding line, and thus in particular making one-liners of the interesting lines.

The second sed step firstly both reduces interesting lines by removing their head part, and discards any lines without that head, and secondly removes the tail parts of the target lines.

answered Feb 8, 2017 at 20:43

Ralph Rönnquist

3,3331 gold badge14 silver badges13 bronze badges

did you remember the -z to the first sed?
– Ralph Rönnquist
Commented Feb 8, 2017 at 21:35
yes, I did exactly how you put it.
– serge
Commented Feb 8, 2017 at 21:54
1

Peculiar; so doing the first sed only result in lines ending with =? I.e. doesn't join? Maybe your line ends are \r then (you are using a mac?) Try replacing \n with 'r'... or you're using windows? replace '\n with \r\n...
– Ralph Rönnquist
Commented Feb 8, 2017 at 22:18
I'm using Linux
– serge
Commented Feb 8, 2017 at 22:26
Yes, but the file might have, say, Window's line endings. You might replace \n with \r\?\n so as to treat both window's and linux line endings as such.
– Ralph Rönnquist
Commented Feb 8, 2017 at 22:29

| Show 1 more comment

Kamaraj · Accepted Answer · 2017-02-08 03:29:26Z

0

can you try with this command

awk -F"3D" '{print $4}' input.txt | sed "s/\\\u.*//"

answered Feb 8, 2017 at 3:29

Kamaraj

4,4151 gold badge14 silver badges19 bronze badges

thanks a lot for your input, but it doesn't work: the command returns lots of broken links, such as: "google.com/a rapsinews.com/judicial_news "16" style US&r "16"></a> </td> <td width "google.com/aler rapsinews.com/judicial_news/20 and so on...
– serge
Commented Feb 8, 2017 at 19:45
why don't you post some of the lines from actual file.
– Kamaraj
Commented Feb 9, 2017 at 2:51

Add a comment |

rcjohnson · Accepted Answer · 2017-02-08 08:40:29Z

0

I'm not certain how you are getting the alerts but I will provide an example of how to do it if the alerts were in a simple text file. I would deal with the "=" first by using tr then I would use Pearl lookarounds with grep as follows ...

cat input.txt | tr --delete '=\n'| grep -oP '(?<=url3D).*?(?=\\u0026)' input.txt

The output using your sample is

http://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-laundering-case-44765120

edited Feb 8, 2017 at 8:40

answered Feb 8, 2017 at 8:26

rcjohnson

9296 silver badges12 bronze badges

I perform the extractions on a single INBOX file (thunderbird). Thanks for your suggestion, it works flawlessly if performed on a file containing ONLY this sample, however when performed on the INBOX file (with many such URLs), it doesn't work at all (output is the same as the input). I really don't know what's wrong. As I have already mentioned, the pattern is always the same (URLs are located between u0026url=3D and \u0026ct=)
– serge
Commented Feb 8, 2017 at 19:52
the weird thing is that until it encounters another instance of "url": , the command you suggest works as intended and ignores all non-relevant strings (lot of html code and useless text), but whenever it encounters the next "url":, it breaks (stops extracting and just puts the rest of the INBOX file content after the first extracted URL)
– serge
Commented Feb 8, 2017 at 21:13
Well we are part of the way there then. :) If you would like to edit your question and provide a more complete example (with 2+ urls) I'd be happy to take a look
– rcjohnson
Commented Feb 9, 2017 at 8:41

Add a comment |

serge · Accepted Answer · 2017-02-08 22:56:45Z

0

Question solved using the suggestion of Ralph Rönnquist

cat INBOX | sed -z -e 's/=\r\?\n//g' | \ sed -e 's/.*u0026url=3D//;t a;d;:a' -e 's/\\u0026ct=3D.*//' > output.txt

It takes a long time to compute, but it does extract the URLs correctly.

Thanks a lot everybody for your assistance!

answered Feb 8, 2017 at 22:56

serge

112 bronze badges

Add a comment |

Stack Exchange Network

Extract URL from specific pattern (Google Alert)

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Extract URL from specific pattern (Google Alert)

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions