1

I have tried every possible (to my poor knowledge) combination of grep and sed commands, but fail to extract URLs with the following pattern (Google Alert e-mails in plaintext):

"url": "https://www.google.com/url?rct=3Dj\u0026sa=3Dt\u0026url=3Dhtt=
p://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-launde=
ring-case-44765120\u0026ct=3Dga\u0026cd=3DCAEYACoTNzAxNDE5ODc4MzMzMTc5OTA4O=
TIaYjdkMGIxMjNmMjc0YWM4ODpjb206ZW46VVM\u0026usg=3DAFQjCNHKeTb3brU2sr0qOpXXJ=
fuW9Nfntg"

Obviously, what I want to extract is:

http://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-laundering-case-44765120

So I need to extract what is between "url=3D" and "\".

I have tried all kinds of grep and sed variations, but nothing works.

I would be very grateful if someone could help me figure this out.

PS: I know that once the URLs extracted I'll have to deal with the = characters, but one problem at a time :)

5
  • 1
    Maybe you should deal with "=\n" first, to join it into a single line, and then you can sed it (or do that joining in the sed program). Commented Feb 7, 2017 at 21:54
  • thanks, the problem is that when performing on a file with many such patterns, it doesn't work (it works only if I put this pattern only in a single file or as an echo input)
    – serge
    Commented Feb 8, 2017 at 19:46
  • Hmm, apparently your sed doesn't have -z, then. If it had, you wouldn't end up with any lines ending in '='. You may need to highlight that in your question. Check man sed and sed --version Commented Feb 8, 2017 at 22:04
  • sed version is sed (GNU sed) 4.2.2 and I do have -z (just checked it with man)
    – serge
    Commented Feb 8, 2017 at 22:13
  • Due to the complexity of the file formatting, I think what I should be able to do is: print all lines that contains url=3D BUT print only what comes AFTER url=3D, AND print UNTIL the line that contains u0026ct BUT print only what comes before u0026ct. This way I may be able to escape the problem caused by the formatting, which causes sed and grep to stop at each end of line? But I have no idea of how to do it. As of now, I managed to print everything it matches the line containing u0026ct: awk '{print} /u0026ct/ {exit}' INBOX > output.txt ...
    – serge
    Commented Feb 8, 2017 at 22:23

4 Answers 4

1

You can use a command line like the following for the processing:

cat INBOX | sed -z -e 's/=\n//g' | \
   sed -e 's/.*u0026url=3D//;t a;d;:a' -e 's/\\u0026ct=3D.*//'

The first sed step is for joining the lines the ends with "=" with their succeeding line, and thus in particular making one-liners of the interesting lines.

The second sed step firstly both reduces interesting lines by removing their head part, and discards any lines without that head, and secondly removes the tail parts of the target lines.

6
  • did you remember the -z to the first sed? Commented Feb 8, 2017 at 21:35
  • yes, I did exactly how you put it.
    – serge
    Commented Feb 8, 2017 at 21:54
  • 1
    Peculiar; so doing the first sed only result in lines ending with =? I.e. doesn't join? Maybe your line ends are \r then (you are using a mac?) Try replacing \n with 'r'... or you're using windows? replace '\n with \r\n... Commented Feb 8, 2017 at 22:18
  • I'm using Linux
    – serge
    Commented Feb 8, 2017 at 22:26
  • Yes, but the file might have, say, Window's line endings. You might replace \n with \r\?\n so as to treat both window's and linux line endings as such. Commented Feb 8, 2017 at 22:29
0

can you try with this command

awk -F"3D" '{print $4}' input.txt | sed "s/\\\u.*//"
2
0

I'm not certain how you are getting the alerts but I will provide an example of how to do it if the alerts were in a simple text file. I would deal with the "=" first by using tr then I would use Pearl lookarounds with grep as follows ...

cat input.txt | tr --delete '=\n'| grep -oP '(?<=url3D).*?(?=\\u0026)' input.txt

The output using your sample is

http://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-laundering-case-44765120
3
  • I perform the extractions on a single INBOX file (thunderbird). Thanks for your suggestion, it works flawlessly if performed on a file containing ONLY this sample, however when performed on the INBOX file (with many such URLs), it doesn't work at all (output is the same as the input). I really don't know what's wrong. As I have already mentioned, the pattern is always the same (URLs are located between u0026url=3D and \u0026ct=)
    – serge
    Commented Feb 8, 2017 at 19:52
  • the weird thing is that until it encounters another instance of "url": , the command you suggest works as intended and ignores all non-relevant strings (lot of html code and useless text), but whenever it encounters the next "url":, it breaks (stops extracting and just puts the rest of the INBOX file content after the first extracted URL)
    – serge
    Commented Feb 8, 2017 at 21:13
  • Well we are part of the way there then. :) If you would like to edit your question and provide a more complete example (with 2+ urls) I'd be happy to take a look
    – rcjohnson
    Commented Feb 9, 2017 at 8:41
0

Question solved using the suggestion of Ralph Rönnquist

cat INBOX | sed -z -e 's/=\r\?\n//g' | \ sed -e 's/.*u0026url=3D//;t a;d;:a' -e 's/\\u0026ct=3D.*//' > output.txt

It takes a long time to compute, but it does extract the URLs correctly.

Thanks a lot everybody for your assistance!

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.