I have tried every possible (to my poor knowledge) combination of grep and sed commands, but fail to extract URLs with the following pattern (Google Alert e-mails in plaintext):
"url": "https://www.google.com/url?rct=3Dj\u0026sa=3Dt\u0026url=3Dhtt=
p://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-launde=
ring-case-44765120\u0026ct=3Dga\u0026cd=3DCAEYACoTNzAxNDE5ODc4MzMzMTc5OTA4O=
TIaYjdkMGIxMjNmMjc0YWM4ODpjb206ZW46VVM\u0026usg=3DAFQjCNHKeTb3brU2sr0qOpXXJ=
fuW9Nfntg"
Obviously, what I want to extract is:
http://abcnews.go.com/US/wireStory/judge-orders-forfeiture-cartel-money-laundering-case-44765120
So I need to extract what is between "url=3D" and "\".
I have tried all kinds of grep and sed variations, but nothing works.
I would be very grateful if someone could help me figure this out.
PS: I know that once the URLs extracted I'll have to deal with the = characters, but one problem at a time :)
sed
it (or do that joining in thesed
program).sed
doesn't have-z
, then. If it had, you wouldn't end up with any lines ending in '='. You may need to highlight that in your question. Checkman sed
andsed --version
awk '{print} /u0026ct/ {exit}' INBOX > output.txt
...