Decoding and Encoding in Python

Question

I have some text that I am trying to decode and encode in Python

import html.parser

original_tweet = "I luv my &lt;3 iphone &amp; you’re awsm 
                 apple.DisplayIsAwesome, sooo happppppy 🙂 
                 http://www.apple.com"
tweet = original_tweet.decode("utf8").encode('ascii', 'ignore')

I have entered the original tweet on one line in Spyder (Python 3.6)

I get the following message

AttributeError: 'str' object has no attribute 'decode'

Is there an alternative way to rewrite this code for Python 3.6?

You seem to be confused what a string in Python represents and what encoding or decoding does. Encoding turns a string into bytes, decoding the opposite. In that light, your call doesn't make sense and hence it also fails. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented Mar 10, 2018 at 9:39
This is the website I am following and am unable to understand what is going on: analyticsvidhya.com/blog/2014/11/… — cordelia
– cordelia, Commented Mar 10, 2018 at 9:41
You can not use str.encode() and bytes.decode() to handle the HTML entities < and & if that’s what you’re trying to do. Look into libs like Parsing HTML with lxml for that (based on you importing a HTML parser). However, your string original_tweet isn’t proper HTML, so you may consider fudging that first… — Jens
– Jens, Commented Mar 10, 2018 at 9:43
@cordelia That website's code does not make any sense. If your original_tweet value is a character string already, there's no need to encode or decode it. If it's a byte string (i.e. a bytes object), decode it once to get a character string. — phihag
– phihag, Commented Mar 10, 2018 at 9:44
I believe that the code on that website was written for Python 2. There, a regular string (without u prefix) is a byte sequence, which can be decoded. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented Mar 10, 2018 at 9:47

Jens · Accepted Answer · 2018-03-10 22:19:03Z

In Python3+, your original_tweet string is a UTF-8 encoded Unicode string containing a Unicode emoji. Because the 65k+ characters in Unicode are a superset of the 256 ASCII characters, you can not simply convert a Unicode string into an ASCII string.

However, if you can live with some data loss (i.e. drop the emoji) then you can try the following (see this or this related question):

original_tweet = "I luv my &lt;3 iphone &amp; you’re awsm ..."

# Convert the original UTF8 encoded string into an array of bytes.
original_tweet_bytes = original_tweet.encode("utf-8")

# Decode that array of bytes into a string containing only ASCII characters;
# pass errors="strict" to find failing character mappings, and I also suggest
# to read up on the option errors="replace".
original_tweet_ascii = original_tweet_bytes.decode("ascii", errors="ignore")

Or as a simple one-liner:

tweet = original_tweet.encode("utf-8").decode("ascii", errors="ignore")

Note that this does not convert the HTML entities < and & which you may have to address separately. You can do that using a proper HTML parser (e.g. lxml), or use a simple string replacement:

tweet = tweet.replace("&lt;", "<").replace("&amp;", "&")

Or as of Python 3.4+ you can use html.unescape() like so:

tweet = html.unescape(tweet)

See also this question on how to handle HTML entities in strings.

Addendum. The Unidecode package for Python seems to provide useful functionality for this, too, although in its current version it does not handle emojis.

Thank you so much for helping me with this. That truly resolves my query.
How do I avoid losing the 're for the you're? Apologies for bugging you with this but I just noticed it.
@cordelia, the ’ character is Unicode character U+2019 and has no direct equivalent in ASCII. What you can do, however, is to use str.replace() to replace all ‘ and ’ with ASCII ' and the double quotation marks “ and ” with ASCI ". See also this question: Replacing unicode punctuation with ASCII approximations.

Collectives™ on Stack Overflow

Decoding and Encoding in Python

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related