0

I have some text that I am trying to decode and encode in Python

import html.parser

original_tweet = "I luv my <3 iphone & you’re awsm 
                 apple.DisplayIsAwesome, sooo happppppy 🙂 
                 http://www.apple.com"
tweet = original_tweet.decode("utf8").encode('ascii', 'ignore')

I have entered the original tweet on one line in Spyder (Python 3.6)

I get the following message

AttributeError: 'str' object has no attribute 'decode'

Is there an alternative way to rewrite this code for Python 3.6?

18
  • 5
    You seem to be confused what a string in Python represents and what encoding or decoding does. Encoding turns a string into bytes, decoding the opposite. In that light, your call doesn't make sense and hence it also fails. Commented Mar 10, 2018 at 9:39
  • This is the website I am following and am unable to understand what is going on: analyticsvidhya.com/blog/2014/11/… Commented Mar 10, 2018 at 9:41
  • 2
    You can not use str.encode() and bytes.decode() to handle the HTML entities < and & if that’s what you’re trying to do. Look into libs like Parsing HTML with lxml for that (based on you importing a HTML parser). However, your string original_tweet isn’t proper HTML, so you may consider fudging that first… Commented Mar 10, 2018 at 9:43
  • @cordelia That website's code does not make any sense. If your original_tweet value is a character string already, there's no need to encode or decode it. If it's a byte string (i.e. a bytes object), decode it once to get a character string. Commented Mar 10, 2018 at 9:44
  • I believe that the code on that website was written for Python 2. There, a regular string (without u prefix) is a byte sequence, which can be decoded. Commented Mar 10, 2018 at 9:47

1 Answer 1

1

In Python3+, your original_tweet string is a UTF-8 encoded Unicode string containing a Unicode emoji. Because the 65k+ characters in Unicode are a superset of the 256 ASCII characters, you can not simply convert a Unicode string into an ASCII string.

However, if you can live with some data loss (i.e. drop the emoji) then you can try the following (see this or this related question):

original_tweet = "I luv my <3 iphone & you’re awsm ..."

# Convert the original UTF8 encoded string into an array of bytes.
original_tweet_bytes = original_tweet.encode("utf-8")

# Decode that array of bytes into a string containing only ASCII characters;
# pass errors="strict" to find failing character mappings, and I also suggest
# to read up on the option errors="replace".
original_tweet_ascii = original_tweet_bytes.decode("ascii", errors="ignore")

Or as a simple one-liner:

tweet = original_tweet.encode("utf-8").decode("ascii", errors="ignore")

Note that this does not convert the HTML entities < and & which you may have to address separately. You can do that using a proper HTML parser (e.g. lxml), or use a simple string replacement:

tweet = tweet.replace("&lt;", "<").replace("&amp;", "&")

Or as of Python 3.4+ you can use html.unescape() like so:

tweet = html.unescape(tweet)

See also this question on how to handle HTML entities in strings.

Addendum. The Unidecode package for Python seems to provide useful functionality for this, too, although in its current version it does not handle emojis.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you so much for helping me with this. That truly resolves my query.
How do I avoid losing the 're for the you're? Apologies for bugging you with this but I just noticed it.
@cordelia, the character is Unicode character U+2019 and has no direct equivalent in ASCII. What you can do, however, is to use str.replace() to replace all and with ASCII ' and the double quotation marks and with ASCI ". See also this question: Replacing unicode punctuation with ASCII approximations.
Thanks for this @Jens

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.