How to show characters non ascii in python?
Question
I'm using the Python Shell in this way:
>>> s = 'Ã'
>>> s
'\xc3'
How can I print s variable to show the character Ã??? This is the first and easiest question. Really, I'm getting the content from a web page that has non ascii characters like the previous and others with tilde like á, é, í, ñ, etc. Also, I'm trying to execute a regex with these characters in the pattern expression against the content of the web page.
How can solve this problem??
This is an example of one regex:
u'<td[^>]*>\s*Definición\s*</td><td class="value"[^>]*>\s*(?P<data>[\w ,-:\.\(\)]+)\s*</td>'
If I use Expresson application works fine.
EDIT[05/26/2009 16:38]: Sorry, about my explanation. I'll try to explain better.
I have to get some text from a page. I have the url of that page and I have the regex to get that text. The first thing I thought was the regex was wrong. I checked it with Expresso and works fine, I got the text I wanted. So, the second thing I thought was to print the content of the page and that was when I saw that the content was not what I see in the source code of the web page. The differences are the non ascii characters like á, é, í, etc. Now, I don't know what I have to do and if the problem is in the encoding of the page content or in the pattern text of the regex. One of the regex I've defined is the previous one.
The question wolud be: is there any problem using regex which pattern text has non ascii characters???
Solution
Suppose you want to print it as utf-8. Before python 3, the best is to specifically encode it
print u'Ã'.encode('utf-8')
if you get the text externally then you have to specifically decode('utf-8) such as
f = open(my_file)
a = f.next().decode('utf-8') # you have a unicode line in a
print a.encode('utf-8')
OTHER TIPS
How can I print s variable to show the character Ã???
use print
:
>>> s = 'Ã'
>>> s
'\xc3'
>>> print s
Ã
I would use ord()
to find out if a character is ASCII/special:
if ord(c) > 127:
# special character
This probably won't work with multibyte encodings such as UTF-8. In this case, I would convert to Unicode before testing.
If you get special characters from a web page, you should know the encoding. Then decode it, see Unicode HOWTO.
Edit: I'm definitely not sure what this question is about... It may be a good idea to clarify it.