Question

I'm using the Python Shell in this way:

>>> s = 'Ã'
>>> s
'\xc3'

How can I print s variable to show the character Ã??? This is the first and easiest question. Really, I'm getting the content from a web page that has non ascii characters like the previous and others with tilde like á, é, í, ñ, etc. Also, I'm trying to execute a regex with these characters in the pattern expression against the content of the web page.

How can solve this problem??

This is an example of one regex:

u'<td[^>]*>\s*Definición\s*</td><td class="value"[^>]*>\s*(?P<data>[\w ,-:\.\(\)]+)\s*</td>'

If I use Expresson application works fine.

EDIT[05/26/2009 16:38]: Sorry, about my explanation. I'll try to explain better.

I have to get some text from a page. I have the url of that page and I have the regex to get that text. The first thing I thought was the regex was wrong. I checked it with Expresso and works fine, I got the text I wanted. So, the second thing I thought was to print the content of the page and that was when I saw that the content was not what I see in the source code of the web page. The differences are the non ascii characters like á, é, í, etc. Now, I don't know what I have to do and if the problem is in the encoding of the page content or in the pattern text of the regex. One of the regex I've defined is the previous one.

The question wolud be: is there any problem using regex which pattern text has non ascii characters???

Was it helpful?

Solution

Suppose you want to print it as utf-8. Before python 3, the best is to specifically encode it

print u'Ã'.encode('utf-8')

if you get the text externally then you have to specifically decode('utf-8) such as

f = open(my_file)
a = f.next().decode('utf-8') # you have a unicode line in a
print a.encode('utf-8') 

OTHER TIPS

How can I print s variable to show the character Ã???
use print:

>>> s = 'Ã'
>>> s
'\xc3'
>>> print s
Ã

I would use ord() to find out if a character is ASCII/special:

if ord(c) > 127:
    # special character

This probably won't work with multibyte encodings such as UTF-8. In this case, I would convert to Unicode before testing.

If you get special characters from a web page, you should know the encoding. Then decode it, see Unicode HOWTO.

Edit: I'm definitely not sure what this question is about... It may be a good idea to clarify it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top