Perl - File Encoding and Word Comparison

I have a file with one phrase/terms each line which i read to perl from STDIN. I have a list of stopwords (like "á", "são", "é") and i want to compare each one of them with each term, and remove if they are equal. The problem is that i'm not certain of the file's encoding format.

I get this from the file command:

words.txt: Non-ISO extended-ASCII English text

My linux terminal is in UTF-8 and it shows the right content for some words and for others don't. Here is the output from some of them:

condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos

You can see that the 3rd and 5th lines are correctly identifying words with accents and special characters while others don't. The correct output for the other lines should be: condiã, conteúdos and moçambique.

If i use binmode(STDOUT, utf8) the "incorrect" lines now output correctly while the other ones don't. For example the 3rd line:

ajuda, mas não resolve

What should i do guys?

5
задан tchrist 4 April 2015 в 19:09
поделиться