I've been hacking on some Perl code that extracts data that comes from web users around the world and been stored into MySQL (with no real encoding information, of course). My goal it to generate well-formed, valid XML that can be read by another tool.
Now I'll be the first to admit that I never really took the time to like, understand, or pay much attention to all the changes in Perl's character and byte handling over the years. I'm one of those developers that, I suspect, is representative of the majority (at least in this self-centered country). I think it's all stupid and complicated and should Just Work... somehow.
But at the same time I know it's not.
Anyway, after importing lots of data I came across my first bug. Well, okay... not my first bug. My first bug related to this encoding stuff. The XML parser on the receiving end raised hell about some weird characters coming in.
Oh, crap. That's right. This is the big bad Internet and I forgot to do anything to scrub the data so that it'd look like the sort of thing you can cram into XML and expect to maybe work.
A little searching around managed to jog my memory and I updated my code to include something like this:
use Encode;
...
my $data = Encode::decode('utf8', $row->{'Stuff'});
And all was well for quite some time. I got a lot farther with that until this weekend when Perl itself began to blow up on me, throwing fatal exceptions like this:
Malformed UTF-8 character (fatal) ...
My first reaction, like yours probably, was WTF?!?! How on god's green earth is this a FATAL error?
After much swearing, a Twitter plea, and some reading (thanks Twitter world!), I came across a section of the Encode manual page from Perl.
I'm going to quote from it a fair amount here because I know you're as lazy as I am and won't go read it if I just link here. The relevant section is at the very end (just before SEE ALSO) and titled UTF-8 vs. utf8.
....We now view strings not as sequences of bytes, but as
sequences of numbers in the range 0 .. 2**32‐1 (or in the case of
64‐bit computers, 0 .. 2**64‐1) ‐‐ Programming Perl, 3rd ed.
That has been the perl