In two of my previous posts I describe how I’ve been able to improve the Unicode text-processing that computes a citation’s unique identifier.
A recent bug that cropped in the computing of the hash key occurs when a character in the hash key is written using 3-bytes, such as:
'\u00e2\u0080\u009d', # "â"
'\u00e2\u0080\u009c', # LEFT DOUBLE QUOTATION MARK
'\u00e2\u0080\u0098', # LEFT SINGLE QUOTATION
'\u00e2\u0080\u0099', # RIGHT SINGLE QUOTATION
I’m already screening out a list of problematic unicode code points, but this bug is more of a entire class of characters than something that can be handled like the Drupal’s blacklist of common characters, so I decided to write a regular expression to catch this whole class of bug.
Here’s the regular expression that catches these 3-byte characters:
hexaPattern = re.compile(r'[\xc2-\xf4][\x80-\xbf]+')
str_return = re.sub(hexaPattern, '', str_return)
It seems as though this bug may occur when the source document’s encoding format is not guessed correctly and these characters are failing to properly convert to utf-8.
There’s still a chance that I don’t fully understand this bug, or that alternately I might want to use the occurrence of one of these errors as an opportunity to revisit the the original document‘s encoding.