hashkey

In two of my previous posts I describe how I’ve been able to improve the Unicode text-processing that computes a citation’s unique identifier.

A recent bug that cropped in the computing of the hash key occurs when a character in the hash key is written using 3-bytes, such as:

'\u00e2\u0080\u009d',   # "â"
'\u00e2\u0080\u009c',   # LEFT DOUBLE QUOTATION MARK
'\u00e2\u0080\u0098',   # LEFT SINGLE QUOTATION
'\u00e2\u0080\u0099',   # RIGHT SINGLE QUOTATION

I’m already screening out a list of problematic unicode code points, but this bug is more of a entire class of characters than something that can be handled like the Drupal’s blacklist of common characters, so I decided to write a regular expression to catch this whole class of bug.

Here’s the regular expression that catches these 3-byte characters:

  hexaPattern = re.compile(r'[\xc2-\xf4][\x80-\xbf]+')
  str_return = re.sub(hexaPattern, '', str_return)

It seems as though this bug may occur when the source document’s encoding format is not guessed correctly and these characters are failing to properly convert to utf-8.

There’s still a chance that I don’t fully understand this bug, or that alternately I might want to use the occurrence of one of these errors as an opportunity to revisit the the original document‘s encoding.

Unicode block 0180 Latin Extended-B

I’ve known for a while that a Unicode bug was preventing the javascript CiteIt.net javascript hash function from matching the python hash function.

I did some research into this recently and discovered that javascript uses UTF-16, while my Python code uses UTF-8.

I changed this in the WordPress client code by converting the javascript string to UTF-8:

Code

Here’s the code:

//** Javascript uses utf-16. Convert to utf-8 **
var hash_key = quoteHashKey(
    citing_quote, 
    citing_url, 
    cited_url
);
hash_key = encode_utf8(hash_key); 

// *** Convert string to UTF-8 ***
function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

Example:

Here’s an example of a special character than now displays after the fix: ü

Two recently published books—one by Ian Milligan (2019) and one edited by Niels Brügger and Ralph Schroeder (2017)—provide essential guides to help answer the question of what web archives are by describing concrete, nonhypothetical examples of how social science and humanities researchers are using web archives today. For those who have participated in web archiving activity and pondered how the records would get used, and for those who are looking to get involved in web archiving but are not sure what it takes, these two books are essential reading.

Webservice Result

Here’s the JSON file that results from calling the webservice.

Unicode Encoding Issues: 3-byte characters

Unicode Bug Fix in WordPress Plugin (Javascript uses UTF-16)

Code

Example:

Webservice Result