In a previous blog post, I asked “What Bugs Would Bill Gates Find in my Spec?“
If Gates were looking for problems, the first thing he might do is look for Unicode bugs that prevent a computer from matching a citing quote with a source quote.
So, This weekend, I put together a new version of the webservice that I hope reduces the number missed quotes.
The key change involves the hashing mechanism, which previously excluded characters on an ad/hoc basis.
The new system systematically screens all incoming quotes and urls against a list of unicode code points.
The implementation of this is found in the client javascript and server python code.
# Remove the following Unicode code points from Hash
TEXT_ESCAPE_CODE_POINTS = set ( [
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 39
, 96, 160, 173, 699, 700, 701, 702, 703, 712, 713, 714, 715, 716
, 717, 718, 719, 732, 733, 750, 757, 8211, 8212, 8213, 8216, 8217
, 8219, 8220, 8221, 8226, 8203, 8204, 8205, 65279, 8232, 8233, 133
, 5760, 6158, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200
, 8201, 8202, 8239, 8287, 8288, 12288
])
def escape_text(str):
"""Remove characters from the string"""
str_return = ''
for char in str:
if (ord(char) not in settings.TEXT_ESCAPE_CODE_POINTS):
str_return = str_return + char
return str_return