CiteIt.net Releases Version 0.4 of Webservice

In a previous blog post, I asked “What Bugs Would Bill Gates Find in my Spec?“

If Gates were looking for problems, the first thing he might do is look for Unicode bugs that prevent a computer from matching a citing quote with a source quote.

So, This weekend, I put together a new version of the webservice that I hope reduces the number missed quotes.

The key change involves the hashing mechanism, which previously excluded characters on an ad/hoc basis.

The new system systematically screens all incoming quotes and urls against a list of unicode code points.

The implementation of this is found in the client javascript and server python code.

# Remove the following Unicode code points from Hash
TEXT_ESCAPE_CODE_POINTS = set ( [
    2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
    , 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 39
    , 96, 160, 173, 699, 700, 701, 702, 703, 712, 713, 714, 715, 716
    , 717, 718, 719, 732, 733, 750, 757, 8211, 8212, 8213, 8216, 8217
    , 8219, 8220, 8221, 8226, 8203, 8204, 8205, 65279, 8232, 8233, 133
    , 5760, 6158, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200
    , 8201, 8202, 8239, 8287, 8288, 12288
])

def escape_text(str):
    """Remove characters from the string"""
    str_return = ''
    for char in str:
        if (ord(char) not in settings.TEXT_ESCAPE_CODE_POINTS):
            str_return = str_return + char
    return str_return