Incorporating Google Link Text Fragments into CiteIt
Several weeks ago, a colleague of mine sent me a link to an Ars Technica article about a project which complements CiteIt.net by allowing writers to link to a specific portion of an Html document, and if necessary, automatically scroll down the page and highlight the selected text.
This feature is already built into Google Chrome and working its way through the standards process. There is even a Chrome plugin (code) which allows browsers to select a section of text and create a link to it.
I think this would be a great addition to CiteIt, providing authors a way to quickly direct their readers to their quote.
The question is how to incorporate the feature, which is written in javascript, with the web service, which is written in python.
- One option would be to set up a javascript (node?) web service which would take as input:
- the quoted text
- the URL of the cited document
- Another option would be to re-write the javascript functionality in Python
This feature is not my highest priority right now, so I would be happy to see a volunteer pursue it.
CiteIt.net Releases Version 0.4 of Webservice
In a previous blog post, I asked “What Bugs Would Bill Gates Find in my Spec?“
If Gates were looking for problems, the first thing he might do is look for Unicode bugs that prevent a computer from matching a citing quote with a source quote.
So, This weekend, I put together a new version of the webservice that I hope reduces the number missed quotes.
The key change involves the hashing mechanism, which previously excluded characters on an ad/hoc basis.
The new system systematically screens all incoming quotes and urls against a list of unicode code points.
The implementation of this is found in the client javascript and server python code.
# Remove the following Unicode code points from Hash
TEXT_ESCAPE_CODE_POINTS = set ( [
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 39
, 96, 160, 173, 699, 700, 701, 702, 703, 712, 713, 714, 715, 716
, 717, 718, 719, 732, 733, 750, 757, 8211, 8212, 8213, 8216, 8217
, 8219, 8220, 8221, 8226, 8203, 8204, 8205, 65279, 8232, 8233, 133
, 5760, 6158, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200
, 8201, 8202, 8239, 8287, 8288, 12288
])
def escape_text(str):
"""Remove characters from the string"""
str_return = ''
for char in str:
if (ord(char) not in settings.TEXT_ESCAPE_CODE_POINTS):
str_return = str_return + char
return str_return
CiteIt API Setup for first time!
This morning I got the Cite API working for the first time!
Ever since my brother helped me create a new version of the code that was designed to work with Amazon Lamba I’ve felt like I needed to clean up my Git repository to clear out my old code.
At the Central PA Open Source Conference in September, a participant encouraged me to use Docker instead of Lamba for tasks like web crawling because you end up paying for all the time when you are waiting for a response.
So I move all the latest code out of Lamba and into a fresh Docker instance.
The server is hosted at Digital Ocean and mapped to the subdomain api.citeit.net.
New Version Fixes Some Bugs
I released an updated version of the wordpress plugin tonight. This version handles the filename hashing a little bit differently. It escapes a list of characters such as apostrophes and quotation marks that can be problematic.
The basic functionality is now working pretty well, but the performance needs to be improved for pages that include multiple citations on a single page. The web service also has a memory leak that needs to be patched before it is really ready for prime time.
Progress Report: Hash Algorithm Issues
Here’s a progress report on where things stand.
I’ve been working on the webservice code to try to fix some problems with how the quote hash values are created. Right now the hash is computed in two places – by the client, using javascript, and by the python web service.
Most of the time both sets of code produce the same result, but there are some cases when they do not, which prevents the client from finding the generated json file.
Before I go too much further, I’d also like to switch the hash algorithm from sha1 to something like sha256. This should be a fairly simple switch because I had planned for the hash algorithm to be swappable, without breaking backwards compatibility.
First Feature Requests and Bug Reports
I got my first user feedback on the wordpress plugin from my friend Daniel Miller. In the interests of documenting feedback here is Daniel’s feedback:
Feature: Add neotext to the Preview/Publish system so that the user doesn’t have to visit api.CiteIt.net and submit a new url. I think this is possible, I just have to research the WordPress API. One optimization I’d also like to do is detect whether a “cite” attribute has been used so the webservice doesn’t run unnecessarily.
Bug: Fix the extra non html code that gets included in the text:
The expandable before-context of the quotation I pulled from neotext.net starts with “.. /analytics.js’,’ga’); ga(‘create’, ‘UA-65403609-1’, ‘auto’); ga(‘send’, ‘pageview’);
Feature: enable the editing (and possible formatting) of the quote
Question: Say I neotextify a page and then later I decide I don’t want the expanding quote contexts in that article. Is there an easy non-technical way I can turn off the neotext plugin for just one page in WordPress?
Answer: You can remove the neotext by removing the “cite” attribute from the html. Perhaps there could be a way to do this with the GUI.
Bug: The headings in the context for my quote are run together with no space between the surrounding text and the heading.
Possible solution: This appears to be the result of line wraps that don’t contain a space at the end of them. This will require a different way of generating the text-version of the html
Feature: Is there supposed to be a link to the original document in the top expandable section? Or maybe at the bottom of the bottom expandable section? Seems like that might be a nice feature, although maybe some people wouldn’t want that, so maybe simpler is better? (thinking out loud here)
Feature: Some of the javascript files included by the plugin are not minified. Would be nice if they were.
More Feedback:
If you have further suggestions send them my way. My gmail account is timlangeman. If I like them, I’ll have you add them to the Github issues pages.
Django backend now uploads to Amazon S3
I haven’t blogged about my recent work on neotext, so here’s another post to get you up to speed:
I modified the django backend so that, in addition to saving a copy of the json file locally, it also uploads a copy of the json file up to Amazon S3.
I then modified the jQuery function to look for neotext json files on:
- http://read.neotext.net/quote/sha1/0.01/
This subdomain is currently served by Amazon S3.
Having all read requests served by Amazon S3 makes the service much more scalable.