Incorporating Google Link Text Fragments into CiteIt
Several weeks ago, a colleague of mine sent me a link to an Ars Technica article about a project which complements CiteIt.net by allowing writers to link to a specific portion of an Html document, and if necessary, automatically scroll down the page and highlight the selected text.
This feature is already built into Google Chrome and working its way through the standards process. There is even a Chrome plugin (code) which allows browsers to select a section of text and create a link to it.
I think this would be a great addition to CiteIt, providing authors a way to quickly direct their readers to their quote.
The question is how to incorporate the feature, which is written in javascript, with the web service, which is written in python.
- One option would be to set up a javascript (node?) web service which would take as input:
- the quoted text
- the URL of the cited document
- Another option would be to re-write the javascript functionality in Python
This feature is not my highest priority right now, so I would be happy to see a volunteer pursue it.
Github projects created under MIT license
Hello Dr. Giles,
I’ve developed an app for creating contextual citations for the web and, given your work with CiteSeer, I’m interested in gauging whether there would be interest within academia for my project.
What are Contextual Citations?
Contextual citations, as I’ve implemented them, allow authors to demonstrate the context of their citations by expanding the 500 characters of the context surrounding a quote. (See Demo below)
Writers can create contextual citations by marking up their citations using a “cite” attribute and calling a web service that looks up their citation and extracts the surrounding context into a JSON file.
Nothing is more certainly written in the book of fate than that these people are to be free.
</blockquote>Reader’s browsers then retrieve the surrounding JSON data and display it when the user clicks on the blue arrows above and below a quote, or for shorter inline quotes, clicking on a styled link to view a popup.
Demo: Two examples of contextual citations:
1. Blockquote: longer quotes

The Jefferson Memorial quotes Thomas Jefferson’s autobiography:
Nothing is more certainly written in the book of fate than that these people are to be free.
But if you click the blue down arrow and read the next few lines, you can quickly see that Jefferson’s original meaning was distorted by the removal of the quote’s context.
2. Inline quotation: shorter quotes
But catching cherry-picked citations isn’t the only advantage of being able to preview the surrounding context. Often the context provides the reader with a fuller understanding of the meaning. Take the following quote about “Well-behaved Women ..”.
Harvard Professor Laurel Thatcher Ulrich‘s recently popularized quip that “Well-behaved women seldom make history
” appears on T-shirts, but do you know the original context? (click the above link)
Additional Examples:
To personalize things, I decided to cite one of your publications on the subject of RefSeer:
RefSeer:
Here’s is an example of a citation describing RefSeer:
a citation recommendation system which automatically suggests candidate citations based on input queries. RefSeer has applications for both researchers and reviewers. While authoring a paper, researchers can use our citation recommendation system to find prior works related to the problem they seek to investigate. In turn, reviewers can use RefSeer to check whether a paper cites all relevant papers
When I published this blog post, it looked up the PDF for your publication and extracted the 500 characters of surrounding text into a JSON file that my web service saved to a CDN.
I have also created a WordPress demo that you can use to test-drive what a web application that implements contextual citation is like in real practice.
(more: Test-drive WordPress Demo with video)
How Contextual Citations Work:
- When an author publish a citation (in Html), they add a “cite” attribute to their citation tag: <blockquote cite=”cited-url”>quote</blockquote>
- The author or CMS system makes a call to the api.citeit.net web service when the publication takes place.
- The web service receives the URL of the citing publication and retrieves a copy, looking for any blockquote or q tags that contain a cite attribute with a valid URL.
- The web service retrieves each cited document and validates whether the quote matches the source.
- If the quote matches, it extracts the 500 characters of the surrounding context and saves it to a JSON file, which is then saved to a CDN or public webserver.
- The reader’s browser uses javascript to look up this JSON file when browsing the citing document.
- If the reader clicks on the arrows or text link, the javascript renders the JSON file that it got from the web service and displays it to the user in a div above the quote or in a popup.
(more: Developer Information with video)
Questions for Dr. Giles:
- Do you think a free open-source contextual citation system would be useful to academics?
- Many academic papers are published in PDF. Do you think a contextual citation system needs to be able to display an expanding/popup user interface from within PDF to be useful, or would the ability to cite PDFs from within Html be enough?
- How significant of a barrier do you think access restrictions are to the implementation of contextual citations in academia?
- Do you have any colleagues who might be interested in learning more about CiteIt.net’s Contextual Citation?

James Burke (Historian) “Connections” Television Program
Citing YouTube transcripts:
Although I haven’t heard of academic publications citing video transcripts, I have also been working on citation of audio and video. As an example, consider the following quote from James Burke‘s “Connections” series about a “crafty way” the Library of Alexandria assembled their collection:
Now they got these scrolls either because the local scholars wrote them or because they had a rather crafty law you see if you came to Alexandria on a boat and you owned a book you had to lend it to the library to be copied and sometimes the copies were so good the owners went off with the fakes and the library kept the original.
About me:
I was born in Winnipeg Manitoba but grew up in Lancaster County, Pennsylvania. I majored in history at the University of Waterloo in Ontario, Canada. I took some electives in computer science while I was there. After I graduated in 2001, I got a job as a computer programmer and I’ve worked as a programmer ever since.
The CiteIt.net app arose out of an interest in writing a review of hypertext pioneer Ted Nelson that was more in the spirit for his original vision for hypertext. I’ve been working on this Citation project since 2015 as a hobby.
Long-term Goals:
If everything works according to plan, I’d like to work on this full-time for an organization like Substack or the Internet Archive.
If you have any suggestions for how I should proceed, sent me an email.
CiteIt.net Releases Version 0.4 of Webservice
In a previous blog post, I asked “What Bugs Would Bill Gates Find in my Spec?“
If Gates were looking for problems, the first thing he might do is look for Unicode bugs that prevent a computer from matching a citing quote with a source quote.
So, This weekend, I put together a new version of the webservice that I hope reduces the number missed quotes.
The key change involves the hashing mechanism, which previously excluded characters on an ad/hoc basis.
The new system systematically screens all incoming quotes and urls against a list of unicode code points.
The implementation of this is found in the client javascript and server python code.
# Remove the following Unicode code points from Hash
TEXT_ESCAPE_CODE_POINTS = set ( [
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 39
, 96, 160, 173, 699, 700, 701, 702, 703, 712, 713, 714, 715, 716
, 717, 718, 719, 732, 733, 750, 757, 8211, 8212, 8213, 8216, 8217
, 8219, 8220, 8221, 8226, 8203, 8204, 8205, 65279, 8232, 8233, 133
, 5760, 6158, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200
, 8201, 8202, 8239, 8287, 8288, 12288
])
def escape_text(str):
"""Remove characters from the string"""
str_return = ''
for char in str:
if (ord(char) not in settings.TEXT_ESCAPE_CODE_POINTS):
str_return = str_return + char
return str_return
CiteIt API Setup for first time!
This morning I got the Cite API working for the first time!
Ever since my brother helped me create a new version of the code that was designed to work with Amazon Lamba I’ve felt like I needed to clean up my Git repository to clear out my old code.
At the Central PA Open Source Conference in September, a participant encouraged me to use Docker instead of Lamba for tasks like web crawling because you end up paying for all the time when you are waiting for a response.
So I move all the latest code out of Lamba and into a fresh Docker instance.
The server is hosted at Digital Ocean and mapped to the subdomain api.citeit.net.
New Version Fixes Some Bugs
I released an updated version of the wordpress plugin tonight. This version handles the filename hashing a little bit differently. It escapes a list of characters such as apostrophes and quotation marks that can be problematic.
The basic functionality is now working pretty well, but the performance needs to be improved for pages that include multiple citations on a single page. The web service also has a memory leak that needs to be patched before it is really ready for prime time.
Progress Report: Hash Algorithm Issues
Here’s a progress report on where things stand.
I’ve been working on the webservice code to try to fix some problems with how the quote hash values are created. Right now the hash is computed in two places – by the client, using javascript, and by the python web service.
Most of the time both sets of code produce the same result, but there are some cases when they do not, which prevents the client from finding the generated json file.
Before I go too much further, I’d also like to switch the hash algorithm from sha1 to something like sha256. This should be a fairly simple switch because I had planned for the hash algorithm to be swappable, without breaking backwards compatibility.
First Feature Requests and Bug Reports
I got my first user feedback on the wordpress plugin from my friend Daniel Miller. In the interests of documenting feedback here is Daniel’s feedback:
Feature: Add neotext to the Preview/Publish system so that the user doesn’t have to visit api.CiteIt.net and submit a new url. I think this is possible, I just have to research the WordPress API. One optimization I’d also like to do is detect whether a “cite” attribute has been used so the webservice doesn’t run unnecessarily.
Bug: Fix the extra non html code that gets included in the text:
The expandable before-context of the quotation I pulled from neotext.net starts with “.. /analytics.js’,’ga’); ga(‘create’, ‘UA-65403609-1’, ‘auto’); ga(‘send’, ‘pageview’);
Feature: enable the editing (and possible formatting) of the quote
Question: Say I neotextify a page and then later I decide I don’t want the expanding quote contexts in that article. Is there an easy non-technical way I can turn off the neotext plugin for just one page in WordPress?
Answer: You can remove the neotext by removing the “cite” attribute from the html. Perhaps there could be a way to do this with the GUI.
Bug: The headings in the context for my quote are run together with no space between the surrounding text and the heading.
Possible solution: This appears to be the result of line wraps that don’t contain a space at the end of them. This will require a different way of generating the text-version of the html
Feature: Is there supposed to be a link to the original document in the top expandable section? Or maybe at the bottom of the bottom expandable section? Seems like that might be a nice feature, although maybe some people wouldn’t want that, so maybe simpler is better? (thinking out loud here)
Feature: Some of the javascript files included by the plugin are not minified. Would be nice if they were.
More Feedback:
If you have further suggestions send them my way. My gmail account is timlangeman. If I like them, I’ll have you add them to the Github issues pages.
Django backend now uploads to Amazon S3
I haven’t blogged about my recent work on neotext, so here’s another post to get you up to speed:
I modified the django backend so that, in addition to saving a copy of the json file locally, it also uploads a copy of the json file up to Amazon S3.
I then modified the jQuery function to look for neotext json files on:
- http://read.neotext.net/quote/sha1/0.01/
This subdomain is currently served by Amazon S3.
Having all read requests served by Amazon S3 makes the service much more scalable.