Joe Germuska responded to my question:
How many of the sources journalists quote are not on the web anywhere to be linked?
— Joe Germuska (@JoeGermuska) August 3, 2021
Non-Web or non-HTML sources are challenges for CiteIt. There are several approaches for dealing with this:
Granted, It may not always be possible or cost-effective to digitize and publish context for every source. Sometimes it is more important to be timely than transparent. If a story is breaking, additional information may be provided at a later time. A story may also be worth publishing in basic form but not worth the additional investment in contextualizing quote sources.
But if the quote is worth contextualizing, here are some of the options:
Disclaimer: I’ve worked as a newspaper carrier1 but never as a writer or editor, so my thoughts about publishing are uninformed by experience. That said, here are my thoughts on the options:
1) Minimal Quoting
The quickest, easiest, and least transparent approach to quoting from a non-web source, such as an interview you’ve done or a town council meeting you’ve recorded, is to transcribe only the 500 characters before and after the quote and upload this text as a mini web document. You can then quote from the document and CiteIt will be able to pull the immediate context. Not ideal, but better than nothing.
2) Paper -> PDF -> text
If the source is on paper (and typed), you can scan it and upload it. I’ve worked on adding PDF support (code link) to the web service, but it’s not mature and the OCR support requires more server resources than I can currently support at scale. In the past, I digitized a copy of War & Peace and it took 16 seconds per page on my average-powered Mac.
It should be possible to add PDF support to the CiteIt webservice if there is enough interest. (Building out solid support for adding PDF digitization to the web service is in the long-term timeline.)
3) Handwritten Sources
I don’t know how well OCR support handles handwriting. In these cases, the author may have to revert to the “minimal quoting” method and link to a scan.
4) Citing YouTube Video Transcripts
Many (but not all) YouTube videos have computer-generated transcripts. The CiteIt webservice currently supports pulling the surrounding context from YouTube videos that contain a transcript. (Here’s a Sample Malcolm Gladwell YouTube quote that the webservice automatically downloads a transcript for. The webservice matches the author’s quote against the transcript and retrieves the surrounding text.
5) Self-hosted Audio and Video
Using YouTube to transcribe audio and video is better than not publishing a source at all, but I assume newspapers will want to upload sources to their own site rather than YouTube.
If the newspaper’s media player supports linking to a timestamp, authors can set the starting position and link to the beginning of the quote and let the “reader” choose what parts of the audio they want to listen to.
It would be quite acceptable for the author to transcribe the minimal 500 characters before and after the quote and then let the “reader” examine the remainder of the untranscribed recording.
6) What Types of Quotes are most likely to be Contextualized First:
If CiteIt is used, I imagine it will be featured first and most commonly on articles that are not most time-sensitive. This means that CiteIt would likely not be used on breaking news but would more likely be used on analysis pieces. Posting the source to YouTube could expedite the transcription of audio and video, but in the long term, software and workflows would have to be developed to streamline the posting of documents and other media on the newspaper’s site.
I’d be interested in hearing more about this issue from others with experience in the industry.