Building trust in media

Wikipedia Examples

Mark asked me to put together a more complete collection of Wikipedia example articles.

These are a few of the articles I’ve marked up, with the goal of automating the process of converting existing quotes to Contextual Citations:

Ruth Bader Ginsburg

Ruth Bader Ginsburg: Example Wikipedia article


  1. Ruth Bader Ginsburg
  2. Hillary Clinton **
  3. Donald Trump **


  1. Pride and Prejudice
  2. Hamlet


  1. Inauguration of John F. Kennedy
  2. 2000_MI6_attack


  1. Manned Orbiting Laboratory
  2. Syphilis


** These articles have been completely marked up, including Books that are not available online.


Converting to Contextual Citations

Here’s how I imagine’s Contextual Citations could be integrated with Wikipedia:

Phase 1: Manual Editing

In phase 1, citations from sample articles could be manually marked up.  This trial-run would most likely start with less high profile pages. If this were successful, Wikipedia Editors could be given the ability to mark up citations through the editor and citations could be indexed, either manually or automatically upon publication. Since most people would not know about the ability to create contextual citations using the Wikipedia editor, this phase would also be fairly low-profile.

Improving the accuracy of the returned CiteIt Context is a precondition for greater adoption.

  1. The context returned by the web service needs to be more accurate (I’ve cataloged some of the bugs below).
  2. The number of “misses” where the web service incorrectly fails to find the context needs to be reduced.
  3. The speed and scalability of the web service need to be improved.
    • Right now it takes hours to process the Donald Trump article.  The Donald Trump and Hillary Clinton articles were chosen as sample articles because they has a high number of web citations.

Phase 2: Automatic Conversion:

My friend Bryan said that it should be possible to retroactively go back and automatically convert all of Wikipedia quotes to use CiteIt’s Contextual Citations, assuming the Pareto principle that roughly 80% of the citations could be converted with a script following some straight-forward rules and the remaining citations would have to be manually processed.

With that in mind, I set out over the past month to do a fairly thorough analysis of the types of issues that we could encounter if we chose to automate the conversion. As part of the process, I would hope to create a database of citations to be “upgraded” and an interface for reporting and fixing bugs

From this database, we could both:

  • Analyze the accuracy of the program that converts quotes to Contextual Citations
  • Capture the human corrections to the program’s output

Machine Learning Option

If someone wanted to automate this process further, this process might be a good candidate for machine learning but the Wikipedia community’s philosophy towards automation and error-handling would determine how the technology is developed.


Analyzing Errors:

I did a more thorough job of analyzing the Hillary Clinton and Donald Trump articles, going through every quotation in the article (excluding the references at the end of the article) and marking each quote up with a q-tag and CSS classes, indicating the reason why the citation couldn’t be properly matched.

You can see highlights of all the errors if you click on the blueShow Citation Errors” button in the upper-right corner of the yellow header:

Screenshot: Show Citation Errors


Error Codes:

Below is a list of the CSS error classes created when marking up sample Wikipedia articles.

Category Class Description Example Color
Error citeit-automation-error An automated bot would likely pull an inappropriate match should we return multiple matches and return all results to editors or readers?


Trump favors neutral or positive balances of trade over negative balances of trade, also known as a trade deficit”. Trump adopted his current skeptical views 


“Share of total U.S. merchandise trade deficit by country”

Explanation:  Source Returns the first result from the first footnote’s source.  But this is a legend title rather than the actual article body. It would be preferable to use a latter match.  Perhaps a UI could be built that displays all matched results and gives Wikipedia editors the ability to chose the preferred instance.

Error citeit-error-context Context is Returned, but Incorrect Example:  convention’s veiled” racist messages #ff3399
Error citeit-error-quote-returned Error in Returned Quote


Clinton asserted President Trump’s 2018 budget proposal was “a con” for underfunding domestic programsReturned: Clinton called Mr. Trump’s 2018 budget proposal ” a cong “ which she said would underfund public education

Error citeit-error-quote-context-edges Slight character errors in Surrounding Context


“adding that her mother Dorothy “made sure I learned [these] words from our Methodist faith” Quote: “And she made sure I learned [these] ds from our Methodist faith”

Error citeit-error-unknown Error of Unknown Type


Emoluments Clause as phony.

HTML: Emoluments Clause as <q cite=”” class=”citeit-error-unknown”>“phony”</q>

Error citeit-error-404 Source URL returns 404 error


Clinton called for a constitutional amendment to limit “unaccountable money

PDF citeit-pdf-scanned Source Document is a PDF image that needs to be scanned with OCR


memorandum saying “the data indicates that the President remains healthy. HTML:   memorandum saying “<q cite=”” class=”citeit-pdf-scanned”>the data indicates that the President remains healthy.

Match citeit-footnote-interspersed There is a footnote insterspersed in the middle of the quote that could throw off the match


Correct and consistent use of latex condoms can reduce the risk of syphilis only when the infected area or site of potential exposure is protected.[41] However, a syphilis sore outside of the area

Match citeit-footnote-shortname Footnote is a short form of citation. Need to cross-reference the short name with the longer reference in the references cited section.

The footnote is listed in the short form: Lastname, page


Troy 2006, pp. 176–77

References cited: Troy, Gil (2006). Hillary Rodham Clinton: Polarizing First Lady. Lawrence, Kansas: University Press of Kansas. ISBN 978-0-7006-1488-2.

Wiki citeit-footnote-later The match is not found in the first footnote, but in a second footnote.


“In July 2016, she “committed” to introducing a U.S. constitutional amendment” 1st Source: not found 1st Source URL: Source: Hillary Clinton committed Saturday 2nd Source URL:

YouTube youtube-video The source is a YouTube URL Example:  “The Hillary Shimmy Song”. September 28, 2016. Retrieved September 16, 2017 – via YouTube. #3366ff
Wiki wiki-legend The Quote is found in the Legend of Wikipedia Image and the footnote may be before the quote


“Fact-checkers from The Washington Post,[839] the Toronto Star,[840] and CNN[841] compiled data on “false or misleading claims” (orange background), and “false claims” (violet foreground), respectively.”

Wiki wiki-note Internal Wikipedia Note


Clinton into “imaginary discussions” with the also-politically active Eleanor Roosevelt.[f]



f. The Eleanor Roosevelt “discussions” were first reported in 1996 by The Washington Post writer Bob Woodward; they had begun from the start of Hillary Clinton’s time as first lady.[154]

Wiki wiki-multiple-source Wikipedia Source Citation Record Contains Multiple Sources


Calabresi, Massimo (November 7, 2011). “Hillary Clinton and the Rise of Smart Power”. Time. pp. 26–31. See also “TIME magazine editor explains Hillary Clinton’s ‘smart power'”. CNN. October 28, 2011.Wikipedia article: Hillary Clinton

Match citeit-numbers-written-out The match is not found because Quoted text writes number out as text rather than numbers


to a willingness “to remold society by redefining what it means to be a human being in the twentieth century, moving into a new millennium Source: “Let us be willing,” she urged in conclusion, “to remold society by redefining what it means to be a human being in the 20th century, moving into a new millennium.”

Match citeit-text-from-source Quote text needed to be replaced from the source

Replaced Quote

“can’t .. miss” Wikipedia: (Outdated): “can’t miss”


“can’t afford to miss”

Match, Feature citeit-later-match A later match (2nd or 3rd) would be preferable


“genocidal taunts”

Source: Quote found in the title, but better context is found in 2nd match in the article body.

Match, Feature citeit-feature-added-word A word is added to the quote using brackets:


[although] “we did not find clear evidence that Secretary Clinton or her colleagues intended to violate laws “

Match citeit-formatting-mismatch Looks like it matches, but doesn’t because of a formatting mismatch.


“contends that they are not shapes of constellations but of what might be called <i>counter constellations</i>, the irregular-shaped dark patches within the twinkling expanse of the <a href=””Milky Way“>Milky Way</a>”

Match citeit-change-case Matches except for changing from upper to lower case or vice versa


[T]he mudslides and heavy rains did not appear to have caused any significant damage to the Nazca Lines

Match citeit-hyphen-change Matches except for changing hyphenization Nazca Lines  (TODO: find an example) #ff99ff
Match citeit-punctuation-change Matches except for punctuation changes: Example: quoted text ends in a comma, but Wikipedia quote uses a period

Live Example:

“Her articles were important, not because they were radically new but because they helped formulate something that had been inchoate.”[63]


Her articles were important, not because they were radically new but because they helped formulate something that had been inchoate, Professor Fox said

Match citeit-omit-text-from-source The original source includes text that is not in the citing quote


“Let me repeat what I have repeated for many months now, I never received nor sent any material that was marked classified.”


“Let me just repeat what I have repeated for many months now,” she said in the interview on “Meet the Press.” “I never received nor sent any material that was marked classified”

Match, Feature citeit-feature-ellipses The quote is interrupted by ellipses and then later continued


“There has never been a better time in history to be born a woman … this data shows just how far we still have to go.”

Match citeit-non-quote Although Quotation Marks are Used, the Quote is a Title or Term, not a quote.


filegate“, “Hillary Doctrine


TODO: remove CiteIt link from quote so normal link is visible

Offline citeit-offline-no-isbn A publication that is not available Online without an ISBN


“<q cite=”” class=”citeit-non-quote citeit-offline-no-isbn”>Children’s Rights: A Legal Perspective</q>” in 1979

Offline citeit-offline-isbn A Book that is not available Online but it has an ISBN


“<q cite=”″ class=”citeit-offline-isbn”>pivot to Asia</q>”

Private citeit-paywall The document requires a subscription


“<q cite=”″ class=”citeit-paywall citeit-non-quote citeit-later-match”>Children’s Policies: Abandonment and Neglect</q>”

Google Books ** google-books Google Books Nazca Lines #6600cc
Private citeit-edu An academic document requires a subscription


“<q cite”” class=”citeit-paywall citeit-edu”>Children Under the Law</q>”

Twitter citeit-twitter Twitter generates its HTML using javascript, which I hope a future version of CiteIt can handle


<q cite=”” class=”citeit-twitter”>primary form of exercise</q>

#3366ff Borrow citeit-archive-org-borrow The source is available to borrow electronically through


<q cite=”” class=”citeit-archive-org-borrow citeit-later-match”>Hillarycare</q>

Wayback Machine citeit-archive-org Wayback Machine and other that doesn’t require checking out


<q cite=”” class=”citeit-archive-org-borrow citeit-later-match”>Hillarycare</q>

Results citeit-no-context A Match is found, but no Context is Returned


“<q cite=”” class=”citeit-no-context”>I’m not going to rule out a military option</q>

Match citeit-no-match The Quote was not found in the Cited Source


<q cite=”,9171,2097973,00.html” class=”citeit-no-context”>convening power</q>

Editing citeit-better-link An Alternative Source is used Instead because it provides Better Context. Requires creating a new Footnote


<q cite=”” class=”citeit-better-link”>words from our Methodist faith</q>

Best-Practices citeit-naked-quote

The Citation is a Quote of a Quote, without the original Context

When CiteIt pulls in the context, the context is from the secondary source rather than the quoted source

If Wikipedia quotes an article in the New York Times, and the New York Times quotes the President.

How much of CiteIt’s 500 characters of context is from the New York Times and how much is from the President?

A naked quote would not have any context from the primary source (The President).  All of the context  CiteIt finds would be from the secondary source.

Click to access the login or register cheese