whitespace

Joel Spolsky (best know for cofounding Stack Overflow, FogBugz, Glitch, and Trello) tells the story of his first Bill Gates Review while working as a project manager on the Excel team.

Spolsky recounts that:

Bill doesn’t really want to review your spec, he just wants to make sure you’ve got it under control. His standard M.O. is to ask harder and harder questions until you admit that you don’t know, and then he can yell at you for being unprepared.

Obscure Details:

For Spolsky’s review, Gates chose to focus on the details of Excel’s date system and its compatibility with Lotus 123. As Spolsky discovered, Basic and Excel use different starting points for the epoch:

Excel’s epoch starts at: January 1, 1900
Basic’s epoch starts at: December 31, 1899

.. but Spolsky was puzzled that the two programs yield the same date for today.

I went to find an Excel developer who was old enough to remember why. Ed Fries seemed to know the answer.

“Oh,” he told me. “Check out February 28th, 1900.”

“It’s 59,” I said.

“Now try March 1st.”

“It’s 61!”

“What happened to 60?” Ed asked.

“February 29th. 1900 was a leap year! It’s divisible by 4!”

“Good guess, but no cigar,” Ed said, and left me wondering for a while.

Oops. I did some research. Years that are divisible by 100 are not leap years, unless they’re also divisible by 400.

1900 wasn’t a leap year.

“It’s a bug in Excel!” I exclaimed.

“Well, not really,” said Ed. “We had to do it that way because we need to be able to import Lotus 123 worksheets.”

“So, it’s a bug in Lotus 123?”

“Yeah, but probably an intentional one. Lotus had to fit in 640K. That’s not a lot of memory. If you ignore 1900, you can figure out if a given year is a leap year just by looking to see if the rightmost two bits are zero. That’s really fast and easy. The Lotus guys probably figured it didn’t matter to be wrong for those two months way in the past. It looks like the Basic guys wanted to be anal about those two months, so they moved the epoch one day back.”

Gates asks his “Gotcha Question

Bill Gates, Source: Flickr

When Spolsky met Bill Gates for the meeting, he noticed that Bill had made notes in the margin on every page, suggesting that he had read the whole thing, which impressed Spolsky because Gates had only been given the spec about 24 hours earlier. Gates asked a series of progressively harder and more detailed questions:

Finally the killer question.

“I don’t know, you guys,” Bill said, “Is anyone really looking into all the details of how to do this? Like, all those date and time functions. Excel has so many date and time functions. Is Basic going to have the same functions? Will they all work the same way?”

“Yes,” I said, “except for January and February, 1900.”

Silence.

The f*** counter and my boss exchanged astonished glances. How did I know that? January and February WHAT?

“OK. Well, good work,” said Bill. He took his marked up copy of the spec

Unicode Support

Unicode, Source: Wikipedia

If there is an equivalent gotcha issue for CiteIt it concerns the details relating to Unicode escape characters in Javascript and Python.

The problem centers around the way the quote identifier is constructed in the browser in Javascript and on the server in Python. The Python code converts HTML into text, which is a step that exposes spacing and encoding discrepancies when hashed. I discovered these issues through trial and error, but I’ve been wanting to rework my code from scratch.

Yesterday I decided to get serious about improving the escape function which normalizes the text. The current set of functions was put together in an ad-hoc way, but this weekend I decided to do some more thorough research into the Unicode Standard.

Research Findings:

Here are some of the things that I found:

Different browsers handle whitespace characters differently. This means that a regular expression that removes whitespace may not produce the same results on all browsers. (That is a problem which suggests I avoid using regular expressions in this way.)
There are some characters that are used as separators, but which have no width
- Sometimes these characters are used to designate a point at which to perform a line-break should the line need to wrap
- Some languages also use separators that don’t have a width.

It is possible that some non-visible characters could appear in the javascript hash key but not carry over to the python text version when HTML is converted to text, resulting in clashing hashkeys. To avoid that, I plan to remove any characters that may not carry over well when converting from HTML to text.

Escape Characters

My new “spec” revision uses a list of escape characters that is 90 characters long.