14 February 2008

It's the little things: Lisp + Unicode

This little snippet of code makes me giddy out of proportion to what it accomplishes:

 ;; A non-final sigma cannot occur at the end of the string.
(when (char= (elt uni (1- (length uni))) #\σ)
(setf (elt uni (1- (length uni))) #\ς))

That this is so easy is a lucky accident of Common Lisp's history. The "common" in Common Lisp is because it was supposed to unite several popular (hey, it was the 80s), but incompatible, Lisp variants. There were lots more kinds of computers in wide use in the 80s, and many different encoding schemes for text. Thus, Common Lisp strings were always arrays of characters which were not necessarily bytes, or even ASCII. EBCDIC, anyone?

Well, the betacode to unicode conversion library works. Next up, indexing Perseus' exotic XML texts.

No comments: