utf8

Byte vector backed, utf8 strings for Clojure.

Downloads
264
Stars
29
Committers
1

#+STARTUP: hidestars showall

  • utf8 Byte vector backed, utf8 strings for Clojure.

    To use this for a Leiningen project

    : [pjstadig/utf8 "0.1.0"]

    Or for a Maven project

    : : pjstadig : utf8 : 0.1.0 :

** Creating utf8 strings utf8 strings can be created using the utf8-str function.

: pjstadig.utf8> (utf8-str "foo") : #

Pass more than one argument to utf8-str, and it will concatenate all the arguments into one utf8 string.

: pjstadig.utf8> (utf8-str "foo" " " "bar") : #

You can get the empty version of a utf8 string (after all it's a persistent collection) by calling utf8-str with no arguments.

: pjstadig.utf8> (utf8-str) : #

utf8 strings are backed by Clojure's byte vectors, and can be appended at the end efficiently.

: pjstadig.utf8> (conj (utf8-str "foo") \b) : #

You can get a String back out of a utf8 string using str

: pjstadig.utf8> (str (utf8-str "foo")) : "foo"

It also handles surrogate pairs

: pjstadig.utf8> (into (utf8-str) "\ud852\udf62¢€") : #<Utf8String 𤭢¢€> ** utf8 strings as persistent collections You can use into

: pjstadig.utf8> (into (utf8-str "foo") (utf8-str "bar")) : #

A utf8 string is a CharSequence and calling seq on it gets you a lazy seq of chars (duh!).

: pjstadig.utf8> (seq (utf8-str "foo")) : (\f \o \o)

You can use sequence operations on a utf8 string

: pjstadig.utf8> (first (utf8-str "foo")) : \f : pjstadig.utf8> (rest (utf8-str "foo")) : (\o \o) : pjstadig.utf8> (map int (utf8-str "foo")) : (102 111 111) ** utf8 strings as CharSequences Since utf8 strings are CharSequences and deal with chars (though the chars are stored in utf8), you can mix utf8 strings and String strings.

: pjstadig.utf8> (into (utf8-str "foo") "bar") : # : pjstadig.utf8> (clojure.string/join " " [(utf8-str "foo") (utf8-str "bar")]) : "foo bar"

Though, as you can see in that last case, the result is a String, so you have to manually convert it back to a utf8 string.

: pjstadig.utf8> (utf8-str (clojure.string/join " " [(utf8-str "foo") (utf8-str "bar")])) : #

You can even match regular expressions against them (surprise!)

: pjstadig.utf8> (re-matches #"foo" (utf8-str "foo")) : "foo" : pjstadig.utf8> (re-matches #"bar" (utf8-str "foo")) : nil ** What about a java.io.Writer implementation? Glad you asked. You can get a Writer by calling utf8-writer, and it will directly encode and store as utf8 every character you write to it. So you can stream characters into a utf8 string without having to use two bytes of storage for each ASCII character.

You just write a bunch of stuff to it, then call utf8-str on it when you're done.

: pjstadig.utf8> (let [w (utf8-writer)] (with-open [f (clojure.java.io/reader "/etc/hosts")] (clojure.java.io/copy f w)) (utf8-str w)) : #<Utf8String 127.0.0.1 localhost : 127.0.1.1 jane : : # The following lines are desirable for IPv6 capable hosts : ::1 ip6-localhost ip6-loopback : fe00::0 ip6-localnet : ff00::0 ip6-mcastprefix : ff02::1 ip6-allnodes : ff02::2 ip6-allrouters : > ** Other stuff utf8 strings define the normal equals and hashCode methods, so you can compare them and stuff them in hash maps

: pjstadig.utf8> (= (utf8-str "foo") (utf8-str "foo")) : true : pjstadig.utf8> (map hash [(utf8-str "foo") (utf8-str "foo")]) : (101574 101574) : pjstadig.utf8> (get {(utf8-str "foo") :foo} (utf8-str "foo")) : :foo

The equals comparison is done character by character; not byte by byte. The usual Unicode normalization caveats apply. (However, since utf8 strings implement CharSequence you can use java.text.Normalizer! :))

utf8 strings will only compare as equal to their own kind. So

: pjstadig.utf8> (= (utf8-str "foo") "foo") : false

If this makes you sad, then realize that this will always be false

: pjstadig.utf8> (= "foo" (utf8-str "foo")) : false

So unless/until Clojure's = is defined as an open protocol, we can't make Strings equal to utf8 strings.

As a consolation you can compare sequences of characters

: pjstadig.utf8> (= (seq "foo") (seq (utf8-str "foo"))) : true : pjstadig.utf8> (= (seq (utf8-str "foo")) (seq "foo")) : true ** What are the downsides? Well like any use of utf8, counting and indexing characters is O(n). It may be possible to store a count so that counting can be constant time, but we'll see. ** What's next? You tell me. I was thinking maybe a reader literal. What else would be useful? ** License : Copyright © 2013 Paul Stadig. All rights reserved. : : This Source Code Form is subject to the terms of the Mozilla Public License, : v. 2.0. If a copy of the MPL was not distributed with this file, You can : obtain one at http://mozilla.org/MPL/2.0/. : : This Source Code Form is "Incompatible With Secondary Licenses", as defined : by the Mozilla Public License, v. 2.0.