#+STARTUP: hidestars showall
utf8 Byte vector backed, utf8 strings for Clojure.
To use this for a Leiningen project
: [pjstadig/utf8 "0.1.0"]
Or for a Maven project
: : pjstadig : utf8 : 0.1.0 :
** Creating utf8 strings
utf8 strings can be created using the utf8-str function.
: pjstadig.utf8> (utf8-str "foo") : #
Pass more than one argument to utf8-str, and it will concatenate all the
arguments into one utf8 string.
: pjstadig.utf8> (utf8-str "foo" " " "bar") : #
You can get the empty version of a utf8 string (after all it's a persistent
collection) by calling utf8-str with no arguments.
: pjstadig.utf8> (utf8-str) : #
utf8 strings are backed by Clojure's byte vectors, and can be appended at the end efficiently.
: pjstadig.utf8> (conj (utf8-str "foo") \b) : #
You can get a String back out of a utf8 string using str
: pjstadig.utf8> (str (utf8-str "foo")) : "foo"
It also handles surrogate pairs
: pjstadig.utf8> (into (utf8-str) "\ud852\udf62¢€")
: #<Utf8String 𤭢¢€>
** utf8 strings as persistent collections
You can use into
: pjstadig.utf8> (into (utf8-str "foo") (utf8-str "bar")) : #
A utf8 string is a CharSequence and calling seq on it gets you a lazy seq
of chars (duh!).
: pjstadig.utf8> (seq (utf8-str "foo")) : (\f \o \o)
You can use sequence operations on a utf8 string
: pjstadig.utf8> (first (utf8-str "foo"))
: \f
: pjstadig.utf8> (rest (utf8-str "foo"))
: (\o \o)
: pjstadig.utf8> (map int (utf8-str "foo"))
: (102 111 111)
** utf8 strings as CharSequences
Since utf8 strings are CharSequences and deal with chars (though the chars
are stored in utf8), you can mix utf8 strings and String strings.
: pjstadig.utf8> (into (utf8-str "foo") "bar") : # : pjstadig.utf8> (clojure.string/join " " [(utf8-str "foo") (utf8-str "bar")]) : "foo bar"
Though, as you can see in that last case, the result is a String, so you have to manually convert it back to a utf8 string.
: pjstadig.utf8> (utf8-str (clojure.string/join " " [(utf8-str "foo") (utf8-str "bar")])) : #
You can even match regular expressions against them (surprise!)
: pjstadig.utf8> (re-matches #"foo" (utf8-str "foo"))
: "foo"
: pjstadig.utf8> (re-matches #"bar" (utf8-str "foo"))
: nil
** What about a java.io.Writer implementation?
Glad you asked. You can get a Writer by calling utf8-writer, and it will
directly encode and store as utf8 every character you write to it. So you
can stream characters into a utf8 string without having to use two bytes of
storage for each ASCII character.
You just write a bunch of stuff to it, then call utf8-str on it when you're
done.
: pjstadig.utf8> (let [w (utf8-writer)] (with-open [f (clojure.java.io/reader "/etc/hosts")] (clojure.java.io/copy f w)) (utf8-str w))
: #<Utf8String 127.0.0.1 localhost
: 127.0.1.1 jane
:
: # The following lines are desirable for IPv6 capable hosts
: ::1 ip6-localhost ip6-loopback
: fe00::0 ip6-localnet
: ff00::0 ip6-mcastprefix
: ff02::1 ip6-allnodes
: ff02::2 ip6-allrouters
: >
** Other stuff
utf8 strings define the normal equals and hashCode methods, so you can
compare them and stuff them in hash maps
: pjstadig.utf8> (= (utf8-str "foo") (utf8-str "foo")) : true : pjstadig.utf8> (map hash [(utf8-str "foo") (utf8-str "foo")]) : (101574 101574) : pjstadig.utf8> (get {(utf8-str "foo") :foo} (utf8-str "foo")) : :foo
The equals comparison is done character by character; not byte by byte. The usual Unicode normalization caveats apply. (However, since utf8 strings implement CharSequence you can use java.text.Normalizer! :))
utf8 strings will only compare as equal to their own kind. So
: pjstadig.utf8> (= (utf8-str "foo") "foo") : false
If this makes you sad, then realize that this will always be false
: pjstadig.utf8> (= "foo" (utf8-str "foo")) : false
So unless/until Clojure's = is defined as an open protocol, we can't make
Strings equal to utf8 strings.
As a consolation you can compare sequences of characters
: pjstadig.utf8> (= (seq "foo") (seq (utf8-str "foo"))) : true : pjstadig.utf8> (= (seq (utf8-str "foo")) (seq "foo")) : true ** What are the downsides? Well like any use of utf8, counting and indexing characters is O(n). It may be possible to store a count so that counting can be constant time, but we'll see. ** What's next? You tell me. I was thinking maybe a reader literal. What else would be useful? ** License : Copyright © 2013 Paul Stadig. All rights reserved. : : This Source Code Form is subject to the terms of the Mozilla Public License, : v. 2.0. If a copy of the MPL was not distributed with this file, You can : obtain one at http://mozilla.org/MPL/2.0/. : : This Source Code Form is "Incompatible With Secondary Licenses", as defined : by the Mozilla Public License, v. 2.0.