Unicode (UTF-8) capable std::string
BSD-3-CLAUSE License
Tiny-utf8 is a library for extremely easy integration of Unicode into an arbitrary C++11 project.
The library consists solely of the class utf8_string
, which acts as a drop-in replacement for std::string
.
Its implementation is successfully in the middle between small memory footprint and fast access. All functionality of std::string
is therefore replaced by the corresponding codepoint-based UTF-32 version - translating every access to UTF-8 under the hood.
c
)(r
)begin
/end
now return codepoint-based iterators, while raw_
(c
)(r
)begin
/end
now return byte-based iterators.str.erase( std::remove( str.begin() , str.end() , U'W' ) , str.end() )
will work, but str.erase( std::remove(
str.raw_begin()
,
str.raw_end()
, U'W' ) ,
str.raw_end()
)
will not (at least not always). The reason is: after the call to std::remove
, the size of the string data might have changed and the second call to str.raw_end()
might have yielded a now-invalidated iterator.std::string
sizeof(utf8_string)
! That is, including the trailing \0
size()
returns the size of the data in bytes, length()
returns the number of codepoints contained.0x0
- 0xFFFFFFFF
, i.e. 1-7 Code Units/Bytes per Codepoint (Note: This is more than specified by UTF8, but until now otherwise considered out of scope)const char*
/const char32_t*
also have an overload for const char (&)[N]
/const char32_t (&)[N]
, allowing correct interpretation of string literals with embedded zeros)shrink_to_fit()
Back when I decided to write a UTF8 solution for C++, I knew I wanted a drop-in replacement for std::string
. At the time mostly because I found it neat to have one and felt C++ always lacked accessible support for UTF8. Since then, several years have passed and the situation has not improved much. That said, things currently look like they are about to improve - but that doesn't say much, eh?
The opinion shared by many "experienced Unicode programmers" (e.g. published on UTF-8 Everywhere) is that "non-experienced" programmers both under and overestimate the need for Unicode- and encoding-specific treatment: This need is...
Unicode is not rocket science but nonetheless hard to get right. Tiny-utf8 does not intend to be an enterprise solution like ICU for C++. The goal of tiny-utf8 is to
std::string
with a custom class which means toTiny-utf8 aims to be the simple-and-dependable groundwork which you build Unicode infrastructure upon. And, if 1) C++2xyz should happen to make your Unicode life easier than tiny-utf8 or 2) you decide to go enterprise, you have not wasted much time replacing std::string
with tiny_utf8::string
either. That's what makes tiny-utf8 so agreeable.
'ch'
vs. 'c'+'h'
)Note: ANSI suppport was dropped in Version 2.0 in favor of execution speed.
#include <iostream>
#include <algorithm>
#include <tinyutf8/tinyutf8.h>
using namespace std;
int main()
{
tiny_utf8::string str = u8"! olleH";
for_each( str.rbegin() , str.rend() , []( char32_t codepoint ){
cout << codepoint;
} );
return 0;
}
__cpp_exceptions
.noexcept
anyway, #define
the macro TINY_UTF8_NOEXCEPT
.#define
the macro TINY_UTF8_THROW( location , failing_predicate )
. For using assertions, you would write #define TINY_UTF8_THROW( _ , pred ) assert( pred )
.TINY_UTF8_THROW( ... )
is automatically defined as void()
. This works well, because all uses of TINY_UTF8_THROW
are immediately followed by a ;
as well as a proper return
statement with a fallback value. That also means, TINY_UTF8_THROW
can safely be a NO-OP.tiny_utf8::basic_utf8_string
has been renamed to basic_string
, which better resembles its drop-in-capabilities for std::string
.tinyutf8.h
has been moved into the folder include/tinyutf8/
in order to mimic the structuring of many other C++-based open source projects.utf8_string
is now defined inside namespace tiny_utf8
. If you want the old declaration in the global namespace, #define TINY_UTF8_GLOBAL_NAMESPACE
tiny_utf8::u8string
, which uses char8_t
as underlying data type (instead of char
)utf8_string
defined in the global namespace, #define
the macro TINY_UTF8_GLOBAL_NAMESPACE
.If you encounter any bugs, please file a bug report through the "Issues" tab. I'll try to answer it soon!
for taking your time to improve tiny-utf8.
Cheers, Jakob