Contents
Characters and strings

Overview

The Balau library has been designed primarily to work with external character data encoded in UTF-8 and internal character data encoded in UTF-8 and UTF-32. UTF-8 is used for persisted strings, data transfer, and in-memory strings. UTF-32 is used in code point processing algorithms that require a fixed size code point type. This allows a normally compact representation in memory, in transit, and in storage, but provides a fixed width character type for processing when required.

The C++ language char character type is used for UTF-8 data and the char32_t character type is used for UTF-32 data. As the size of the C++ language wchar_t character type is not defined in the specification, the wchar_t character type and associated std::wstring string type are not used in Balau.

The char16_t character type is not used in Balau components. A set of universal to-string and from-string function overloads is however included for UTF-16 string generation and conversion. These functions provide to-string and from-string conversions when another library works with UTF-16 strings or when application code requires UTF-16 encoded strings.

Balau uses the ICU library for unicode support. Unlike ICU, Balau uses the standard char32_t primitive type for representing UTF-32 characters. This is implicitly cast to and from ICU's UChar32 (which is a signed int) within the Character functions.

The primary character and string related functionality that Balau provides is:

The character utilities, universal to-string and universal from-string functions are discussed in this section. The string utilities section discusses the string utilities. The resources section discusses the various resource classes.

String types

The Balau C++ library uses the following character and string types:

Char type String type Usage
char std::string UTF-8 string or undefined array of bytes
char16_t std::u16string UTF-16 string
char32_t std::u32string UTF-32 string

Character utilities

#include <Balau/Type/Character.hpp>

Character utility functions for the following themes are provided:

Many of the character utility functions are proxies to corresponding ICU functions.

Classification

The classification functions each accept a char32_t character. Most of the classification functions act as predicates.

The following predicate classification functions are available.

Function name Description
isLower Does the specified code point have the general category Ll (lowercase letter).
isUpper Does the specified code point have the general category Lu (uppercase letter).
isDigit Does the specified code point have the general category Nd (decimal digit numbers).
isHexDigit Does the specified code point have the general category Nd (decimal digit numbers) or is one of the ASCII latin letters a-f or A-F.
isOctalDigit Is the specified code point one of the ASCII characters 0-7.
isBinaryDigit Is the specified code point one of the ASCII characters 0-1.
isAlpha Does the specified code point have the general category L (letters).
isAlphaOrDecimal Does the specified code point have the general category L (letters) or Nd (decimal digit numbers).
isControlCharacter Is the specified code point a control character.
isSpace Is the specified code point a space character (excluding CR / LF).
isWhitespace Is the specified code point a whitespace character.
isBlank Is the specified code point a character that visibly separates words on a line.
isPrintable Is the specified code point a printable character.
isPunctuation Does the specified code point have the general category P (punctuation).
isIdStart Does the specified code point have the general category L (letters) or Nl (letter numbers).
isIdPart Is the specified code point valid as part of an Id.
isBreakableCharacter Is the specified code point a breakable character for line endings.
isInclusiveBreakableCharacter Is the specified code point a breakable character for line endings that should be printed.

The following non-predicate classification functions are available.

Function name Description
utf8ByteCount Returns the number of bytes that the character occupies when UTF-8 encoded.

Iteration

Iteration functions are defined for UTF-8 string views. These functions advance or retreat an integer offset to the next or previous UTF-8 character. Two of the functions also return the resulting character.

The following iteration functions are currrently available.

Function name Description
getNextUtf8 Get the next code point from the UTF-8 string view.
getPreviousUtf8 Get the previous code point from the UTF-8 string view.
advanceUtf8 Advance the supplied offset from one code point boundary to the next one.
retreatUtf8 Retreat the supplied offset from one code point boundary to the previous one.

Mutation

Mutation functions are available for char32_t characters and for UTF-8 char characters at offsets inside std::string strings.

The following mutating functions are currrently available.

Function name Description
toUpper(char32_t) Convert the supplied code point to uppercase.
toLower(char32_t) Convert the supplied code point to lowercase.
setUtf8AndAdvanceOffset(
std::string & destination,
int & offset,
char32_t c)
Write a code point into the supplied UTF-8 string.

Universal to-string

#include <Balau/Type/ToString.hpp>

This section outlines a development approach and supporting code in the Balau library for a universal to-string function for each of the supported unicode encoding string types. These functions are used throughout the Balau library and will propagate to application code through the Balau header files. The implementation allows application developers to define additional to-string function overloads for their own types and any other types for which they require custom to-string function implementations.

Overview

The C++ standard library provides a to_string function for several primitive types, defined within the std namespace. Whilst the C++ specification forbids the overloading of functions in the std namespace, the to_string function can be overloaded in the namespaces of user defined classes and the compiler will resolve them by examining the parameter type of the function call.

The Boost library also provides a boost:lexical_cast<std::string> function which relies on user defined operator << functions to perform the to-string conversion.

The output of each of the above to-string implementations may differ.

In addition,

Unlike standard to-string functions or methods in other programming languages such as the toString method in Java, C++ does not have a unified standard for a to-string function, nor can it have a standard to-string method as there is no common base class to declare one in. Due to this and the complications described above, the Balau library standardises on the use of a single to-string function for each of the Unicode character encodings.

One possible solution to this requirement was to promote the primitive type to_string functions to the global namespace. This solution was decided against, in order to avoid using a token defined in the std namespace. In addition, three to-string functions (one per Unicode encoding) are required. Consequently, the toString, toString16, and toString32 tokens were chosen instead.

Users of the Balau library may define toString, toString16, and toString32 function overloads for their own custom types. Wrappers for the primitive type std::to_string functions defined in the standard library <string> header are also provided in the <ToString.hpp> header file. Additional overloads for common primitive types and standard containers are also supplied in <ToString.hpp>.

Signatures

The signatures of the universal UTF-8, UTF-16, and UTF-32 to-string functions are:

			std::string toString(const T & value);
			std::u16string toString16(const T & value);
			std::u32string toString32(const T & value);
		

where T is the parameter type.

Usage

To use any of the Balau universal to-string functions, include the <ToString.hpp> header file in your code. As this header is already included in the <BalauException.hpp> header which is subsequently included in the <Logger.hpp> header, use of the logger or features that throw Balau exceptions will automatically include the <ToString.hpp> header file.

In order to provide universal to-string function overloads to Balau classes and functions for your custom types, it is sufficient to define a toString, toString16, or toString32 function overload in the same namespace as your custom type. C++ argument-dependent lookup will resolve the function overload via the parameter type in the call.

Note that toString, toString16, or toString32 function overloads should not be defined for type aliases, as this prevents the compiler from resolving the correct overload for a particular type. Instead, use the original type within its namespace.

When calling the toString, toString16, and toString32 functions from a namespace that contains to-string function definitions in the namespace or a intermediate parent namespace, it may be necessary to import the the functions in the global namespace via a using directive, in order to ensure the correct overload is picked up from the local context.

			// Example of using the toString function with a using directive.

			struct G {};

			std::string toString(G) {
				return "G";
			}

			namespace N {

			class L {};

			std::string toString(L) {
				return "L";
			}

			void foo() {
				using ::toString;

				std::cout << toString(L())     // Local scope.
				          << toString(G())     // ADL.
				          << toString(2)       // Requires using directive.
				          << toString("hello") // Requires using directive.
				          << "\n";
			}

			} // namespace N
		

Container to-string

The ToString.hpp header contains a template function toStringHelper. This helper function provides a convenient parameter pack template template parameter to-string implementation that can be selectively used for container types.

The declaration of the UTF-8 version of the helper function is as follows.

			///
			/// Helper for container to UTF-8 string functions.
			///
			/// This helper function can be used for custom container types if required.
			///
			template <typename ... T, template <typename ...> class C>
			inline std::string toStringHelper(const C<T ...> & c);
		

In order to use these container to-string helper functions, it is sufficient to define a new to-string function that calls the helper function. This can be done manually or via the BALAU_CONTAINERx_TO_STRINGy macros, where x is the number of template parameters that the container accepts (1-5), and y is the unicode encoding (none for UTF-8, "16" for UTF-16, and "32" for UTF-32). These macros are also provided in the ToString.hpp header.

In order to use these macros, the container must implement begin() const and end() const iterator methods.

The ToString.hpp header contains a set of such functions for each of the standard library containers.

Parameter pack to-string

These versions of the to-string functions allow two or more input arguments to be converted to strings and concatenated together in a single function call. They are templated parameter pack versions of the universal to-string functions. They each contain a fold expression in order to concatenate string versions of the input arguments.

As with the other predefined universal to-string functions, there are UTF-8, UTF-16, and UTF-32 versions of the parameter pack to-string function.

The complete UTF-8 version of the parameter pack universal to-string function is as follows.

			///
			/// Calls toString on each input argument and concatenates them together to
			/// form a single UTF-8 string.
			///
			template <typename P1, typename P2, typename ... P>
			inline std::string toString(const P1 & p1, const P2 & p2, const P & ... p) {
				return toString(p1) + toString(p2) + (std::string() + ... + toString(p));
			}
		

To-string template class

The ToString.hpp header also contains an additional template class based version of the universal to-string functions. This version consists of a class template declaration plus three specialisations, one for each character type. These classes are useful when the universal to-string function needs to be called from within a class or function template and when the string type is provided by a template argument.

The three specialisations are proxies to the previous sets of toString, toString16, and toString32 function overloads.

The complete UTF-8 version of the templated universal to-string class is as follows.

			///
			/// Convert the supplied object to a std::string by calling toString.
			///
			template <> struct ToString<char> {
				template <typename T> std::string operator () (const T & object) const {
					return toString(object);
				}
			};
		

Custom allocation

In addition to the std::basic_string<CharT> based to-string functions, the ToString.hpp header file includes a parallel set of templated to-string functions that accept a custom allocator. The goal of these alternative functions is to provide suitable to-string function implementations for components that use a std::basic_string<CharT, std::char_traits<CharT>, AllocatorT> string type.

The notable usage of these templated to-string functions is in the logging system. When optionally enabled, the logging system uses a custom allocator that allocates on statically allocated thread local buffers. In this configuration, the logging system uses the to-string functions that accept an allocator template argument.

Universal from-string

#include <Balau/Type/FromString.hpp>

This section outlines a development approach and supporting code in the Balau library for a universal from-string function for each of the supported unicode encoding string types. The universal from-string functions provide a standard way to define string to object type conversions that can be used within library components. Similarly to the universal to-string functions, these functions are used throughout the Balau library, will propagate to application code, and application developers can define additional to-string function overloads for their own types.

Signatures

The signatures of the universal UTF-8, UTF-16, and UTF-32 from-string functions are:

			void fromString(T & destination, std::string_view value);
			void fromString16(T & destination, std::u16string_view value);
			void fromString32(T & destination, std::u32string_view value);
		

where T is the parameter type.

The type T must be copy assignable, move assignable, have public mutable fields, or must provide suitable setter methods in order to be used in a from-string function overload. If this is not the case, the type is unsuitable to have a universal from-string function overload and a custom named from-string function (with a different signature) should be created instead.

Usage

To use any of the Balau universal from-string functions, include the <FromString.hpp> header file in your code.

In order to provide universal from-string function overloads to Balau classes and functions for your custom types, it is sufficient to define a fromString, fromString16, or fromString32 function overload in the same namespace as your custom type. C++ argument-dependent lookup will resolve the function overload via the parameter type in the call.

When calling the fromString, fromString16, and fromString32 functions from a namespace that contains from-string function definitions in the namespace or a intermediate parent namespace, it may be necessary to import the the functions in the global namespace via a using directive, in order to ensure the correct overload is picked up from the local context.

From-string template class

The FromString.hpp header also contains an additional template class based version of the universal from-string functions. This version consists of a class template declaration plus three specialisations, one for each character type. These classes are useful when the universal from-string function needs to be called from within a class or function template and when the string type is provided by a template argument.

The three specialisations are proxies to the previous sets of fromString, fromString16, and fromString32 function overloads.

The complete UTF-8 version of the templated universal from-string class is as follows.

			///
			/// UTF-8 specialisation of FromString<T>.
			///
			/// Converts the supplied std::string to an object of type T by calling fromString.
			///
			template <> struct FromString<char> {
				///
				/// @param destination the destination value that is set via assignment
				/// @param value the string input
				/// @throw ConversionException when the conversion fails
				///
				template <typename T>
				void operator () (T & destination, const std::string & value) const {
					fromString(destination, value);
				}
			};