utf8proc Reference



utf8proc main pageutf8proc main page

Options for the mapping functions


NULLTERM
Indicates that string arguments are terminated with a 0x00 byte, and that the passed length parameter should be ignored. (This flag is only used in the C library)

STABLE
Respects Unicode's versioning stability policy. You have to use this option, if you want to do normalization according to Unicode Standard Annex #15.

COMPAT
Decompose compatibility characters during normalization. For example the ffi-ligature is converted into three characters f,f,i.

COMPOSE
Selects NFC or NFKC normalization (depending on the COMPAT flag). Characters are preferrably encoded in their composed form. (This is what you should use for the web.)

DECOMPOSE
Selects NFD or NFKD normalization (depending on the COMPAT flag). All composed characters will be transformed into their decomposed sequences.

IGNORE
Strip "default ignorable" characters, e.g. soft-hyphens or variation selectors.

REJECTNA
Reject any input which contains unassigned Unicode codepoints.

NLF2LS
Converts New Line Function (NLF) characters (LF, CR, CRLF, NEL) into Unicode line seperators.

NLF2PS
Converts New Line Function (NLF) characters (LF, CR, CRLF, NEL) into Unicode paragraph seperators.

NLF2LF
Converts any New Line Function (NLF) character (LF, CR, CRLF, NEL) into a line feed (LF) character.

STRIPCC
Control characters (general Unicode category 'Cc') are stripped or replaced by "Space" (depending on the character). The NLF2xx option has higher priority though.

CASEFOLD
Returns a case folded version of the string. Normalized strings might be non-normalized, unless one of the COMPOSE or DECOMPOSE options is set too. The case folded version can be understood as a "hash". You should not use it for output, but you can use it for comparisons, to see if strings are equal in a case insensitive way.

CHARBOUND
Any complete character ("grapheme cluster", which can consist of multiple codepoints, which again can consist of multiple bytes) will be preceeded by a 0xFF byte. If the first character is incomplete the string will not starte with an 0xFF byte.

LUMP
Certain characters are mapped, so that you can compare strings in a reasonable way. For example "Hyphen" U+2010 and "Minus" U+2212 are both mapped to ASCII "Hyphen-Minus" U+002D. See "lump.txt" for details of the mappings. If this option is used together with NLF2LF, paragraph and line seperators will be mapped to line feed (LF).

STRIPMARK
This option can only be used together with COMPOSE or DECOMPOSE. It strips any character marks (accents, diaeresis) from the string.

C-Library Functions


const char *utf8proc_errmsg(ssize_t errcode)
Returns a static error string for the given error code.

ssize_t utf8proc_iterate(uint8_t str, ssize_t strlen, int32_t dst)
Reads a single char from the UTF-8 sequence being pointed to by 'str'. The maximum number of bytes read is 'strlen', unless 'strlen' is negative. If a valid unicode char could be read, it is stored in the variable being pointed to by 'dst', otherwise that variable will be set to -1. In case of success the number of bytes read is returned, otherwise a negative error code is returned.

ssize_t utf8proc_encode_char(int32_t uc, uint8_t *dst)
Encodes the unicode char with the code point 'uc' as an UTF-8 string in the byte array being pointed to by 'dst'. This array has to be at least 4 bytes long. In case of success the number of bytes written is returned, otherwise 0. This function does not check if 'uc' is a valid unicode code point.

const utf8proc_property_t *utf8proc_get_property(int32_t uc)
Returns a pointer to a (constant) struct containing information about the unicode char with the given code point 'uc'. If the character is not existent a pointer to a special struct is returned, where 'category' is a NULL pointer.
WARNING: The parameter 'uc' has to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!

ssize_t utf8proc_decompose_char(int32_t uc, int32_t dst, ssize_t bufsize, int options, int last_boundclass)
Writes a decomposition of the unicode char 'uc' into the array being pointed to by 'dst'. The pointer 'last_boundclass' has to point to an integer variable which is storing the last character boundary class, if the CHARBOUND option is used. In case of success the number of chars written is returned, in case of an error, a negative error code is returned. If the number of written chars would be bigger than 'bufsize', the buffer (up to 'bufsize') has inpredictable data, and the needed buffer size is returned.
WARNING: The parameter 'uc' has to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!

ssize_t utf8proc_decompose(uint8_t str, ssize_t strlen, int32_t buffer, ssize_t bufsize, int options)
Does the same as 'utf8proc_decompose_char', but acts on a whole UTF-8 string, and orders the decomposed sequences correctly. If the NULLTERM flag in 'options' is set, processing will be stopped, when a NULL byte is encounted, otherwise 'strlen' bytes are processed. The result in form of unicode code points is written into the buffer being pointed to by 'buffer', having the length of 'bufsize' entries. In case of success the number of chars written is returned, in case of an error, a negative error code is returned. If the number of written chars would be bigger than 'bufsize', the buffer (up to 'bufsize') has inpredictable data, and the needed buffer size is returned.

ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options)
Reencodes the sequence of unicode characters given by the pointer 'buffer' and 'length' as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the 'options' field are regarded: In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned.
WARNING: The amount of free space being pointed to by 'buffer', has to exceed the amount of the input data by one byte, and the entries of the array pointed to by 'str' have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!

ssize_t utf8proc_map(uint8_t str, ssize_t strlen, uint8_t *dstptr, int options)
This is probably the function being most interesting for you.
Maps the given UTF-8 string being pointed to by 'str' to a new UTF-8 string, which is allocated dynamically, and afterwards pointed to by the pointer being pointed to by 'dstptr'. If the NULLTERM flag in the 'options' field is set, the length is determined by a NULL terminator, otherwise the parameter 'strlen' is evaluated to determine the string length, but in any case the result will be NULL terminated (though it might contain NULL characters before). Other flags in the 'options' field are passed to the functions defined above, and regarded as described. In case of success the length of the new string is returned, otherwise a negative error code is returned.
NOTICE: The memory of the new UTF-8 string will have been allocated with 'malloc', and has theirfore to be freed with 'free'.

uint8_t utf8proc_NFD(uint8_t str)
uint8_t utf8proc_NFC(uint8_t str)
uint8_t utf8proc_NFKD(uint8_t str)
uint8_t utf8proc_NFKC(uint8_t str)
Returns a pointer to newly allocated memory of a NFD, NFC, NFKD or NFKC normalized version of the null-terminated string 'str'.

Ruby-Library


Utf8Proc::SpecialChars
A hash containing some strings representing special control characters: HT, LF, VT, FF, CR, FS, GS, RS, US, LS, PS.

String#utf8map(*option_array)
Returns a string, which is transformed in a way determined by the given options. The options are symbols like :compose or :ignore, as described in the beginning of this document, but lower-case.

String#utf8map!(*option_array)
Same as utf8map, but modifies the current string, instead of creating a new one.

String#utf8nfd
String#utf8nfc
String#utf8nfkd
String#utf8nfkc
Applies NFD, NFC, NFKD or NFKC normalization.

String#utf8nfd!
String#utf8nfc!
String#utf8nfkd!
String#utf8nfkc!
Same as abouve, but replacing the current string, instead of creating a new one.

String#utf8chars
Returns an array, consisting of all "grapheme clusters" (according to Unicode Standard Annex #29). Graphme clusters are "real characters", which can consist of multiple codepoints, which themselves consist of multiple bytes.

String#char_ary
Same as String#utf8chars, but deprecated.

Integer#utf8
Transforms an integer into an UTF-8 string, representing the Unicode code point given by the integer.

utf8proc main pageutf8proc main page