regexps.com
This chapter describes the foundation of support for the Unicode character set in the Hackerlab C library.
This chapter is not a tutorial introduction to Unicode. We presume that readers are already somewhat familiar with Unicode. A very brief introduction can be found in An Absurdly Brief Introduction to Unicode.
enum uni_encoding_schemes;
Values of the enumerated type uni_encoding_schemes
are used in
interfaces throughout the Hackerlab C library to identify encoding
schemes for strings or streams of Unicode characters. (See
An Absurdly Brief Introduction to Unicode.)
enum uni_encoding_schemes
{
uni_iso8859_1,
uni_utf8,
uni_utf16,
uni_utf16be,
uni_utf16le,
};
uni_iso8859_1
refers to a degenerate encoding scheme. Each
character is stored in one byte. Only characters in the
range U+0000 .. U+00FF
can be represented.
uni_utf8
refers to the UTF-8 encoding scheme.
uni_utf16
refers to UTF-16 in the native byte order of
the machine.
uni_utf16be
refers to UTF-16, explicitly in big-endian order.
uni_utf16le
refers to UTF-16, explicitly in little-endian order.
Some low-level functions in the Hackerlab C library work with
any of these five encodings. Higher-level functions work
only with uni_iso8859_1
, uni_utf8
, and uni_utf16
.
Code units in a uni_utf8
string are of type t_uchar
(unsigned,
8-bit integer). Code units in a uni_utf16
string are of type
t_unichar
(unsigned 16-bit integer). Unicode code points are
of type t_unicode
. (See Machine-Specific Definitions.)
The Hackerlab C Library is designed to operate correctly for programs which internally use any combination of the encodings iso8859-1, utf-8, and utf-16. (Future releases are likely to add support for utf-32.)
typedef struct uni__undefined_struct * uni_string;
The type uni_string
is pointer to a value of unknown size. It is
used to represent the address of a Unicode string or an address
within a Unicode string.
Any two uni_string
pointers may be compared for equality.
uni_string
pointers within a single string may be compared
using any relational operator (<
, >
, etc.).
uni_string
pointers are created from UTF-8 pointers (t_uchar *
)
and from UTF-16 pointers (t_unichar *
) by means of a cast:
uni_string s = (uni_string)utf_8_string;
uni_string t = (uni_string)utf_16_string;
By convention, all functions that operate on Unicode strings accept two parameters for each string: an encoding form, and a string pointer as in this function declaration:
void uni_fn (enum uni_encoding_scheme encoding,
uni_string s);
By convention, the length of a Unicode string is always measured in code units, no matter what the size of those code units. Integer string indexes are also measured in code units.
These functions were not ready for the current release of the Hackerlab C Library. They will be included in future releases.
The functions and macros in this chapter present programs with an interface to various properties extracted from the Unicode Character Database as published by the Unicode consortium.
For information about the version of the database used and the implications of using these functions on program size, see Data Sheet for the Hackerlab Unicode Database.
Function
unidata_is_assigned_code_point
int unidata_is_assigned_code_point (t_unicode c);
Return 1
if c
is an assigned code point, 0
otherwise.
A code point is assigned if it has an entry in unidata.txt
or is part of a range of characters whose end-points are
defined in unidata.txt
.
Type
enum unidata_general_category
enum uni_general_category;
The General Category of a Unicode character is represented by an enumerated value of this type.
The primary category values are:
uni_general_category_Lu Letter, uppercase
uni_general_category_Ll Letter, lowercase
uni_general_category_Lt Letter, titlecase
uni_general_category_Lm Letter, modifier
uni_general_category_Lo Letter, other"
uni_general_category_Mn Mark, nonspacing
uni_general_category_Mc Mark, spacing combining
uni_general_category_Me Mark, enclosing
uni_general_category_Nd Number, decimal digit
uni_general_category_Nl Number, letter
uni_general_category_No Number, other
uni_general_category_Zs Separator, space
uni_general_category_Zl Separator, line
uni_general_category_Zp Separator, paragraph
uni_general_category_Cc Other, control
uni_general_category_Cf Other, format
uni_general_category_Cs Other, surrogate
uni_general_category_Co Other, private use
uni_general_category_Cn Other, not assigned
uni_general_category_Pc Punctuation, connector
uni_general_category_Pd Punctuation, dash
uni_general_category_Ps Punctuation, open
uni_general_category_Pe Punctuation, close
uni_general_category_Pi Punctuation, initial quote
uni_general_category_Pf Punctuation, final quote
uni_general_category_Po Punctuation, other
uni_general_category_Sm Symbol, math
uni_general_category_Sc Symbol, currency
uni_general_category_Sk Symbol, modifier
uni_general_category_So Symbol, other
Seven additional synthetic categories are defined. These are:
uni_general_category_L Letter
uni_general_category_M Mark
uni_general_category_N Number
uni_general_category_Z Separator
uni_general_category_C Other
uni_general_category_P Punctuation
uni_general_category_S Symbol
No character is given a synthetic category as its general category. Rather, the synthetic categories are used in some interfaces to refer to all characters having a general category within one of the synthetic categories.
Function
unidata_general_category
enum uni_general_category unidata_general_category (t_unicode c);
Return the general category of c
.
The category returned for unassigned code points is
uni_general_category_Cn
(Other, Not Assigned).
Function
unidata_decimal_digit_value
int unidata_decimal_digit_value (t_unicode c);
If c
is a decimal digit (regardless of script) return
its digit value. Otherwise, return -1
.
Type
enum unidata_bidi_category
enum uni_bidi_category;
The Bidrectional Category of a Unicode character is represented by an enumerated value of this type.
The bidi category values are:
uni_bidi_L Left-to-Right
uni_bidi_LRE Left-to-Right Embedding
uni_bidi_LRO Left-to-Right Override
uni_bidi_R Right-to-Left
uni_bidi_AL Right-to-Left Arabic
uni_bidi_RLE Right-to-Left Embedding
uni_bidi_RLO Right-to-Left Override
uni_bidi_PDF Pop Directional Format
uni_bidi_EN European Number
uni_bidi_ES European Number Separator
uni_bidi_ET European Number Terminator
uni_bidi_AN Arabic Number
uni_bidi_CS Common Number Separator
uni_bidi_NSM Non-Spacing Mark
uni_bidi_BN Boundary Neutral
uni_bidi_B Paragraph Separator
uni_bidi_S Segment Separator
uni_bidi_WS Whitspace
uni_bidi_ON Other Neutrals
Function
unidata_bidi_category
enum uni_bidi_category unidata_bidi_category (t_unicode c);
Return the bidirectional category of c
.
The category returned for unassigned code points is
uni_bidi_ON
(other neutrals).
int unidata_is_mirrored (t_unicode c);
Return 1
if c
is mirrored in bidirectional text, 0
otherwise.
Macro
unidata_canonical_combining_class
#define unidata_canonical_combining_class(C)
Return the canonical combining class of a Unicode character.
Combining classes are represented as unsigned 8-bit integers.
These functions use the case mappings in unidata.txt
.
t_unicode unidata_to_upper (t_unicode c);
If c
has a default uppercase mapping, return that mapping.
Otherwise, return c
.
t_unicode unidata_to_lower (t_unicode c);
If c
has a default lowercase mapping, return that mapping.
Otherwise, return c
.
t_unicode unidata_to_title (t_unicode c);
If c
has a default titlecase mapping, return that mapping.
Otherwise, return c
.
Type
enum uni_decomposition_type
enum uni_decomposition_type;
The decomposition mapping of a character is described by values of this enumerated type:
uni_decomposition_none
uni_decomposition_canonical
uni_decomposition_font
uni_decomposition_noBreak
uni_decomposition_initial
uni_decomposition_medial
uni_decomposition_final
uni_decomposition_isolated
uni_decomposition_circle
uni_decomposition_super
uni_decomposition_sub
uni_decomposition_vertical
uni_decomposition_wide
uni_decomposition_narrow
uni_decomposition_small
uni_decomposition_square
uni_decomposition_fraction
uni_decomposition_compat
The value uni_decomposition_none
indicates that a character
has no decomposition mapping.
Type
struct uni_decomposition_mapping
struct uni_decomposition_mapping;
A character's decomposition mapping is described by this structure. It has the fields:
enum uni_decomposition_type type;
t_unicode * decomposition;
type
is the type of decomposition.
If type
is not uni_decomposition_none
, then decomposition
is a 0-termianted array of code points which are the decomposition
of the character.
Macro
unidata_character_decomposition_mapping
#define unidata_character_decomposition_mapping(C)
Return the decomposition mapping of C
. This macro returns
a pointer to a struct uni_decomposition_mapping
.
struct uni_block;
Structures of this type describe one of the standard blocks of
Unicode characters ("Basic Latin"
, "Latin-1 Supplement"
, etc.)
struct uni_block
{
t_uchar * name; /* name of the block */
t_unichar start; /* first character in the block */
t_unichar end; /* last character in the block */
};
extern struct uni_block uni_blocks[];
The names of the standard Unicode blocks. This array is sorted in code-point order, from least to greatest.
n_uni_blocks
is the number of blocks in uni_blocks
.
uni_blocks[n_uni_blocks].name == 0
extern const struct uni_block uni_blocks[]; extern const int n_uni_blocks;
bits uni_universal_bitset (void);
Return the set of all assigned code points which are not surrogate code points and are not private use code points. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)
The shared bitset tree returned by this function uses the tree
structure defined by uni_bits_tree_rule
. (See Unicode Character Bitsets.)
Programs should not attempt to modify the set returned by this function.
Function
uni_general_category_bitset
bits uni_general_category_bitset (enum uni_general_category c);
Return the set of all assigned code points having the indicated general category or synthetic general category. The set is represented as a shared bitset tree. (See Shared Bitset Trees.)
c
indicates which category to return. It may be a Unicode
general category or a synthetic general category. (See
General Category.)
The shared bitset tree returned by this function uses the tree
structure defined by uni_bits_tree_rule
. (See Unicode Character Bitsets.)
Programs should not attempt to modify the set returned by this function.
regexps.com