Scripts and Languages: Pango Reference Manual

Scripts and Languages

Scripts and Languages — Identifying writing systems and languages

Functions

PangoScript	pango_script_for_unichar ()
PangoLanguage *	pango_script_get_sample_language ()
PangoScriptIter *	pango_script_iter_new ()
void	pango_script_iter_get_range ()
gboolean	pango_script_iter_next ()
void	pango_script_iter_free ()
PangoLanguage *	pango_language_from_string ()
const char *	pango_language_to_string ()
gboolean	pango_language_matches ()
gboolean	pango_language_includes_script ()
const PangoScript *	pango_language_get_scripts ()
PangoLanguage *	pango_language_get_default ()
const char *	pango_language_get_sample_string ()

Types and Values

enum	PangoScript
#define	PANGO_TYPE_SCRIPT
	PangoScriptIter
	PangoLanguage
#define	PANGO_TYPE_LANGUAGE

Object Hierarchy

    GBoxed
    ╰── PangoLanguage
    GEnum
    ╰── PangoScript

Description

The functions in this section are used to identify the writing system, or script of individual characters and of ranges within a larger text string.

Functions

pango_script_for_unichar ()

PangoScript
pango_script_for_unichar (gunichar ch);

Looks up the PangoScript for a particular character (as defined by Unicode Standard Annex #24). No check is made for ch being a valid Unicode character; if you pass in invalid character, the result is undefined.

As of Pango 1.18, this function simply returns the return value of g_unichar_get_script().

Parameters

a Unicode character

Returns

the PangoScript for the character.

Since: 1.4

pango_script_get_sample_language ()

PangoLanguage *
pango_script_get_sample_language (PangoScript script);

Given a script, finds a language tag that is reasonably representative of that script. This will usually be the most widely spoken or used language written in that script: for instance, the sample language for PANGO_SCRIPT_CYRILLIC is ru (Russian), the sample language for PANGO_SCRIPT_ARABIC is ar.

For some scripts, no sample language will be returned because there is no language that is sufficiently representative. The best example of this is PANGO_SCRIPT_HAN, where various different variants of written Chinese, Japanese, and Korean all use significantly different sets of Han characters and forms of shared characters. No sample language can be provided for many historical scripts as well.

As of 1.18, this function checks the environment variables PANGO_LANGUAGE and LANGUAGE (checked in that order) first. If one of them is set, it is parsed as a list of language tags separated by colons or other separators. This function will return the first language in the parsed list that Pango believes may use script for writing. This last predicate is tested using pango_language_includes_script(). This can be used to control Pango's font selection for non-primary languages. For example, a PANGO_LANGUAGE enviroment variable set to "en:fa" makes Pango choose fonts suitable for Persian (fa) instead of Arabic (ar) when a segment of Arabic text is found in an otherwise non-Arabic text. The same trick can be used to choose a default language for PANGO_SCRIPT_HAN when setting context language is not feasible.

Parameters

script

a PangoScript

Returns

a PangoLanguage that is representative of the script, or NULL if no such language exists.

[nullable]

Since: 1.4

pango_script_iter_new ()

PangoScriptIter *
pango_script_iter_new (const char *text,
                       int length);

Create a new PangoScriptIter, used to break a string of Unicode text into runs by Unicode script. No copy is made of text , so the caller needs to make sure it remains valid until the iterator is freed with pango_script_iter_free().

Parameters

text	a UTF-8 string
length	length of `text` , or -1 if `text` is nul-terminated.

Returns

the new script iterator, initialized to point at the first range in the text, which should be freed with pango_script_iter_free(). If the string is empty, it will point at an empty range.

Since: 1.4

pango_script_iter_get_range ()

void
pango_script_iter_get_range (PangoScriptIter *iter,
                             const char **start,
                             const char **end,
                             PangoScript *script);

Gets information about the range to which iter currently points. The range is the set of locations p where *start <= p < *end. (That is, it doesn't include the character stored at *end)

Parameters

iter	a PangoScriptIter
start	location to store start position of the range, or `NULL`.	[out][allow-none]
end	location to store end position of the range, or `NULL`.	[out][allow-none]
script	location to store script for range, or `NULL`.	[out][allow-none]

Since: 1.4

pango_script_iter_next ()

gboolean
pango_script_iter_next (PangoScriptIter *iter);

Advances a PangoScriptIter to the next range. If iter is already at the end, it is left unchanged and FALSE is returned.

Parameters

iter

a PangoScriptIter

Returns

TRUE if iter was successfully advanced.

Since: 1.4

pango_script_iter_free ()

void
pango_script_iter_free (PangoScriptIter *iter);

Frees a PangoScriptIter created with pango_script_iter_new().

Parameters

iter

a PangoScriptIter

Since: 1.4

pango_language_from_string ()

PangoLanguage *
pango_language_from_string (const char *language);

Take a RFC-3066 format language tag as a string and convert it to a PangoLanguage pointer that can be efficiently copied (copy the pointer) and compared with other language tags (compare the pointer.)

This function first canonicalizes the string by converting it to lowercase, mapping '_' to '-', and stripping all characters other than letters and '-'.

Use pango_language_get_default() if you want to get the PangoLanguage for the current locale of the process.

Parameters

language

a string representing a language tag, or NULL.

[allow-none]

Returns

an opaque pointer to a PangoLanguage structure, or NULL if language was NULL. The returned pointer will be valid forever after, and should not be freed.

[transfer none][nullable]

pango_language_to_string ()

const char *
pango_language_to_string (PangoLanguage *language);

Gets the RFC-3066 format string representing the given language tag.

Parameters

language

a language tag.

Returns

a string representing the language tag. This is owned by Pango and should not be freed.

pango_language_matches ()

gboolean
pango_language_matches (PangoLanguage *language,
                        const char *range_list);

Checks if a language tag matches one of the elements in a list of language ranges. A language tag is considered to match a range in the list if the range is '*', the range is exactly the tag, or the range is a prefix of the tag, and the character after it in the tag is '-'.

Parameters

language	a language tag (see `pango_language_from_string()`), `NULL` is allowed and matches nothing but '*'.	[nullable]
range_list	a list of language ranges, separated by ';', ':', ',', or space characters. Each element must either be '*', or a RFC 3066 language range canonicalized as by `pango_language_from_string()`

Returns

TRUE if a match was found.

pango_language_includes_script ()

gboolean
pango_language_includes_script (PangoLanguage *language,
                                PangoScript script);

Determines if script is one of the scripts used to write language . The returned value is conservative; if nothing is known about the language tag language , TRUE will be returned, since, as far as Pango knows, script might be used to write language .

This routine is used in Pango's itemization process when determining if a supplied language tag is relevant to a particular section of text. It probably is not useful for applications in most circumstances.

This function uses pango_language_get_scripts() internally.

Parameters

language	a PangoLanguage, or `NULL`.	[nullable]
script	a PangoScript

Returns

TRUE if script is one of the scripts used to write language or if nothing is known about language (including the case that language is NULL), FALSE otherwise.

Since: 1.4

pango_language_get_scripts ()

const PangoScript *
pango_language_get_scripts (PangoLanguage *language,
                            int *num_scripts);

Determines the scripts used to to write language . If nothing is known about the language tag language , or if language is NULL, then NULL is returned. The list of scripts returned starts with the script that the language uses most and continues to the one it uses least.

The value num_script points at will be set to the number of scripts in the returned array (or zero if NULL is returned).

Most languages use only one script for writing, but there are some that use two (Latin and Cyrillic for example), and a few use three (Japanese for example). Applications should not make any assumptions on the maximum number of scripts returned though, except that it is positive if the return value is not NULL, and it is a small number.

The pango_language_includes_script() function uses this function internally.

Parameters

language	a PangoLanguage, or `NULL`.	[allow-none]
num_scripts	location to return number of scripts, or `NULL`.	[out caller-allocates][allow-none]

Returns

An array of PangoScript values, with the number of entries in the array stored in num_scripts , or NULL if Pango does not have any information about this particular language tag (also the case if language is NULL). The returned array is owned by Pango and should not be modified or freed.

[array length=num_scripts][nullable]

Since: 1.22

pango_language_get_default ()

PangoLanguage *
pango_language_get_default (void);

Returns the PangoLanguage for the current locale of the process. Note that this can change over the life of an application.

On Unix systems, this is the return value is derived from setlocale(LC_CTYPE, NULL), and the user can affect this through the environment variables LC_ALL, LC_CTYPE or LANG (checked in that order). The locale string typically is in the form lang_COUNTRY, where lang is an ISO-639 language code, and COUNTRY is an ISO-3166 country code. For instance, sv_FI for Swedish as written in Finland or pt_BR for Portuguese as written in Brazil.

On Windows, the C library does not use any such environment variables, and setting them won't affect the behavior of functions like ctime(). The user sets the locale through the Regional Options in the Control Panel. The C library (in the setlocale() function) does not use country and language codes, but country and language names spelled out in English. However, this function does check the above environment variables, and does return a Unix-style locale string based on either said environment variables or the thread's current locale.

Your application should call setlocale(LC_ALL, ""); for the user settings to take effect. Gtk+ does this in its initialization functions automatically (by calling gtk_set_locale()). See man setlocale for more details.

Returns

the default language as a PangoLanguage, must not be freed.

[transfer none]

Since: 1.16

pango_language_get_sample_string ()

const char *
pango_language_get_sample_string (PangoLanguage *language);

Get a string that is representative of the characters needed to render a particular language.

The sample text may be a pangram, but is not necessarily. It is chosen to be demonstrative of normal text in the language, as well as exposing font feature requirements unique to the language. It is suitable for use as sample text in a font selection dialog.

If language is NULL, the default language as found by pango_language_get_default() is used.

If Pango does not have a sample string for language , the classic "The quick brown fox..." is returned. This can be detected by comparing the returned pointer value to that returned for (non-existent) language code "xx". That is, compare to:

1	pango_language_get_sample_string (pango_language_from_string ("xx"))

Parameters

language

a PangoLanguage, or NULL.

[nullable]

Returns

the sample string. This value is owned by Pango and should not be freed.

Types and Values

enum PangoScript

The PangoScript enumeration identifies different writing systems. The values correspond to the names as defined in the Unicode standard. Note that new types may be added in the future. Applications should be ready to handle unknown values. This enumeration is interchangeable with GUnicodeScript. See Unicode Standard Annex 24: Script names.

Members

PANGO_SCRIPT_INVALID_CODE	a value never returned from `pango_script_for_unichar()`
PANGO_SCRIPT_COMMON	a character used by multiple different scripts
PANGO_SCRIPT_INHERITED	a mark glyph that takes its script from the base glyph to which it is attached
PANGO_SCRIPT_ARABIC	Arabic
PANGO_SCRIPT_ARMENIAN	Armenian
PANGO_SCRIPT_BENGALI	Bengali
PANGO_SCRIPT_BOPOMOFO	Bopomofo
PANGO_SCRIPT_CHEROKEE	Cherokee
PANGO_SCRIPT_COPTIC	Coptic
PANGO_SCRIPT_CYRILLIC	Cyrillic
PANGO_SCRIPT_DESERET	Deseret
PANGO_SCRIPT_DEVANAGARI	Devanagari
PANGO_SCRIPT_ETHIOPIC	Ethiopic
PANGO_SCRIPT_GEORGIAN	Georgian
PANGO_SCRIPT_GOTHIC	Gothic
PANGO_SCRIPT_GREEK	Greek
PANGO_SCRIPT_GUJARATI	Gujarati
PANGO_SCRIPT_GURMUKHI	Gurmukhi
PANGO_SCRIPT_HAN	Han
PANGO_SCRIPT_HANGUL	Hangul
PANGO_SCRIPT_HEBREW	Hebrew
PANGO_SCRIPT_HIRAGANA	Hiragana
PANGO_SCRIPT_KANNADA	Kannada
PANGO_SCRIPT_KATAKANA	Katakana
PANGO_SCRIPT_KHMER	Khmer
PANGO_SCRIPT_LAO	Lao
PANGO_SCRIPT_LATIN	Latin
PANGO_SCRIPT_MALAYALAM	Malayalam
PANGO_SCRIPT_MONGOLIAN	Mongolian
PANGO_SCRIPT_MYANMAR	Myanmar
PANGO_SCRIPT_OGHAM	Ogham
PANGO_SCRIPT_OLD_ITALIC	Old Italic
PANGO_SCRIPT_ORIYA	Oriya
PANGO_SCRIPT_RUNIC	Runic
PANGO_SCRIPT_SINHALA	Sinhala
PANGO_SCRIPT_SYRIAC	Syriac
PANGO_SCRIPT_TAMIL	Tamil
PANGO_SCRIPT_TELUGU	Telugu
PANGO_SCRIPT_THAANA	Thaana
PANGO_SCRIPT_THAI	Thai
PANGO_SCRIPT_TIBETAN	Tibetan
PANGO_SCRIPT_CANADIAN_ABORIGINAL	Canadian Aboriginal
PANGO_SCRIPT_YI	Yi
PANGO_SCRIPT_TAGALOG	Tagalog
PANGO_SCRIPT_HANUNOO	Hanunoo
PANGO_SCRIPT_BUHID	Buhid
PANGO_SCRIPT_TAGBANWA	Tagbanwa
PANGO_SCRIPT_BRAILLE	Braille
PANGO_SCRIPT_CYPRIOT	Cypriot
PANGO_SCRIPT_LIMBU	Limbu
PANGO_SCRIPT_OSMANYA	Osmanya
PANGO_SCRIPT_SHAVIAN	Shavian
PANGO_SCRIPT_LINEAR_B	Linear B
PANGO_SCRIPT_TAI_LE	Tai Le
PANGO_SCRIPT_UGARITIC	Ugaritic
PANGO_SCRIPT_NEW_TAI_LUE	New Tai Lue. Since 1.10
PANGO_SCRIPT_BUGINESE	Buginese. Since 1.10
PANGO_SCRIPT_GLAGOLITIC	Glagolitic. Since 1.10
PANGO_SCRIPT_TIFINAGH	Tifinagh. Since 1.10
PANGO_SCRIPT_SYLOTI_NAGRI	Syloti Nagri. Since 1.10
PANGO_SCRIPT_OLD_PERSIAN	Old Persian. Since 1.10
PANGO_SCRIPT_KHAROSHTHI	Kharoshthi. Since 1.10
PANGO_SCRIPT_UNKNOWN	an unassigned code point. Since 1.14
PANGO_SCRIPT_BALINESE	Balinese. Since 1.14
PANGO_SCRIPT_CUNEIFORM	Cuneiform. Since 1.14
PANGO_SCRIPT_PHOENICIAN	Phoenician. Since 1.14
PANGO_SCRIPT_PHAGS_PA	Phags-pa. Since 1.14
PANGO_SCRIPT_NKO	N'Ko. Since 1.14
PANGO_SCRIPT_KAYAH_LI	Kayah Li. Since 1.20.1
PANGO_SCRIPT_LEPCHA	Lepcha. Since 1.20.1
PANGO_SCRIPT_REJANG	Rejang. Since 1.20.1
PANGO_SCRIPT_SUNDANESE	Sundanese. Since 1.20.1
PANGO_SCRIPT_SAURASHTRA	Saurashtra. Since 1.20.1
PANGO_SCRIPT_CHAM	Cham. Since 1.20.1
PANGO_SCRIPT_OL_CHIKI	Ol Chiki. Since 1.20.1
PANGO_SCRIPT_VAI	Vai. Since 1.20.1
PANGO_SCRIPT_CARIAN	Carian. Since 1.20.1
PANGO_SCRIPT_LYCIAN	Lycian. Since 1.20.1
PANGO_SCRIPT_LYDIAN	Lydian. Since 1.20.1
PANGO_SCRIPT_BATAK	Batak. Since 1.32
PANGO_SCRIPT_BRAHMI	Brahmi. Since 1.32
PANGO_SCRIPT_MANDAIC	Mandaic. Since 1.32
PANGO_SCRIPT_CHAKMA	Chakma. Since: 1.32
PANGO_SCRIPT_MEROITIC_CURSIVE	Meroitic Cursive. Since: 1.32
PANGO_SCRIPT_MEROITIC_HIEROGLYPHS	Meroitic Hieroglyphs. Since: 1.32
PANGO_SCRIPT_MIAO	Miao. Since: 1.32
PANGO_SCRIPT_SHARADA	Sharada. Since: 1.32
PANGO_SCRIPT_SORA_SOMPENG	Sora Sompeng. Since: 1.32
PANGO_SCRIPT_TAKRI	Takri. Since: 1.32
PANGO_SCRIPT_BASSA_VAH	Bassa. Since: 1.40
PANGO_SCRIPT_CAUCASIAN_ALBANIAN	Caucasian Albanian. Since: 1.40
PANGO_SCRIPT_DUPLOYAN	Duployan. Since: 1.40
PANGO_SCRIPT_ELBASAN	Elbasan. Since: 1.40
PANGO_SCRIPT_GRANTHA	Grantha. Since: 1.40
PANGO_SCRIPT_KHOJKI	Kjohki. Since: 1.40
PANGO_SCRIPT_KHUDAWADI	Khudawadi, Sindhi. Since: 1.40
PANGO_SCRIPT_LINEAR_A	Linear A. Since: 1.40
PANGO_SCRIPT_MAHAJANI	Mahajani. Since: 1.40
PANGO_SCRIPT_MANICHAEAN	Manichaean. Since: 1.40
PANGO_SCRIPT_MENDE_KIKAKUI	Mende Kikakui. Since: 1.40
PANGO_SCRIPT_MODI	Modi. Since: 1.40
PANGO_SCRIPT_MRO	Mro. Since: 1.40
PANGO_SCRIPT_NABATAEAN	Nabataean. Since: 1.40
PANGO_SCRIPT_OLD_NORTH_ARABIAN	Old North Arabian. Since: 1.40
PANGO_SCRIPT_OLD_PERMIC	Old Permic. Since: 1.40
PANGO_SCRIPT_PAHAWH_HMONG	Pahawh Hmong. Since: 1.40
PANGO_SCRIPT_PALMYRENE	Palmyrene. Since: 1.40
PANGO_SCRIPT_PAU_CIN_HAU	Pau Cin Hau. Since: 1.40
PANGO_SCRIPT_PSALTER_PAHLAVI	Psalter Pahlavi. Since: 1.40
PANGO_SCRIPT_SIDDHAM	Siddham. Since: 1.40
PANGO_SCRIPT_TIRHUTA	Tirhuta. Since: 1.40
PANGO_SCRIPT_WARANG_CITI	Warang Citi. Since: 1.40
PANGO_SCRIPT_AHOM	Ahom. Since: 1.40
PANGO_SCRIPT_ANATOLIAN_HIEROGLYPHS	Anatolian Hieroglyphs. Since: 1.40
PANGO_SCRIPT_HATRAN	Hatran. Since: 1.40
PANGO_SCRIPT_MULTANI	Multani. Since: 1.40
PANGO_SCRIPT_OLD_HUNGARIAN	Old Hungarian. Since: 1.40
PANGO_SCRIPT_SIGNWRITING	Signwriting. Since: 1.40

PANGO_TYPE_SCRIPT

#define PANGO_TYPE_SCRIPT (pango_script_get_type ())

PangoScriptIter

typedef struct _PangoScriptIter PangoScriptIter;

A PangoScriptIter is used to iterate through a string and identify ranges in different scripts.

PangoLanguage

typedef struct _PangoLanguage PangoLanguage;

The PangoLanguage structure is used to represent a language.

PangoLanguage pointers can be efficiently copied and compared with each other.

PANGO_TYPE_LANGUAGE

#define PANGO_TYPE_LANGUAGE (pango_language_get_type ())

The GObject type for PangoLanguage.