Skip to content

Latest commit

 

History

History
1004 lines (1004 loc) · 97.9 KB

README.md

File metadata and controls

1004 lines (1004 loc) · 97.9 KB

Unicode API

Introduction

This API provides access to detailed information for all characters, blocks and planes in version 15.1.0 of the Unicode Standard (released Sep 12, 2023). In an attempt to adhere to the tenants of REST, the API is organized around the following principles:

  • URLs are predictable and resource-oriented.
  • Uses standard HTTP verbs and response codes.
  • Returns JSON-encoded responses.

Project Resources/Contact Info

Pagination

The top-level API resources for Unicode Characters and Unicode Blocks have support for retrieving all character/block objects via "list" API methods. These API methods (/v1/characters and /v1/blocks) share a common structure, taking at least these three parameters: limit, starting_after, and ending_before.

For your initial request, you should only provide a value for limit (if the default value of limit=10 is ok, you do not need to provide values for any parameter in your initial request). The response of a list API method contains a data parameter that represents a single page of results, and a hasMore parameter that indicates whether the list contains more results after this set.

The starting_after parameter acts as a cursor to navigate between paginated responses, however, the value used for this parameter is different for each endpoint. For Unicode Characters, the value of this parameter is the codepoint property, while for Unicode Blocks the id property is used.

For example, if you request 10 items and the response contains hasMore=true, there are more search results beyond the first 10. If the 10th search result has codepoint=U+0346, you can retrieve the next set of results by sending starting_after=U+0346 in a subsequent request.

The ending_before parameter also acts as a cursor to navigate between pages, but instead of requesting the next set of results it allows you to access previous pages in the list.

For example, if you previously requested 10 items beyond the first page of results, and the first search result of the current page has codepoint=U+0357, you can retrieve the previous set of results by sending ending_before=U+0357 in a subsequent request.

⚠️ IMPORTANT: Only one of starting_after or ending_before may be used in a request, sending a value for both parameters will produce a response with status 400 Bad Request.

The top-level API resources for Unicode Characters and Unicode Blocks also have support for retrieval via "search" API methods. These API methods (/v1/characters/search and /v1/blocks/search) share an identical structure, taking the same four parameters: name, min_score, per_page, and page.

The name parameter is the search term and is used to retrieve a character/block using the official name defined in the UCD. Since a fuzzy search algorithm is used for this process, the value of name does not need to be an exact match with a character/block name.

The response will contain a results parameter that represents the characters/blocks that matched your query. Each object in this list has a score property which is a number ranging from 0-100 that describes how similar the character/block name is to the name value provided by the user (A value of 100 means that the name provided by the user is an exact match with a character/block name). The list contains all results where score >= min_score, sorted by score (the first element in the list being the most similar).

The default value for min_score is 80, however if your request is returning zero results, you can lower this value to potentially surface lower-quality results. Keep in mind, the lowest value for min_score that is permitted is 70, since the relevence of results quickly drops off around a score of 72, often producing hundreds of results with no relevance to the search term.

The per_page parameter controls how many results are included in a single response. The response will include a hasMore parameter that indicates whether there are more search results beyond the current page, as well as currentPage and totalResults parameters. If hasMore=true, the response will also contain a nextPage parameter.

For example, if you receive a response to a search request with hasMore=true and nextPage=2, you can update your request to include page=2 to fetch the next page of results. If the next response includes hasMore=true and nextPage=3, update your request to include page=3, etc. Rinse and repeat until you receive a response with hasMore=false, indicating that you have received the final set of search results.

Loose Matching

Unicode specifies a set of rules to be used when comparing symbolic values, such as block names, known as loose matching rule UAX44-LM3. The algotithm for UAX44-LM3 is simple: Ignore case, whitespace, underscore ('_'), hyphens, and any initial prefix string "is".

This rule applies to many of the parameters that are included with API requests, which avoids returning a 400 response when a parameter name, for example, is sent as 'script', but the expected value is 'Script'. Under UAX44-LM3, both values are equivalent.

For another example, under this rule the block name "Supplemental Arrows-A" is equivalent to "supplemental_arrows__a" and "SUPPLEMENTALARROWSA" since all three of these strings would be reduced to "supplementalarrowsa" after applying UAX44-LM3. For any query or path parameter that expects the name of a Unicode block, any of these three values could be provided and would be understood to refer to block U+27F0..U+27FF SUPPLEMENTAL ARROWS-A.

Whenever the loose-matching rule applies to a parameter, it will be called out in the docuentation for each individual API endpoint below.

Core Resources

Unicode Characters

API Endpoints

GET /v1/characters/-/{string}
Retrieve one or more character(s)*
GET /v1/characters
List all characters*
GET /v1/characters/filter
List characters that match filter settings
GET /v1/characters/search
Search characters
*Supports requests for all codepoints in the Unicode space (i.e., assigned, reserved, noncharacter, surrogate, and private-use codepoints).
Supports ONLY assigned codepoints.

The UnicodeCharacter Object

The UnicodeCharacter object represents a single character/codepoint in the Unicode Character Database (UCD). It contains a rich set of properties that document the purpose and intended representation of the character.

UnicodeCharacter Property Groups

If each response contained every character property, it would be massively inneficient. To ensure that the API remains responsive and performant while also allowing clients to access the full set of character properties, each property is assigned to a property group.

Since they are designed to return lists of characters, responses from the /v1/characters or /v1/characters/search endpoints will only include properties from the Minimum property group:

Minimum
character
A unit of information used for the organization, control, or representation of textual data.
name
A unique string used to identify each character encoded in the Unicode standard.
description
(CJK Characters ONLY)

An English definition for this character. Definitions are for modern written Chinese and are usually (but not always) the same as the definition in other Chinese dialects or non-Chinese languages.

More info: http://www.unicode.org/reports/tr38/#kDefinition

codepoint

In character encoding terminology, a codepoint is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols, control characters, or formatting. The set of all possible code points within a given encoding/character set make up that encoding's codespace.

For example, the character encoding scheme ASCII comprises 128 code points in the range 00-7F, Extended ASCII comprises 256 code points in the range 00-FF, and Unicode comprises 1,114,112 code points in the range 0000-10FFFF. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

uriEncoded

The character as a URI encoded string. A URI is a string that identifies an abstract or physical resource on the internet (The specification for the URI format is defined in RFC 3986).

A URI string must contain only a defined subset of characters from the standard 128 ASCII character set, any other characters must be replaced by an escape sequence representing the UTF-8 encoding of the character.

For example, ∑ (U+2211 N-ARY SUMMATION) in UTF-8 encoding is 0xE2 0x88 0x91. To include this character in a URI, each UTF-8 byte is prefixed with the % character to produce the URI-encoded string: %E2%88%91.


⚠️ NOTE: Specifying show_props=Minimum in any request is redundent since the Minimum property group is included in all responses.

If you wish to explore the properties of one or more specifc characters, the /v1/characters/-/{string} and /v1/characters/filter endpoints accept one or more show_props parameters that allow you to specify additional property groups to include in the response.

For example, you could view the properties from groups UTF-8, Numeric, and Script for the character Ⱒ (U+2C22 GLAGOLITIC CAPITAL LETTER SPIDERY HA), which is equal to 0xE2 0xB0 0xA2 in UTF-8 encoding by submitting the following request: /v1/characters/-/%E2%B0%A2?show_props=UTF8&show_props=Numeric&show_props=Script.

Verbosity

The value of many of the properties that are defined for each character are only meaningful for specific blocks or a small subset of codepoints (e.g., the hangul_syllable_type property will have a (Not Applicable) NA value for all codepoints except those in the four blocks that contain characters from the Hangul writing system).

By default, the hangul_syllable_type property will NOT be included with the response for any character that has this default value even if the user has submitted a request containing show_props=hangul or show_props=all. For actual Hangul characters, the property will be included in the response.

These properties are removed to make the size of each response as small as possible. Knowing that the 🇺 (U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U) character has the value hangul_syllable_type=NA provides no real information about this character.

However, if you wish to see every property value, include verbose=true with your request to the /v1/characters/-/{string} or /v1/characters/filter endpoints.

Basic
block
Name of the block to which the character belongs. Each block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points, and starting at a location that is a multiple of 16. A block may contain unassigned code points, which are reserved.
plane
A range of 65,536 (0x10000) contiguous Unicode code points, where the first code point is an integer multiple of 65,536 (0x10000). Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 is U+0000...U+FFFF, Plane 1 is U+10000...U+1FFFF, ..., and Plane 16 (0x10) is U+100000...10FFFF.
The vast majority of commonly used characters are located in Plane 0, which is called the Basic Multilingual Plane (BMP). Planes 1-16 are collectively referred to as supplementary planes.
age
The version of Unicode in which the character was assigned to a codepoint, such as "1.1" or "4.0.".
generalCategory
The General Category that this character belongs to (e.g., letters, numbers, punctuation, symbols, etc.). The full list of values which are valid for this property is defined in Unicode Standard Annex #44
combiningClass
Specifies, with a numeric code, which sequences of combining marks are to be considered canonically equivalent and which are not. This is used in the Canonical Ordering Algorithm and in normalization. For more info, please see Unicode Standard Section 4.3.
htmlEntities
A string begining with an ampersand (&) character and ending with a semicolon (;). Entities are used to display reserved characters (e.g., '<' in an HTML document) or invisible characters (e.g., non-breaking spaces). For more info, please see the MDN entry for HTML Entities.
ideoFrequency
(CJK Characters ONLY)
A rough frequency measurement for the character based on analysis of traditional Chinese USENET postings; characters with a kFrequency of 1 are the most common, those with a kFrequency of 2 are less common, and so on, through a kFrequency of 5.
ideoGradeLevel
(CJK Characters ONLY)
The primary grade in the Hong Kong school system by which a student is expected to know the character; this data is derived from 朗文初級中文詞典, Hong Kong: Longman, 2001.
rsCountUnicode
(CJK Characters ONLY)

The standard radical-stroke count for this character in the form “radical.additional strokes”. The radical is indicated by a number in the range (1..214) inclusive. An apostrophe (') after the radical indicates a simplified version of the given radical. The “additional strokes” value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical.

This field is also used for additional radical-stroke indices where either a character may be reasonably classified under more than one radical, or alternate stroke count algorithms may provide different stroke counts.

The residual stroke count may be negative. This is because some characters (for example, U+225A9, U+29C0A) are constructed by removing strokes from a standard radical.

rsCountKangxi
(CJK Characters ONLY)
The Kangxi radical-stroke count for this character consistent with the value of the character in the《康熙字典》Kangxi Dictionary in the form “radical.additional strokes”.
totalStrokes
(CJK Characters ONLY)
The total number of strokes in the character (including the radical). When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both

UTF-8
UTF-8 is a method of encoding the Unicode character set where each code unit is equal to 8-bits. UTF-8 is backwards-compatible with ASCII and all codepoints in range 0-127 are represented as a single byte. Codepoints greater than 127 are represented as a sequence of 2-4 bytes.
utf8
The UTF-8 encoded value for the character as a hex string.
utf8HexBytes
The byte sequence for the UTF-8 encoded value for the character. This property returns a list of strings, hex values (base-16) in range 00-FF.
utf8DecBytes
The byte sequence for the UTF-8 encoded value for the character. This property returns a list of integers, decimal values (base-10) in range 0-127

UTF-16
UTF-16 is a method of encoding the Unicode character set where each code unit is equal to 16-bits. All codepoints in the BMP (Plane 0) can be represented as a single 16-bit code unit (2 bytes). Code points in the supplementary planes (Planes 1-16) are represented as pairs of 16-bit code units (4 bytes).
utf16
The UTF-16 encoded value for the character as a hex string.
utf16HexBytes
The byte sequence for the UTF-16 encoded value for the character. This property returns a list of strings, hex values (base-16) in range 0000-FFFF.
utf16DecBytes
The byte sequence for the UTF-16 encoded value for the character. This property returns a list of integers, decimal values (base-10) in range 0-65,535

UTF-32
UTF-32 is a method of encoding the Unicode character set where each code unit is equal to 32-bits. UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit code unit. Because of this, UTF-32 has a one-to-one relationship between encoded character and code unit; it is a fixed-width character encoding form.
utf32
The UTF-32 encoded value for the character as a hex string.
utf32HexBytes
The byte sequence for the UTF-32 encoded value for the character. This property returns a list of strings, hex values (base-16) in range 00000000-0010FFFF.
utf32DecBytes
The byte sequence for the UTF-32 encoded value for the character. This property returns a list of integers, decimal values (base-10) in range 0-1,114,111

Bidirectionality
bidirectionalClass
A value assigned to each Unicode character based on the appropriate directional formatting style. For the property values, see Bidirectional Class Values.
bidirectionalIsMirrored
A normative property of characters such as parentheses, whose images are mirrored horizontally in text that is laid out from right to left. For example, U+0028 LEFT PARENTHESIS is interpreted as opening parenthesis; in a left-to-right context it will appear as “(”, while in a right-to-left context it will appear as the mirrored glyph “)”. This requirement is necessary to render the character properly in a bidirectional context.
bidirectionalMirroringGlyph
A character that can be used to supply a mirrored glyph for the requested character. For example, "(" (U+0028 LEFT PARENTHESIS) mirrors ")" (U+0098 RIGHT PARENTHESIS) and vice versa.
bidirectionalControl

Boolean value that indicates whether the character is one of 12 format control characters which have specific functions in the Unicode Bidirectional Algorithm:

  • U+200E LEFT-TO-RIGHT MARK
  • U+200F RIGHT-TO-LEFT MARK
  • U+202A LEFT-TO-RIGHT EMBEDDING
  • U+202B RIGHT-TO-LEFT EMBEDDING
  • U+202C POP DIRECTIONAL FORMATTING
  • U+202D LEFT-TO-RIGHT OVERRIDE
  • U+202E RIGHT-TO-LEFT OVERRIDE
  • U+2066 LEFT-TO-RIGHT ISOLATE
  • U+2067 RIGHT-TO-LEFT ISOLATE
  • U+2068 FIRST STRONG ISOLATE
  • U+2069 POP DIRECTIONAL ISOLATE
  • U+061C ARABIC LETTER MARK
pairedBracketType
Type of a paired bracket, either opening, closing or none (the default value). This property is used in the implementation of parenthesis matching.
pairedBracketProperty
For an opening bracket, the code point of the matching closing bracket. For a closing bracket, the code point of the matching opening bracket.

Decomposition
decompositionType

The type of the decomposition (canonical or compatibility). The possible values are listed below:

  • none None
  • can  Canonical
  • com  Otherwise Unspecified Compatibility Character
  • enc  Encircled Form
  • fin  Final Presentation Form (Arabic)
  • font Font Variant
  • fra  Vulgar Fraction Form
  • init Initial Presentation Form (Arabic)
  • iso  Isolated Presentation Form (Arabic)
  • med  Medial Presentation Form (Arabic)
  • nar  Narrow (or Hankaku) Compatibility Character
  • nb   No No-break Version Of A Space Or Hyphen
  • sml  Small Variant Form (CNS Compatibility)
  • sqr  CJK Squared Font Variant
  • sub  Subscript Form
  • sup  Superscript Form
  • vert Vertical Layout Presentation Form
  • wide Wide (or Zenkaku) Compatibility Character

Quick Check

Unicode, being a unifying character set, contains characters that allow similar results to be expressed in different ways. Given that similar text can be written in different ways, we have a problem. How can we determine if two strings are equal ? How can we find a substring in a string?

The answer is to convert the string to a well-known form, a process known as normalization. Unicode normalization is a set of rules based on tables and algorithms. It defines two kinds of normalization equivalence: canonical and compatible.

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, "Å" (U+212B ANGSTROM SIGN) is canonically equivalent to BOTH "Å" (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE) and "A" (U+00C5 LATIN CAPITAL LETTER A) + "◌̊" (U+030A COMBINING RING ABOVE).

Code point sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. An example of this could be representations of the decimal digit 6: "Ⅵ" (U+2165 ROMAN NUMERAL SIX) and "⑥" (U+2465 CIRCLED DIGIT SIX). In one particular sense they are the same, but there are many other qualities that are different between then.

Compatible equivalence is a superset of canonical equivalence. In other words each canonical mapping is also a compatible one, but not the other way around.

Composition is the process of combining marks with base letters (multiple code points are replaced by single points whenever possible). Decomposition is the process of taking already composed characters apart (single code points are split into multiple ones). Both processes are recursive.

An additional difficulty is that the normalized ordering of multiple consecutive combining marks must be defined. This is done using a concept called the Canonical Combining Class or CCC, a Unicode character property (available as the combiningClass property in the Basic property group).

When you take all of these concepts into consideration, four normalization forms are defined:

  • NFD  Canonical decomposition and ordering
  • NFC  Composition after canonical decomposition and ordering
  • NFKD Compatible decomposition and ordering
  • NFKC Composition after compatible decomposition and ordering

In an effort to make the process of normalizing/determining if a string is already normalized less tedious and complex, four “quick check” properties exist for each character (NFD_QC, NFC_QC, NFKD_QC, and NFKC_QC, one for each normalization form).

These properties allow implementations to quickly determine whether a string is in a particular Normalization Form. This is, in general, many times faster than normalizing and then comparing.

NFD_QC
NFD_QC stands for Normalization Form D Quick Check. This property is used to quickly check if a character is already in NFD form, and thus does not need to be further normalized.
NFC_QC
NFC_QC stands for Normalization Form C Quick Check. This property is used to quickly check if a character is already in NFC form, and thus does not need to be further normalized.
NFKD_QC
NFKD_QC stands for Normalization Form KD Quick Check. This property is used to quickly check if a character is already in NFKD form, and thus does not need to be further normalized.
NFKC_QC
NFKC_QC stands for Normalization Form KC Quick Check. This property is used to quickly check if a character is already in NFKC form, and thus does not need to be further normalized.

Numeric
numericType

If a character is normally used as a number, it will be assigned a value other than None, which is the default value used for all non-number characters:

  • None None
  • De   Decimal
  • Di   Digit
  • Nu   Numeric
numericValue

If the character has the property value numericType=Decimal, then the numericValue of that digit is represented with an integer value (limited to the range 0..9).

If the character has the property value numericType=Digit, then the numericValue of that digit is represented with an integer value (limited to the range 0..9). This covers digits that need special handling, such as the compatibility superscript digits. Starting with Unicode 6.3.0, no newly encoded numeric characters will be given numericValue=Digit, nor will existing characters with numericValue=Decimal be changed to numericValue=Digit. The distinction between those two types is not considered useful.

If the character has the property value numericType=Numeric, then the numericValue of that character is represented with a positive or negative integer or rational number. This includes fractions such as, for example, "1/5" for ⅕ (U+2155 VULGAR FRACTION ONE FIFTH).

numericValueParsed
This is NOT a property from the Unicode Standard. This is a floating point version of the numericValue property (which is a string value). For example, 0.2 for ⅕ (U+2155 VULGAR FRACTION ONE FIFTH)

Joining
joiningType

Each Arabic letter must be depicted by one of a number of possible contextual glyph forms. The appropriate form is determined on the basis of the cursive joining behavior of that character as it interacts with the cursive joining behavior of adjacent characters. In the Unicode Standard, such cursive joining behavior is formally described in terms of values of a character property called joiningType. Each Arabic character falls into one of the types listed below:

  • R Right Joining
  • L Left Joining
  • D Dual Joining
  • C Join Causing
  • U Non Joining
  • T Transparent

Note that for cursive joining scripts which are typically rendered top-to-bottom, rather than right-to-left, joiningType=L conventionally refers to bottom joining, and joiningType=R conventionally refers to top joining.

joiningGroup
The group of characters that the character belongs to in cursive joining behavior. For Arabic and Syriac characters.
joiningControl
Boolean value that indicates whether the character has specific functions for control of cursive joining and ligation.

Linebreak
lineBreak

Line-breaking class of the character. Affects whether a line break must, may, or must not appear before or after the character. The possible values are listed below:

  • AL  Ordinary Alphabetic And Symbol
  • AI  Ambiguous (Alphabetic Or Ideographic)
  • BA  Break Opportunity After
  • B2  Break Opportunity Before And After
  • BK  Mandatory Break
  • BB  Break Opportunity Before
  • CL  Closing Punctuation
  • CB  Contingent Break Opportunity
  • CR  Carriage Return
  • CM  Attached Characters And Combining Marks
  • GL  Non-breaking ("Glue")
  • EX  Exclamation/Interrogation
  • H3  Hangul LVT Syllable
  • H2  Hangul LV Syllable
  • ID  Ideographic
  • HY  Hyphen
  • IS  Infix Separator
  • IN  Inseparable
  • JT  Hangul T Jamo
  • JL  Hangul L Jamo
  • LF  Line Feed
  • JV  Hangul V Jamo
  • NS  Non Starter
  • NL  Next Line
  • OP  Opening Punctuation
  • NU  Numeric
  • PR  Prefix (Numeric)
  • PO  Postfix (Numeric)
  • SA  Complex Context (South East Asian)
  • QU  Ambiguous Quotation
  • SP  Space
  • SG  Surrogates
  • WJ  Word Joiner
  • SY  Symbols Allowing Breaks
  • ZW  Zero Width Spac
  • XX  Unknown

East Asian Width
eastAsianWidth

The width of the character, in terms of East Asian writing systems that distinguish between full width, half width, and narrow. The possible values are listed in Unicode Standard Annex #11:

  • A  East Asian Ambiguous
  • F  East Asian Fullwidth
  • H  East Asian Halfwidth
  • N  Neutral Not East Asian
  • Na East Asian Narrow
  • W  East Asian Wide

Case
uppercase
Boolean value that indicates whether the character is an uppercase letter.
lowercase
Boolean value that indicates whether the character is a lowercase letter.
simpleUppercaseMapping
The uppercase form of the character, if expressible as a single character.
simpleLowercaseMapping
The lowercase form of the character, if expressible as a single character.
simpleTitlecaseMapping
The titlecase form of the character, if expressible as a single character.
simpleCaseFolding
The case-folded (lowercase) form of the character when applying simple folding, which does not change the length of a string (and may thus fail to fold some characters correctly).

Script
script
The script (writing system) to which the character primarily belongs to, such as "Latin," "Greek," or "Common," which indicates a character that is used in different scripts.
scriptExtensions

Further refines the script category of a character by providing additional information about the character's usage and context. This property allows for more specific categorization of characters that may have multiple uses or are used in multiple scripts.

The script extensions property can also be used to indicate characters that are used in multiple scripts, such as characters that are used in both Latin and Cyrillic scripts.


Hangul
hangulSyllableType

Type of syllable, for characters that are Hangul (Korean) syllabic characters. Possible values

  • NA  Not Applicable
  • L   Leading Jamo
  • V   Vowel Jamo
  • T   Trailing Jamo
  • LV  Lv Syllable
  • LVT Lvt Syllable

Indic
indicSyllabicCategory
Used to identify the type of syllable that a character belongs to, such as a vowel, consonant, or a combination of both.
indicMatraCategory
Used to identify the type of matra (vowel sign) associated with a character, such as a short or long vowel sign.
indicPositionalCategory
Used to identify the position of a character in a syllable, such as the initial, medial, or final position.

CJK Variants

Although Unicode encodes characters and not glyphs, the line between the two can sometimes be hard to draw, particularly in East Asia. There, thousands of years worth of writing have produced thousands of pairs which can be used more-or-less interchangeably.

To deal with this situation, the Unicode Standard has adopted a three-dimensional model for determining the relationship between ideographs, and has formal rules for when two forms may be unified. Both are described in some detail in the Unicode Standard. Briefly, however, the three-dimensional model uses the x-axis to represent meaning, and the y-axis to represent abstract shape. The z-axis is used for stylistic variations.

The traditionalVariant and simplifiedVariant fields are used in character-by-character conversions between simplified and traditional Chinese (SC and TC, respectively).

Two variation fields, semanticVariant and specializedSemanticVariant, are used to mark cases where two characters have identical and overlapping meanings, respectively.

The spoofingVariant field is used to denote a special class of variant, a spoofing variant. Spoofing variants are potentially used in bad faith to direct users to unexpected URLs, evade email filters, or otherwise deceive end-users.

For more information on CJK variants, please see UAX #38, Section 3.7.

traditionalVariant
The Unicode value(s) for the traditional Chinese variant(s) for this character.
simplifiedVariant
The Unicode value(s) for the simplified Chinese variant(s) for this character.
zVariant
The z-variants for the character, if any. Z-variants are instances where the same abstract shape has been encoded multiple times, either in error or because of source separation. Z-variant pairs also have identical semantics.
compatibilityVariant
The canonical Decomposition_Mapping value for the ideograph
semanticVariant
The Unicode value for a semantic variant for this character. A semantic variant is an x- or y-variant with similar or identical meaning which can generally be used in place of the indicated character.
specializedSemanticVariant
The Unicode value for a specialized semantic variant for this character. The syntax is the same as for the kSemanticVariant field. A specialized semantic variant is an x- or y-variant with similar or identical meaning only in certain contexts.
spoofingVariant
The spoofing variants for the character, if any. Spoofing variants include character pairs which look similar, particularly at small point sizes, which are not already z-variants or compatibility variants.

CJK Numeric

There are three fields, accountingNumeric, otherNumeric, and primaryNumeric to indicate the numerical values an ideograph may have. Traditionally, ideographs were used both for numbers and words, and so many ideographs have (or can have) numeric values. The various kinds of numeric values are specified by these three fields.

The three numeric-value fields should have no overlap; that is, characters with a accountingNumeric value should not have a otherNumeric or primaryNumeric value as well.

accountingNumeric

The value of the character when used as an accounting numeral to prevent fraud. A numeral such as 十 (ten) is easily transformed into 千 (thousand) by adding a single stroke, so monetary documents often use an accounting form of the numeral, such as 拾 (ten), instead of the more common—and simpler—form.

Characters with this property will have a single, well-defined value, which a native reader can reasonably be expected to understand.

primaryNumeric

The value of the character when used as a numeral. Characters which have this property have numeric values that are common, and always convey the same numeric value.

For example, 千 always means “thousand.” A native reader is expected to understand the numeric value for these characters.

otherNumeric

One or more values of the character when used as a numeral. Characters with this property are rarely used for writing numbers, or have non-standard or multiple values depending on the region.

For example, 㠪 is a rare character whose meaning, “five,” would not be recognized by most native readers. An English-language equivalent is “gross,” whose numeric value, “one hundred forty-four,” is not universally understood by native readers.


CJK Readings

The properties in this group include the pronunciations for a given character in Mandarin, Cantonese, Japanese, Sino-Japanese, Korean, and Vietnamese.

Any attempt at providing a reading or set of readings for a character is bound to be fraught with difficulty, because the readings will vary over time and from place to place, even within a language. Mandarin is the official language of both the PRC and Taiwan (with some differences between the two) and is the primary language over much of northern and central China, with vast differences from place to place. Even Cantonese, the modern language covered by the Unihan database with the least geographical range, is spoken throughout Guangdong Province and in much of neighboring Guangxi Zhuang Autonomous Region, and covers four large urban centers (Guangzhou, Shenzhen, Macao, and Hong Kong). There are therefore distinct regional variations in pronunciation and vocabulary.

Indeed, even the same speaker will pronounce the same word differently depending on the speaker or even the social context. This is particularly true for languages such as Cantonese, where there has been comparatively little government effort to standardize the language.

Add to this the fact that in none of these languages—the various forms of Chinese, Japanese, Korean, Vietnamese—is the syllable the fundamental unit of the language. As in the West, it’s the word, and the pronunciation of a character is tied to the word of which it is a part. In Chinese (followed by Vietnamese and Korean), the rule is one ideograph/one syllable, with most words written using multiple ideographs. In most cases, an ideograph has only one reading (or only one important reading), but there are numerous exceptions.

In Japanese, the situation is enormously more complex. Japanese has two pronunciation systems, one derived from Chinese (the on pronunciation, or Sino-Japanese), and the other from Japanese (the kun pronunciation).

The on readings derive from Chinese loan-words. They depend on factors such as when (and from which part of China) the loan-word was borrowed, and changes to Japanese since then. On readings can therefore have little obvious relationship to modern Chinese readings, and the same Chinese reading for a given kanji can be reflected in multiple on readings in Japanese. Contrary to Chinese practice, on readings may be polysyllabic.

Kun readings, on the other hand, derive from native Japanese words for which either existing kanji were adopted or new kanji coined.

The net result is that multiple readings are the rule for Japanese kanji. These multiple readings may bear no relationship to one another and are highly context-sensitive. Even a native Japanese reader may not know the correct pronunciation of a proper noun if it is written only in kanji.

mandarin
The most customary pīnyīn reading for this character. When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both.
cantonese
The most customary jyutping (Cantonese) reading for this character.
japaneseKun
The Japanese pronunciation(s) of this character in the Hepburn romanization.
japaneseOn
The Sino-Japanese pronunciation(s) of this character.
hangul
The modern Korean pronunciation(s) for this character in Hangul
vietnamese
The character's pronunciation(s) in Quốc ngữ.

Function and Graphic
dash
Boolean value that indicates whether the character is classified as a dash. This includes characters explicitly designated as dashes and their compatibility equivalents.
hyphen
Boolean value that indicates whether the character is regarded as a hyphen. This refers to those dashes that are used to mark connections between parts of a word and to the Katakana middle dot.
quotationMark
Boolean value that indicates whether the character is used as a quotation mark in some language(s).
terminalPunctuation
Boolean value that indicates whether the character is a punctuation mark that generally marks the end of a textual unit.
sentenceTerminal
Boolean value that indicates whether the character is used to terminate a sentence.
diacritic
Boolean value that indicates whether the character is diacritic. i.e., linguistically modifies another character to which it applies. A diacritic is usually, but not necessarily, a combining character.
extender
Boolean value that indicates whether the principal function of the character is to extend the value or shape of a preceding alphabetic character.
softDotted
Boolean value that indicates whether the character contains a dot that disappears when a diacritic is placed above the character (e.g., "i" and "j" are soft dotted).
alphabetic
Boolean value that indicates whether the character is alphabetic. i.e., a letter or comparable to a letter in usage. True for characters with generalCategory value of Lu, Ll, Lt, Lm, Lo, or Nl and additionally for characters with the otherAlphabetic property.
math
Boolean value that indicates whether the character is mathematical. This includes characters with Sm (Symbol, math) as the General Category value, and some other characters.
hexDigit
Boolean value that indicates whether the character is used in hexadecimal numbers. This is true for ASCII hexadecimal digits and their fullwidth versions.
asciiHexDigit
Boolean value that indicates whether the character is an ASCII character used to represent hexadecimal numbers (i.e., letters A-F, a-f and digits 0-9).
defaultIgnorableCodePoint
Boolean value that indicates whether the code point should be ignored in automatic processing by default.
logicalOrderException
Boolean value that indicates whether the character belongs to the small set of characters that do not use logical order and hence require special handling in most processing
prependedConcatenationMark
Boolean value that indicates whether the character belongs to a small class of visible format controls, which precede and then span a sequence of other characters, usually digits. These have also been known as "subtending marks", because most of them take a form which visually extends underneath the sequence of following digits.
whiteSpace
Boolean value that indicates whether the character should be treated by programming languages as a whitespace character when parsing elements. This concept does not match the more restricted whitespace concept in many programming languages, but it is a generalization of that concept to the "Unicode world."
verticalOrientation
A property used to establish a default for the correct orientation of characters when used in vertical text layout, as described in Unicode Standard Annex #50, "Unicode Vertical Text Layout"
regionalIndicator

The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.

They are encoded in the range 🇦 (U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A) to 🇿 (U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z) Within the Enclosed Alphanumeric Supplement block in the Supplementary Multilingual Plane.

These were defined as an alternative to encoding separate characters for each country flag. Although they can be displayed as Roman letters, it is intended that implementations may choose to display them in other ways, such as by using national flags.

For example, since the ISO 3166-1 alpha-2 country code for Ukraine is UA, when the characters 🇺 (U+1F1FA) and 🇦 (U+1F1E6) are placed next to eachother the Ukrainian flag should be rendered: 🇺🇦.


Emoji
emoji
Boolean value that indicates whether the character is recommended for use as emoji.
emojiPresentation
Boolean value that indicates whether the character has emoji presentation by default.
emojiModifier
Boolean value that indicates whether the character is used as an emoji modifier. Currently this includes only the skin tone modifier characters.
emojiModifierBase
Boolean value that indicates whether the character can serve as a base for emoji modifiers.
emojiComponent
Boolean value that indicates whether the character is used in emoji sequences but normally does not appear on emoji keyboards as a separate choice (e.g., keycap base characters or Regional_Indicator characters).
extendedPictographic
Boolean value that indicates whether the character is a pictographic symbol or otherwise similar in kind to characters with the Emoji property. This enables segmentation rules involving emoji to be specified stably, even in cases where an existing non-emoji pictographic symbol later comes to be treated as an emoji.

Unicode Codepoints

API Endpoints

GET /v1/codepoints/{hex}
Retrieve details of a single character

The UnicodeCodepoint resource is not an object like the other resources, it is simply a hexadecimal value that refers to a single character in the Unicode codespace.

This endpoint performs nearly the same function as the /v1/characters/-/{string} endpoint. However, sending a request for a character to the /v1/characters/-/{string} endpoint requires you to provide either the character itself or the URI encoded string representation of the character.

Since there are plenty of scenarios where it may be easier to supply the assigned codepoint for a character rather than the rendered glyph or URI-encoded value, the /v1/codepoints/{hex} endpoint allows you to request the same sets of character property groups as the /v1/characters/-/{string} endpoint.

The only difference between the two endpoints is requests to the /v1/characters/-/{string} endpoint can retrieve data for one or more characters, while requests to the /v1/codepoints/{hex} endpoint can only be used to retrieve details of a single character.

Unicode Blocks

API Endpoints

GET /v1/blocks/{name}
Retrieve one or more Block(s)
GET /v1/blocks
List Blocks
GET /v1/blocks/search
Search Blocks

The UnicodeBlock Object

The UnicodeBlock object represents a grouping of characters within the Unicode encoding space. Each block is generally, but not always, meant to supply glyphs used by one or more specific languages, or in some general application area such as mathematics, surveying, decorative typesetting, social forums, etc.

Each block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points (additionally, the starting codepoint for each block is a multiple of 16). A block may contain unassigned code points, which are reserved.

The UnicodeBlock object exposes a small set of properties such as the official name of the block, the range of code points assigned to the block and the total number of defined characters within the block:

UnicodeBlock Properties
id
This is NOT a property from the Unicode Standard. This is an integer value used to navigate within a paginated list of UnicodeBlock objects. The first block (U+0000..U+007F BASIC LATIN) has id=1 and each block is numbered sequentially in order of starting codepoint.
name
Unicode blocks are identified by unique names, which use only ASCII characters and are usually descriptive of the nature of the symbols (in English), such as "Tibetan" or "Supplemental Arrows-A".
plane
A string value equal to the abbreviated name of the Unicode Plane containing the block (e.g., "BMP" for Basic Multilingual Plane).
start
A string value equal to the first codepoint allocated to the block, expressed in U+hhhhhh format.
finish
A string value equal to the last codepoint allocated to the block, expressed in U+hhhhhh format.
total_allocated
An integer value equal to the total number of characters (defined or reserved) contained in the block.
total_defined
An integer value equal to the total number of characters with defined names, glyphs, etc in the block.

Unicode Planes

API Endpoints

GET /v1/planes/{number}
Retrieve one or more Plane(s)
GET /v1/planes
List Planes

The UnicodePlane Object

The UnicodePlane object represents a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16. The first two positions of a character's codepoint value (U+hhhhhh) correspond to the plane number in hex format (possible values 0x000x10).

Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly used characters. The higher planes 1 through 16 are called "supplementary planes". The last code point in plane 16 is the last code point in Unicode, U+10FFFF.

UnicodePlane Properties
number
The official number that identifies the range of codepoints within a plane. The first two positions of a character's codepoint value (U+hhhhhh) correspond to the plane number in hex format (possible values 0x00...0x10). This is a decimal value, however, with possible values 0...16.
name

The official name of a plane, according to the Unicode Standard. As of version 15.0.0, seven of the total 17 planes have official names (the official abbreviation for each plane if also given in parentheses):

  1. Basic Multilingual Plane (BMP)
  2. Supplementary Multilingual Plane (SMP)
  3. Supplementary Ideographic Plane (SIP)
  4. Tertiary Ideographic Plane (TIP)
  5. Supplementary Special-purpose Plane (SSP)
  6. Supplementary Private Use Area-A (SPUA-A)
  7. Supplementary Private Use Area-B (SPUA-B)

The codepoints within Planes 4-13 (U+40000...U+​DFFFF) are unassigned, and these planes currently have no official name/abbreviation.

abbreviation
An acronym that identifies the plane, the list in the previous definition contains the abbreviation for each plane along with the official name.
start
A string value equal to the first codepoint allocated to the plane, expressed in U+hhhhhh format.
finish
A string value equal to the last codepoint allocated to the plane, expressed in U+hhhhhh format.
total_allocated
An integer value equal to the total number of characters (defined or reserved) contained in the plane (always 216).
total_defined
An integer value equal to the total number of characters with defined names, glyphs, etc in the plane.