Normalize decorative Unicode letters not handled by NFKC#19608
Normalize decorative Unicode letters not handled by NFKC#19608bramd wants to merge 4 commits intonvaccess:masterfrom
Conversation
Extend unicodeNormalize to apply a supplementary translation table for
decorative Unicode letter characters that unicodedata.normalize("NFKC")
does not decompose: negative squared Latin capitals (U+1F170-U+1F189),
negative circled Latin capitals (U+1F150-U+1F169), and regional
indicator symbol letters (U+1F1E6-U+1F1FF).
efa23a1 to
6282518
Compare
|
should this be reported to NFKC so it eventually gets fixed upstream? |
…ormalization Remove Regional Indicator Symbol Letters (U+1F1E6-U+1F1FF) from the supplementary normalization table as decomposing them breaks flag emoji (e.g. 🇺🇸 would become "US" instead of being recognized as a flag). Exclude 4 Negative Squared codepoints that have emoji semantics: U+1F170 (🅰 blood type A), U+1F171 (🅱 blood type B), U+1F17E (🅾 blood type O), U+1F17F (🅿 P button). The remaining 22 squared letters are still decomposed.
Good question, I have looked some more in the origin of these characters. More importantly, the Unicode Normalization Stability Policy guarantees that once a character is encoded without a decomposition mapping, it can never receive one. Adding decomposition mappings retroactively would break the guarantee that normalized strings remain stable across Unicode versions. Additionally, several codepoints in the Negative Squared range have emoji semantics (blood type A, blood type B, blood type O, P button). Decomposing those to plain letters would destroy their meaning. I've excluded those 4 emoji codepoints from the supplementary table in the latest commit, along with removing the Regional Indicator Symbols (U+1F1E6 - U+1F1FF) entirely since decomposing those would break flag emoji (e.g. 🇺🇸 becoming "US"). These regional symbols were specifically added to create flags without encoding many different flags in the standard and probably won't render properly as single characters/letters, so I don't expect them to be used as such anyway. |
|
I think if we are going to include this, we need to specifically list all additional processing we do on top of NFKC in the user guide. |
|
Otherwise, I think this looks good |
Link to issue number:
Closes #17120
Summary of the issue:
NVDA's Unicode normalization (NFKC) does not decompose certain decorative Unicode letter characters, causing them to be read as their full Unicode name or as silence. This affects negative squared Latin capital letters (U+1F170–U+1F189), negative circled Latin capital letters (U+1F150–U+1F169), and regional indicator symbol letters (U+1F1E6–U+1F1FF).
Description of user facing changes:
When Unicode normalization is enabled, decorative Unicode letters such as negative squared (🅰🅱🅲), negative circled, and regional indicator symbol characters are now correctly read as their base Latin letters (A, B, C, etc.) in both speech and braille.
Description of developer facing changes:
_buildSupplementaryNormalizationTable()which builds a translation table for the three affected Unicode ranges.unicodeNormalize()to apply the supplementary table viastr.translate()before standard NFKC normalization.isUnicodeNormalized()to detect characters in the supplementary table.UnicodeNormalizationOffsetConverter.__init__()so the braille code path also handles these characters correctly.Description of development approach:
The standard
unicodedata.normalize("NFKC", ...)does not define decompositions for these Supplementary Multilingual Plane characters. A supplementary translation table maps each codepoint to its plain Latin letter. This table is applied viastr.translate()before NFKC normalization in both the speech path (unicodeNormalize()) and the braille path (UnicodeNormalizationOffsetConverter). Since all mappings are single-codepoint to single-codepoint, the existing offset converter logic handles them correctly without changes to the offset mapping algorithm.Testing strategy:
test_textUtilsunit tests pass.unicodeNormalize()correctly maps all three character ranges to A-Z.isUnicodeNormalized()returnsFalsefor supplementary characters.Known issues with pull request:
None.
Code Review Checklist: