Skip to content

Normalize decorative Unicode letters not handled by NFKC#19608

Open
bramd wants to merge 4 commits intonvaccess:masterfrom
bramd:fix/unicode-normalization-squared-17120
Open

Normalize decorative Unicode letters not handled by NFKC#19608
bramd wants to merge 4 commits intonvaccess:masterfrom
bramd:fix/unicode-normalization-squared-17120

Conversation

@bramd
Copy link
Contributor

@bramd bramd commented Feb 12, 2026

Link to issue number:

Closes #17120

Summary of the issue:

NVDA's Unicode normalization (NFKC) does not decompose certain decorative Unicode letter characters, causing them to be read as their full Unicode name or as silence. This affects negative squared Latin capital letters (U+1F170–U+1F189), negative circled Latin capital letters (U+1F150–U+1F169), and regional indicator symbol letters (U+1F1E6–U+1F1FF).

Description of user facing changes:

When Unicode normalization is enabled, decorative Unicode letters such as negative squared (🅰🅱🅲), negative circled, and regional indicator symbol characters are now correctly read as their base Latin letters (A, B, C, etc.) in both speech and braille.

Description of developer facing changes:

  • Added _buildSupplementaryNormalizationTable() which builds a translation table for the three affected Unicode ranges.
  • Extended unicodeNormalize() to apply the supplementary table via str.translate() before standard NFKC normalization.
  • Extended isUnicodeNormalized() to detect characters in the supplementary table.
  • Applied the supplementary table in UnicodeNormalizationOffsetConverter.__init__() so the braille code path also handles these characters correctly.

Description of development approach:

The standard unicodedata.normalize("NFKC", ...) does not define decompositions for these Supplementary Multilingual Plane characters. A supplementary translation table maps each codepoint to its plain Latin letter. This table is applied via str.translate() before NFKC normalization in both the speech path (unicodeNormalize()) and the braille path (UnicodeNormalizationOffsetConverter). Since all mappings are single-codepoint to single-codepoint, the existing offset converter logic handles them correctly without changes to the offset mapping algorithm.

Testing strategy:

  • All 89 existing test_textUtils unit tests pass.
  • Manual verification that unicodeNormalize() correctly maps all three character ranges to A-Z.
  • Manual verification that isUnicodeNormalized() returns False for supplementary characters.
  • Regression check that standard NFKC characters (circled Ⓐ, fullwidth A) still normalize correctly.
  • Manual testing for speech and braille output on a unicode character table in Firefox

Known issues with pull request:

None.

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

Extend unicodeNormalize to apply a supplementary translation table for
decorative Unicode letter characters that unicodedata.normalize("NFKC")
does not decompose: negative squared Latin capitals (U+1F170-U+1F189),
negative circled Latin capitals (U+1F150-U+1F169), and regional
indicator symbol letters (U+1F1E6-U+1F1FF).
@bramd bramd force-pushed the fix/unicode-normalization-squared-17120 branch from efa23a1 to 6282518 Compare February 12, 2026 18:59
@bramd bramd marked this pull request as ready for review February 15, 2026 14:30
@bramd bramd requested a review from a team as a code owner February 15, 2026 14:30
@bramd bramd requested a review from SaschaCowley February 15, 2026 14:30
@seanbudd
Copy link
Member

should this be reported to NFKC so it eventually gets fixed upstream?

…ormalization

Remove Regional Indicator Symbol Letters (U+1F1E6-U+1F1FF) from the
supplementary normalization table as decomposing them breaks flag emoji
(e.g. 🇺🇸 would become "US" instead of being recognized as a flag).

Exclude 4 Negative Squared codepoints that have emoji semantics:
U+1F170 (🅰 blood type A), U+1F171 (🅱 blood type B),
U+1F17E (🅾 blood type O), U+1F17F (🅿 P button).
The remaining 22 squared letters are still decomposed.
@bramd
Copy link
Contributor Author

bramd commented Feb 19, 2026

should this be reported to NFKC so it eventually gets fixed upstream?

Good question, I have looked some more in the origin of these characters.
The characters (Negative Circled and Negative Squared Latin Capital Letters) were added in Unicode 6.0 (2010), sourced from the Japanese ARIB broadcasting standard. Unlike the older Circled Latin Letters (added in Unicode 1.1), which were given <circle> compatibility decompositions because they were considered typographic variants, the newer characters were encoded as autonomous symbols. It seems the policy was to don't offer decompositions for characters imported from other, external standards.

More importantly, the Unicode Normalization Stability Policy guarantees that once a character is encoded without a decomposition mapping, it can never receive one. Adding decomposition mappings retroactively would break the guarantee that normalized strings remain stable across Unicode versions.

Additionally, several codepoints in the Negative Squared range have emoji semantics (blood type A, blood type B, blood type O, P button). Decomposing those to plain letters would destroy their meaning. I've excluded those 4 emoji codepoints from the supplementary table in the latest commit, along with removing the Regional Indicator Symbols (U+1F1E6 - U+1F1FF) entirely since decomposing those would break flag emoji (e.g. 🇺🇸 becoming "US"). These regional symbols were specifically added to create flags without encoding many different flags in the standard and probably won't render properly as single characters/letters, so I don't expect them to be used as such anyway.

@seanbudd seanbudd added the blocked/needs-product-decision A product decision needs to be made. Decisions about NVDA UX or supported use-cases. label Feb 25, 2026
@seanbudd seanbudd requested review from seanbudd and removed request for SaschaCowley February 25, 2026 23:08
@seanbudd seanbudd assigned seanbudd and unassigned SaschaCowley Feb 25, 2026
@seanbudd
Copy link
Member

I think if we are going to include this, we need to specifically list all additional processing we do on top of NFKC in the user guide.

seanbudd
seanbudd previously approved these changes Feb 25, 2026
@seanbudd seanbudd self-requested a review February 25, 2026 23:54
@seanbudd seanbudd dismissed their stale review February 25, 2026 23:55

accident

@seanbudd
Copy link
Member

Otherwise, I think this looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocked/needs-product-decision A product decision needs to be made. Decisions about NVDA UX or supported use-cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unicode "negative squared latin" letters not picked up by normalisation algorithm

3 participants