Nicolas Bouliane

File names, unicode normalization problems, and how to fix them Posted on

There are many ways to represent the same accented character in Unicode.

For example, Ü can be represented as Ü (U+00DC) or as U + umlaut (U+0055 plus U+0308).

They look like the same character, but when compared, they are not equivalent:

>>> '\u00DC'
'Ü'
>>> '\u0055\u0308'
'Ü'
>>> '\u00DC' == '\u0055\u0308'
False

To avoid problems, we pick one way of representing accented characters, and we stick to it. This is called normalization. There are two normalization forms: NFC, which prefers composed characters (like U+00DC), and NFD, which prefers decomposed characters (like U+0055 plus U+0308).

Different software and filesystems use different normalization forms. This can lead to problems. For example, I used Syncthing to backup files, and it converted NFC filenames to NFD. This broke All About Berlin, who looked for pages like ./glossary/Bürgergeld.md that no longer existed. The ü in Bürgergeld.md was represented differently in the code, and in the filename.

I wrote a small script to fix this. It’s at the end of this post.

Python and Unicode normalization

In Python, the unicodedata package handles normalization. You can use unicodedata.normalize to convert between normalization forms.

>>> from unicodedata import normalize
>>> '\u0055\u0308' == normalize('NFD', '\u00DC')
True
>>> '\u00DC' == normalize('NFC', '\u0055\u0308')
True

After Syncthing borked my files with Unicode characters in them, I wrote this short script to fix it. It converts the file names back to NFD.

#!/usr/bin/env python3
from pathlib import Path
from unicodedata import normalize

for file in list_of_files:
    current_form = file
    normalized_form = normalize('NFC', file)
    if current_form != normalized_form:
        Path(file).rename(normalized_form)

Syncthing and unicode normalization

By default, Syncthing automatically fixes unicode normalization errors. In my case, it kept renaming the files to the “wrong” format used by All About Berlin’s files. You can change the autoNormalize setting to false in your config.xml, and that will disable that feature.