Overcoming The Challenges Of Character Encoding To Deliver Effective Data Migration
By Brian Murphy, Databodi

Filesystem deployments are becoming increasingly complex as organizations around the world develop environments approaching petabyte scale. In doing so, when it comes time for the desired file migration, requirements tied to these systems become equally detailed and nuanced. This is especially the case when migrating across different storage platforms and vendors.
It’s not uncommon, for example, for filesystem hardware technology updates to reveal many years of poor practice, often leading to very complicated datasets. In addition, there’s the process of character encoding, or more specifically, the method of interpreting binary data into real characters. Taken together, these issues can significantly add to the complexity of data migration processes.
In the case of character encoding, there has never been a single character set for all characters in use around the world. This isn’t at all surprising given the sheer number and variety of people active on the internet and the 6500+ languages in use around the globe. While attempts at standardization have been defined - ISO-8859 being one example, the problem is that every region has had its own character set that supported different sets of characters that mapped those characters to different numbers.
To illustrate how confusion and ambiguity can take hold: the Western European standard (ISO-8859-1) maps the value 216 to the Swedish Ø character, whereas the Central European standard (ISO-8859-2) maps 216 to the Ř character.
In practical terms, when IT systems create files and directories on filesystems, the filesystem encodes the file name using a particular character encoding. When it comes to network storage systems, it is extremely important to store file and directory names on the filesystem using the encoding that the filesystem expects. These filesystem technologies typically need to be capable of converting file and directory names to different encodings, all of which are dependent on the client system and protocol that is being used to access the data.
This is evident in the behavior of an NFSv4 file system protocol client, which always expects the UTF-8 encoding, but when it comes to the SMBv2 NAS file transfer protocol, the client always expects UTF-16. As a result, it’s not uncommon for organizations to have datasets coming from many different clients with no standard encoding parameters implemented in the environment.
When it comes to character-encoding issues related to filesystem migrations, it’s always best practice to use a data migration tool to clean up those datasets and prevent migration errors and potential data access issues. For instance, using a purpose-built migration tool with advanced features that allow users to dictate character-encoding parameters at the migration path, as well as configure fallback encoding at the proxy layer (data mover) is key. Features like this allow users to overcome complicated encoding issues during their filesystem migrations and clean up legacy datasets.
About The Author
Brian Murphy is senior systems engineer at Datadobi.