Have you ever opened a CSV file and found parts of it contain junk characters? Diacritics and sometimes other special characters are often affected -- but not necessarily to blame!

It usually happens when a CSV file is opened in Microsoft Excel. Let me explain.

What is a file encoding?

When a text file is saved, the characters on the screen are saved as numbers. The number represents a particular character. A file encoding tells the computer what character should be displayed for each number. There are different types of encodings; meaning that the same number can produce different characters if it is interpreted using different encodings.

There are many different types of file encodings. Common ones include UTF-8, ASCII and most annoyingly, Windows-1252.

Often, encodings share the same number-character pairings for the basic letters and numbers (A-Z and 0-9), so these will often display correctly regardless of the encoding that is used to decode the number. However, the number-character pairings for diacritics and other special characters is often different between encodings. See table below, showing the number, and the character it represents in UTF-8 and Windows-1252.

You can start to see where things can go wrong.

Why do I care?

If you save a file in UTF-8 and subsequently open it in UTF-8, everything is okay. If you save a file in UTF-8 and read it in any other file encoding, you have no guarantee that all of the characters will display correctly. This is okay if you don't use the data, or save it -- the data itself is not corrupted, your program is just using the wrong encoding to display it. However, if you subsequently save the data, it will be corrupt.

It's especially dangerous, because there are many characters that are displayed as whitespace (ie. they are invisible), but in actual fact, they are represented by different numbers, They're usually interpreted as "real characters" rather than whitespace in most programming languages and text searching functions (find and replace and REGEX expressions). You may corrupt the data and not even realise it.

What goes wrong?

In Windows, if you double-click on a CSV file, Microsoft Excel will open it using the Windows-1252 file encoding. You should never do this unless you are 100% certain that the data is encoded using Windows-1252. (but please note, the encoding that Excel uses to open the file probably changes depending on the regional settings in your computer). It's very counter-intuitive behaviour -- but hey, it will definitely keep us data scientists in a job!

Generally, you should get into the habit of opening CSV files in Excel using the following process:

  1. Open a blank instance of Excel
  2. Go to Data -> From Text/CSV -> choose your file -> then choose the appropriate file encoding. Excel calls this 'File Origin'.
  3. This will create a 'query' that pulls data from the CSV file into your instance of Excel. I usually copy the data and 'Paste as Values' into a new sheet, then delete the old sheet. Then, the data is finally available for use.

Best practice recommendations

If you ever send or receive data in CSV format, include Metadata, and in the Metadata, make sure you specify the file encoding which you used. The UTF-8 encoding seems to be the most widely used, so you should probably use it too.

What if I have data but don't know the encoding?

  1. Contact the author and ask what encoding it is. They will appreciate your concern for the shareability of their data, and hopefully they'll specify the encoding in future releases.
  2. Try different encodings. Start with the common ones listed above, or search for encodings which might be specific to your locality and/or use case. Generally, if you see a few diacritics and special characters displaying correctly, you're likely to have guessed the right one. But this is far from best practice.

What if I have opened the file using the wrong encoding, and subsequently changed it?

You either have to use a backup of the file or manually fix the wrong characters. There's a few tricks that can help you.

  1. If you had made changes to the corrupted data, but still have a backup available, you could perform a 'diff' on the two files, to identify which characters have changed.
  2. You can search for characters which are not printable. There's various functions that can do this in different programming languages. In Microsoft Excel, you can use the following:

=IF(LEN(CLEAN(A1))<>LEN(A1),"non-printable characters present", "Cell is clean )

I generally check for double-whitespace, extra whitespace on the sides of strings, and blank cells at the same time:

=IF(LEN(CLEAN([@territory]))<>LEN([@territory]),"non-printable characters", IF(LEN(TRIM([@territory]))<>LEN([@territory]),"Needs trimming", IF(LEN(TRIM([@territory]))=0,"Blank", "Clean")))