Gabi Danon: Bidirectional control characters in Mac filenames

A silly bug and a simple fix

Note: With the exception of the layout (which has been modified to match the rest of my site), this page hasn't been updated for quite a long time, and it might contain some information which is no longer valid.

Download version 1.1 of the CleanBiDi application and scripts; this version contains important bug fixes by Nir Soffer.

Scroll down to see the most recent updates.

The Problem:

On some Mac OS systems which use the Hebrew language kit, filenames on HFS+ volumes often contain special invisible Unicode characters that force the directionality to be left to right or right to left. On affected systems, even in a filename which contains only Latin characters, every space or dot (or any other character whose directionality is ambiguous) is preceded by the Unicode char 0x202D (in UTF-8: E2 80 AD) and followed by 0x202C (E2 80 AC):

0x202D (decimal 8237, UTF-8 E2 80 AD): this is the Unicode LEFT-RIGHT OVERRIDE (LRO) char.
0x202C (decimal 8236, UTF-8 E2 80 AC): this is the Unicode POP DIRECTION FORMATTING (PDF) char.

These special chars are invisible Unicode chars: LRO means "write from left to right until I tell you to stop"; PDF restores the previous directionality. I suppose that other directionality characters are also used in other environments. One additional control character which appears much less frequently, mainly in Hebrew file names, is RLO (RIGHT-LEFT OVERRIDE), 0x202E (E2 80 AE). The LRO and RLO characters are not really needed usually, because Hebrew letters have the right-to-left directionality "built in" and Latin letters have the left-to-right directionality. These control charcters are needed only in relatively rare ambiguous situations, where the surrounding text doesn't make it clear what the text direction should be.

For a list of these and other Unicode characters, see this file: http://www.unicode.org/charts/PDF/U2000.pdf

Characters that trigger the LRO+PDF control characters include spaces, punctuation marks, and numbers.

In general, these control characters should not affect the system behavior, and users should not even be aware of the existence of these invisible characters. However, I have found a bug in the Finder under Mac OS X 10.1.x, which causes the Finder to become extremely slow and unresponsive when displaying windows that have files whose names contain these characters. An even bigger problem is that in some cases, Mac OS 9.2.2 doesn't hide these charcters from the user, so a user which upgrades from 9.2.1 to 9.2.2 might discover that his system became practically unusable, or he may think that his filesystem has been messed up (when all that happened is that now he can see what was hidden from him before).

In the rest of this page, I will assume that you have Mac OS X installed. I know that this isn't true for many people who suffer from this problem, but that's the only way to apply the fix that I have found. I hope this will change in the future.

The easiest way to see whether a directory contains affected filenames is with the Terminal application, using two basic Unix commands. In a terminal window, 'cd' to the directory that you want to check; then type 'ls' to see a list of files in that directory. Each control character will appear as a sequence of 3 question marks in the listing.

Here's a list of problems that, as far as I know, are related to these characters:

  • The Finder in OS X versions prior to 10.2 is very slow when displaying files with these characters.
  • Maybe the most serious problem (which I haven't had, luckily) is that for some users, these control characters became visible (as question marks) in the Finder after upgrading to Mac OS 9.2.2. More details.
  • When copying files to your iDisk, sometimes the control characters are replaced by an 'x'; sometimes copying fails with an error message. More details.
  • File listings in the Terminal are often unreadable because of all the extra '???' sequences.
  • I suspect that problems where Java under Mac OS 9 doesn't work for many users of the Hebrew language kit are also related to this.
  • Problems when copying files between the Mac and Virtual PC.
  • I suspect that many other small annoyances that I never understood are related too.
  • One person told me he was unable to insert graphics into Office files until he got rid of these characters.
  • Yeda, Apple's representatives in Israel, have recently added to their site a page with instructions on how to use Mozilla to view Hebrew web pages. In their instructions, they say that if Mozilla quits immediately when you launch it, you should rename your hard disk, under Mac OS X— to the same name it already has. Hmmm.. let's see... why would they suggest something like that? Traditional Mac voodoo? No— keep reading.

The funny part is that in most cases, these characters surround a single character, such as a space or a period (so you get sequences like: LRO-space-PDF). What difference does it make if the space between words is "left to right" or "right to left"?! Is this just pure stupidity, or the computerized version of "the sound of one hand clapping"? Directionality has to do with how several charcters are placed next to each other!!!
As far as I can see, the reason behind this bizarre thing is that in the old MacHebrew encoding, there are two distinct space characters (a Hebrew space and a Latin space), two periods (Hebrew and Latin), and so on; and in order to make sure that translating from MacHebrew to Unicode and back to MacHebrew will give exactly the same string as the original one, the single Unicode space/period/etc is preceded by either LRO or RLO. So, a Latin space in MacHebrew is translated into Unicode as LRO-space-PDF, and a Hebrew space is translated as RLO-space-PDF. There is a certain logic to this, but come on: why would you ever want to put a Hebrew space between two Latin words? Anyway, Mac OS X is based on Unicode and it's time to put MacHebrew out of its misery, even if it means that once in every 10,000 files we might get a Latin space instead of a Hebrew one...

I think that since the Hebrew language kit uses the MacHebrew encoding, while HFS+ stores filenames as Unicode, the translation which adds the control characters would probably happen every time an OS which uses the HLK writes a new file to an HFS+ volume.

By the way, these control characters are added not just in filenames, but in any text which is converted from MacHebrew to Unicode using Apple's Text Encoding Converter (TEC). To see this, create a text file with the text "aa bb" in any text editor; then use the freeware Cyclone to convert this file from MacHebrew to UTF-8. If you then open the new file in a program such as HexEdit or HexEditor, you will see that it is 11 bytes long (and not 5 bytes long as the original), and contains 3-byte control characters before and after the space. (You can find all these programs at VersionTracker).

How to fix this problem:

The simple answer is 'rename the problematic files under Mac OS X', since Mac OS X doesn't add these charcters like older versions of the Mac OS did. To find out which files are problematic, go to the Terminal (again, under Mac OS X of course) and look at the file listing. In order to rename an affected file, go back to the Finder. You first need to rename the file to a temporary name: simply retyping the filename (under Mac OS X) won't work, because the new name is considered by the Finder to be the same as the old name. So first retype the filename with an extra space at the end, then rename it again by deleting the extra space.

Of course, renaming hundreds or thousands of files by hand is not practical. It seems like a simple matter to write a script that does this automatically. The problem is that, as far as I can tell, high-level Mac programs only 'see' a 'cleaned' version of the filename, and therefore writing an AppleScript that removes the control characters is harder than it appears to be. Luckily, the Unix environment does 'see' the raw filenames, and therefore it is possible to write scripts that are run from the command-line to do the work.

I have written 4 simple Python scripts to do the job, and an AppleScript Studio application which provides a simple graphical user interface for running the scripts. To run the scripts under Mac OS 10.1.x, you must first install the command-line version of Python (which you can do with Fink; see instructions below). Mac OS X 10.2 and up already includes Python.

The first script, called cleanbidi.py, cleans the names of the files in the current directory only. If you want to run it the hard way (and not from the graphical interface), start by opening a terminal window, and change to the directory that you want to 'clean', and type:

python path-to-cleanbidi.py

For instance, if you have the file cleanbidi.py on the desktop and you want to clean the Documents folder on your startup drive, you should type:

cd /Documents
python ~/Desktop/cleanbidi.py

This script does not clean subdirectories, so you will have to run it on each directory separately. I did this mainly for safety reasons, so if you're afraid to risk your entire hard disk or just want to clean a single directory this script is for you. I also wrote another script, called rcleanbidi.py, which does scan subdirectories, if you're confident enough to do it all at once (in my testing so far I haven't had any problems with it). This second script can also take a command-line argument, so you can either run it without any arguments on the current directory, or with an optional argument specifying the directory to clean.

If you just want to check which filenames will be changed without actually changing anything, you can use the scripts testbidi.py and rtestbidi.py, which just list the files that will be changed by the cleaning scripts. The testing scripts themselves will leave your files as they are, so they are 100% safe. A good way to use them is by sending their output to a text file, that you can later open with a program such as TextEdit. For example, the following command:
python ~/Desktop/testbidi.py > ~/Desktop/changes
will create a file called 'changes' on your desktop with a list of changes that would happen if you run cleanbidi.py in the same directory. When you open this file, just make sure to tell TextEdit that it's encoded as UTF-8 (in the 'Plain Text Encoding' popup menu in the Open dialog).

You can get all these scripts, together with detailed instructions in Hebrew (that don't assume any prior knowledge of Unix), here.

IMPORTANT: These scripts have worked very nicely for me, with no problems so far, but there are no guarantees that they will work for others and that they won't cause any damage. As far as I can see, the worst thing that can happen is that some complex filenames, mainly those that contain both Hebrew and Latin characters will show up with the wrong order in the Finder. On my hard disk, this happened once after using an older version of the scripts: the "NisusWriter 6.0.3" folder showed as "Nisus 3.0.6 retirW" after the cleaning (and it doesn't contain any Hebrew characters!); Nisus itself still worked fine, of course. It turns out that the cleaned filename contained the RLO (Right-to-Left Override) character after the word "Nisus"; this character means "display from right to left until I tell you to stop (by using a PDF character)", so the Finder's behavior is correct. I have no idea why this character was put there in the first place, but a simple rename solved the problem. The current version of my scripts always removes the RLO character.
Anyway, I would be careful about where I use it; I don't recommend running it on sensitive system folders. But since the Mac OS itself (both pre-X and X) doesn't contain any files with Hebrew characters, I think this is pretty safe.

Note that after you clean a directory, you probably won't see the change in the Finder immediately. You will notice a significant increase in the speed of opening a 'cleaned' directory in the Finder after you log out and log in again, or after you force-quit the Finder.

On my computer, the results of running these scripts were simply amazing. Folders that used to take more than 5 seconds to open in the Finder, now open almost instantaneously. I let the scripts fix most of my hard disk (there are still some directories that I haven't cleaned — nothing serious, only 16,000 files...), and so far I would say this is one of the best upgrades I've ever done... It almost feels like a new computer. And I actually like the new column view, now that it works!

If you've never used the Terminal and you don't know anything about the Unix command line, the stuffit archive which contains the scripts also includes a small application called CleanBiDi that provides a simple user interface for running the scripts. See the ReadMe file for details.

If you still want to try installing Fink and Python: First, go to this page and follow the instructions in 2.1 and 2.3. Then, the easiest way to install Python would be to type:
sudo apt-get update
(you will be asked for an administrator password). After this, type:
sudo apt-get install python-nox
You may be asked some questions — just press Enter. This should install Python. Maybe you'll have to relaunch the Terminal once before it actually works — I'm not sure. Now you're ready to run my scripts!
By the way, during the installation of Fink and Python you must be connected to the internet!
The only Unix commands that you really must know are ls, which shows you a list of files and directories (or ls -l that gives more details); and cd (followed by a space and the name of a directory), that changes the current directory. Note that cd .. moves up one directory; and that directory names are separated by slashes.

Please if you have any comments or suggestions. If you tried using one of my scripts, let me know whether or not it worked for you.

Updates:

  • (28/11/2003) Version 1.1 of the CleanBiDi application has been released. This version, made available by Nir Soffer, fixes a silly bug which prevented the scripts from running if Python is not installed through Fink; this should make it more usable to most users of Mac OS X 10.2 and higher. Many internal improvements have also been made to the Python scripts. (Note that the application itself hasn't changed; therefore, it still contains the link to my no-longer-existent .Mac page).
  • (12/2/2003) Version 1.0.1 of the CleanBiDi application has been released. This version fixes a bug which causes the application to freeze when the selected folder contains special characters such as parentheses. The bug was identified and fixed by Mickey London.
  • (2/9/2002) After playing with Mac OS X 10.2 (Jaguar) for some time, and hearing what other users say, I can say that the slow Finder problem has been fixed, at last. According to some users, just using the Lucida Grande font from 10.2 in 10.1.x already solves the problem.
    10.2 also includes the Python interpreter, so if you still want to run my cleaning scripts under 10.2, you don't need to go through the process of installing Fink and Python.
  • (11/5/2002) Version 1.0 of the CleanBiDi application is ready! It is now fully functional, and for most users there is no longer any need to use the Terminal to run the scripts. See the ReadMe file for instructions.
  • (9/5/2002) I made a few minor improvements to the GUI application that runs the scripts; for instance, now you can see the output from the Perl script, and there's a status bar (but it doesn't move...). The Perl script was also changed to allow its output to be displayed, so be sure to use the new version.
    I removed the script that was supposed to delete colons from filenames, after finding out that it actually deletes slashes. The problem is that what looks like a colon in the Finder is really a slash, and vice versa. For the rare occasion where you find out that you have a colon in a Hebrew filename, you can boot into Mac OS 9 or earlier and rename the file manually.
  • (5/5/2002) The following tip is from Eden Orion: If you need to clean many computers which don't have Mac OS X installed, and you have an external hard disk, you can install Mac OS X on the external disk and boot from it; then just run the scripts as usual. To be extra cool: use an iPod as your external hard disk!
  • (5/5/2002) The following tip is from Shoshannah Forbes and Aaron Adelman: To further speed up the Finder in folders that contain Hebrew filenames, use a system font that contains Hebrew instead of Lucida Grande. If you can get a copy of the Lucida Grande used in the Mac OS X public beta, it's probably your best choice. Otherwise, use TinkerTool to set the system font to Tahoma or Arial from Microsoft, or any other Unicode font that contains Hebrew. Note that you might then see any control characters that you still have in filenames.
  • (28/4/2002) I've written a simple application which provides a basic graphical user interface for running the scripts. You must have the scripts in a folder called 'Cleanbidi' inside the /Applications/Utilities folder, or otherwise the program won't find them. It's a very primitive program which really doesn't do anything except for letting you choose a script and a folder on which to run it. If you try to run one of the Python scripts without having Python installed, don't expect any useful results... Also be aware that it displays all its output only after the script finishes, so it may seem 'stuck' for a while; and sometimes it doesn't display any output at all (with the Perl scripts). I'll try to fix this soon. To run this application, you must have Mac OS 10.1.2 or higher.
  • (26/4/2002) Finally, somebody has translated my scripts into Perl, so they can be run by anyone with Mac OS X, without installing any additional software. Mickey London's Perl script is called rcleanbidi.pl, and is now included with the rest of the scripts. To run it, put it wherever you want, open a Terminal window and cd into the directory you want to clean, then type:
    perl path-to-rcleanbidi.pl
    For instance, if you have the file rcleanbidi.pl in the folder 'Cleanbidi' inside the Utilities folder and you want to clean the folder "Documents" on your startup disk, you should type:
    cd /Documents
    perl /Applications/Utilities/Cleanbidi/rcleanbidi.pl
    I have not thoroughly tested this script, so if you try it, please let me know if it worked or not. I already got some reports that it fails to rename quite a lot of files for some unknown reason; running it twice sometimes helps a little.
  • (26/2/2002) While cleaning some files with Hebrew names, I noticed something strange (which doesn't really have anything to do with the control characters problem, but it's quite important anyway): it seems that Mac OS 9 allows you to use the ':' character in a filename if you type it in Hebrew! Since this character is used as a directory separator in file paths, it shouldn't be allowed in filenames at all. Files that do have this character won't show up correctly in the Mac OS X Finder and you'll probably have problems when you try to open them. I recommend that you go over your files, under Mac OS 9, and see if you have any files with a colon in their name. If you find any, rename them immediately!
    My scripts won't be able to rename files that contain this character in their name, so if you see a message about a file that can't be renamed, and the message shows a bunch of question marks with a slash somewhere, this is probably a file with a Hebrew name that contains a colon.
  • (25/2/2002) I have heard from one person that my scripts don't work over a network to clean a remote drive. If you have a drive that you must clean but don't have Mac OS X installed on that computer, a radical solution (if you're really desparate) would be to take out the hard disk and physically connect it to a computer running Mac OS X. I don't recommend this unless you know exactly what you're doing!!!
  • Folders with filenames that contain Hebrew characters seem to slow the Finder even after the control characters have been removed. Sorry, but I don't have a solution for that!
  • On my computer, some filenames containing both Hebrew and numbers have become strictly right-to-left after cleaning, so the numbers are shown in reverse. Renaming these files by hand should fix this. (Under Mac OS X, since there's no Hebrew keyboard yet, you can use the free program Mellel to type Hebrew text, and then copy and paste this text into a filename).
  • (21/2/2002) A whole week has passed since I first published the information on this page, and so far I received no report of any sort of problem as a result of running my scripts. On the other hand, I did get reports from happy people for whom it worked perfectly, fixing the Finder problem in Mac OS X and the question marks in Mac OS 9.2.2. So I guess this is really a good solution, for those who can apply it. The main problem remains for Mac OS 9 users who don't have X installed. I hope that a fix will arrive soon — hopefully, from Apple.
  • (20/2/2002) Yusuke Kinoshita has suggested a really cool tip (which, like everything else on this page, comes with no warranty!): even if you don't have Python installed (so you can't run my testing scripts), you can still get a list of all affected files on your hard disks by typing one (very very complicated) command. At the Terminal, type (or better, copy and paste):
    sudo ls -RAl / | perl -ne 'use utf8; print if /[\x{202c}-\x{202e}]/' > ~/SearchResults.txt
    After you type your adminstrator password, it will take a few minutes, and finally you'll get a file in your Home folder called "SearchResults.txt" with a list of all affected files. Open it with TextEdit as a UTF-8 file to see what's going on.
    If you want to get really fancy, you can install the 'tree' program with Fink:
    sudo apt-get install tree
    and then get a more informative listing with this command:
    sudo tree -afFN / | perl -ne 'use utf8; print if /[\x{202c}-\x{202e}]/' > ~/SearchResult2.txt
    But then, if you know how to install things with Fink, you can also run my testing scripts:
    python ~/Desktop/rtestbidi.py / > ~/SearchResults3.txt
    (Thanks, Kino.)
  • (18/2/2002)IMPORTANT: Early versions of my scripts did not remove the RLO character, but did remove all PDF characters; as a result, some "orphaned" RLO characters might be left, which would make text appear right-to-left even where it shouldn't. The current scripts (modified on or after 17/2/2002) remove all RLOs as well; this means that after running the scripts there should be no directionality characters at all in the file names. If you used older versions of my scripts, you should run the new versions to make sure that everything is ok.
  • I added the two scripts testbidi.py and rtestbidi.py, which only test the filenames without actually doing anything. These scripts are totally harmless and don't contain any dangerous commands, so you can use them just to get a picture of how bad the situation is.
  • If your system is not affected by this problem but you still want to see what it's all about, you can add the control characters by hand using the "Unicode Hex Input" keyboard. If you press and hold the option key and then type a 4-digit unicode code, it's just like typing the character with that code. So you can insert the codes 202c, 202d and 202e to get these control characters.
    Try the following: create a new file on the desktop. Then click on the file name in the Finder, delete the old name, and rename it by choosing the Unicode Hex Input keyboard, and typing "test", then option+202d, space (without the option key), option+202c, and finally "file". You get something that looks like "test file", but has two invisible characters — one before and one after the space. Select the file and duplicate it 3 or 4 times (command-d), to get several messy files. Now, try to select all these files at once and drag them somewhere. How do you like that spinning cursor?
  • Several people have already contacted me to tell me that they find my solution correct. One person ('Mitz Pettel') said he thinks that these control characters are put in whenever the views font in the (pre-Mac OS X) Finder is a Hebrew font. He was also worried about whether my script changes the last modification date for files, which would be a problem for backups; I think that my scripts don't change the modification date for files, only for folders.
    See his comments here.
  • While trying to send a Stuffit archive with affected files to somebody who wanted to help, I noticed that Stuffit saves the files in their "cleaned" form (so now I have the opposite problem: I lose the control characters just when I want to keep them...). This suggests a very simple way to clean filenames without a script: just stuff a folder and expand it (under Mac OS X) — and the new folder should be "clean". Not very economical, but it should work. Be careful, however, with filenames that contain Hebrew; Stuffit (at least under Mac OS X) doesn't seem to handle these correctly.
  • Python will only recognize the commands in the scripts if the end of line marking is Unix-style. If Stuffit Expander is set to translate text files to have Mac end of lines, the scripts won't work. If nothing happens when you try to run one of the scripts, open it with BBedit (or BBedit Lite) and make sure that it has Unix line breaks (in the options in the Save As dialog).
  • While cleaning the filenames on my hard disk, I noticed that many documents and programs that behaved strangely in the past had 'dirty' names. Some examples are programs that only now show new Aqua icons that I never saw before; some Sherlock plugins, that just 'disappeared' after I installed them — but came back after I cleaned the hard disk; and the help files of some applications, that never opened when I tried to access them through the 'Help' menu. The more I investigate this, the more I wonder how I managed to use my Mac until now!

Conclusions:

  • How much trouble because of one silly thing! And what a simple solution!
  • We are so lucky that we have access to the Terminal. I cannot overemphasize the importance of this. Without the Terminal, I probably would have never been able to find out what causes these problems. And with all due respect to AppleScript, it doesn't come close to languages like Python or Perl when it comes to tasks like this.
    If you've never used the Terminal before, maybe it's time to learn some basic Unix! You can never know when you're going to find it useful.

Links:


[an error occurred while processing this directive] [an error occurred while processing this directive]