How to convert web pages from TibetanMachineWeb to Unicode

For instance, the texts at are encoded in the TibetanMachineWeb font. This font relies on some arcane encoding to produce the proper stacks of consonants, etc. Because of this, the texts offered by that site cannot be used as-is if any kind of sensible information processing is going to be performed on them. It is possible however, to convert those files to Unicode or Wylie. Here’s the process. Unfortunately, it requires Microsoft software. I tried to find a procedure in Linux but my efforts were thwarted. (Also, I was not inclined to test every html2rtf tool available under the sun.) You probably need to have the TibetanMachineWeb fonts installed for this to work.

1. Open the desired document in Explorer.
2. Click on the “Page” icon on the right hand side of the Explorer screen (it is at the same height as the tab titles.) Select the “Edit with Microsoft Word” option.
3. Once in Word, select “File->Save As” and select “Rich Text Format” in the “Save as type” combo box.
4. Start Jskad and select “Tools->Launch Converter”. You can use any of the “TWM to…” options. You can use “TWM to Unicode” for Unicode.

It is likely that there are other methods which can achieve the same results. Some of them may be a little more efficient. (For instance, I think there exist word macros to do conversion of Tibetan from this to that format, etc. I don’t use Word on a regular basis so that’s not an option I’m exploring.)

