A quick guide to transliterating Arabic, Persian or Urdu on your computer

Scholars in the West relying on sources in languages written in Arabic script (such as Arabic, Persian, Ottoman Turkish or Urdu) often need -if only to search the library catalogues- to be able to write the Arabic script in a transliterated or romanized form. This post offers a quick guide to transliterating or romanizing languages written in Arabic script. Transliteration and romanization are used interchangeably to designate the action of writing the Arabic characters in Latin characters.

1. Transliteration systems

Transliteration and romanization system are based on adding diacritic marks to Latin characters to render letters and sounds that don’t exist in English. Numerous transliteration standards are available (ALA-LC, ISO, IJMES for example) which might be confusing, but the most important is to be consistent once you have chosen a system. It is important to note as well that each language -even if written in Arabic script- will have a proper transliteration system. Most North American libraries use the ALA-LC (Library of Congress) romanization tables whereas a number of European libraries use the ISO 233 transliteration standard. Knowing the differences between ALA-LC and ISO 233 will help search library catalogues much more efficiently. Last, some journals or publishers have their own transliteration system which they require authors to use: knowing which standard is used in a specific publication will often make using it much easier.

2. Diacritic marks

The main challenge with romanization is the consistent encoding of letters with diacritic marks. Using a persistent encoding standard will ensure the marked letters display properly regardless of the document format, type of device, or exploitation system you are working on. Inconsistent encoding will result in alterations of the text where letters turn into different signs, often illegible.

3. Encoding standard

The computing standard for consistent encoding of non-Latin scripts is the UNICODE TRANSFORMATION FORMAT (UTF). Developed in the early 1990s by a not-for-profit consortium made of large computing companies (Adobe, Apple, Google, IBM, Microsoft, Oracle) and governmental agencies, UNICODE is regularly amended to include more characters. At present, it allows to write 150 different scripts among which Arabic, Persian, Ottoman Turkish, Urdu and  their romanized forms. Different UTF standards are available, but the most commonly used are UTF-8 (in particular for HTML web documents) and UTF-16 (especially for text documents in both Windows and mac OS environments).

4. Typefaces (fonts)

In order to encode letters in UTF, you need to use one of the rare typefaces that support UNICODE characters such as Arial Unicode MS on PCs, and either Times New Roman, Helvetica or Lucida Grande on mac. If not among the default typefaces available on your computer, these fonts can easily be downloaded for free from the internet.

5. Transliterated letters input

Once you have a typeface compatible with UNICODE, you need a tool allowing the input of characters and diacritic marks. Because regular keyboards layout cannot accommodate key combinations for all characters with diacritics, alternative methods were developed by operating systems: the Microsoft Windows Character Map and the Extended Accent Codes for Mac will give you access to the entire repertoire of UNICODE characters.

6. Additional information

The Arabic Macintosh website is a very valuable resource for mac users interested in transliterating the Arabic script. The Digital Orientalist dedicated a lengthy post to keyboard layouts in both mac OS and Windows environments.

Leave a Comment