Return

Displaying Foreign Languages
(and Especially Chinese)
on Web Sites
or
How to See or
Instead of

Click here to test your computer for its ability to display
Extended Roman, Greek, Russian, and Chinese using Unicode.

All You Need to Know

All pages on this site that include Esperanto, Romanized Chinese, or Chinese Characters are encoded in Unicode, the international standard for representing virtually all non-Roman or enhanced Roman writing systems on computers. Recent versions of Internet Explorer, Netscape Navigator, Firefox, and other browsers should display these correctly if you have the necessary type fonts. In most cases, appropriate fonts come with the system. Most pages will automatically switch to the correct coding system. If they don't, you can reset the coding manually under the browser's "view" menu.

I have generally defaulted to Microsoft's Times New Roman type font for general purposes and PMingLiU and SimSun for Chinese characters because of their clarity and large character sets. In the event that characters do not display correctly (and that you care), you can pick up the necessary fonts on the Microsoft or Apple sites if you thrash around clicking things long enough.

There is a good chance that you already have have Chinese fonts on your computer, but that you need to activate them. For Windows, go to "Settings / Control Panel / Regional and Language Options" and click on things till it works. If you wish to input Chinese, go to "Settings / Control Panel / Keyboard" and mess around till it does what you want.

On PCs, Microsoft has designated "Arial Unicode MS" as a typefont continuously enlarged to contain all characters in the latest official version of the Unicode Standard. It's not beautiful, since it must be spaced out to allow room for the ascenders and descenders of all the languages involved, but it should be complete if you have the latest version.

I use a PC and am not able to test pages on an Apple, but I have sometimes been appalled at how bad they look on Apples. I think the problem MAY be that the routinely included fonts for Apples may need to be supplemented.

I also do most development work using FireFox, and sometimes I am surprised to see unanticipated differences with Internet Explorer. In most cases, these problems have not involved the correct display of non-Roman characters, however.


More In Case You Are Curious

Contents:

The Problem

The internationalization and increasing sophistication of the internet raises a need for characters other than those used in the Latin alphabet as used in English. Web pages in Arabic, Hebrew, Russian, Bengali, and Japanese are now common, and these languages are sometimes mixed with others, as when a Russian bibliographic reference occurs on a page that is otherwise in English.

Until recently, there were insurmountable obstacles to this, which have recently been overcome with the adoption of an international standard called Unicode, which I will describe below.

For Chinese, most internet sites still use one of two earlier standards, Big 5 (about 13,000 traditional characters available) and Guobiao (GB) (about 7,000 simplified characters available), which I will also describe below.

First, a few things to keep in mind about Chinese characters:

  1. Traditional characters are the ones that were more or less standardized about 200 BC. Simplified characters are simplifications and consolidations of these, for which the standard was established in 1956 (with a little tinkering afterwards).

  2. Simplified characters are universal in mainland China and Singapore. Traditional characters are used in Taiwan and elsewhere. Some characters used only in Cantonese are added to the total picture in Hong Kong.

  3. Despite the names, most of the characters used as part of the "simplified" mainland set are identical with characters used in the "traditional" set used in Taiwan.

  4. There are about 50,000 Chinese characters extant, not counting those that occur only in Buddhist texts or only in very local areas, so none of these computer coding schemes includes all Chinese Characters. Both on the mainland in Taiwan, Chinese characters are in use beyond those included in the computer coding schemes. Presumably when the dust settles on Unicode development and it is the universal standard, characters not included in it will drop out of active use anywhere except perhaps in shorthand.

Return to top.


Unicode

Unicode (UTF) is the newest standard, adopted by the International Organization for Standardization (ISO) in Geneva, and developed and sustained by the Unicode Consortium. Unicode uses sixteen-bit instead of 8-bit characters, so it can permit a type font to include 216 = 65,536 characters, whereas the standard 8-bit ASCII fonts can contain only 28 = 256 characters. Unicode is also known as ISO-10646-2. (Actually every statement in this paragraph is slightly wrong in detail; follow the links if you want more precision.)

Unicode will eventually provide programmers with the ability to include text in any combination of languages into the same document (or web page) without confusing the computer, and to do so in a way that any type of computer can correctly interpret. It has already made great strides in this direction, as you see from the examples in the comptuer test associated with this page. As the same examples may show instead, for many computers and computer users we are not quite there yet.

(To save space in files, there are various short-cuts that allow unicode numbers to be interpreted as 8-bit rather than 16-bit codes, producing a distinction between Unicode-8 and Unicode-16, and in the web world there is even a convention for identifying Unicode glyphs with numerical calls that stay within the older, 7-bit limits. You didn't want to know this, right?)

Unicode and Chinese

Displaying Chinese characters requires that your computer have

Over 20,000 characters have been made available for Chinese in the Unicode (or "UTF") scheme (so far). These include all of the characters in the other two sets (Big 5 and Guobiao, mentioned earlier). For example, the word "China" (Zhongguo) in the Unicode code is simply 20013-22283 in traditional characters and 20013-22269 in simplified characters.

Microsoft, while adopting Unicode as its internal system for the representation of all languages, including Chinese, has designed Chinese-specific type fonts that are keyed to the Big 5 and Extended Guobiao character sets in addition to Unicode. (See Chinese Type Fonts, below.)

Big 5 (B5) and Guobiao (GB)

Big 5 (B5), named for a coalition of five Taiwan computer companies that devised it, is used in Taiwan and normally produces traditional (full) characters. An extension includes the Mainland simplified characters.

Guobiao (GB) (Chinese for "National Standard") is used in Mainland China and normally produces simplified characters. About 7000 characters are available in the standard GB set, but extensions include additional characters, including all of those in the B5 set.

Additional extensions to these systems include some idiosyncratic Hong Kong characters. This has produced a bewildering range of variants on both schemes.

(A transform of basic GB codes is used in 7-bit Email transmissions and some other contexts and is called "Hanzi" (HZ) coding --Hanzi is Chinese for "Chinese characters"-- but this is not used for web sites because it inherently conflicts with the interpreters in most browsers. To the extent that one can confine one's needed characters to the 7000 of the GB/HZ system, however, it makes Email in Chinese efficient even on relatively old computer systems.)

Chinese Type Fonts

Because the three different standards used to represent Chinese characters vary in the numerical codes that they use, each of the main systems normally theoretically requires separate type fonts corresponding with its system of numerical codes.

A few type fonts are capable of being used with more than one encoding. (Don't ask me how this is possible. I am only a grubby consumer of this stuff.) Three widely distributed Microsoft fonts that do this are:

The extended B5 and Extended GB schemes contain about 21,000 characters and are closely similar to each other and to Unicode (and to ISO-2022-CN, yet another new standard being quietly developed by Chinese software engineers in China and Taiwan). At this point, the main incompatibility is not in which characters are included, but in which comptuer codes they are assigned to.

In general my Unicode Chinese pages call for SimSun, occasionally as a second choice after PMingLiU. I have not been successful in discovering the names of comparable Macintosh fonts, if there are any. (Several Macintosh salesmen have promised to provide this information, but none has ever followed up.)

Return to top.


Web Sites

We have reached the point where it takes no more than a free web browser, equipped with its "international" kit, to read web sites in Chinese, or, for that matter, to read web sites in other languages --Hindi, Arabic, Greek, Bulgarian, whatever-- in the correct scripts.

To the best of my knowledge it is not (yet) possible to mix different language-specific coding systems (such as GB and Big 5) on the same web page --or anyway I haven't figured out how to do it-- so any given page can use only one coding system. For mixed-script pages, therefore, Unicode is effectively the only game in town.

In theory, the pages on this site will automatically set your browser to the correct coding scheme and type font though a charset command written into header of the page. In a few situations that doesn't seem to work, and you may need to make the changes yourself by adjusting the "character set" entry under the "view" menu item. In my experience, Internet Explorer is more likely to render pages correctly without a lot of fuss than Netscape Navigator is. (If you copy pages to a different server, they may or may not display without adjustment. Some servers have settings that override the international "UTF-8" specification.)

If the page does not specify the appropriate type font, or if you don't have it installed, the default set for that language in your browser should automatically be used. (It will also be used if you have set the browser to overrule the web author's choice of fonts.) If your browser lacks an appropriate font associated with that language, the display still may not work, and missing characters will be represented by dummy characters (usually little boxes or question marks).

Return to top.


Links: Technical Information

Lots of sites try to tell you about all this and to sell you stuff, much of which is poorly documented and barely works or works not at all. There are a few excellent sites and some wonderful products, however. Here are some that are among the best sites I have found:

Links: Free & Shareware Downloads

Fonts

Miscellaneous Notes

Return to top.