X-Git-Url: https://vcs.maemo.org/git/?p=samba;a=blobdiff_plain;f=docs%2Fhtmldocs%2FSamba3-HOWTO%2Funicode.html;fp=docs%2Fhtmldocs%2FSamba3-HOWTO%2Funicode.html;h=b54f20e126cf7bb6f126fb6d361e51479b17130b;hp=0000000000000000000000000000000000000000;hb=6bca4ca307d55b6dc888e56cee47aebcddbce786;hpb=7fd70fa738b636089bcc6c961aa3eaa02f20dda2 diff --git a/docs/htmldocs/Samba3-HOWTO/unicode.html b/docs/htmldocs/Samba3-HOWTO/unicode.html new file mode 100644 index 0000000..b54f20e --- /dev/null +++ b/docs/htmldocs/Samba3-HOWTO/unicode.html @@ -0,0 +1,317 @@ +
Table of Contents
+ +Every industry eventually matures. One of the great areas of maturation is in +the focus that has been given over the past decade to make it possible for anyone +anywhere to use a computer. It has not always been that way. In fact, not so long +ago, it was common for software to be written for exclusive use in the country of +origin. +
+Of all the effort that has been brought to bear on providing native +language support for all computer users, the efforts of the +Openi18n organization +is deserving of special mention. +
+ +Samba-2.x supported a single locale through a mechanism called +codepages. Samba-3 is destined to become a truly transglobal +file- and printer-sharing platform. +
+ +Computers communicate in numbers. In texts, each number is +translated to a corresponding letter. The meaning that will be assigned +to a certain number depends on the character set (charset) + that is used. +
+ + +A charset can be seen as a table that is used to translate numbers to +letters. Not all computers use the same charset (there are charsets +with German umlauts, Japanese characters, and so on). The American Standard Code +for Information Interchange (ASCII) encoding system has been the normative character +encoding scheme used by computers to date. This employs a charset that contains +256 characters. Using this mode of encoding, each character takes exactly one byte. +
+ + +There are also charsets that support extended characters, but those need at least +twice as much storage space as does ASCII encoding. Such charsets can contain +256 * 256 = 65536 characters, which is more than all possible +characters one could think of. They are called multibyte charsets because they use +more then one byte to store one character. +
+ +One standardized multibyte charset encoding scheme is known as +unicode. A big advantage of using a +multibyte charset is that you only need one. There is no need to make sure two +computers use the same charset when they are communicating. +
+
+
+
+Old Windows clients use single-byte charsets, named
+codepages
, by Microsoft. However, there is no support for
+negotiating the charset to be used in the SMB/CIFS protocol. Thus, you
+have to make sure you are using the same charset when talking to an older client.
+Newer clients (Windows NT, 200x, XP) talk Unicode over the wire.
+
+ + +As of Samba-3, Samba can (and will) talk Unicode over the wire. Internally, +Samba knows of three kinds of character sets: +
+
+
+ This is the charset used internally by your operating system.
+ The default is UTF-8
, which is fine for most
+ systems and covers all characters in all languages. The default
+ in previous Samba releases was to save filenames in the encoding of the
+ clients for example, CP850 for Western European countries.
+
This is the charset Samba uses to print messages
+ on your screen. It should generally be the same as the unix charset
.
+
This is the charset Samba uses when communicating with + DOS and Windows 9x/Me clients. It will talk Unicode to all newer clients. + The default depends on the charsets you have installed on your system. + Run testparm -v | grep "dos charset" to see + what the default is on your system. +
+ +Because previous Samba versions did not do any charset conversion, +characters in filenames are usually not correct in the UNIX charset but only +for the local charset used by the DOS/Windows clients. +
Bjoern Jacke has written a utility named convmv +that can convert whole directory structures to different charsets with one single command. +
+Setting up Japanese charsets is quite difficult. This is mainly because: +
+ + The Windows character set is extended from the original legacy Japanese + standard (JIS X 0208) and is not standardized. This means that the strictly + standardized implementation cannot support the full Windows character set. +
+ + + + + + Mainly for historical reasons, there are several encoding methods in + Japanese, which are not fully compatible with each other. There are + two major encoding methods. One is the Shift_JIS series used in Windows + and some UNIXes. The other is the EUC-JP series used in most UNIXes + and Linux. Moreover, Samba previously also offered several unique encoding + methods, named CAP and HEX, to keep interoperability with CAP/NetAtalk and + UNIXes that can't use Japanese filenames. Some implementations of the + EUC-JP series can't support the full Windows character set. +
There are some code conversion tables between Unicode and legacy + Japanese character sets. One is compatible with Windows, another one + is based on the reference of the Unicode consortium, and others are + a mixed implementation. The Unicode consortium does not officially + define any conversion tables between Unicode and legacy character + sets, so there cannot be standard one. +
The character set and conversion tables available in iconv() depend + on the iconv library that is available. Next to that, the Japanese locale + names may be different on different systems. This means that the value of + the charset parameters depends on the implementation of iconv() you are using. +
+ + + + + Though 2-byte fixed UCS-2 encoding is used in Windows internally, + Shift_JIS series encoding is usually used in Japanese environments + as ASCII encoding is in English environments. +
+ + The dos charset and + display charset + should be set to the locale compatible with the character set + and encoding method used on Windows. This is usually CP932 + but sometimes has a different name. +
+ + + + The unix charset can be either Shift_JIS series, + EUC-JP series, or UTF-8. UTF-8 is always available, but the availability of other locales + and the name itself depends on the system. +
+ Additionally, you can consider using the Shift_JIS series as the + value of the unix charset + parameter by using the vfs_cap module, which does the same thing as + setting “coding system = CAP” in the Samba 2.2 series. +
+ Where to set unix charset + to is a difficult question. Here is a list of details, advantages, and + disadvantages of using a certain value. +
+ Shift_JIS series means a locale that is equivalent to Shift_JIS
,
+ used as a standard on Japanese Windows. In the case of Shift_JIS
,
+ for example, if a Japanese filename consists of 0x8ba4 and 0x974c
+ (a 4-bytes Japanese character string meaning “share”) and “.txt”
+ is written from Windows on Samba, the filename on UNIX becomes
+ 0x8ba4, 0x974c, “.txt” (an 8-byte BINARY string), same as Windows.
+
Since Shift_JIS series is usually used on some commercial-based + UNIXes; hp-ux and AIX as the Japanese locale (however, it is also possible + to use the EUC-JP locale series). To use Shift_JIS series on these platforms, + Japanese filenames created from Windows can be referred to also on + UNIX.
+ If your UNIX is already working with Shift_JIS and there is a user + who needs to use Japanese filenames written from Windows, the + Shift_JIS series is the best choice. However, broken filenames + may be displayed, and some commands that cannot handle non-ASCII + filenames may be aborted during parsing filenames. Especially, there + may be “\ (0x5c)” in filenames, which need to be handled carefully. + It is best to not touch filenames written from Windows on UNIX. +
+ Note that most Japanized free software actually works with EUC-JP + only. It is good practice to verify that the Japanized free software can work + with Shift_JIS. +
+ + + EUC-JP series means a locale that is equivalent to the industry + standard called EUC-JP, widely used in Japanese UNIX (although EUC + contains specifications for languages other than Japanese, such as + EUC-KR). In the case of EUC-JP series, for example, if a Japanese + filename consists of 0x8ba4 and 0x974c and “.txt” is written from + Windows on Samba, the filename on UNIX becomes 0xb6a6, 0xcdad, + “.txt” (an 8-byte BINARY string). +
+ + + + + + + + + + + Since EUC-JP is usually used on open source UNIX, Linux, and FreeBSD, and on commercial-based UNIX, Solaris, + IRIX, and Tru64 UNIX as Japanese locale (however, it is also possible on Solaris to use Shift_JIS and UTF-8, + and on Tru64 UNIX it is possible to use Shift_JIS). To use EUC-JP series, most Japanese filenames created from + Windows can be referred to also on UNIX. Also, most Japanized free software works mainly with EUC-JP only. +
+ It is recommended to choose EUC-JP series when using Japanese filenames on UNIX. +
+ Although there is no character that needs to be carefully treated + like “\ (0x5c)”, broken filenames may be displayed and some + commands that cannot handle non-ASCII filenames may be aborted + during parsing filenames. +
+ + Moreover, if you built Samba using differently installed libiconv, + the eucJP-ms locale included in libiconv and EUC-JP series locale + included in the operating system may not be compatible. In this case, you may need to + avoid using incompatible characters for filenames. +
+ UTF-8 means a locale equivalent to UTF-8, the international standard defined by the Unicode consortium. In
+ UTF-8, a character
is expressed using 1 to 3 bytes. In case of the Japanese language,
+ most characters are expressed using 3 bytes. Since on Windows Shift_JIS, where a character is expressed with 1
+ or 2 bytes is used to express Japanese, basically a byte length of a UTF-8 string the length of the UTF-8
+ string is 1.5 times that of the original Shift_JIS string. In the case of UTF-8, for example, if a Japanese
+ filename consists of 0x8ba4 and 0x974c, and “.txt” is written from Windows on Samba, the filename
+ on UNIX becomes 0xe585, 0xb1e6, 0x9c89, “.txt” (a 10-byte BINARY string).
+
+ For systems where iconv() is not available or where iconv()'s locales + are not compatible with Windows, UTF-8 is the only locale available. +
+ There are no systems that use UTF-8 as the default locale for Japanese. +
+ Some broken filenames may be displayed, and some commands that + cannot handle non-ASCII filenames may be aborted during parsing + filenames. Especially, there may be “\ (0x5c)” in filenames, which + must be handled carefully, so you had better not touch filenames + written from Windows on UNIX. +
+ + + + In addition, although it is not directly concerned with Samba, since + there is a delicate difference between the iconv() function, which is + generally used on UNIX, and the functions used on other platforms, + such as Windows and Java, so far is concerens the conversion between + Shift_JIS and Unicode UTF-8 must be done with care and recognition + of the limitations involved in the process. +
+ + Although Mac OS X uses UTF-8 as its encoding method for filenames, + it uses an extended UTF-8 specification that Samba cannot handle, so + UTF-8 locale is not available for Mac OS X. +
+ + + + CAP encoding means a specification used in CAP and NetAtalk, file + server software for Macintosh. In the case of CAP encoding, for + example, if a Japanese filename consists of 0x8ba4 and 0x974c, and + “.txt” is written from Windows on Samba, the filename on UNIX + becomes “:8b:a4:97L.txt” (a 14 bytes ASCII string). +
+ For CAP encoding, a byte that cannot be expressed as an ASCII + character (0x80 or above) is encoded in an “:xx” form. You need to take + care of containing a “\(0x5c)” in a filename, but filenames are not + broken in a system that cannot handle non-ASCII filenames. +
+ The greatest merit of CAP encoding is the compatibility of encoding + filenames with CAP or NetAtalk. These are respectively the Columbia Appletalk + Protocol, and the NetAtalk Open Source software project. + Since these software applications write a file name on UNIX with CAP encoding, if a + directory is shared with both Samba and NetAtalk, you need to use + CAP encoding to avoid non-ASCII filenames from being broken. +
+ However, recently, NetAtalk has been + patched on some systems to write filenames with EUC-JP (e.g., Japanese original Vine Linux). + In this case, you need to choose EUC-JP series instead of CAP encoding. +
+ vfs_cap itself is available for non-Shift_JIS series locales for + systems that cannot handle non-ASCII characters or systems that + share files with NetAtalk. +
+ To use CAP encoding on Samba-3, you should use the unix charset parameter and VFS + as in the VFS CAP smb.conf file. +
Example 29.1. VFS CAP
[global] |
# the locale name "CP932" may be different |
dos charset = CP932 |
unix charset = CP932 |
[cap-share] |
vfs option = cap |
+ + + + + You should set CP932 if using GNU libiconv for unix charset. With this setting, + filenames in the “cap-share” share are written with CAP encoding. +
+Here is some additional information regarding individual implementations: +
+ To handle Japanese correctly, you should apply the patch + libiconv-1.8-cp932-patch.diff.gz + to libiconv-1.8. +
+ Using the patched libiconv-1.8, these settings are available: +
+dos charset = CP932 +unix charset = CP932 / eucJP-ms / UTF-8 + | | + | +-- EUC-JP series + +-- Shift_JIS series +display charset = CP932 +
+ Other Japanese locales (for example, Shift_JIS and EUC-JP) should not + be used because of the lack of the compatibility with Windows. +
+ To handle Japanese correctly, you should apply a patch + to glibc-2.2.5/2.3.1/2.3.2 or should use the patch-merged versions, glibc-2.3.3 or later. +
+ Using the above glibc, these setting are available: +
dos charset = CP932 |
unix charset = CP932 / eucJP-ms / UTF-8 |
display charset = CP932 |
+
+ Other Japanese locales (for example, Shift_JIS and EUC-JP) should not + be used because of the lack of the compatibility with Windows. +
+Prior to Samba-2.2 series, the “coding system” parameter was used. The default codepage in Samba +2.x was code page 850. In the Samba-3 series this has been replaced with the unix charset parameter. Japanese Character Sets in Samba-2.2 and Samba-3 +shows the mapping table when migrating from the Samba-2.2 series to Samba-3. +
Table 29.1. Japanese Character Sets in Samba-2.2 and Samba-3
Samba-2.2 Coding System | Samba-3 unix charset |
---|---|
SJIS | Shift_JIS series |
EUC | EUC-JP series |
EUC3[a] | EUC-JP series |
CAP | Shift_JIS series + VFS |
HEX | currently none |
UTF8 | UTF-8 |
UTF8-Mac[b] | currently none |
others | none |
[a] Only exists in Japanese Samba version [b] Only exists in Japanese Samba version |
“Samba is complaining about a missing CP850.so
file.”
+ CP850 is the default dos charset. + The dos charset is used to convert data to the codepage used by your DOS clients. + If you do not have any DOS clients, you can safely ignore this message.
+ CP850 should be supported by your local iconv implementation. Make sure you have all the required packages installed.
+ If you compiled Samba from source, make sure that the configure process found iconv. This can be
+ confirmed by checking the config.log
file that is generated when
+ configure is executed.