× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.



Aaron,

I can't say my answers are real crisp, but...

1 - If you are required to store Asian characters, what CCSID do you use?

I didn't catch where you mentioned where you are storing the data. As you
know there are a number of possible CCSIDs that can get the job done. As
you asked "what CCSID do you use" my answer is 1200 (UTF-16) assuming we're
talking DB2 tables/files. If you're talking IFS then I've used both 1200
(UTF-16) and 1208 (UTF-8). And if we go far enough back, 13488 (UCS-2).

2 - Are there IBM i server-side advantages to using UTF-16? (i.e. is text
searching quicker for Asian languages because they have two bytes in
UTF-16
vs. 3 in UTF-8?)

Assuming DB2 storage some consider the associated data types to be either
advantages or disadvantages. UTF-8 is implemented using the character data
type while UTF-16 is implemented using the graphic data type. With graphic
what you read is what you get, what you write is what gets written (well,
depending on the interface used to DB2). With character job CCSID can come
into play. From my answer to #1 above you can safely guess that I consider
conversion to/from the job CCSID to be a disadvantage.

If you are really going to be doing searches through the Chinese text then
having 50% more bytes to fetch and compare could have an impact (though I
don't know how much data you're talking about, so this could just be a nit).

3 - If you've used both UTF-8 and UTF-16, have you found one to be more
advantageous than the other?

Yes, but it usually is driven by the other application/platform. In general
I use UTF-16 when possible, but that really is just a personal preference.

Bruce



On Fri, Nov 22, 2013 at 3:14 PM, Aaron Bartell <aaronbartell@xxxxxxxxx>wrote:

My team has been looking at the advantages of using UTF-8 vs. UTF-16 if the
majority of the data you store is double-byte in nature. A number of
advantages and disadvantages are listed at this
link<http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16>:
http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16

One thing that stuck out to us on that link was this:

*Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in
UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could
take more space in UTF-8 if there are more of these characters than there
are ASCII characters. This happens for pure text[34] but rarely for HTML
documents or documents in XML based formats such as .docx or .odt. For
example, both the Japanese UTF-8 and the Hindi Unicode articles on
Wikipedia take more space in UTF-16 than in UTF-8.*

To sum it up, many (most?) Asian characters will be 3 to 4 bytes in UTF-8
and only 2 bytes in UTF-16. The majority of a webpage is markup, not
actual information from your database or text literals. Many (most?) new
applications convey data to the browser vs. staying on only on the machine.

*I now have 3 questions:*

1 - If you are required to store Asian characters, what CCSID do you use?

2 - Are there IBM i server-side advantages to using UTF-16? (i.e. is text
searching quicker for Asian languages because they have two bytes in UTF-16
vs. 3 in UTF-8?)

3 - If you've used both UTF-8 and UTF-16, have you found one to be more
advantageous than the other?

Aaron Bartell
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.





As an Amazon Associate we earn from qualifying purchases.

This thread ...

Follow-Ups:
Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.