Encoding the Database

Comments, bug reports, discussions on CCDICT.
Dylan Sung

Re: Encoding the Database

Post by Dylan Sung »

BTW, the SimSun fonts include ExtensionA of Unicode 3.0, so you can display the first 6000 characters or so which were originally the rectangular boxes. Had you not selected the Simsun font, you could have scrolled down to find the one of the first readable characters as being the character yi/yat/yit/ichi/itsu/hitotsu/il/one That's what gave the game away for me.

Cheers,
Dyl.
Thomas Chin

Re: Encoding the Database

Post by Thomas Chin »

Hi,

The description that Dylan gave to open the file in Open Office is completely the way I tested it. I have no additional directions, except that I tested it in OO v1.1 (this is probably also the version you used)

The font I used is proprietary font based on Song with ExtA and ExtB. But even fonts such as MingliU or Arial Unicode MS should enable you to display the largest part of the file (except ExtA).

The bug in the last part of the file is also correct, caused by a scripting error. I corrected it in a newer version:

http://www.chineselanguage.org/CCDICT/S ... 4.0.tar.gz

or the spreadsheet

http://www.chineselanguage.org/CCDICT/S ... -4.4.0.sxc

Sorry for the inconvenience.

Regards
Benjamin Barrett

Re: Encoding the Database

Post by Benjamin Barrett »

Thanks, Dylan, for the additional help.

I toyed around with it for a couple of hours on Open Office, but to no avail.

The best I can get is that the first 6000-plus characters are question marks or blank, but the characters after that are there.

The thing I did during those couple of hours was to change the language default setting for when an application does not support Unicode. I changed it between Japanese (my default) and English, and each variety of Chinese. This made a difference in whether the first 6000-plus characters were blanks, boxes or question marks, but regardless of the font (Unicode, Sim Sun, etc.), those first characters just don't show.

I'm pretty frustrated at this point. I think what I'll do is send the file to a fellow graduate student who is pretty knowledgable in encoding systems. He said he'd take a look at it for me.

Still, if anyone has any other ideas, I would like to hear them, to help in the future and for other people.

Best
Benjamin Barrett
Graduate Student
Department of Linguistics, University of Washington

[%sig%]
Thomas Chin

Re: Encoding the Database

Post by Thomas Chin »

Hi Benjamin,

Don't worry about the first 6000 chars. They are from ExtA. You probably will not need them. You really need a ExtA-supporting font the view these (most fonts do not support them).

In addition, there is not much pronunciation data in these records. My suggestion is to stick with the viewable part which is the part you probably will need actually.

Regards,
Dylan Sung

Re: Encoding the Database

Post by Dylan Sung »

Hi Benjamin,

I pretty much agree with Thomas, the Extension A characters are rarer, or just variants. There are some which are characters missing from the original Chinese GB and Big5 standards which are used to provide a direct conversion from traditional to simplified characters.

Unless you are working with ancient texts, you may not need those rare characters. For just work on current characters, those in current popular usage, those that appear in Unicode 2.1 (i.e. the 21000 or so characters you can see) are pretty much all you need.


Cheers,
Dyl.
Benjamin Barrett

Re: Encoding the Database

Post by Benjamin Barrett »

Doh! Now I feel a lot better :)

At this point in time, I'm going to just look at some sound correspondences of the modern languages to get my feet wet.

If I have any updates or additional information, I'll write back to the forum.

Thanks for all the help, Thomas and Dylan.
Silent_Lamb

Re: Encoding the Database

Post by Silent_Lamb »

I'm using OpenOffice 1.1.3 and the Chinese characters are missing. All I get are funny looking symbols instead. How to fix this? Any ideas?
Thomas Chin

Re: Encoding the Database

Post by Thomas Chin »

What file did you open?
sunwukong
Posts: 6
Joined: Wed Jan 17, 2007 3:45 am
Location: Ione, WA

Port of Unihan to Excel

Post by sunwukong »

I've had some success porting the Unihan DB to Excel. So far I'm able to show the 4 digit U+nnnn characters but am having trouble with the 5 digit characters.

Im using this VB macro to resolve the characters

Sub aaa()
'20050626, sunwukong (AT) povn(dot)com (Pat kirol)
'you put the 4 digit unicode values in col b
'and run this script and it will insert the characters in collumn c
'in this case n is 1 to 6 but you would have to adjust 6 to the 'last row of unicode 4 digit numbers.

For n = 1 To 6
vvv = Cells(n, 2).Value
Debug.Print n
Cells(n, 3).Value = ChrW("&H" & Cells(n, 2))
Next n
End Sub

What is even more interesting is that you can not get the characters to display consistantly and must edit the font used for each character. In my case I am editing in a lookup code for the kTaiwanTelegraph field (CCT or CTC). Im up against a brick wall because CCT=5983 is a 5 digit unicode value and this routine will not handle it.
tfc.chin
Posts: 50
Joined: Wed Mar 09, 2005 12:07 pm

Re: Port of Unihan to Excel

Post by tfc.chin »

sunwukong wrote:I've had some success porting the Unihan DB to Excel. So far I'm able to show the 4 digit U+nnnn characters but am having trouble with the 5 digit characters.
If you are using a MS Windows system you need to insert 5-digit (U+nnnnn) characters as a surrogate pair (recent versions of Office support them).

'*----------------------------------------------------------*
'* Name : vbShiftRight *
'*----------------------------------------------------------*
'* Purpose : Shift 32-bit integer value right 'n' bits. *
'*----------------------------------------------------------*
'* Parameters : Value Required. Value to shift. *
'* : Count Required. Number of bit positions to *
'* : shift value. *
'*----------------------------------------------------------*
'* Description: This function is equivalent to the 'C' *
'* : language construct '>>'. *
'*----------------------------------------------------------*
Public Function vbShiftRight(ByVal Value As Long, _
Count As Integer) As Long
Dim i As Integer

vbShiftRight = Value

For i = 1 To Count
vbShiftRight = vbShiftRight \ 2
Next

End Function
'*----------------------------------------------------------*
'* Name : WriteSurrogate *
'*----------------------------------------------------------*
'* Purpose : Returns a surrogate pair of ISO10646:1993 *
'* : CJK Extension B codepoints *
'*----------------------------------------------------------*
'* Parameters : Codepoint Required. 5-digit string to be *
'* : converted. *
'*----------------------------------------------------------*
'* Description: Based on the C++ conversion algorithm. *
'*----------------------------------------------------------*
Function WriteSurrogate(Codepoint as String) as String
Code = Val("&H" + Codepoint)
lowsur = vbShiftRight(Code, 10) + &HD7C0
highsur = &HDC00 Or Code And &H3FF
WriteSurrogate = ChrW(Val(lowsur)) + ChrW(Val(highsur))
End Function

Did not test the code for typos.

Good luck,

Thomas
Locked