Encoding the Database

Comments, bug reports, discussions on CCDICT.
Benjamin Barrett

Encoding the Database

Postby Benjamin Barrett » Wed Oct 01, 2003 8:03 pm

I am beginning a project to show the phonetic differences between Japanese, Korean and Mandarin, and found this dictionary download.

I would like to reformat the txt data into a table of some sort. Is there a simple way to convert the character codes into the actual characters, or is there a good reference site to learn how to do this?

I tried to look around, but I'm not exactly sure what sort of converter it is I'm looking for.

Any help would be appreciated.

Benjamin Barrett, Graduate Students
Department of Linguistics
University of Washington
students.washington.edu/bjb5
gogaku@ix.netcom.com

Dylan Sung

Re: Encoding the Database

Postby Dylan Sung » Thu Oct 02, 2003 6:50 am

Thomas Chin can tell you what the encoding is, but there are some useful code converters available for download.

One such comes with NJStar's Communicator available via

http://www.njstar.com

If you don't want to lose information between the encodings then it is better to work with one of the unicode encodings. For example Big5 to GB is lossy, since GB does not contain all the characters of Big5, equally GB > Big5 is lossy since there are characters in GB which cannot be found in Big5.

See also the information in unihan.txt file (although it may contain errors) which you can download from the unicode site

http://www.unicode.org

Cheers,
Dyl.

Forum admin

Re: Encoding the Database

Postby Forum admin » Thu Oct 02, 2003 7:54 am

Hi,

The character code field in the text file represents the hexadecimal representation of a Unicode codepoint in plain text format. The remainder of the fields are in UTF-8 format (you problably do not have to deal with this).

I do not know what the easiest way would be for you but here are some suggestions:

1. Enclose the field (e.g. 4E00) between '&#x' and ';' resulting in e.g. 一. This can be interpreted by a web browser giving you the characters, from there you can copy the data to for example a word processor etc.

2. Use a scripting language to convert the code to a 'real' number and than output it as either UTF8 or UCS2 (Unicode 2-byte), or after conversion to Big5 or GB if you wish.

3. If all is too complicated, contact me again. I might me able to provide the data in another format.

Regards,

Thomas Chin

Benjamin Barrett

Re: Encoding the Database

Postby Benjamin Barrett » Mon Oct 06, 2003 6:34 am

Thanks for the help. I've managed to get the characters in the 7XXX range to show up, but not other ones. The code I'm using is:

<html>
<head>
<content="text/html; charset=UTF-8">
</head>
<body>
The quick brown fox &#38;#x8D14;
</body>

With this, I get a character &#x8D14; that looks like three shellfish, one on top of two others, but they are each one stroke short.

Generally, though, I just get a dot. Any ideas on what I'm doing wrong?

Also, would it be possible to get the text dump with the Korean as well? I think the easiest format would be XXXX.Y [tab] [field type] [data] [tab], etc., with each character separated with a carriage return. Right now, the character number repeats for each type of data.

Sorry if I'm asking too much, but this is an incredible project you have, and it would be a shame (and a waste of time) to re-do it.

Best regards
Benjamin Barrett, Graduate Students
Department of Linguistics
University of Washington
students.washington.edu/bjb5
gogaku@ix.netcom.com

Benjamin Barrett

Re: Encoding the Database

Postby Benjamin Barrett » Mon Oct 06, 2003 7:46 pm

I wrote: "Also, would it be possible to get the text dump with the Korean as well? I think the easiest format would be XXXX.Y [tab] [field type] [data] [tab], etc., with each character separated with a carriage return. Right now, the character number repeats for each type of data."

I guess what I should have said is any format that is easy to put into a database, or preferably Excel.

Best

Forum admin

Re: Encoding the Database

Postby Forum admin » Thu Oct 09, 2003 7:10 am

Sorry for my late reply. I was away for a meeting for a couple of days. I'll prepare a data file ASAP.

Regards,

Thomas

Thomas Chin

Re: Encoding the Database

Postby Thomas Chin » Sun Oct 12, 2003 12:14 pm

Hi again,

I prepared a tab-separated data file with one character entry per row. The first column with the character codepoint is UTF8-encoded.

You should be able to import the file in MS Excel (you should check whether Excel can convert UTF8 to its native Unicode; I know MS Access can do it). You might need to set a Chinese font to view the characters.

I have tested the import of the file in the spreadsheet of the Open Office suite myself and it converts without problems.

http://www.chineselanguage.org/CCDICT/S ... 3.0.tar.gz

Good luck,

Thomas Chin

Benjamin Barrett

Re: Encoding the Database

Postby Benjamin Barrett » Sat Oct 18, 2003 3:58 am

Thanks, Thomas.

I've been trying different things to get the characters to display. I can't get the file to work on my English or Japanese computer; both of them are Windows XP with Chinese and Unicode fonts installed.

I'll try a few more things...

Best

Benjamin Barrett

Re: Encoding the Database

Postby Benjamin Barrett » Sat Oct 18, 2003 8:17 pm

Thomas,

No luck. I even downloaded Open Office, but all I get is garbled characters for the Chinese glyphs whether I use Notepad, Memo Pad, Excel, Word, or Open Office.

I have two computers; one is English XP and the other is Japanese XP. On both, I've enabled Chinese, so I can't see what else could be the problem...

Can you describe the process you used to import/open the file in the spreadsheet application? Also, what font are you using?

TIA
Benjamin Barrett

Dylan Sung

Re: Encoding the Database

Postby Dylan Sung » Sun Oct 19, 2003 8:03 am

Hi Benjamin, Thomas,

I decided to download the .tar.gz file that Thomas has so kindly put online and after unpacking it, I got a .txt file. I have Open Office 1.0 so I opened it as though it was a database, but in text form. To do this, you

File > Open > Files of type

Under "Files of type" you need to scroll down to "Text CVS" which is in the type list section on spreadsheets. Once done, select the text file name

ccdict-4.3.0.txt

Click on the Open button, then you need to set the encoding to UTF8

Once that's done, the file opens. You will see a lot of upright rectangular boxes on the far left column. In order for these to display, you need to change the font setting. Press "Control A" to select all the text, then you need to change the font in the font box in the top left of the screen, just below the filename.

If you have XP, and have Chinese installed, you should have a font file named SimSun or SimSun-18030. Select this font name, and let the thing do it's business. Once done, the boxes become characters.

One problem I have found with the file is that from line 27098, all the characters are the same, which means that the file may be corrupted somehow.

They are all &#39669; from that line to &#39669; on line 28315. All the non-CJK text appears fine, and are all in columns.

Does this help?

Cheers,

Dyl.

Dylan Sung

Re: Encoding the Database

Postby Dylan Sung » Sun Oct 19, 2003 8:08 am

BTW, the SimSun fonts include ExtensionA of Unicode 3.0, so you can display the first 6000 characters or so which were originally the rectangular boxes. Had you not selected the Simsun font, you could have scrolled down to find the one of the first readable characters as being the character yi/yat/yit/ichi/itsu/hitotsu/il/one That's what gave the game away for me.

Cheers,
Dyl.

Thomas Chin

Re: Encoding the Database

Postby Thomas Chin » Mon Oct 20, 2003 9:28 pm

Hi,

The description that Dylan gave to open the file in Open Office is completely the way I tested it. I have no additional directions, except that I tested it in OO v1.1 (this is probably also the version you used)

The font I used is proprietary font based on Song with ExtA and ExtB. But even fonts such as MingliU or Arial Unicode MS should enable you to display the largest part of the file (except ExtA).

The bug in the last part of the file is also correct, caused by a scripting error. I corrected it in a newer version:

http://www.chineselanguage.org/CCDICT/S ... 4.0.tar.gz

or the spreadsheet

http://www.chineselanguage.org/CCDICT/S ... -4.4.0.sxc

Sorry for the inconvenience.

Regards

Benjamin Barrett

Re: Encoding the Database

Postby Benjamin Barrett » Wed Oct 22, 2003 3:36 am

Thanks, Dylan, for the additional help.

I toyed around with it for a couple of hours on Open Office, but to no avail.

The best I can get is that the first 6000-plus characters are question marks or blank, but the characters after that are there.

The thing I did during those couple of hours was to change the language default setting for when an application does not support Unicode. I changed it between Japanese (my default) and English, and each variety of Chinese. This made a difference in whether the first 6000-plus characters were blanks, boxes or question marks, but regardless of the font (Unicode, Sim Sun, etc.), those first characters just don't show.

I'm pretty frustrated at this point. I think what I'll do is send the file to a fellow graduate student who is pretty knowledgable in encoding systems. He said he'd take a look at it for me.

Still, if anyone has any other ideas, I would like to hear them, to help in the future and for other people.

Best
Benjamin Barrett
Graduate Student
Department of Linguistics, University of Washington

[%sig%]

Thomas Chin

Re: Encoding the Database

Postby Thomas Chin » Wed Oct 22, 2003 6:54 am

Hi Benjamin,

Don't worry about the first 6000 chars. They are from ExtA. You probably will not need them. You really need a ExtA-supporting font the view these (most fonts do not support them).

In addition, there is not much pronunciation data in these records. My suggestion is to stick with the viewable part which is the part you probably will need actually.

Regards,

Dylan Sung

Re: Encoding the Database

Postby Dylan Sung » Wed Oct 22, 2003 8:49 am

Hi Benjamin,

I pretty much agree with Thomas, the Extension A characters are rarer, or just variants. There are some which are characters missing from the original Chinese GB and Big5 standards which are used to provide a direct conversion from traditional to simplified characters.

Unless you are working with ancient texts, you may not need those rare characters. For just work on current characters, those in current popular usage, those that appear in Unicode 2.1 (i.e. the 21000 or so characters you can see) are pretty much all you need.


Cheers,
Dyl.


Return to “CCDICT”

Who is online

Users browsing this forum: Google [Bot] and 2 guests