Encoding the Database

Comments, bug reports, discussions on CCDICT.
Benjamin Barrett

Encoding the Database

Post by Benjamin Barrett »

I am beginning a project to show the phonetic differences between Japanese, Korean and Mandarin, and found this dictionary download.

I would like to reformat the txt data into a table of some sort. Is there a simple way to convert the character codes into the actual characters, or is there a good reference site to learn how to do this?

I tried to look around, but I'm not exactly sure what sort of converter it is I'm looking for.

Any help would be appreciated.

Benjamin Barrett, Graduate Students
Department of Linguistics
University of Washington
students.washington.edu/bjb5
gogaku@ix.netcom.com
Dylan Sung

Re: Encoding the Database

Post by Dylan Sung »

Thomas Chin can tell you what the encoding is, but there are some useful code converters available for download.

One such comes with NJStar's Communicator available via

http://www.njstar.com

If you don't want to lose information between the encodings then it is better to work with one of the unicode encodings. For example Big5 to GB is lossy, since GB does not contain all the characters of Big5, equally GB > Big5 is lossy since there are characters in GB which cannot be found in Big5.

See also the information in unihan.txt file (although it may contain errors) which you can download from the unicode site

http://www.unicode.org

Cheers,
Dyl.
Forum admin

Re: Encoding the Database

Post by Forum admin »

Hi,

The character code field in the text file represents the hexadecimal representation of a Unicode codepoint in plain text format. The remainder of the fields are in UTF-8 format (you problably do not have to deal with this).

I do not know what the easiest way would be for you but here are some suggestions:

1. Enclose the field (e.g. 4E00) between '&#x' and ';' resulting in e.g. 一. This can be interpreted by a web browser giving you the characters, from there you can copy the data to for example a word processor etc.

2. Use a scripting language to convert the code to a 'real' number and than output it as either UTF8 or UCS2 (Unicode 2-byte), or after conversion to Big5 or GB if you wish.

3. If all is too complicated, contact me again. I might me able to provide the data in another format.

Regards,

Thomas Chin
Benjamin Barrett

Re: Encoding the Database

Post by Benjamin Barrett »

Thanks for the help. I've managed to get the characters in the 7XXX range to show up, but not other ones. The code I'm using is:

<html>
<head>
<content="text/html; charset=UTF-8">
</head>
<body>
The quick brown fox &#x8D14;
</body>

With this, I get a character &#x8D14; that looks like three shellfish, one on top of two others, but they are each one stroke short.

Generally, though, I just get a dot. Any ideas on what I'm doing wrong?

Also, would it be possible to get the text dump with the Korean as well? I think the easiest format would be XXXX.Y [tab] [field type] [data] [tab], etc., with each character separated with a carriage return. Right now, the character number repeats for each type of data.

Sorry if I'm asking too much, but this is an incredible project you have, and it would be a shame (and a waste of time) to re-do it.

Best regards
Benjamin Barrett, Graduate Students
Department of Linguistics
University of Washington
students.washington.edu/bjb5
gogaku@ix.netcom.com
Benjamin Barrett

Re: Encoding the Database

Post by Benjamin Barrett »

I wrote: "Also, would it be possible to get the text dump with the Korean as well? I think the easiest format would be XXXX.Y [tab] [field type] [data] [tab], etc., with each character separated with a carriage return. Right now, the character number repeats for each type of data."

I guess what I should have said is any format that is easy to put into a database, or preferably Excel.

Best
Forum admin

Re: Encoding the Database

Post by Forum admin »

Sorry for my late reply. I was away for a meeting for a couple of days. I'll prepare a data file ASAP.

Regards,

Thomas
Thomas Chin

Re: Encoding the Database

Post by Thomas Chin »

Hi again,

I prepared a tab-separated data file with one character entry per row. The first column with the character codepoint is UTF8-encoded.

You should be able to import the file in MS Excel (you should check whether Excel can convert UTF8 to its native Unicode; I know MS Access can do it). You might need to set a Chinese font to view the characters.

I have tested the import of the file in the spreadsheet of the Open Office suite myself and it converts without problems.

http://www.chineselanguage.org/CCDICT/S ... 3.0.tar.gz

Good luck,

Thomas Chin
Benjamin Barrett

Re: Encoding the Database

Post by Benjamin Barrett »

Thanks, Thomas.

I've been trying different things to get the characters to display. I can't get the file to work on my English or Japanese computer; both of them are Windows XP with Chinese and Unicode fonts installed.

I'll try a few more things...

Best
Benjamin Barrett

Re: Encoding the Database

Post by Benjamin Barrett »

Thomas,

No luck. I even downloaded Open Office, but all I get is garbled characters for the Chinese glyphs whether I use Notepad, Memo Pad, Excel, Word, or Open Office.

I have two computers; one is English XP and the other is Japanese XP. On both, I've enabled Chinese, so I can't see what else could be the problem...

Can you describe the process you used to import/open the file in the spreadsheet application? Also, what font are you using?

TIA
Benjamin Barrett
Dylan Sung

Re: Encoding the Database

Post by Dylan Sung »

Hi Benjamin, Thomas,

I decided to download the .tar.gz file that Thomas has so kindly put online and after unpacking it, I got a .txt file. I have Open Office 1.0 so I opened it as though it was a database, but in text form. To do this, you

File > Open > Files of type

Under "Files of type" you need to scroll down to "Text CVS" which is in the type list section on spreadsheets. Once done, select the text file name

ccdict-4.3.0.txt

Click on the Open button, then you need to set the encoding to UTF8

Once that's done, the file opens. You will see a lot of upright rectangular boxes on the far left column. In order for these to display, you need to change the font setting. Press "Control A" to select all the text, then you need to change the font in the font box in the top left of the screen, just below the filename.

If you have XP, and have Chinese installed, you should have a font file named SimSun or SimSun-18030. Select this font name, and let the thing do it's business. Once done, the boxes become characters.

One problem I have found with the file is that from line 27098, all the characters are the same, which means that the file may be corrupted somehow.

They are all &#39669; from that line to &#39669; on line 28315. All the non-CJK text appears fine, and are all in columns.

Does this help?

Cheers,

Dyl.
Locked