Do You PHP はてブロ

Do You PHPはてなからはてブロに移動しました

namazuによるUTF-8なHTMLの全文検索

作業メモ。

参考:

$ su -
# rpm -qa | grep nkf
nkf-1.92-6
# wget http://www01.tcp-ip.or.jp/~furukawa/nkf_utf8/nkf205.tar.gz
# tar zxf nkf205.tar.gz -C /usr/local/src/
# cd /usr/local/src/nkf205/
# make
# make test
perl test.pl

Basic Conversion test

JIS to JIS ... Ok
JIS to SJIS... Ok
JIS to EUC ... Ok
JIS to UTF8... Ok
JIS to U16L... Ok
JIS to U16B... Ok
SJIS to JIS ... Ok
SJIS to SJIS... Ok
SJIS to EUC ... Ok
SJIS to UTF8... Ok
SJIS to U16L... Ok
SJIS to U16B... Ok
EUC to JIS ... Ok
EUC to SJIS... Ok
EUC to EUC ... Ok
EUC to UTF8... Ok
EUC to U16L... Ok
EUC to U16B... Ok
UTF8 to JIS ... Ok
UTF8 to SJIS... Ok
UTF8 to EUC ... Ok
UTF8 to UTF8N.. Ok
UTF8 to UTF8... Ok
UTF8 to UTF8N.. Ok
UTF8 to U16L... Ok
UTF8 to U16L0.. Ok
UTF8 to U16B... Ok
UTF8 to U16B0.. Ok
JIS to JIS ... Ok
JIS to SJIS... Ok
JIS to EUC ... Ok
JIS to UTF8... Ok
SJIS to JIS ... Ok
SJIS to SJIS... Ok
SJIS to EUC ... Ok
SJIS to UTF8... Ok
EUC to JIS ... Ok
EUC to SJIS... Ok
EUC to EUC ... Ok
EUC to UTF8... Ok
UTF8 to JIS ... Ok
UTF8 to SJIS... Ok
UTF8 to EUC ... Ok
UTF8 to UTF8... Ok
Ambiguous Case. Ok
SJIS Input assumption Ok
Broken JIS Ok
Broken JIS is safe on Normal JIS? Ok
test_data/cp932 Ok
test_data/cp932inv Ok
test_data/no-cp932inv Ok


UCS Mapping Test
Shift_JIS to UTF-16
Normal UCS Mapping : Ok
Microsoft UCS Mapping : Ok


X0201 test

X0201 conversion: SJIS Ok
X0201 conversion: JIS Ok
X0201 conversion:SI/SO Ok
X0201 conversion: EUC Ok
X0201 conversion: UTF8 Ok
X0201 output: SJIS Ok
X0201 output: JIS Ok
X0201 output: EUC Ok
X0201 output: UTF8 Ok

MIME test

MIME decode (strict) Ok
MIME decode (nonstrict)Ok
MIME decode (unbuf) Ok
MIME decode (base64) Ok
MIME ISO-8859-1 (Q) Ok

Bug Fixes

test_data/cr Ok
test_data/fixed-qencode Ok
test_data/long-fold-1 Ok
test_data/long-fold Ok
test_data/mime_out Ok
test_data/mime_out2 Ok
test_data/multi-line Ok
test_data/nkf-19-bug-1 Ok
test_data/nkf-19-bug-2 Ok
test_data/nkf-19-bug-3 Ok
test_data/non-strict-mime Ok
test_data/q-encode-softrap Ok
test_data/rot13 Ok
test_data/slash Ok
test_data/z1space-0 Ok
test_data/z1space-1 Ok
test_data/z1space-2 Ok
# mv /usr/bin/nkf /usr/bin/nkf.org
# cp -p ./nkf /usr/bin/
# vi /etc/namazu/mknmzrc
# grep NKF /etc/namazu/mknmzrc
# $NKF = "module_nkf";
$NKF = "/usr/bin/nkf";
# exit
$ echo $LANG

$ export LANG=ja_JP.eucJP
$ mknmz -O /path/to/index /path/to/utf-8_html/
$