Converting ISO-2022-JP incorrect #191

andreas-globi · 2018-03-20T23:11:33Z

I'm not getting the correct text results for an email.

The code is pretty straight-forward:

$Parser = new PhpMimeMailParser\Parser();
$Parser->setText ( $rawEmail );
$text = $Parser->getMessageBody( 'text' );

$rawEmail contains (edited for privacy):

Date: Mon, 19 Mar 2018 11:44:17 +0900
From: =?ISO-2022-JP?B?GyRCOzAldj8sPjs5MBsoQi8bJEIzdDwwMnE8UkpMQmc2PTs6GyhC?= <redacted@something.jp>
To: redacted@something.jp
Subject: =?ISO-2022-JP?B?GyRCMkhEQkk9ISFFOklVGyhC?=
X-Mailer: Sylpheed 3.4.2 (GTK+ 2.10.14; i686-pc-mingw32)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="Multipart=_Mon__19_Mar_2018_11_44_17_+0900_U0H1YwIu=LlJm2HL"

This is a multi-part message in MIME format.
--Multipart=_Mon__19_Mar_2018_11_44_17_+0900_U0H1YwIu=LlJm2HL
Content-Type: text/plain; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit


�$B%K%3%i%9MM�(B

�$BK\F|$O$*K;$7$$Cf$4MhE9$rD:$-$^$7$F�(B
�$B$"$j$,$H$&$4$6$$$^$7$?!#�(B
�$B2HDBI=$rE:IU$7$F$*$j$^$9$N$G$43NG'$r$*4j$$CW$7$^$9!#�(B

�$B$^$?!"4IM}7@Ls$K:]$7$F2<5-$N>pJs<}=8$r$*4j$$CW$7$^$9!#�(B
�$B-!7zC[3NG':Q=q%3%T!<�(B
�$B-"#J#B#RMM!!Aw6b@h8}:B�(B
�$B-#%m%8%c!<%9MM!!%U%k%M!<%`!&O"Mm@h�(B
�$B-$1|MM!!%U%k%M!<%`!&O"Mm@h�(B
�$B-%E:IU2HDBI=$NJg=8>r7o$G$h$$$+H]$+�(B
�$B-&=i4|HqMQ$dI_6b#0!&Ni6b#1$G$h$$$N$+H]$+�(B
�$B-'%Z%C%H$K$D$$$F$O$I$&$7$^$9$+�(B
�$B-(J*7o%Q!<%9;qNA0l<0�(B
�$B-)D99,7z@_C4Ev<TMM;aL>�(B

�$B0J>e!"$42sEzD:$1$k$H9,$$$G$9!#�(B
�$B4IM}NA$O:#2sFCJL$K#3!s$GBg>fIW$G$9!#�(B
�$B$h$m$7$/$*4j$$?=$7>e$2$^$9!#�(B

$Parser->getMessageBody( 'text' ) returns:

ニコラス様

本日はお忙しい中ご来店を頂きまして
ありがとうございました。
家賃表を添付しておりますのでご確認をお願い致します。

また、管理契約に際して下記の情報収集をお願い致します。
〃築確認済書コピー
■複贈厖諭〜金先口座
ロジャース様　フルネーム・連絡先
け様　フルネーム・連絡先
ヅ塞娉板舵修諒臀絃魴錣任茲い否か
初期費用や敷金０・礼金１でよいのか否か
Д撻奪箸砲弔い討呂匹Δ靴泙垢
物件パース資料一式
長幸建設担当者様氏名

以上、ご回答頂けると幸いです。
管理料は今回特別に３％で大丈夫です。
よろしくお願い申し上げます。

But what I expect (and see in Thunderbird) is:

ニコラス様

本日はお忙しい中ご来店を頂きまして
ありがとうございました。
家賃表を添付しておりますのでご確認をお願い致します。

また、管理契約に際して下記の情報収集をお願い致します。
①建築確認済書コピー
②ＪＢＲ様　送金先口座
③ロジャース様　フルネーム・連絡先
④奥様　フルネーム・連絡先
⑤添付家賃表の募集条件でよいか否か
⑥初期費用や敷金０・礼金１でよいのか否か
⑦ペットについてはどうしますか
⑧物件パース資料一式
⑨長幸建設担当者様氏名

以上、ご回答頂けると幸いです。
管理料は今回特別に３％で大丈夫です。
よろしくお願い申し上げます。

This is obviously very different.

Is it a bug? How can I fix this?

The text was updated successfully, but these errors were encountered:

eXorus · 2018-03-21T12:21:16Z

Hello,

Thanks for your report, I think it's a bug but I don't know how to resolve it.

It seams that the lib is not able to decode the numbers ①②③④⑤⑥⑦⑧⑨

eXorus · 2018-03-21T12:36:04Z

I just saw this comment

If you want to convert japanese to ISO-2022-JP it is highly recommended to use ISO-2022-JP-MS as the target encoding instead. This includes the extended character set and avoids ? in the text. For example the often used "1 in a circle" ① will be correctly converted then.
https://secure.php.net/manual/fr/function.mb-convert-encoding.php#99571

eXorus · 2018-03-21T12:43:25Z

I found a way to do it in Charset.php I replace the

return iconv($this->getCharsetAlias($charset), 'UTF-8//TRANSLIT//IGNORE', $encodedString);

by

return mb_convert_encoding($encodedString, 'UTF-8', 'ISO-2022-JP-MS');

and the result is

ニコラス様

本日はお忙しい中ご来店を頂きまして
ありがとうございました。
家賃表を添付しておりますのでご確認をお願い致します。

また、管理契約に際して下記の情報収集をお願い致します。
①建築確認済書コピー
②ＪＢＲ様　送金先口座
③ロジャース様　フルネーム・連絡先
④奥様　フルネーム・連絡先
⑤添付家賃表の募集条件でよいか否か
⑥初期費用や敷金０・礼金１でよいのか否か
⑦ペットについてはどうしますか
⑧物件パース資料一式
⑨長幸建設担当者様氏名

以上、ご回答頂けると幸いです。
管理料は今回特別に３％で大丈夫です。

This issue is related to this PR #137 I need to finish it to be able to fix this issue

andreas-globi · 2018-03-21T14:43:53Z

PR137 has been open for over a year.

Is there any quick fix I can apply to the current code-base to make it work without breaking all other charsets?

eXorus · 2018-03-21T17:00:41Z

In src/Charset.php you can replace:

    public function decodeCharset($encodedString, $charset)
    {
        if (strtolower($charset) == 'utf-8' || strtolower($charset) == 'us-ascii') {
            return $encodedString;
        } else {
            return iconv($this->getCharsetAlias($charset), 'UTF-8//TRANSLIT//IGNORE', $encodedString);
        }
    }

by (but you need to have mb_convert_encoding in your php)

    public function decodeCharset($encodedString, $charset)
    {
        if (strtolower($charset) == 'utf-8' || strtolower($charset) == 'us-ascii') {
            return $encodedString;
        } elseif (strtolower($charset) == 'iso-2022-jp' ) {
            return mb_convert_encoding($encodedString, 'UTF-8', 'ISO-2022-JP-MS');
        } else {
            return iconv($this->getCharsetAlias($charset), 'UTF-8//TRANSLIT//IGNORE', $encodedString);
        }
    }

andreas-globi · 2018-03-21T19:25:44Z

Thanks for that.

Any reason why you're using mb_convert_encoding for this charset only and not for all?

eXorus · 2018-03-22T08:16:03Z

I'm using mb_convert_encoding because iconv doesn't work with charset ISO-2022-JP-MS.

I think mb_convert_encoding is better than iconv but mb_convert_encoding are not compiled in php by default so for the time being we keep inconv.

eXorus · 2018-07-23T12:02:46Z

Fix in the release 3.0.0

eXorus added the bug label Mar 21, 2018

eXorus closed this as completed Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting ISO-2022-JP incorrect #191

Converting ISO-2022-JP incorrect #191

andreas-globi commented Mar 20, 2018

eXorus commented Mar 21, 2018

eXorus commented Mar 21, 2018

eXorus commented Mar 21, 2018

andreas-globi commented Mar 21, 2018

eXorus commented Mar 21, 2018

andreas-globi commented Mar 21, 2018

eXorus commented Mar 22, 2018 •

edited

eXorus commented Jul 23, 2018

Converting ISO-2022-JP incorrect #191

Converting ISO-2022-JP incorrect #191

Comments

andreas-globi commented Mar 20, 2018

eXorus commented Mar 21, 2018

eXorus commented Mar 21, 2018

eXorus commented Mar 21, 2018

andreas-globi commented Mar 21, 2018

eXorus commented Mar 21, 2018

andreas-globi commented Mar 21, 2018

eXorus commented Mar 22, 2018 • edited

eXorus commented Jul 23, 2018

eXorus commented Mar 22, 2018 •

edited