Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting ISO-2022-JP incorrect #191

Closed
andreas-globi opened this issue Mar 20, 2018 · 8 comments
Closed

Converting ISO-2022-JP incorrect #191

andreas-globi opened this issue Mar 20, 2018 · 8 comments
Labels

Comments

@andreas-globi
Copy link

I'm not getting the correct text results for an email.

The code is pretty straight-forward:

$Parser = new PhpMimeMailParser\Parser();
$Parser->setText ( $rawEmail );
$text = $Parser->getMessageBody( 'text' );

$rawEmail contains (edited for privacy):

Date: Mon, 19 Mar 2018 11:44:17 +0900
From: =?ISO-2022-JP?B?GyRCOzAldj8sPjs5MBsoQi8bJEIzdDwwMnE8UkpMQmc2PTs6GyhC?= <redacted@something.jp>
To: redacted@something.jp
Subject: =?ISO-2022-JP?B?GyRCMkhEQkk9ISFFOklVGyhC?=
X-Mailer: Sylpheed 3.4.2 (GTK+ 2.10.14; i686-pc-mingw32)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="Multipart=_Mon__19_Mar_2018_11_44_17_+0900_U0H1YwIu=LlJm2HL"

This is a multi-part message in MIME format.
--Multipart=_Mon__19_Mar_2018_11_44_17_+0900_U0H1YwIu=LlJm2HL
Content-Type: text/plain; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit


�$B%K%3%i%9MM�(B

�$BK\F|$O$*K;$7$$Cf$4MhE9$rD:$-$^$7$F�(B
�$B$"$j$,$H$&$4$6$$$^$7$?!#�(B
�$B2HDBI=$rE:IU$7$F$*$j$^$9$N$G$43NG'$r$*4j$$CW$7$^$9!#�(B

�$B$^$?!"4IM}7@Ls$K:]$7$F2<5-$N>pJs<}=8$r$*4j$$CW$7$^$9!#�(B
�$B-!7zC[3NG':Q=q%3%T!<�(B
�$B-"#J#B#RMM!!Aw6b@h8}:B�(B
�$B-#%m%8%c!<%9MM!!%U%k%M!<%`!&O"Mm@h�(B
�$B-$1|MM!!%U%k%M!<%`!&O"Mm@h�(B
�$B-%E:IU2HDBI=$NJg=8>r7o$G$h$$$+H]$+�(B
�$B-&=i4|HqMQ$dI_6b#0!&Ni6b#1$G$h$$$N$+H]$+�(B
�$B-'%Z%C%H$K$D$$$F$O$I$&$7$^$9$+�(B
�$B-(J*7o%Q!<%9;qNA0l<0�(B
�$B-)D99,7z@_C4Ev<TMM;aL>�(B

�$B0J>e!"$42sEzD:$1$k$H9,$$$G$9!#�(B
�$B4IM}NA$O:#2sFCJL$K#3!s$GBg>fIW$G$9!#�(B
�$B$h$m$7$/$*4j$$?=$7>e$2$^$9!#�(B

$Parser->getMessageBody( 'text' ) returns:

ニコラス様

本日はお忙しい中ご来店を頂きまして
ありがとうございました。
家賃表を添付しておりますのでご確認をお願い致します。

また、管理契約に際して下記の情報収集をお願い致します。
〃築確認済書コピー
■複贈厖諭〜金先口座
ロジャース様 フルネーム・連絡先
け様 フルネーム・連絡先
ヅ塞娉板舵修諒臀絃魴錣任茲い否か
初期費用や敷金0・礼金1でよいのか否か
Д撻奪箸砲弔い討呂匹Δ靴泙垢
物件パース資料一式
長幸建設担当者様氏名

以上、ご回答頂けると幸いです。
管理料は今回特別に3%で大丈夫です。
よろしくお願い申し上げます。

But what I expect (and see in Thunderbird) is:

ニコラス様

本日はお忙しい中ご来店を頂きまして
ありがとうございました。
家賃表を添付しておりますのでご確認をお願い致します。

また、管理契約に際して下記の情報収集をお願い致します。
①建築確認済書コピー
②JBR様 送金先口座
③ロジャース様 フルネーム・連絡先
④奥様 フルネーム・連絡先
⑤添付家賃表の募集条件でよいか否か
⑥初期費用や敷金0・礼金1でよいのか否か
⑦ペットについてはどうしますか
⑧物件パース資料一式
⑨長幸建設担当者様氏名

以上、ご回答頂けると幸いです。
管理料は今回特別に3%で大丈夫です。
よろしくお願い申し上げます。

This is obviously very different.

Is it a bug? How can I fix this?

@eXorus
Copy link
Member

eXorus commented Mar 21, 2018

Hello,

Thanks for your report, I think it's a bug but I don't know how to resolve it.

It seams that the lib is not able to decode the numbers ①②③④⑤⑥⑦⑧⑨

@eXorus
Copy link
Member

eXorus commented Mar 21, 2018

I just saw this comment

If you want to convert japanese to ISO-2022-JP it is highly recommended to use ISO-2022-JP-MS as the target encoding instead. This includes the extended character set and avoids ? in the text. For example the often used "1 in a circle" ① will be correctly converted then.
https://secure.php.net/manual/fr/function.mb-convert-encoding.php#99571

@eXorus
Copy link
Member

eXorus commented Mar 21, 2018

I found a way to do it in Charset.php I replace the

return iconv($this->getCharsetAlias($charset), 'UTF-8//TRANSLIT//IGNORE', $encodedString);

by

return mb_convert_encoding($encodedString, 'UTF-8', 'ISO-2022-JP-MS');

and the result is

ニコラス様

本日はお忙しい中ご来店を頂きまして
ありがとうございました。
家賃表を添付しておりますのでご確認をお願い致します。

また、管理契約に際して下記の情報収集をお願い致します。
①建築確認済書コピー
②JBR様 送金先口座
③ロジャース様 フルネーム・連絡先
④奥様 フルネーム・連絡先
⑤添付家賃表の募集条件でよいか否か
⑥初期費用や敷金0・礼金1でよいのか否か
⑦ペットについてはどうしますか
⑧物件パース資料一式
⑨長幸建設担当者様氏名

以上、ご回答頂けると幸いです。
管理料は今回特別に3%で大丈夫です。

This issue is related to this PR #137 I need to finish it to be able to fix this issue

@andreas-globi
Copy link
Author

PR137 has been open for over a year.

Is there any quick fix I can apply to the current code-base to make it work without breaking all other charsets?

@eXorus
Copy link
Member

eXorus commented Mar 21, 2018

In src/Charset.php you can replace:

    public function decodeCharset($encodedString, $charset)
    {
        if (strtolower($charset) == 'utf-8' || strtolower($charset) == 'us-ascii') {
            return $encodedString;
        } else {
            return iconv($this->getCharsetAlias($charset), 'UTF-8//TRANSLIT//IGNORE', $encodedString);
        }
    }

by (but you need to have mb_convert_encoding in your php)

    public function decodeCharset($encodedString, $charset)
    {
        if (strtolower($charset) == 'utf-8' || strtolower($charset) == 'us-ascii') {
            return $encodedString;
        } elseif (strtolower($charset) == 'iso-2022-jp' ) {
            return mb_convert_encoding($encodedString, 'UTF-8', 'ISO-2022-JP-MS');
        } else {
            return iconv($this->getCharsetAlias($charset), 'UTF-8//TRANSLIT//IGNORE', $encodedString);
        }
    }

@eXorus eXorus added the bug label Mar 21, 2018
@andreas-globi
Copy link
Author

Thanks for that.

Any reason why you're using mb_convert_encoding for this charset only and not for all?

@eXorus
Copy link
Member

eXorus commented Mar 22, 2018

I'm using mb_convert_encoding because iconv doesn't work with charset ISO-2022-JP-MS.

I think mb_convert_encoding is better than iconv but mb_convert_encoding are not compiled in php by default so for the time being we keep inconv.

@eXorus
Copy link
Member

eXorus commented Jul 23, 2018

Fix in the release 3.0.0

@eXorus eXorus closed this as completed Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants