Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

Closed
felixwatzka opened this issue Jun 10, 2022 · 21 comments
Closed

PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

felixwatzka opened this issue Jun 10, 2022 · 21 comments

Comments

@felixwatzka
Copy link

felixwatzka commented Jun 10, 2022

Description

The following code:

<?php

echo mb_convert_encoding ( chr(252), 'UTF-8', 'ASCII' );

Resulted in this output:

?

But I expected this output instead:

ü

In fact, since PHP 8.1, no ASCII character above 127 will be converted correctly. You can try the above example also here: https://3v4l.org/7ZASZ

PHP Version

PHP 8.1.6

Operating System

Ubuntu 22.04

@cmb69
Copy link
Contributor

cmb69 commented Jun 10, 2022

Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?

@felixwatzka
Copy link
Author

Oh yeah, you're right. Didn't even think about this. So this was actually a bug in all older PHP versions?

We fixed our application by using ISO-8859-1 instead of ASCII as $from_encoding.

@alexdowad
Copy link
Contributor

Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?

Yep, that is right.

So this was actually a bug in all older PHP versions?

Yes, it was.

There were a lot of bugs in mbstring in older PHP versions. Most of them were so obscure that probably no user of PHP ever experienced them.

@cmb69
Copy link
Contributor

cmb69 commented Jun 10, 2022

Okay, closing as invalid then.

@alexdowad
Copy link
Contributor

...But thanks to @felixwatzka for reporting. Please keep letting us know if you notice anything else which seems unusual.

@rhulka
Copy link

rhulka commented Dec 4, 2022

I have a case when we load images from the NTEXT field (SQL Server) database. The images are saved to NTEXT from VBScript (I know that storing binary in NTEXT is deprecated and stupid, but the database is 20 years old and can't be easily changed).

In PHP 8.0.11 we used the following conversion:

mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

Which returned the correct binary representation of the Image but in PHP 8.1.8+, it doesn't work. Some bytes are replaced with 3F

CleanShot 2022-12-04 at 15 22 42

Any suggestion on how to emulate old behavior? I also struggle to find what exactly changes did this and in what version.

UPD:

This still works

return @\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8');

@alexdowad
Copy link
Contributor

@rhulka Thank you very much for the report. I will be happy to investigate and report exactly what has caused the difference in behavior and when it changed.

If the change is unintentional, we will revert it; if it is intentional, we will advise how you can work around it.

Hope to check into this later today if possible. Thanks again for the report.

@rhulka
Copy link

rhulka commented Dec 4, 2022

@alexdowad, thanks a lot for the quick reply. We will go with UConverter at the moment. I understand that it's more of exploiting of "old bugs" in our case than normal usage, so do not rush.

To sum up

PHP 8

@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8') === \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8'); === mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

PHP 8.1

@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8') === \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8'); !== mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

@alexdowad
Copy link
Contributor

@rhulka Please also post the value of bin2hex($this->getRawData()) which demonstrates the problem.

@alexdowad
Copy link
Contributor

Just investigating but can't get very far without knowing what $this->getRawData() actually is.

@rhulka
Copy link

rhulka commented Dec 4, 2022

@alexdowad

ntext_image.txt - raw data from the database (NTEXT field SQLServer 2016), it's not a "text". I can't upload with the .bin extension.

d5fdc38d6d139b4e186810c19725ffa8 - here is result image after conversion in PHP 8.0.11 (MacOS and Debian in docker container)

image_after_encoding_in_php81 - after encoding in PHP 8.1.12

ntext_image_bin2hex.txt - bin2hex output

$image = mb_convert_encoding($rawDataFromDbNtextField, 'Windows-1252', 'UTF-8');
header('Content-Type: image/jpeg');
echo $image;
exit;

Thank you

@alexdowad
Copy link
Contributor

OK, it looks like the error markers (?) are being added not when decoding the UTF-8, but when converting to Windows-1252.

Please see this information on Windows-1252: https://en.wikipedia.org/wiki/Windows-1252

Looking at the conversion table there, you can see that some codepoints like U+0081 and U+008F have no mapping in Windows-1252. The old implementation of mbstring would convert U+0081 to 0x81 for Windows-1252 nonetheless.

The Wikipedia page does have an interesting note here:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too.

So if we want to improve compatibility with the Win32 API, we could restore mappings for U+0081, U+008D, U+0090, and U+009D.

I'm not sure if this would be enough to make Windows-1252 work for @rhulka's use case or not.

Any comments??

@alexdowad
Copy link
Contributor

Calling @cmb69

@alexdowad
Copy link
Contributor

Thinking about this a bit more. Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.

@rhulka
Copy link

rhulka commented Dec 4, 2022

@alexdowad

I think it would be enough. The encoder in VBScript (a legacy app that we rewrite in PHP) does just that, that is why we got the idea to use this approach in PHP in the first place. I am not pro in encodings conversion, just noticed that behavior differs between versions, and U+0081, U+008D, U+0090... got replaced with 3F

@rhulka
Copy link

rhulka commented Dec 4, 2022

Does it was changed in this commit b5ff87c ? Then I can revert this and re-compile PHP from sources for testing

@alexdowad
Copy link
Contributor

@rhulka Yes, that should be the one.

@cmb69
Copy link
Contributor

cmb69 commented Dec 5, 2022

Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.

I agree.

@alexdowad
Copy link
Contributor

I should have a patch within a day or so.

@themao
Copy link

themao commented Jan 24, 2023

Thank you for bringing up this issue, I figured out that our app was exploiting this "bug" by converting from us-ascii to UTF-8, which worked up to version 8.0 and stopped working in 8.1. The app was accepting special characters like German umlauts ä, ö, ü, ß and tried to convert them to a proper UTF-8 to send to a third-party app and this was broken after an update to php8.2, these characters were replaced with ?? instead. Now the app is using ISO-8859-1 as a source encoding an everything works properly.

I apologize for commenting on a closed issue, I just thought it would be useful to add my case here so someone could maybe find this page in Google, I couldn't do it so had to browse the issues list.

@alexdowad
Copy link
Contributor

Thanks for posting, @themao.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants