PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

felixwatzka · 2022-06-10T08:53:05Z

Description

The following code:

<?php

echo mb_convert_encoding ( chr(252), 'UTF-8', 'ASCII' );

Resulted in this output:

But I expected this output instead:

ü

In fact, since PHP 8.1, no ASCII character above 127 will be converted correctly. You can try the above example also here: https://3v4l.org/7ZASZ

PHP Version

PHP 8.1.6

Operating System

Ubuntu 22.04

The text was updated successfully, but these errors were encountered:

cmb69 · 2022-06-10T10:10:49Z

Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?

felixwatzka · 2022-06-10T11:04:24Z

Oh yeah, you're right. Didn't even think about this. So this was actually a bug in all older PHP versions?

We fixed our application by using ISO-8859-1 instead of ASCII as $from_encoding.

alexdowad · 2022-06-10T11:07:39Z

Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?

Yep, that is right.

So this was actually a bug in all older PHP versions?

Yes, it was.

There were a lot of bugs in mbstring in older PHP versions. Most of them were so obscure that probably no user of PHP ever experienced them.

cmb69 · 2022-06-10T11:16:32Z

Okay, closing as invalid then.

alexdowad · 2022-06-10T11:23:51Z

...But thanks to @felixwatzka for reporting. Please keep letting us know if you notice anything else which seems unusual.

rhulka · 2022-12-04T14:26:20Z

I have a case when we load images from the NTEXT field (SQL Server) database. The images are saved to NTEXT from VBScript (I know that storing binary in NTEXT is deprecated and stupid, but the database is 20 years old and can't be easily changed).

In PHP 8.0.11 we used the following conversion:

mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

Which returned the correct binary representation of the Image but in PHP 8.1.8+, it doesn't work. Some bytes are replaced with 3F

Any suggestion on how to emulate old behavior? I also struggle to find what exactly changes did this and in what version.

UPD:

This still works

return @\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8');

alexdowad · 2022-12-04T15:29:19Z

@rhulka Thank you very much for the report. I will be happy to investigate and report exactly what has caused the difference in behavior and when it changed.

If the change is unintentional, we will revert it; if it is intentional, we will advise how you can work around it.

Hope to check into this later today if possible. Thanks again for the report.

rhulka · 2022-12-04T15:38:27Z

@alexdowad, thanks a lot for the quick reply. We will go with UConverter at the moment. I understand that it's more of exploiting of "old bugs" in our case than normal usage, so do not rush.

To sum up

PHP 8

@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8') === \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8'); === mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

PHP 8.1

@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8') === \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8'); !== mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

alexdowad · 2022-12-04T16:18:34Z

@rhulka Please also post the value of bin2hex($this->getRawData()) which demonstrates the problem.

alexdowad · 2022-12-04T16:47:01Z

Just investigating but can't get very far without knowing what $this->getRawData() actually is.

rhulka · 2022-12-04T17:12:26Z

@alexdowad

ntext_image.txt - raw data from the database (NTEXT field SQLServer 2016), it's not a "text". I can't upload with the .bin extension.

- here is result image after conversion in PHP 8.0.11 (MacOS and Debian in docker container)

- after encoding in PHP 8.1.12

ntext_image_bin2hex.txt - bin2hex output

$image = mb_convert_encoding($rawDataFromDbNtextField, 'Windows-1252', 'UTF-8');
header('Content-Type: image/jpeg');
echo $image;
exit;

Thank you

alexdowad · 2022-12-04T17:42:50Z

OK, it looks like the error markers (?) are being added not when decoding the UTF-8, but when converting to Windows-1252.

Please see this information on Windows-1252: https://en.wikipedia.org/wiki/Windows-1252

Looking at the conversion table there, you can see that some codepoints like U+0081 and U+008F have no mapping in Windows-1252. The old implementation of mbstring would convert U+0081 to 0x81 for Windows-1252 nonetheless.

The Wikipedia page does have an interesting note here:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too.

So if we want to improve compatibility with the Win32 API, we could restore mappings for U+0081, U+008D, U+0090, and U+009D.

I'm not sure if this would be enough to make Windows-1252 work for @rhulka's use case or not.

Any comments??

alexdowad · 2022-12-04T17:43:52Z

Calling @cmb69

alexdowad · 2022-12-04T17:46:45Z

Thinking about this a bit more. Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.

rhulka · 2022-12-04T17:59:37Z

@alexdowad

I think it would be enough. The encoder in VBScript (a legacy app that we rewrite in PHP) does just that, that is why we got the idea to use this approach in PHP in the first place. I am not pro in encodings conversion, just noticed that behavior differs between versions, and U+0081, U+008D, U+0090... got replaced with 3F

rhulka · 2022-12-04T20:01:55Z

Does it was changed in this commit b5ff87c ? Then I can revert this and re-compile PHP from sources for testing

alexdowad · 2022-12-04T20:03:55Z

@rhulka Yes, that should be the one.

cmb69 · 2022-12-05T10:03:02Z

Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.

I agree.

alexdowad · 2022-12-05T14:41:26Z

I should have a patch within a day or so.

themao · 2023-01-24T12:14:30Z

Thank you for bringing up this issue, I figured out that our app was exploiting this "bug" by converting from us-ascii to UTF-8, which worked up to version 8.0 and stopped working in 8.1. The app was accepting special characters like German umlauts ä, ö, ü, ß and tried to convert them to a proper UTF-8 to send to a third-party app and this was broken after an update to php8.2, these characters were replaced with ?? instead. Now the app is using ISO-8859-1 as a source encoding an everything works properly.

I apologize for commenting on a closed issue, I just thought it would be useful to add my case here so someone could maybe find this page in Google, I couldn't do it so had to browse the issues list.

alexdowad · 2023-01-24T12:27:30Z

Thanks for posting, @themao.

felixwatzka added Bug Status: Needs Triage labels Jun 10, 2022

cmb69 added the Extension: mbstring label Jun 10, 2022

cmb69 closed this as completed Jun 10, 2022

cmb69 added Status: Invalid and removed Status: Needs Triage labels Jun 10, 2022

alexdowad mentioned this issue Dec 6, 2022

Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding #10060

Closed

youkidearitai mentioned this issue Mar 19, 2024

mb_check_encoding is detecting SJIS as Windows-1252 in PHP 8.1 #13746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

felixwatzka commented Jun 10, 2022 •

edited

cmb69 commented Jun 10, 2022

felixwatzka commented Jun 10, 2022

alexdowad commented Jun 10, 2022

cmb69 commented Jun 10, 2022

alexdowad commented Jun 10, 2022

rhulka commented Dec 4, 2022 •

edited

alexdowad commented Dec 4, 2022

rhulka commented Dec 4, 2022

alexdowad commented Dec 4, 2022

alexdowad commented Dec 4, 2022

rhulka commented Dec 4, 2022 •

edited

alexdowad commented Dec 4, 2022

alexdowad commented Dec 4, 2022

alexdowad commented Dec 4, 2022

rhulka commented Dec 4, 2022 •

edited

rhulka commented Dec 4, 2022

alexdowad commented Dec 4, 2022

cmb69 commented Dec 5, 2022

alexdowad commented Dec 5, 2022

themao commented Jan 24, 2023

alexdowad commented Jan 24, 2023

PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

Comments

felixwatzka commented Jun 10, 2022 • edited

Description

PHP Version

Operating System

cmb69 commented Jun 10, 2022

felixwatzka commented Jun 10, 2022

alexdowad commented Jun 10, 2022

cmb69 commented Jun 10, 2022

alexdowad commented Jun 10, 2022

rhulka commented Dec 4, 2022 • edited

alexdowad commented Dec 4, 2022

rhulka commented Dec 4, 2022

alexdowad commented Dec 4, 2022

alexdowad commented Dec 4, 2022

rhulka commented Dec 4, 2022 • edited

alexdowad commented Dec 4, 2022

alexdowad commented Dec 4, 2022

alexdowad commented Dec 4, 2022

rhulka commented Dec 4, 2022 • edited

rhulka commented Dec 4, 2022

alexdowad commented Dec 4, 2022

cmb69 commented Dec 5, 2022

alexdowad commented Dec 5, 2022

themao commented Jan 24, 2023

alexdowad commented Jan 24, 2023

felixwatzka commented Jun 10, 2022 •

edited

rhulka commented Dec 4, 2022 •

edited

rhulka commented Dec 4, 2022 •

edited

rhulka commented Dec 4, 2022 •

edited