Bug 5102 - difficult to support user names that aren't UTF-8
Summary: difficult to support user names that aren't UTF-8
Status: NEW
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Other (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: LowPrio
Assignee: Peter Åstrand
URL:
Keywords:
Depends on: 7593 4586
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-16 13:44 CEST by Pierre Ossman
Modified: 2022-11-15 16:47 CET (History)
1 user (show)

See Also:
Acceptance Criteria:


Attachments

Description Pierre Ossman cendio 2014-04-16 13:44:09 CEST
Whilst investigating bug 5098, we realised that we have a much broader problem with encoding of user names.

The first step of a ThinLinc connection is to connect to SSH. Although the SSH protocol mandates UTF-8 for the user name, OpenSSH completely ignores this and just treats it as a binary blob.

So, no matter how much fancy handling we do in ThinLinc, OpenSSH will never respect the locale. And even if it did, LANG is not properly set for sshd on Debian based systems.

What this all means is that we have to make an assumption about what character encoding the user names are in. The client currently uses UTF-8 no matter what the client side locale is.

One upside of such a restriction is that we can get rid of all the locale_encode()/locale_decode() handling we have everywhere we use a user name. With some luck we can also get rid of the other few cases they are used as well, which means that ever server processes are no longer dependent on a proper locale. This is extra beneficial on Debian systems where locale is normally not set for system daemons (see bug 5098).
Comment 1 Pierre Ossman cendio 2016-07-11 10:50:50 CEST
One data point is that Gnome now requires a UTF-8 locale. See their FAQ on Gnome Terminal under Exit status 8:

https://wiki.gnome.org/Apps/Terminal/FAQ#Exit_status_8
Comment 2 Pierre Ossman cendio 2020-10-23 10:45:53 CEST
Python 3 is also affecting this issue in that it has a bunch of implicit conversions to UTF-8 in some cases and the current locale in other cases.

Python 2 generally didn't do any implicit conversions, except for file names which uses the current locale. For some odd reason Python 3 uses UTF-8 there instead, even though it also is aware of the current locale.
Comment 3 Pierre Ossman cendio 2020-10-23 15:54:44 CEST
Also note that we now start our services through systemd on all distributions (bug 4290), so we don't know what the current behaviour really is. Given the comments on bug 5098 it seems like the fix in r13738 might not be needed anymore.
Comment 4 Pierre Ossman cendio 2020-10-26 14:00:18 CET
(In reply to Pierre Ossman from comment #2)
> 
> Python 2 generally didn't do any implicit conversions, except for file names
> which uses the current locale. For some odd reason Python 3 uses UTF-8 there
> instead, even though it also is aware of the current locale.

Apparently this isn't true. Python 3 uses the locale for file names (and most system calls it seems). However for a bad locale, or the default locale it falls back to UTF-8 rather than ASCII.

Test cases:

> $ LANG=C python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 utf-8
> 
> $ LANG=sv_SE.ISO8859-1 python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 iso8859-1
> 
> $ LANG=sv_SE.UTF-8 python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 utf-8
> 
> $ LANG=fofofo python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 utf-8

Compared to Python 2:

> $ LANG=C python2 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> ('ascii', 'ANSI_X3.4-1968')
> 
> $ LANG=sv_SE.ISO8859-1 python2 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> ('ascii', 'ISO-8859-1')
> 
> $ LANG=sv_SE.UTF-8 python2 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> ('ascii', 'UTF-8')
> 
> $ LANG=fofofo python2 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> ('ascii', 'ANSI_X3.4-1968')

This change was implemented in PEP 540:

https://www.python.org/dev/peps/pep-0540/

However 3.6 and older don't have this change, so they are more similar to Python 2:

> $ LANG=C python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 ascii
> 
> $ LANG=sv_SE.ISO8859-1 python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 iso8859-1
> 
> $ LANG=sv_SE.UTF-8 python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 utf-8
> 
> $ LANG=fofofo python3 -c 'import sys; print(sys.getdefaultencoding(), sys.getfilesystemencoding())'
> utf-8 ascii
Comment 5 Pierre Ossman cendio 2020-10-26 14:01:56 CET
Python 3 also doesn't accept "bytes" everywhere. E.g. getpnam() requires "str", so we have no choice but to rely on Python's own encoding/decoding.
Comment 6 Pierre Ossman cendio 2020-10-27 09:25:38 CET
We'll be gradually removing the need for locale_encode()/locale_decode() as part of bug 4586. Once that is done all that is left here is to verify that services are indeed started with the correct locale, and poke OpenSSH about their handling.
Comment 7 Pierre Ossman cendio 2020-10-27 10:06:26 CET
Reported to OpenSSH:

https://bugzilla.mindrot.org/show_bug.cgi?id=3225
Comment 9 Linn cendio 2020-10-30 09:26:47 CET
With hiveconf now supporting Python 3, we have had to make a decision about which encoding it should use. Since the shipped .hconf-files of ThinLinc always are encoded in UTF-8, hiveconf have to always use UTF-8, regardless of what the system's locale is.

For more information, see bug 7557.
Comment 11 Pierre Ossman cendio 2021-04-21 10:34:36 CEST
systemd (and dbus) also seem to refuse to fully work with non-UTF-8 systems:

> POSIX does not specify the encoding of non-ASCII environment variable names or
> values and allows them to contain any non-zero byte, but neither dbus-daemon
> nor  systemd supports environment variables with non-UTF-8 names or values.
> Accordingly, dbus-update-activation-environment assumes that any name or value
> that appears to be valid UTF-8 is intended to be UTF-8, and ignores other names
> or values with a warning. 

https://dbus.freedesktop.org/doc/dbus-update-activation-environment.1.html

The username is unfortunately common in environment variables (e.g. the homedir, and the ThinLinc session dir).

Note You need to log in before you can comment on or make changes to this bug.