Libraries Exposing 8-bit Binary Strings to Python 3, Best Practice?
Written by
on
in
Snaking.
Query came up on PyOpenGL-dev this morning about how to handle GLchar pointer arguments. These are binary-specified arguments, they are human-readable text *most* of the time, ascii source-code and identifiers, that kind of thing, but nothing about GLchar pointer requires that they be ascii. They *are* 8-bit character strings (that's what GLchar pointer means).
But it looks awkward for Python 3 users to specify b'identifier' and b'#shader-code...' so they are likely going to expect to be able to pass unicode values in. That, I think, we can support without any real problems, but then users are going to be using Unicode to store and process their ASCII 8-bit shaders...
The question is what to produce when we *return* something which is a GLchar pointer. Most of the time, these are ASCII human-readable strings, some of the time they are 8-bit character pointers with binary data. It seems whichever way we go, some corner cases will pop up where a user tries to compare, search or otherwise interact with a byte-string and a unicode object and blows up because they can't be auto-converted.
So, is best practice to raise errors on ingest (refuse to guess, require explicit conversion to 8-bit)? Return unicode even if the data might be binary (make it convenient for the user in the common case of not caring)? Allow unicode ingest, but produce 8-bit output (introduce some corner cases that are likely to blow up "elsewhere" in the code)? Or do we have to explicitly code every single GLchar pointer entry point looking at whether that entry point is dealing with text or arbitrary data (and hoping that it always does the same)?
My sinking feeling is that the only way to provide both a "natural" interface and a safe/sane one will be the last of those. I'd rather go for the first or third option, just to make it simple and easy to explain. Python 3 experts care to weigh in?
Comments
Comments are closed.
Pingbacks
Pingbacks are closed.
Nick Coghlan on 04/04/2012 7:06 p.m. #
Yup, case-by-case is really the only way to do this "right" - it really does matter whether an API is dealing with encoded text or arbitrary binary data and, in the former case, whether or not you can make a reasonable guess as to what the encoding of that data is.
The simplest case is when it's always arbitrary binary data - then you just declare the Python API as using bytes. This can also be a good starting point for APIs that are stored as char arrays in C - the decision to use bytes to represent such data in Python is never going to be totally wrong from a correctness point of view, it's just sometimes inconvenient (since it places the burden of decoding on the API user rather than handling it for them)
The simplest encoded text case is when the external API explicitly specifies an encoding, and there's no need to cope with data sources that may produce incorrectly encoded data. In this case, you just decode on input and encode on output using the externally specified encoding.
The next simplest case is where you have a standard encoding (or can make a reasonable guess at one), and just need unrecognised bytes to correctly survive a round trip through Python. This is where the "errors='surrogateescape'" error_handler (which squirrels unrecognised bytes away in the Private Use Area) can come in handy. This is the approach the interpreter and standard library use for OS facing APIs like the command line arguments and filesystem access. APIs use this design are often best complemented with a raw "bytes" API as well (e.g. os.environ and os.environb).
Beyond that case, there's no real general purpose answers that are suitable for use at the library level. Sometimes it can make sense to offer "encoding" and "errors" arguments that are used for an implicit decode() or encode() call, but in such cases it may also be better to just document the expected encoding and let the API user sort it out.
For the specific case of GLchar, it sounds like your safest option will be to say "GLchar maps to Python 3 bytes" as the general rule, but also accept *7-bit* ASCII text (via input.encode("ascii", "strict") in affected APIs (a number of standard library APIs that normally operate on binary data have been updated to also accept pure ASCII text as input in 3.3)
For output though, it really depends on the specific API involved. The standard library takes one of a number of different approaches depending on the context:
- bytes in -> bytes out, str in -> str out
- always bytes out
- always str out
For the latter cases, sometimes *both* interfaces are provided under separate names (e.g. os.getcwd() and os.getcwdb(), os.environ and os.environb)
Mike Fletcher on 04/04/2012 8:56 p.m. #
Yeah, that was about where I arrived (do manual analysis of every entry point). I just didn't like the location :) .
Mapping to bytes, and allowing unicode with a defined mapping to bytes seems like the most reasonable solution that doesn't introduce lots of subtle changes between 2.x and 3.x versions of the library. On 2.x, should someone *really* want to, they can pass in a u'' and have it encoded, and with implicit coercion, you'll probably never have a problem. On 3.x, if you happen to pass in Unicode, we'll let you go on until you try to compare it with a byte-string... then boom :) .