bytes(23) this is not your Python2.7 bytes...

Working on integrating a patch that lets parts of PyOpenGL work on PyPy 1.5 and Python 3.2 (thanks to Renaud).  Tried creating a little wrapper that abstracts away the various changes.  One of these is a function that can take a string, a unicode string, or an arbitrary object and wants to get a friendly string (8-bit) representation of the object to pass to the C-level code, it handles the unicode case explicitly (using an encode call) and just lets str() handle the rest.  So, check what type is 8-bit string, assign it to "bytes" and then do:

bytes( obj )

Except, when the integer value 23 is passed in on Python 3.2 I get b'\000'*23 instead of b'23'.  The rationale here escapes me; How is that a useful response to "give me a bytes representation of 23", I'd accept b'23', b'\x00\x00\x00\x17', b'\x17\x00\x00\x00' or b'\x17', but twenty-three null bytes?  That definitely was not what I was expecting.

Turns out bytes([]) is b'' instead of b'[]' too, same for bytes({}), but that almost makes sense if you think of bytes() as "iterate over X giving me a byte for each item in x"; but when did ints become iterables with values of 0.

Comments

  1. Antoine

    Antoine on 05/02/2011 10:04 a.m. #

    It's because there's no such thing as a standard "bytes representation" of an integer or a list. There is a "text representation", which you can get by calling str() - and then you can encode it to bytes using whichever character set you want.

  2. Mike Fletcher

    Mike Fletcher on 05/02/2011 10:19 a.m. #

    Which would mean you should throw an error if passed an integer (or a list). I can certainly code around the behavior; I'm annoyed that it is failing silently and subtly. \x00 x 23 is garbage data which will show up in some program that was expecting to get '23'. I can't even imagine a user use-case where the \x00 x 23 is what the user intended from bytes( 23 ), particularly as right up to Python 2.7.1 this call returned '23'.

  3. Mike Fletcher

    Mike Fletcher on 05/02/2011 10:39 a.m. #

    Okay, too hyperbolic there. I can imagine someone using it as a way of saying (bytes x 23)(), I'd just expect to use the same approach as ctypes for that case... I can't imagine a *good* use case.

  4. eryksun

    eryksun on 05/02/2011 11:26 a.m. #

    I also don't understand the rationale for bytes(int). That said, bytes in Python 3.x is of type 'bytes', not 'str'. If you want a string representation, use 'str', and if you want that as 'bytes', specify an encoding such a utf-8:

    >>> bytes(str(23), 'utf-8')
    b'23'

    You can use a struct to get the bytes of, for example, a 32-bit signed int ('l'):

    >>> import struct
    >>> struct.pack('l', 23)
    b'\x17\x00\x00\x00'

    You can also use bytes on an array since it supports the buffer interface:

    >>> from array import array
    >>> bytes(array('l', [23]))
    b'\x17\x00\x00\x00'

  5. Mike Fletcher

    Mike Fletcher on 05/02/2011 2:18 p.m. #

    Sure, again, I can work around the bug, and I'll have to in many of the code-bases I port to 3.x. The problem is that the bug introduces silent, surprising failures with a subtle change that needs to be manually tracked down across all existing code-bases. I know, I know this is for the new Python programmers, so suck it up library developers. I just don't see this bytes() behavior as particularly useful (particularly the int behavior). If you can't call bytes() on X and get a useful value, or the value you have always received, then raise an error instead of propagating useless values throughout the application. The sky isn't falling, I'm just annoyed.

    Keep in mind, this code was written a long time ago, in many cases, where saying "make this an 8-bit string representation of this object to pass to a C extension" was the *precise* intention of the code. The same code was doing *precisely* the intended thing in checking for str/unicode objects... both cases reference "str" in the same name-spaces precisely for its 2.x meaning... that code needs to be revisited in each case to be sure it is still doing what is intended in the face of all expected inputs. Some of those references need to become bytes, some str, and some now need whole extra calls inserted to produce the result required (i.e. 8-bit string representations).

    Again, I can do it, it's just a bit of a PITA.

  6. Nick Coghlan

    Nick Coghlan on 05/03/2011 12:03 a.m. #

    The behaviour is definitely deliberate (documented in PEP 358 and 3137, the official docs and the bytearray docstring, albeit missing from the bytes docstring).

    However, I don't understand why any calls to "bytes" would be anywhere near code that "wants to get a friendly string (8-bit) representation" of *anything*.

    The correct way to write that is something like:

    val = str(obj)
    if str is not bytes: # Easy to make this work by setting "bytes = str" if bytes is not a builtin
    val = val.encode("utf-8") # or specific encoding

  7. eryksun

    eryksun on 05/03/2011 12:51 a.m. #

    If what you want is a bytes object, why wouldn't you call bytes?

    >>> str(23).encode('utf-8')

    vs

    >>> bytes(str(23), 'utf-8')

    Either way seems fine to me, but I think the 2nd case scores brownie points for being obvious. Ask for bytes; get bytes.

    However, I don't think it's obvious that bytes(23) should yield 23 null bytes. I think it should yield a TypeError for not being iterable. I think 23 \* bytes([0]) is more obvious and in line with existing idioms such as 23 \* [0]. Anyway, thanks for pointing out the relevant PEPs. I look forward to grokking the rationale for this.

  8. Marius Gedminas

    Marius Gedminas on 05/03/2011 5:39 a.m. #

    If bytes were a mutable type, I could see

    buf = bytes(23)
    buf[0] = 123
    buf[1] = 231
    ...

    making sense. But to get 23 null bytes, I'd use b'\0' * 23.

  9. Antoine

    Antoine on 05/03/2011 6:32 a.m. #

    The bytes constructor accepts a list as parameter because a bytes object is really a sequence of (byte-sized) integers:

    >>> b = bytes([1,2,3])
    >>> b
    b'\x01\x02\x03'
    >>> list(b)
    [1, 2, 3]

    You can actually give it any iterable of ints:

    >>> bytes(x & 1 for x in range(5))
    b'\x00\x01\x00\x01\x00'

  10. eryksun

    eryksun on 05/03/2011 12:06 p.m. #

    It looks like you're allocating memory for a C-like array. IMO, this idiom looks out of place in Python. I'm used to seeing either a single element copied or a sequence explicitly initialized, such as by using a generator:

    >>> from random import randrange
    >>> bytes(randrange(128, 256) for i in range(8))
    b'\xa9\xbe\xb4\xc1\xe6\xe1\xf6\xb7'

  11. Mike Fletcher

    Mike Fletcher on 05/03/2011 10:01 p.m. #

    Regarding why the call to bytes is there, see above, but to be more explicit: in the same module, two references to "str" exist, one used to check isinstance( value, str ) and the other pass_to_c( str( someobject ) ). Initial port is to map str -> bytes, but that now fails on str( someobject ).

    The reason the call to "bytes" shows up in this context is because up until 2.7.1 this is precisely one of the meanings of str( object ). Each time it has been used as such needs to be revisited and rewritten as described. Again, fixable, but this is a *silent* failure, the kind of thing that causes weird, unexplained data to show up in weird places long after the error has occurred.

Comments are closed.

Pingbacks

Pingbacks are closed.