Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Sunday, February 17, 2013

Lists and Tuples

I had initially found the divide between lists and tuples in Python confusing. I came from a database background, so I have a certain expectation of what a tuple might be. If you read up on what the difference is in Python, you will find that a) tuples are immutable, and b) singleton tuples use a funny syntax. So just use lists, because it's easier to read and you can't go wrong that way. Oh, and they are both sequences, another overloaded term.

(Yes, there are some details omitted here, such as that since a tuple is immutable, it is hashable and can be used as a dictionary key. But I think that is used fairly seldomly.)

Then I came across Haskell and it dawned on me: Was this just a poorly mangled feature from Haskell? I don't know the history, but it looks a bit like it. You see, Haskell also has list and tuples. Lists are delimited by square brackets, and tuples are delimited by parentheses:

let alist = [1, 2, 3]
let atuple = (1, 2, 3)
(Technically, in Python, tuples are not delimited by parentheses, but they often appear that way.) But the difference is that Haskell does not use parentheses for any other purpose, such as delimiting function arguments. It uses spaces for that. (So it looks more like a shell script at times.)
Python: len([1, 2, 3])
Haskell: length [1, 2, 3]
But in Haskell, tuples are not mutable lists and lists are not mutable tuples. Tuples and lists are quite different but complementary things. A list can only contain elements of the same type. So you can have lists
[1, 2, 3, 4, 5]
["a", "b", "c", "d"]
but not
[1, 2, "a"]
A tuple, on the other hand, can contain values of different types
(1, 2, "a")
(3, 4, "b")
A particular type combination in a tuple creates a new type on the fly, which becomes interesting when you embed tuples in a list. So you can have a list
[(1, 2, "a"), (3, 4, "b")]
but not
[(1, 2, "a"), (3, 4, 5)]
Because Haskell is statically typed, it can verify this at compile time.

If you think in terms of relational databases, the term tuple in particular makes a lot of sense in this way. A result set from a database query would be a list of tuples.

The arrival of the namedtuple also supports the notion that tuples should be thought of as combining several pieces of data of different natures, but of course this is not enforced in either tuples or named tuples.

Now, none of this is relevant to Python. Because of duck typing, a database query result set might as well be a list of lists or a tuple of tuples or something different altogether that emulates sequences. But I found it useful to understand where this syntax and terminology might have come from.

Looking at the newer classes set and frozenset, it might also help to sometimes think of a tuple as a frozenlist instead, because this is closer to the role it plays in Python.

Tuesday, November 22, 2011

plpydbapi: DB-API for PL/Python

One thing that's weird about PL/Python is that its database access API is completely different from the standard Python DB-API. It is similar to PL/Perl and PL/Tcl, and the C "SPI" API, from which they are all derived, but that's little help for a Python programmer. (The reasons for this are lost in history. Probably laziness.) Moreover, the two APIs use the same function names for different purposes.

So I set out to develop a DB-API compatible layer on top of PL/Python: plpydbapi

Example:

CREATE FUNCTION test() RETURNS void
LANGUAGE plpythonu
AS $$
import plpydbapi

dbconn = plpydbapi.connect()
cursor = dbconn.cursor()
cursor.execute("SELECT ... FROM ...")
for row in cursor.fetchall():
    plpy.notice("got row %s" % row)
dbconn.close()
$$;

Granted, it's more verbose than the native PL/Python syntax, so you might not want to use it after all. But it can be helpful if database calls are nested in some other modules, or you just don't want to learn another database access API.

This started out more as an experiment, but it turns out that with the many improvements in PL/Python in PostgreSQL 9.1, it's possible to do this. (Subtransaction control and exception handling were the big issues.) The one gaping hole is that there is apparently no way to get metadata out of a query result. Something to address in PostgreSQL 9.2, perhaps.

Thanks go to the DB-API compliance test suite, which was extremely helpful in making this happen. (Nonetheless, the test suite is quite incomplete in some regards, so treat the result with care anyway.)

Another thing that I found neat about this project is that I managed to get the unit tests based on Python's unittest module to run in the PL/Python environment inside the PostgreSQL server. That's the power of unittest2.

Friday, May 7, 2010

Update: PostgreSQL doesn't really need a new Python driver

A couple of months ago, there was a brief stir in the PostgreSQL community about the state of the Python drivers and there being too many of them. While we can't really do much anymore about there being too many, the most popular driver, Psycopg, has since had a revival:
  • A new release 2.0.14 has been issued.
  • The license has been changed to GNU LGPL 3, alleviating various concerns about the previous "hacked up" license.
Of course, this doesn't yet fix every single bug that has been mentioned, but it ought to be the foundation for a viable future. The mailing list has been quite active lately, and the Git repository has seen a bunch of commits, by multiple developers. So if you have had issues with Psycopg, get involved now. And thanks to all who have made this "revival" happen.

Monday, January 25, 2010

PostgreSQL: The Universal Database Management System

I'm glad you asked, since I've been pondering this for a while.  $subject is my new project slogan.  Now I'm not sure whether we can actually use it, because a) it's stolen from Debian, and b) another (commercial, proprietary) database product already uses the "universal database" line.

I have come to appreciate that the "universality" of a software proposition can be a killer feature.  For example, Debian GNU/Linux, the "universal operating system", might not be the operating system that is the easiest to approach or use, but once you get to know it, the fact that it works well and the same way on server, desktop, and embedded ensures that you never have to worry about what operating system to use for a particular task.  Or Python, it's perhaps not the most geeky nor the most enterprisy programming language, but you can use it for servers, GUIs, scripting, system administration, like few other languages.  It might as well be the "universal programming language".  A lot of other software is not nearly universal, which means that whenever you move into a new area, you have to learn a lot of things from scratch and cannot easily apply and extend past experiences.  And often the results are then poor and expensive.

The nice thing about PostgreSQL is that you never have to worry about whether to use it, because you can be pretty sure that it will fit the job.  Even if you don't care whether something is "open source" or "most advanced".  But it will fit the job.  The only well-known exception is embedded databases, and frankly I think we should try to address that.

Tuesday, January 12, 2010

Procedural Languages in PostgreSQL 8.5: The One That Works!

While much of the PostgreSQL hacker world is abuzz over two-letter acronyms (HS, SR, VF), I will second Andrew's post and will generalize this to say, partially tooting my own horn, of course, that the next PostgreSQL release will be a great one for procedural languages. Behold:
  • PL/pgSQL is installed by default.
  • New DO statement allows ad hoc execution of PL code.
  • PL/pgSQL finally got a sane parser.
  • PL/Perl got a shot in the arm.
  • PL/Python got saner data type handling, Unicode support, and Py3k support.
  • Not directly related, but the coming PL/Proxy features are looking promising as well.
  • (Meanwhile, language historians will be interested to know that PL/Tcl has received exactly zero feature or bug-fix commits since 8.4.)
This will be a great boost for PostgreSQL the development platform.

Tuesday, December 22, 2009

Patience

Occasionally, there are concerns expressed about the adoption rate of Python 3.  Now that PostgreSQL 8.5alpha3 is released with Python 3 support in PL/Python, let's see what the schedule might be until this hits real use.

Python 3.0 was released in December 2008 and was admitted to be somewhat experimental.  At that point, PostgreSQL 8.4 was already in some kind of freeze, so adding a feature as signicant as Python 3 support was not feasible at that point.

In fact, we opted to do two significant rounds of fixing/enhancing/refactoring of PL/Python before tackling Python 3 support: fixed byte string (bytea) support and Unicode support.  Both of those benefit Python 2 users, and they made the eventual port to Python 3 quite simple.

PostgreSQL 8.5 might release around May 2010.  Debian squeeze is currently in testing and doesn't even contain Python 2.6 or Python 3.1 yet, due to some technical problems and a bit of infighting.  Debian freezes in March 2010, which means without PostgreSQL 8.5 but hopefully with Python 2.6 and 3.1.  The final release of squeeze should then be later in 2010, which means the earliest that a significant number of Debian users are going to look into moving any of their code at all nearer to Python 3 (via 2.6) is going to be late 2010.  (Other operating system distributions will have other schedules, of course.)

The next Debian release (squeeze+1), which will presumably include PostgreSQL 8.5 or later and a solid Python 3.x environment, will then not be released before January 2012, if we believe the current 18-months-plus-slip cycle of Debian.  So it will be mid-2012 until significant numbers have upgraded Debian and PostgreSQL to the then-current versions.  If you are sticking to a stable and supported operating system environment, this is the earliest time you actually have the option to migrate to the Python 3 variant of PL/Python across the board for your applications.  Of course in practice this is not going to be the first thing you are going to do, so by the time you actually port everything, it might be late 2012 or even 2013.  Also, if you are heavily invested in PL/Python, you are probably not going to upgrade much your other Python code before PL/Python is ready.

This will then be 3 years after the PL/Python 3 code is written and 4 years after the release of Python 3.0.  And 2 years before Python 2.x is expected to go out of maintenance in 2015.

So, to all users and developers: patience.

Incidentally, I fully expect to still be using IPv4 by then. ;-)

Thursday, July 16, 2009

Adding Color to the Console: Pygments vs. Source-highlight

The other day I write about code syntax highlighting with less and source-highlight, when someone pointed out the Pygments package, that can do the same thing. This called for

the great source-highlight vs. pygments face-off

Installation

Source-highlight is in C++ and needs the Boost regex library. About 3.5 MB together. Pygments is in Python and has no dependencies beyond that. About 2 MB installed. No problem on either side.

In Debian, the packages are source-highlight and python-pygments. Note that the Pygments command-line tool is called pygmentize.

Source-highlight is licensed under the GPL, Pygments under a BSD license.

Getting Started

pygmentize file.c
writes a colored version of file.c to the standard output. Nice.
source-highlight file.c
writes a colored version of file.c to file.c.html. As I had written before, the correct invocation for this purpose is
source-highlight -fesc -oSTDOUT file.c
That makes pygmentize slightly easier to use, I suppose.

Supported Languages

Source-highlight supports 30 languages, Pygments supports 136.

Source-highlight can produce output for DocBook, ANSI console, (X)HTML, Javadoc, LaTeX, Texinfo. Pygments can produce output for HTML, LaTeX, RTF, SVG, and several image formats.

Source-highlight supports styles, but only ships a few. Pygments supports styles and filters, and ships over a dozen styles.

So more options with Pygments here.

Also note that Pygments is a Python library that can be used, say, in web applications for syntax highlighting. This is used in Review Board, for example. Source-highlight is just a command-line tool, but it could of course also be invoked by other packages. Horde uses this, for instance.

Speed

To process all the C files in the PostgreSQL source tree (709271 lines), writing the ANSI console colored version of each file.c to file.c.txt:
source-highlight
25 seconds
pygmentize
5 minutes
So for interactive use with less, source-highlight is probably still the better option.

Robustness

pygmentize gave me a few errors of this sort during the processing of the PostgreSQL source tree:
*** Error while highlighting:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 83: ordinal not in range(128)
(file "/usr/lib/python2.5/site-packages/Pygments-1.0-py2.5.egg/pygments/formatters/terminal.py", line 93, in format)

*** Error while highlighting:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 196-198: invalid data
(file "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode)
That probably shouldn't happen.

source-highlight gave no spurious errors in my limited testing.

Miscellaneous

Source-highlight can highlight its own configuration files, which are in a custom language, and Pygments' configuration files, which are in Python. Conversely, Pygments can of course highlight its own configuration files, but doesn't know what to do with those of Source-highlight.

Summary

I will stick with Source-highlight for interactive use, but Pygments is a great alternative when you need more formatting options or want to integrate the package as a library.