Dan on Python: Python

Showing posts with label Python. Show all posts

Tuesday, June 22, 2021

Python 3.x threading comparison

I've put a comparison of different Python runtimes here.

In short, CPython, but also Pypy3 and Nuitka, threaded poorly. This while the Python for tiny systems, Micropython, threaded quite well - at least on this embarallel problem. See the graph at the link above.

Thursday, October 1, 2020

Fast-paced, seven part intro to python for developers on youtube

Hi folks. I've uploaded a fast-paced, seven part intro to python for developers who already know at least one other turing complete, imperative programming language, to youtube. I hope people find it useful.

Saturday, November 17, 2018

Python, Rust and C performance doing MD5

I put a performance comparison between Python, Rust and C doing MD5 calculations, here.

Interestingly, CPython and Pypy came out on top, even beating gcc and clang.

Granted, CPython and Pypy are probably calling the highly-optimized OpenSSL, but it's still noteworthy that sometimes Python can be pretty zippy.

Friday, March 2, 2018

The House Robber Problem

I've put a Genetic Algorithm-based solution to "The House Robber Problem" here.

The problem has us maximizing the value from houses robbed, subject to the constraint that no two adjacent houses can be robbed.

Tuesday, February 6, 2018

I've put a simple, Python 3.6 website dead link checker here.

You give it one or more URL's to search through, and one or more URL prefixes to mostly remain under, and it does the rest.

It's intended to be shell-callable, and can output CSV or JSON.

I hope people find it useful.

Sunday, June 5, 2016

from-table

I've put from-table here.

It's a small python3 script that knows how to extract one or more HTML tables as CSV data. You can give it a URL or a file. It can extract to stdout or to a series of numbered filenames (one file per table).

I hope folks find it useful.

Sunday, October 18, 2015

Backshift not That slow, and for good reason

Backshift is a deduplicating backup program in Python.

At http://burp.grke.org/burp2/08results1.html you can find a performance comparison between some backup applications.

The comparison did not compare backshift, because backshift was believed to have prohibitively slow deduplication.

Backshift is truly not a speed-demon. It is designed to:

minimize storage requirements
minimize bandwidth requirements
emphasize parallel (concurrent backups of different computers) performance to some extent
allow expiration of old data that is no longer needed

Also, it was almost certainly not backshift's deduplication that was slow, it was:

backshift's variable-length, content-based blocking algorithm. This makes python inspect every byte of the backup, one byte at a time.
backshift's use of xz compression. xz packs files very hard, reducing storage and bandwidth requirements, but it is known to be slower than something like gzip that doesn't compress as well.

Also, while the initial fullsave is slow, subsequent backups are much faster because they do not reblock or recompress any files that still have the same mtime and size as found in 1 of (up to) 3 previous backups.

Also, if you run backshift on Pypy, its variable-length, content-based blocking algorithm is many times faster than if you run it on CPython. Pypy is not only faster than CPython, it's also much faster than CPython augmented with Cython.

I sent G. P. E. Keeling an e-mail about this some time ago (the date of this writing is October 2015), but never received a response

Wednesday, August 12, 2015

Latest python sorted dictionary comparison

I recently completed another sorted dictionary comparison, and thought I'd share the results.

This time I've eliminated the different mixes of get/set. It's all 95% set and 5% get now.

Also, I added sorteddict, which proved to be an excellent performer.

And I added standard deviation to the graph and collapsible detail.

The latest comparison can be found here.

HTH someone.

fdupes and equivs3e

I recently saw an announcement of fdupes on linuxtoday.

Upon investigating it a bit, I noticed that it uses almost exactly the same algorithm as my equivs3e program.

Both are intended to find duplicate files in a filesystem, quickly.

The main difference seems to be that fdupes is in C, and equivs3e is in Python. Also, fdupes accepts a directory in argv (like tar), while equivs3e expects to have "find /directory -type f -print0" piped into it (like cpio).

However, upon doing a quick performance comparison, it turns out that fdupes is quite a bit faster on large collections of small files, and equivs3e is quite a bit faster on collections of large files. I really don't know why the python code is sometimes outperforming the C code, given that they're so similar internally.

I've added a "related work" section on my equivs3e page that compares equivs3e and fdupes.

Anyway, I hope people find one or both of these programs useful.

Saturday, September 6, 2014

Fibonacci Heap implementation in Pure Python

I've put a Pure Python Fibonacci Heap (priority queue) implementation at:

It passes pylint and pep8, is thoroughly unit tested, and runs on CPython 2.[67], CPython 3.[01234], Pypy 2.3.1, Pypy3 2.3.1 and Jython 2.7b3.

It's similar to the standard library's heapq module. The main difference is Fibonacci heaps have better big-O for some operations, and also supports decrease-key without having to remove and re-add.

This Fibonacci heap implementation is also better abstracted than heapq.

Friday, January 17, 2014

Python dictionary-like trees with O(log2n) find_min and find_max, and O(n) ordered traversal

I've made some changes to the tree comparison I did in 2012.

I've added a few more dictionary-like tree datastructures.

I also changed the methodology: Instead of running 4 million operations for all datastructures, I've told it to run each test to 8 million ops, or 2 minutes, whichever comes first. This allowed comparison of datastructures that perform well at one workload and poorly at another.

I also told it to use both random and sequential workloads this time - some datastructures are good at one and poor at the other.

The chief reasons to use a tree instead of a standard dictionary, are:

You get find_min and find_max methods, which run in O(log2n) time. Standard dictionaries do this in O(n) time, which is much slower for large n.
You get an ordered iteration in O(n) time. Standard dictionaries do this in O(nlogn) time, which is also much slower for large n.

The latest version of the comparison is here, as a series of graphs and as an HTML click-to-expand hierarchy. It's also linked from the URL at the top of this entry.

Friday, January 3, 2014

Backshift announcement

Backshift is a deduplicating filesystem backup tool in Python that compresses input data very hard and supports removing old data to make room for new data.

Backshift is thoroughly, automatically tested; is thoroughly documented; runs on a variety of Python interpreters; runs on a wide assortment of operating systems, and passes pylint.

Files to backup are selected by piping "find /dir -print0" into backshift, and files are restored by piping backshift to tar xvfp. Usage examples can be found here.

Here's a table comparing backshift and related tools.

Here's a list of backshift's features and misfeatures.

My hope is it will be useful to some of you.

Results of the Python 2.x vs 3.x survey

The results are at https://wiki.python.org/moin/2.x-vs-3.x-survey

Thanks.

Tuesday, December 31, 2013

I put up a page listing some of the datastructures I've worked on over the years. Most are in Python. It can be found here.

Listed are a few dictionary-like datastructures (including one that wraps others to allow duplicates), a linked list, a hash table (in C), and a bloom filter.

Monday, December 30, 2013

Python 2.x vs 3.x usage survey

I've put together a 9-multiple-choice-question survey about Python 2.x vs 3.x at https://www.surveymonkey.com/s/N5N5PG2 .

If you have a free minute, I hope you'll take it.

Saturday, June 23, 2012

Python and tree datastructures: A performance comparison

At http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/ I've put together a comparison of a few in-memory tree datastructures for Python. I hope to add heaps to the comparison someday, hence the URL.

The trees have each been made to run on cpython 2.x, cpython 3.x, pypy and jython, and they are all passing pylint now.

I tossed out two of the datastructures I wanted to compare, for the following reasons:

2-3 trees were tossed because they were giving incorrect results. This may or may not be my fault. I had to add a simple class to make them work like a dictionary.
Red-Black trees were tossed because they were very slow with large trees, so much so that they were making the other datastructures in which red-black trees appeared hard to read - that is, the things with better performance were getting all squished together and kinda merging into one fat line. The problem seemed to be related to garbage collection - perhaps the red-black tree implementation was creating and discarding lots of temporary variables or something. This too may or may not be my fault.

The top three datastructures I continued to examine were:

Splay trees - sometimes first in performance.
AVL trees - sometimes first in performance.
Treaps - sometimes first in performance, never worse than second in performance.

2-3 trees were tossed because they were giving incorrect results. This may or may not be my fault. I had to add a simple class to make them work like a dictionary.
Red-Black trees were tossed because they were very slow with large trees, so much so that they were making the other datastructures in which red-black trees appeared hard to read - that is, the things with better performance were getting all squished together and kinda merging into one fat line. The problem seemed to be related to garbage collection - perhaps the red-black tree implementation was creating and discarding lots of temporary variables or something. This too may or may not be my fault.

The top three datastructures I continued to examine were:

Splay trees - sometimes first in performance.
AVL trees - sometimes first in performance.
Treaps - sometimes first in performance, never worse than second in performance.

Wednesday, June 6, 2012

Python Trees (and treap) evaluated

2-3 trees were tossed because they were giving incorrect results. This may or may not be my fault. I had to add a simple class to make them work like a dictionary.
Red-Black trees were tossed because they were very slow with large trees, so much so that they were making the other datastructures in which red-black trees appeared hard to read - that is, the things with better performance were getting all squished together and kinda merging into one fat line. The problem seemed to be related to garbage collection - perhaps the red-black tree implementation was creating and discarding lots of temporary variables or something. This too may or may not be my fault.

The top three datastructures I continued to examine were:

Splay trees - sometimes first in performance.
AVL trees - sometimes first in performance.
Treaps - sometimes first in performance, never worse than second in performance.