256 (random ?) colors

This painting by Gerhard Richter is called 256 colors. The painter is fully committed to this kind of work, as you can see here. When visiting the San Francisco Museum of Modern Art (SFMOMA) (I’m getting literate…), the guide asked the following question:

Do you think the colors are positioned randomly or not?

Not a trivial question, is it? And you, would you say it is random? This work dates back to 1974, when computer screens mainly displayed green letters on a black background. So it seems the artist did not benefit of computer assistance.

There are many ways to interpret this plain English statement into statistic terms. For example, are the colors, with no ordering, uniformly distributed? (OK, this doesn’t mean at all (true) randomness, but this is a question…) It would be nice to have the 256 colors in RGB. In this color model, (0,0,0) is black, and (255,255,255) is white. I think that there are rather more dark colors than light ones, ie more data points near the (0,0,0) vertex than near  the opposite one, in the RGB cube. So a test of uniformity would probably be rejected.

A more subtle way to interpret uniformity in the painting would be to take into account the position of the colors… Any idea how to check that? I have no clue.

Here is a larger one, 1024 colors…

True randomness, lava lamp and atmospheric noise

What is the link between a lava lamp and random number generation ?

Assessing what is random and what is not is not easy; after all a dice and a flipping coin are only ruled by basic rules of mechanics. We say that it is random only because it is hard to predict. When you ask your computer for a random number, there is nothing really random in it either, since a computer can only run predefined methods. This is illustrated by a famous quote from John von Neumann:

Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.

Well then, how can we really choose random digits? This website

http://www.random.org/

proposes random numbers generated from atmospheric noise picked by a radio. In some sense, there is probably nothing random in it either, but it is hard enough to predict so that it can be considered as true random numbers. From the website:

People use RANDOM.ORG for holding drawings, lotteries and sweepstakes, to drive games and gambling sites, for scientific applications and for art and music.

And indeed on their website there are random generator for drawing dices, cards, calendar dates or even Jazz scales.

Apart from using atmospheric noise, there are other fun ways to catch some randomness, including taking pictures of a lava lamp and analysing the patterns, or timing the decay of a radioactive source, which is considered to be truly random, in the sense that a stochastic process rules the decay. Of course all these methods allow to draw only a few numbers at a particular time, but that can still be useful as a seed for a pseudo-random generator, or in cryptography. More information can be found on this page.

Cloud services useful for Stat researchers

This post is not on statistics, but merely on a few tools that can be convenient for many researchers, in my opinion. At least some researchers here at CREST use them everyday! These tools belong to the cloud computing/service trend.

One thing that any people working on a computer is afraid of is losing his files. It’s probably the #1 nightmare of any PhD student. Many solutions make this nothing but a bad memory, a disaster from an Ancient time, like the Plagues of Egypt, the Behemoth or Claude François.

A good solution is to synchronize your files with a secure, reliable server on which you can count. For instance you can count on Google’s servers for not losing your documents, presumably (although you cannot count on them for not reading them but that’s a different story). So you can store your files on Google Docs (or a similar service) for instance. It’s free, but it gets painful if you have to do that every day. An interesting and very convenient solution is Dropbox. It consists of a little program that synchronizes the files of a given folder of your computer with servers provided by Dropbox Inc, a three years old startup. Their free offer allows you to store 2Gb, and of course you have to pay if you want more space. The good points are the following:

  • it works on Windows, Mac OS and Linux, and you don’t need the administrator privilege to install it on windows (therefore you can install it at CREST for instance),
  • if you have it on all your computers, your working directory gets synchronized seamlessly at startup; if you use to send emails to yourself or if you keep sync’ing your files on a USB key, you’ll definitely find it convenient,
  • you can share a sub-folder with other Dropbox users, which is convenient for a team project,
  • 2Gb is not a lot if you store pictures and music files, but it’s a lot of TeX and program files,
  • you can set a file “public”, and get an URL for it, which is really useful if you want to share a file that is too big to be sent by email.

Overall my (office) life has improved since I use it, but the big catch here is that your data is stored on a private company’s server, so you have to trust them at least to some extent. Since I don’t work on sensitive matters I don’t mind, but that can obviously be prohibitive.

Other startups provide equivalent offers, although I didn’t test them: box.net, Sugar Sync… I’m surprised that this kind of service is not proposed by the main Internet service providers (like Google or Yahoo), but maybe their generally bad reputation of not respecting their customers’ privacy would make it hard for them to propose such a service.

One step further in these cloud offers, and more focused on statistics, is the ability to launch stat programs online. Some startups propose this service as well, like Monkey Analytics. On this site you can store Matlab, R and Python programs and launch them on their servers. This way they provide both a storing and a computing service. You can then access the results online, even from a smartphone. There is no free offer, but a 30-day free trial. I suppose it’s interesting for travelling statisticians, or for statisticians who need a lot of computing power (though they don’t give a lot of information on their clusters). Then again, you have to trust the private company behind the service…

Using graphic cards to perform statistical computation

Hi!

For my first post I’m going to blog about a trendy topic in computational statistics: the increasingly common use of graphic cards to perform any kind of scientific computation (see here) and in particular statistical computation.

The fun thing about graphic cards (fun in a weird and definitely nerd way) is that they were built to display pictures (obviously), mainly for movies and video games. These cards became more and more powerful in the 90s and until now, helping the games to achieve an always higher level of photorealism. But I’m not sure their early developers realized how much they could be useful in scientific computation.

In the last years it has become clear that, for a limited amount of money, say 300$, you can get much more computational power if you buy a GPU (a graphic card processor) than a CPU (a standard processor). This gain can be of order 100 or even more!… meaning a program that takes several hours on a standard CPU can be run in less than a minute on a GPU. Of course it doesn’t work for any computation (otherwise GPU would simply replace CPU and there would be nothing to blog about).

Compared to a CPU, a GPU is extremely good at doing parallel computation. It means that it can do hundreds of little tasks at the same time, where a standard CPU with two cores can do… two tasks, basically. Algorithms that need to do a lot of independent little tasks can therefore be parallelized, other algorithms cannot.

  • For example, if you have a vector (x_0, x_1, \ldots x_n) and want to take the log of each component of this vector, you can do it in parallel, because you don’t need to know the result of log(x_i) in order to compute log(x_j), for any i \neq j. Similarly, evaluating a likelihood of a vector of independent variables can also be parallelized.
  • On the contrary, suppose you want to compute Fibonacci numbers, which are defined by F_0 = F_1 = 1 and  the recurrence relation:  F_n = F_{n-1} + F_{n-2}. Then you obviously need to compute the n-th term before computing the n+1-th term. The program doing this computation would therefore be sequential. Any algorithm requiring a recurrence is thus hard to parallelize, and that includes for instance the Metropolis-Hastings algorithm. Though I’m working on using parallel computation to improve Metropolis-Hastings based estimation… maybe more about this in a future post.

My PhD focuses on Sequential Monte Carlo methods (SMC), also known as Particle Filters, which happen to be very easy to parallelize. Hence my interest in the topic! Instead of waiting hours for my computations to finish, I could have the results in (near) real-time. But then I would have no excuse for playing table football

If you are interested, there are plenty of links to check. Owning a NVIDIA device, I use CUDA and its python counterpart, PyCUDA. Apparently you can also use CUDA in R with this. There is also a MATLAB plugin. An alternative is the open standard called OpenCL, that allows you to program NVIDIA and ATI devices (those two being the main graphic card designers), so that’s probably the best solution, or at least will be in the future. In the statistical literature, there has been recent articles, like this one in the SMC context. This website from the University of Oxford gives lots of links as well, and this one too from Duke University. Programming on a GPU already interests a lot of people, namely those who have intensive computations to perform (financial engineers, physicists, biologists…), I hope you’ll find it useful too.

Stay tuned for more exciting posts!…

Penalty shootout in FIFA World Cups

While looking for a post on octopus Paul, Jérôme told me about this surprising football stat:

the team which begins a penalty shootout is more likely to win than the following team.

Here is a so bad souvenir, with Trézeguet’s shoot on the bar… And yes, Italy went fisrt at shooting!

What do data say about that?

Here are results on penalty shootout of five main tournaments in the past decades, World Cup, European Championships, Copa América, African Cup of Nations and the Asian Cup. The trouble is a crucial point is missing: there is no mention of which team began to shoot. I managed to gather 14 data points in the following way (only for the World Cups): either using the number of penalties taken column (which provides the order information when the numbers differ), or watching relative youtube video. The outcomes are shown bellow (please comment if I misinterpreted any result, or if you know a clever way to get more data, I no football scholar!): the beginning team won 11 (out of 14) times!

Now, the question is to see whether this difference is statistically significant or not.

A simple model is the following: the random variable X_i is 1 when the match labelled by i is won by the first team to shoot, and is 0 otherwise (n=14,\,i=1,...n). Denote p the probability that X_i=1. Then, testing H_0 :\,p=1/2 against H_1 :\,p\neq 1/2 can be done via a \chi^2 test (or a Wald test, with the same statistic in the case of the Bernoulli model).

Under hypothesis H_0, the statistic T_n=4n(\bar X-1/2)^2 folows a \chi^2(1) distribution asymptotically. We have T_n=4.57. Compared to the 95% quantile of a \chi^2(1) variable, q=3.8, we can state that the probability of success is significantly higher for the first team.

Any explanation? We can guess that the following team is more under pressure than the first, and fails more often when trying to equilize. Indeed, a player whose shoot makes win his team in case of success scores an average 93%, against an average 52% or so when it makes loose in case of failure…

PS: should Spain – Netherlands end up with a penalty shootout, football analysts say it would be in the interest of Spain. Indeed the Netherlands are among the 5 worse nations at thos (along with the UK). What if the Netherlands win the coin toss?

3 second-in winners:

1990 SF Argentina Italy 1-1 4-3 4 & 5
1990 SF West Germany England 1-1 4-3 4 & 5
1994 Final Brazil Italy 0-0 3-2 4 & 5

against 11 first-in winners:

1986 QF West Germany Mexico 0-0 4-1 4 & 3
1998 Last 16 Argentina England 2-2 4-3 5 each
1998 QF France Italy 0-0 4-3 5 each
1998 SF Brazil Netherlands 1-1 4-2 4 each
2002 QF South Korea Spain 0-0 5-3 5 & 4
2006 Qualifier Australia Uruguay 0-1 1-0 4-2 5 & 4
2006 Last 16 Ukraine Switzerland 0-0 3-0 4 & 3
2006 QF Germany Argentina 1-1 4-2 4 each
2006 QF Portugal England 0-0 3-1 5 & 4
2006 Final Italy France 1-1 5-3 5 & 4
2010 Last 6 Paraguay Japan 0-0 5-3 5 & 4

Puzzle at the beach

A little younger, we used to play a game of skill in family at the beach. A simplified two players version is like this. Find two light stones. Draw a straight line in the sand. Move back a little, so that throwing your stone close to the line becomes difficult enough.

The objective of the game is to throw your stone on the good side of the line (a throwing beyond the line means you loose) and of course closer to the line than your opponent. It is played sequentially… which favours Player 1.

In the following outcome of the game, Player 2 wins.

The questions are:

  • to what distance d away from the line should Player 1 be aiming to throw his stone ? (cf picture bellow)
  • what is his probability to win ?

The model can be the following: a throwing aimed at a given point is distributed as a standard Gaussian random variable \mathcal{N}(0,1) around this point. It means in particular that both players are equally skillful, and that the distance does not alter the precision. Assume that if Player 1 throws beyond the line, then Player 2 wins without even throwing. Assume finally that once Player 1 has succesfully thrown his stone, then Player 2 rationally maximizes his profit by aiming his throw right in between the line and the shoot of Player 1.

Good luck. (answers in a few days)

Funny symbols in Latex

I’m sure you were always dreaming about funny symbols for your beamer presentations or LateX documents. Standing ovations guaranteed for all your congress and seminars.

Just take a look to this pdf document and enjoy the simpsons package.

\usepackage{simpson}

 

Holidays coming soon

This blog is barely opened and I speek of holidays… but it seems relevant for statisticians: the blog Information Is Beautiful points out this World Map of Touristyness

Word Map Of Touristyness

It utilizes pictures posted on Panoramio. An interesting way to use collaborative databases on the internet. Too bad, the seemingly good quality at this scale of the map does not actually enable to zoom in to cities scale. Other maps created for cities can be found here on flikr.

If on the first map only basically pictures density seems to be used, city maps take advantage of more information. For how long the user publishes pictures in a given city, and whether or not he published pictures in others allows to define tourist (red) and local (blue) photographers.

Welcome!

Hi everyone!

Welcome to this new blog written by some PhD students and postdocs at CREST. The purpose of this blog is to share random stuff about statistics / econometrics / computing and anything more or less in connection. Actually we didn’t really think about the purpose, we were too busy finding the blog’s name.

Design a site like this with WordPress.com
Get started