Python Minimization Functions: Is the code optimal?

Question

I am rather new to python and am porting my R functions into python. I have written the functions below and the primary function is called thetaMax. That function minimizes a likelihood function or optionally minimizes a posterior distribution, depending on the option selected by method. The function works exactly as intended and I have unit tested this against my R code.

However, because I am not familiar with pythonic ways of doing this, I am curious what parts of the code below could be improved to 1) write less/more compact code and 2) to improve speed and take advantage of vectorization when possible?

A complete reproducible example is below

import numpy as np
from scipy.stats import binom  
from scipy.optimize import minimize  
from scipy.stats import norm

def prob3pl(theta, a, b, c, D = 1.7): 
    result = c + (1 - c) / (1 + np.exp(-D * a * (theta - b)))
    return(result)        

def gpcm(theta, d, score, a, D = 1.7):
    Da = D * a
    result = np.exp(np.sum(Da * (theta - d[0:score])))/np.sum(np.exp(np.cumsum(Da * (theta - d))))
    return(result)

def thetaMax(x, indDichot, a, b, c, D, d, method = 'mle', **kwargs):
    method_options = ['mle', 'map']
    optional_args = kwargs
    if method not in method_options:
        raise ValueError("Invalid method. Expected one of: %s" % method_options)
    if method == 'map' and 'mu' not in optional_args:
        raise ValueError("You must enter a value for 'mu' for the posterior distribution. Example, mu = 0")
    if method == 'map' and 'sigma' not in optional_args:
        raise ValueError("You must enter a value for 'sigma' for the posterior distribution. Example, sigma = 1")        
    x1 = x[indDichot]
    x2 = np.delete(x, indDichot)
    result = [0] * len(x2)
    def fn(theta):
        if(len(x1) > 0):
            p = prob3pl(theta, a, b, c, D = D)
            logDichPart = np.log(binom.pmf(x1,1,p)).sum()
        else: 
            logPolyPart = 0     
        if(len(x2) > 0):
            for i in range(0,len(x2)):
                result[i] = gpcm(theta, d = d[i], score = x2[i], a = 1, D = D)        
            logPolyPart = np.log(result).sum()
        else:
            logPolyPart = 0
        if(method == 'mle'):
            LL = -(logDichPart + logPolyPart)
        elif(method == 'map'):        
            normal = np.log(norm.pdf(theta, loc = optional_args['mu'], scale = optional_args['sigma']))
            LL = -(logDichPart + logPolyPart + normal)
        return(LL)
    out = minimize(fn, x0=0)
    return(out)


### In order to run
d = np.array([[0, -1, .5, 1],[0,-.5,.2,1]])
a = np.array([1,1,1,1,1])
b = np.array([-1,.5,-.5,0,2])
c = np.array([0,0,0,0,0])
#x = np.array([1,1,0,1,0])
x = np.array([1,1,0,1,0,1,1])
indDichot = range(0,5,1) 

object = py.thetaMax(x,indDichot,a,b,c,D=1,d = d)

This is not a "complete reproducible example": NameError: name 'py' is not defined — AJNeufeld
– AJNeufeld, Commented Apr 28, 2020 at 18:10

AJNeufeld · Accepted Answer · 2020-04-28 18:34:07Z

PEP-8 Violations

The Style Guide for Python Code enumerates many conventions that all Python code should follow. You have deviated from these conventions in several areas:

mixedCase is discouraged. Function names, methods and variable should all be snake_case. This means thetaMax should be called theta_max. (Exceptions are allowed for things like Da, where consistency with mathematical notation is more important than consistency with Python style.)
return is not a function call; it should not have parenthesis. Ie) return(result) should be written as return result.
if is also not a function call, nor is Python a C-like language; it should not have parenthesis. if(method == 'mle'): should be written as if method == 'mle':
Binary operators should have one space before and after it. This is only violated by the division operation in gpcm from my quick perusal of the code.
Commas should be followed by one space.
The equal sign used in keyword=parameter should not have a space before or after it. This applies both to the function definitions def gpcm(theta, d, score, a, D=1.7): and function calls p = prob3pl(theta, a, b, c, D=D)
Builtin Python identifiers should not be redefined without reason. object is such an identifier. A better name should be used.

Truthiness of lists.

A container used in a boolean context is True if the container is not empty, and False if the container is empty. There is no need to fetch the length of the container, test that value is greater than zero:

if x1:

is more "Pythonic" than:

if(len(x1) > 0):

Ranges Start at Zero by Default

By default, all ranges start at 0. This means range(0, len(x2)) is much more commonly written as range(len(x2)).

Iteration over a container

Python is a scripting language, where the script can assign its own implementation for many operations including subscripting, and iteration. This makes it impossible for Python to optimize code like:

for i in range(0, len(x2)):
    ... use the value x2[i] ...

It will usually be more efficiently written as:

for x2_i in x2:
    ... use the value x2_i ...

If the index is needed along with the values, then enumerate() is used for the most efficient result:

for i, x2_i in enumerate(x2):
    result[i] = gpcm(theta, d=d[i], score=x2_i, a=1, D=D)

Iterating over two (or more) parallel list (d and x2) would be done using zip:

for d_i, x2_i in zip(d, x2):
    ... = gpcm(theta, d=d_i, score=x2_i, a=1, D=D)

And if the indices are also needed, enumerate(zip(...)):

for i, (d_i, x2_i) in enumerate(zip(d, x2)):
    result[i] = gpcm(theta, d=d_i, score=x2_i, a=1, D=D)

But building up a complete array is more commonly done with list comprehension:

result = [ gpcm(theta, d=d_i, score=x2_i, a=1, D=D) for d_i, x2_i in zip(d, x2) ]

which eliminates the need for the result = [0] * len(x2) pre-allocation.

Keyword Arguments

What arguments can be passed to thetaMax()?

The answer is "any". The user has no way of knowing what is possible, or allowed. You can pass optimization_level=18 without error ... and without effect.

It would be safer to define the function with the allowed keyword arguments explicitly:

def thetaMax(x, indDichot, a, b, c, D, d, method='mle', *, mu=None, sigma=None):

    if method not in {'mle', 'map'}:
        raise ValueError("Invalid method")
    if method == 'map':
        if mu is None or sigma is None:
            raise ValueError("Both mu= and sigma= keyword arguments must be given")
    elif method == 'mle':
        if mu is not None or sigma is not None:
            raise ValueErorr("Neither mu= or sigma= is appropriate for 'mle'")

As a bonus, having the values already extracted into their own variables is faster than repeatedly looking up optional_args['mu'] and optional_args['sigma'] during each and every call of fn() during the minimize function.

Additional speed may come from defining different fn() functions for the 6 different combinations of len(x1) > 0, len(x2) > 0, and method, thus removing the conditionals from inside the nested calls.

\$\begingroup\$ wow, what a thorough and helpful review. Thank you! \$\endgroup\$

user350540
– user350540

2020-04-28 20:13:52 +00:00
Commented Apr 28, 2020 at 20:13 — user350540
– user350540, Commented Apr 28, 2020 at 20:13

Stack Exchange Network

Python Minimization Functions: Is the code optimal?

1 Answer 1

PEP-8 Violations

Truthiness of lists.

Ranges Start at Zero by Default

Iteration over a container

Keyword Arguments

You must log in to answer this question.

Hot Network Questions

Python Minimization Functions: Is the code optimal?

1 Answer 1

PEP-8 Violations

Truthiness of lists.

Ranges Start at Zero by Default

Iteration over a container

Keyword Arguments

You must log in to answer this question.

Related

Hot Network Questions