Finding pair of numbers that minimize a function

Question

This is a problem of a past bioinformatic contest (requires an account). My solution works but is too slow for some test cases.

Input format

The first line of the input contains one integer \$T\$, \$(1 \leq T \leq 3)\$ the number of test cases. Each test case is specified by four lines.

The first line of each test case contains three integer numbers \$M\$, \$K\$, \$N\$.

The second line contains \$M\$ numbers \$m_i\$ − masses of metabolites \$(0 < m_i\le 1000)\$.

The third line contains \$K\$ numbers \$a_i\$ − masses of adducts \$(-1000 \le a_i \le 1000)\$.

The fourth line contains \$N\$ numbers \$s_i\$ − masses of signals \$(0 < s_i\le 1000)\$.

All the masses are indicated with exactly six decimal places.

Output format

For each signal \$s_i\$ of each test case, print numbers \$j\$ and \$k\$ such that \$s_i = m_j+a_k+\Delta\$, \$m_j+a_k > 0\$ and the absolute value of \$\Delta\$ is smallest possible. If there are multiple numbers \$j\$ and \$k\$ with same absolute value of \$\Delta\$ for some signal, you can print any of them.

Sample input

3
2 2 5
1.000002 0.000002
0.500000 -0.500000
0.500001 0.500002 0.500003 1.000000 0.000001
2 2 5
1.000002 0.000001
0.500000 -0.500000
0.500001 0.500002 0.500003 1.000000 0.000001
5 4 7
0.000001 0.000002 0.000003 0.000004 0.000005
0.000002 0.000010 0.000001 -0.000001
0.000001 0.000002 0.000100 0.000005 0.000020 0.000010 0.000003

Sample output

Test cases

1.txt: \$M,K,N≤10\$
3.zip: \$M,K≤1000;N≤10^5\$
4.zip: \$M≤10^6;K,N≤1000\$
5.zip: \$M,K,N≤10^4\$
answers.zip: to test the solution

Code

from bisect import bisect_left
from time import perf_counter as pc

# Find in arr the closest number to n
def take_closest(arr, n):
    pos = bisect_left(arr, n)
    if pos == 0:
        return arr[0]
    if pos == len(arr):
        return arr[-1]
    before = arr[pos - 1]
    after = arr[pos]
    if after - n < n - before:
        return after
    else:
        return before


def solve(masses, adducts, signals):
    totals = {}
    for i, m in enumerate(masses):
        for j, a in enumerate(adducts):
            ma = m + a
            if ma > 0:
                totals[ma] = (i + 1, j + 1)
    skeys = sorted(totals.keys())
    for s in signals:
        closest = take_closest(skeys, s)
        yield totals[closest]


if __name__ == "__main__":
    test_num = 3
    of = open(f"out{test_num}.txt", "w")
    with open(f"{test_num}.txt", "r") as f:
        t0 = pc()
        t = int(f.readline())
        for _ in range(t):
            M, K, N = map(int, f.readline().strip().split())
            masses = list(map(float, f.readline().strip().split()))
            adducts = list(map(float, f.readline().strip().split()))
            signals = list(map(float, f.readline().strip().split()))

            for j, k in solve(masses, adducts, signals):
                of.write(f'{j} {k}\n')
        t1 = pc()
        print(f"Runtime: {round(t1-t0,3)} s")
    of.close()

Algorithm:

Store all sums \$m_i + a_j\$ in a dictionary with indices \$i,j\$ as values.
Sort signals
For each signal, find the closest number among the sorted keys of the dictionary using binary search.

Issues:

The solution works but is too slow for test case 4, while test case 5 takes around 10 minutes on my machine.

Any feedback is appreciated.

You can avoid creating a dict+sorted list: 1. Sort adducts. 2. For each signal[i]-mass[j] look for (bisect) adduct. — Pavlo Slavynskyy
– Pavlo Slavynskyy, Commented Jun 30, 2021 at 9:50
@PavloSlavynskyy Thanks, I'll try your idea but at the moment I am not sure if that will be enough. masses would need to be scanned for each signal, and M is kind of large in test case 4. Feel free to post an answer, will help me to understand better your idea. — Marc
– Marc, Commented Jun 30, 2021 at 10:32
Sorry, this is code_review, not write_my_code. But i've tested the idea - it's under 2 minutes for test case 5 on my i5-7500. — Pavlo Slavynskyy
– Pavlo Slavynskyy, Commented Jun 30, 2021 at 10:45
@PavloSlavynskyy I wasn't asking for code, but for a more formal answer. Comments should be for clarifications as far as I know. — Marc
– Marc, Commented Jun 30, 2021 at 11:07
The only technological constraint I can see on the problem page is "using programming". Does it stipulate Python? — Reinderien
– Reinderien, Commented Jun 30, 2021 at 17:22

Pavlo Slavynskyy · Accepted Answer · 2021-06-30 12:40:45Z

The complexity of your algorithm is:

Creating a dict of values: \$O(MNlog(MN))\$ (Loops for M and for N, log for adding into dict).

Sorting keys: \$O(MNlog(MN))\$ (sorting a list of the size MN)

Looking up: \$O(Klog(MN))\$ (because we do K lookups among an MN-sized list).

Total will be \$O((MN+K)log(MN))\$, for the case \$N=M=K\$ it will be \$O(N^2log(N^2))\$, not great, not terrible.

\$\Delta\$ is represented as an expression for 3 variables; to search for minimum (with log complexity), we still need to fix two other variables, which gives us \$O(N^2log(N))\$ - this is slightly better, but I think worth a try. The question is what variables should we loop over, and what to use for bisection search. The task is to find for every signal - so, we should have a loop over it. So the idea is something like this:

adducts_dict = {adducts[k]:k for k in range(len(adducts))}
adducts = sorted(adducts)
for s in signals:
    for j, m in enumerate(masses):
        #bisect find minimal distance from s-m to adduct; save that adduct and j
    yield (closest_j, adducts_dict[closest_adduct])

The complexity here will be: \$O(Klog(K) + Klog(K)+N(Mlog(K)+log(K))) = O((MN+K)log(K)\$. Slightly better.

One thing more: adducts and masses can be treated equally in the expression for \$\Delta\$ and swapped; this will give us \$O((KN+M)log(M))\$. It looks like it will be good to keep greatest of (M,N) added, not multiplied, so for the test case 4 you should sort and bisect search masses, not adducts (just swap the arrays and resulting pairs).

Reinderien · Accepted Answer · 2021-07-03 21:16:03Z

A minor addition: the "sample" test case is ambiguous. Consider:

$$s = 0.500,001$$ $$s - 1.000,002 - (-0.5) = s - 0.000,002 - 0.5 = -0.000,001 $$

With synthesized tests you don't want to allow for nondeterministic behaviour. Better to choose inputs that unambiguously point toward one correct answer.

LP: Don't do this

It's possible (though not advisable) to reframe your implementation as a mixed-integer linear programming problem where:

the structural variables are binary selection coefficients into the metabolite and adduct vectors
there are three auxiliary variables: to minimize the objective, and one for each of metabolite and adduct to enforce exactly one choice
since an abs needs to be applied, it requires two passes per value of s

This works(ish) but is very slow.

#include <assert.h>
#include <limits.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <glpk.h>


#define VERBOSE 1

const double epsilon = 1e-10;
        

static void fatal(const char *msg) {
    fprintf(stderr, "%s\n", msg);
    exit(1);
}


static void pfatal(const char *msg) {
    perror(msg);
    exit(1);
}


static void usage(const char *cmd) {
    fprintf(stderr, "Usage: %s problem-number [...]\n", cmd);
    exit(1);
}


static void open_files(int i_problem, FILE **file_in, FILE **file_ans) {
    char filename_in[NAME_MAX], filename_ans[NAME_MAX];
    
    snprintf(filename_in, NAME_MAX, "%d.txt", i_problem);
    *file_in = fopen(filename_in, "r");
    if (!*file_in) 
        pfatal("Failed to open input file");
    
    snprintf(filename_ans, NAME_MAX, "ans%d.txt", i_problem);
    *file_ans = fopen(filename_ans, "r");
    if (!*file_ans) 
        pfatal("Failed to open output file");
}


static void read_line(FILE *file, char *line, int n) {
    if (!fgets(line, n, file))
        pfatal("Input I/O");
        
    if (line[strlen(line) - 1] != '\n')
        fatal("Input line too long");
}


static void read_ints(FILE *file, int *array, int n) {
    const int field_chars = 12, buf_size = n*field_chars;
    char *line = malloc(buf_size);
    if (!line)
        pfatal("No memory for line");
    read_line(file, line, buf_size);
    
    const char *field = line;
    for (int i = 0; i < n; i++) {
        int consumed;
        if (sscanf(field, "%d%n", array + i, &consumed) != 1)
            pfatal("Bad input");
        field += consumed;
    }
    
    free(line);
}


static double *read_doubles(FILE *file, int n) {
    const int field_chars = 12, buf_size = n*field_chars;
    char *line = malloc(buf_size);
    if (!line)
        pfatal("No memory for line");
    read_line(file, line, buf_size);
    
    double *array = malloc(n * sizeof(double));
    if (!array)
        pfatal("No memory for input");
    
    const char *field = line;
    for (int i = 0; i < n-1; i++) {
        int consumed;
        if (sscanf(field, "%lf%n", array + i, &consumed) != 1)
            pfatal("Bad input");
        field += consumed;
    }
    
    free(line);
    return array;
}


static void read_case(
    FILE *file_in, 
    int *M, int *K, int *N,
    double **m, double **a, double **s
) {
    char line[256];
    read_line(file_in, line, sizeof(line));
    if (sscanf(line, "%d %d %d\n", M, K, N) != 3)
        fatal("Incorrect test case header format");
    
    printf("M=%d K=%d N=%d  ", *M, *K, *N);
    #if VERBOSE
        putchar('\n');
    #endif
    fflush(stdout);
    
    if (*M < 1) fatal("Out-of-range M");
    if (*K < 1) fatal("Out-of-range K");
    if (*N < 1) fatal("Out-of-range N");
    
    *m = read_doubles(file_in, *M),  // metabolites
    *a = read_doubles(file_in, *K),  // adducts
    *s = read_doubles(file_in, *N);  // signals
}


/*
For each given s, choose one m and one a to minimize |s - m - a|.
Show the indices in m and a.

In GLPK terms,
  x[:M+K]:   structural "col" variables, the actual selection coefficients
  x'[:3]:    auxiliary "row" variables, to constrain the solution
  z: objective, should approach s
  c: objective coefficients, equal to m and a concatenated
  A: constraint coefficients, three constraint rows, one col for each m,a
  l, u: lower and upper bounds
  
      c
z = [m m a a a][x]
               [x]
               [x]
               [x]
               [x]

        A
[x']   [m m a a a][x]
[x'] = [1 1 0 0 0][x]
[x']   [0 0 1 1 1][x]
                  [x]
                  [x]
                  
Min|maximize z = cx subject to x' = Ax, l <= x <= u, l' <= x' <= u'
Synthesizing "minimize abs(s - m - a)" translates to:
 - Maximize m+a subject to m+a <= s
 - Minimize m+a subject to m+a >= s
 - Take whichever solution is closer to s
*/


static glp_prob *make_prob(
    int i_problem, int i_test,     
    int M, int K, const double *m, const double *a
) {
    glp_term_out(GLP_OFF);
    glp_prob *lp = glp_create_prob();
    char name[64];
    snprintf(name, sizeof(name), "stepik-bioinfo-2021-%d.%d", i_problem, i_test);
    glp_set_prob_name(lp, name);
    glp_set_obj_name(lp, "m+a");
    
    // auxiliary "row" variables:
    //   0: tracking the objective function, to enforce minimum or maximum
    //   1: metabolite selection sum equal to 1
    //   2: adduct selection sum equal to 1
    glp_add_rows(lp, 3);
    glp_set_row_name(lp, 1, "objective_limit");
    glp_set_row_name(lp, 2, "fixed_sum_metabolite");
    glp_set_row_name(lp, 3, "fixed_sum_adduct");
    //  set_row_bnds(lp, 1) deferred to the min/max step
    glp_set_row_bnds(lp, 2, GLP_FX, 1, 1);
    glp_set_row_bnds(lp, 3, GLP_FX, 1, 1);
    
    // structural "column" variables, M+K selection vector of metabolites and 
    // adducts in [0, 1]
    glp_add_cols(lp, M + K);
    
    // The glpk array convention is dumb and 1-indexed, meaning every input
    // array needs a dummy prefix
    const int row_ind[4] = {INT_MIN, 1, 2, 3};
    
    char col_name[16];
    
    // Metabolites
    for (int i = 0; i < M; i++) {
        snprintf(col_name, sizeof(col_name), "m_%d", i+1);
        glp_set_col_name(lp, i+1, col_name);
        glp_set_col_kind(lp, i+1, GLP_BV);
        // implied: glp_set_col_bnds(lp, i+1, GLP_DB, 0, 1);
        glp_set_obj_coef(lp, i+1, m[i]);
        double constraints[4] = {NAN, m[i], 1, 0};
        glp_set_mat_col(lp, i+1, 3, row_ind, constraints);
    }
    
    // Adducts
    for (int i = 0; i < K; i++) {
        snprintf(col_name, sizeof(col_name), "a_%d", i+1);
        glp_set_col_name(lp, i+M+1, col_name);
        glp_set_col_kind(lp, i+M+1, GLP_BV);
        // implied: glp_set_col_bnds(lp, i+M+1, GLP_DB, 0, 1);
        glp_set_obj_coef(lp, i+M+1, a[i]);
        double constraints[4] = {NAN, a[i], 0, 1};
        glp_set_mat_col(lp, i+M+1, 3, row_ind, constraints);
    }
    
    return lp;
}


static int find_selected(glp_prob *lp, int n, int offset) {
    for (int i = 0; i < n; i++) {
        if (glp_mip_col_val(lp, i + offset + 1) > 0.5)
            return i;
    }
    fatal("Selected index not found");
}   


static double optimize(
    glp_prob *lp, int direction, int i_s, double s,
    const double *m, const double *a,
    int M, int K, int *j_max, int *k_max
) {
    const char *dir_str = direction == GLP_MIN ? "min" : "max";
    #if VERBOSE
        printf("    [%d] %s ", i_s, dir_str);
    #endif
    
    // Reset between optimization runs
    // glp_std_basis(lp);
    
    glp_set_obj_dir(lp, direction);
    int bound = direction == GLP_MIN ? GLP_LO : GLP_UP;
    glp_set_row_bnds(lp, 1, bound, s, s);
    
    int err = glp_simplex(lp, NULL);
    if (err) glp_error("GLPK simplex failure %d\n", err);
    int stat = glp_get_status(lp);
    if (stat == GLP_OPT) {
        err = glp_intopt(lp, NULL);
        if (err) glp_error("GLPK MIP failure %d\n", err);
        stat = glp_mip_status(lp);
    }
    
    if (stat != GLP_OPT) {
        #if VERBOSE
            printf("%lf: infeasible\n", s);
        #endif
        return INFINITY;
    }
    
    double obj = glp_mip_obj_val(lp);
    #if VERBOSE
        if (direction == GLP_MIN) printf("%.2le <- %.2le ", s, obj);
        else                      printf("%.2le -> %.2le ", obj, s);
    #endif
    
    *j_max = find_selected(lp, M, 0);
    *k_max = find_selected(lp, K, M);
    
    double error = fabs(obj - s);
    #if VERBOSE
        printf(
            "j=%d k=%2d err=%.1le act_err=%+.1le\n",
            *j_max+1, *k_max+1, error,
            s - m[*j_max] - a[*k_max]
        );
    #endif
    return error;
}


static void test_case(int i_problem, int i_test, FILE *file_in, FILE *file_ans) {
    printf("problem %d.%d ", i_problem, i_test);
    int M, K, N;
    double *m, *a, *s;
    read_case(file_in, &M, &K, &N, &m, &a, &s);
    
    glp_prob *lp = make_prob(i_problem, i_test, M, K, m, a);
    
    int matches = 0;
    
    int expected[2];
    
    for (int i_s = 0; i_s < N; i_s++) {
        int j = -1, k = -1;
        // Minimize m+a subject to m+a >= s
        double error = optimize(lp, GLP_MIN, i_s, s[i_s], m, a, M, K, &j, &k);
        
        if (error > epsilon) {
            int j1 = -1, k1 = -1;
            // Maximize m+a subject to m+a <= s
            double error1 = optimize(lp, GLP_MAX, i_s, s[i_s], m, a, M, K, &j1, &k1);
            
            if (error > error1) {
                error = error1;
                j = j1; k = k1;
            }
        }
        
        if (j < 0 || k < 0) fatal("No solution");
        
        read_ints(file_ans, expected, 2);
        #if VERBOSE
        printf("    Act %2d %2d  exp %2d %2d\n", j+1, k+1, expected[0], expected[1]);
        #endif
        if (j+1 == expected[0] && k+1 == expected[1])
            matches++;
    }
    
    glp_delete_prob(lp);
    free(m); free(a); free(s);
    
    printf("  matched %d/%d\n", matches, N);  
}

int main(int argc, const char **argv) {
    if (argc < 2) usage(*argv);
    
    printf("Using glpk %s\n", glp_version());
        
    for (int a = 1; a < argc; a++) {
        FILE *file_in, *file_ans;
        int i_problem;
        if (sscanf(argv[a], "%d", &i_problem) != 1)
            usage(*argv);
        
        open_files(i_problem, &file_in, &file_ans);
        
        int T;
        if (fscanf(file_in, "%d\n", &T) != 1) fatal("Bad test count");
        if (T < 1 || T > 3) fatal("Out-of-range test count");
        
        for (int i_test = 0; i_test < T; i_test++) {
            test_case(i_problem, i_test, file_in, file_ans);
        }
    }
    
    return 0;
}

Numpy vectorization

It's possible to use something vaguely close to your original implementation but using all numpy and no loops. This works-ish for problems 1.1, 2.2 and almost everything in 3.2 but

there's a few stray mismatches in 3.2;
I wasn't careful enough with memory so problem 4.1 dies from OOM - this could be fixed by switching to a KN lookup instead of an MK lookup; and
Problems 2.1 and 3.1 are totally wrong for some reason;

but it's still possible as a proof-of-concept to demonstrate how you would take your algorithm and vectorize it.

from sys import argv

import numpy as np


def solve_case(m: np.ndarray, a: np.ndarray, s: np.ndarray) -> np.ndarray:
    mrep = np.tile(m, len(a))
    jrep = np.tile(np.arange(len(m), dtype=np.int32), len(a))
    arep = np.repeat(a, len(m))
    krep = np.repeat(np.arange(len(a), dtype=np.int32), len(m))
    jk = np.vstack((jrep, krep))

    masum = mrep + arep
    order = masum.argsort()
    jk[:] = jk[:, order]
    masum[:] = masum[order]

    i = np.searchsorted(masum, s)
    lower = np.abs(s - masum[i - 1])
    upper = np.abs(s - masum[i])
    adj = lower < upper
    res = jk[:, i - adj]
    return res.T + 1


def solve(i_problem: int) -> None:
    with open(f'{i_problem}.txt') as file_in, \
            open(f'ans{i_problem}.txt') as file_ans:
        T = int(next(file_in))

        for i_case in range(1, T + 1):
            M, K, N = (int(x) for x in next(file_in).split())
            print(f'problem {i_problem}.{i_case}: M={M} K={K} N={N}', end=' ')

            m, a, s = (
                np.genfromtxt(file_in, dtype=np.float64, max_rows=1)
                for _ in range(3)
            )
            assert m.shape == (M,)
            assert a.shape == (K,)
            assert s.shape == (N,)

            actual = solve_case(m, a, s)
            expected = np.genfromtxt(file_ans, dtype=np.int32, max_rows=N)
            matched = np.sum(actual == expected) / actual.size
            print(f'{matched:.2%} matched')


def main() -> None:
    for arg in argv[1:]:
        solve(int(arg))


if __name__ == '__main__':
    main()

The test case in the question is actually the "sample test" provided by the contest. I agree that it is not the best test case. — Marc
– Marc, Commented Jul 3, 2021 at 5:41

Stack Exchange Network

Finding pair of numbers that minimize a function

Input format

Output format

Sample input

Sample output

Test cases

Code

2 Answers 2

LP: Don't do this

Numpy vectorization

You must log in to answer this question.

Hot Network Questions

Finding pair of numbers that minimize a function

Input format

Output format

Sample input

Sample output

Test cases

Code

2 Answers 2

LP: Don't do this

Numpy vectorization

You must log in to answer this question.

Related

Hot Network Questions