2
\$\begingroup\$

This C code does DBSCAN - Density-based spatial clustering of applications with noise. It's a cluster algorithm for turining unsupervised (no labels) data to become supervised (labels e.g class number) data. And it's also fast. The drawback with DBSCAN is that it consumes a lot of memory, but its speed is excellent!

So what I have done is that I have created a DBSCAN algorithm that using recursion. Once I was done with this, I notice that the recusion made a stack overflow if the row argument was large.

Question:

  1. I want to make sure that recursion functions in C must be allocated on the heap, not the stack. How can that be done?
  2. Is there any way I could make sure this function recursion_clustering not increase any memory and only using static memory? I have tried to use so much pointers as possible to prevent memory duplication.

Code:

/* Private function */
static bool recursion_clustering(const int32_t i, const size_t* row, const float* epsilon, const size_t* min_pts, const float C[], bool visited[], size_t idx[], int32_t* id, int32_t* count);

 /*
  * Create ID's of clusters
  * X[m*n]
  * idx[m] = Contains cluster ID. Noise ID is 0
  * epsilon = Raduis of the clusters
  * min_pts = Minimum points of a valid cluster
  */
void dbscan(const float X[], size_t idx[], const float epsilon, const size_t min_pts, const size_t row, const size_t column) {
    /* Create idx */
    memset(idx, 0, row * sizeof(size_t));

    /* Create pdist2 C */
    float* C = (float*)malloc(row * row * sizeof(float));
    pdist2(X, X, C, row, column, row, PDIST2_METRIC_EUCLIDEAN);

    /* Flags */
    bool* visited = (bool*)malloc(row * sizeof(bool));
    memset(visited, false, row * sizeof(bool));
    
    /* Iterate */
    int32_t i, count, id = 1;
    for (i = 0; i < row; i++) {
        if (recursion_clustering(i, &row, &epsilon, &min_pts, C, visited, idx, &id, &count)) {
            /* Set ID value */
            idx[i] = id;
            id++;
        }
    }

    /* Free */
    free(C);
    free(visited);
}

static bool recursion_clustering(const int32_t i, const size_t* row, const float* epsilon, const size_t* min_pts, const float C[], bool visited[], size_t idx[], int32_t* id, int32_t* count) {
    /* Declare variables */
    int32_t j;

    /* Check if already visited - else mark as visited */
    if (visited[i]) {
        return false;
    }
    else {
        visited[i] = true;
    }

    /* Begin first with the row of C */
    *count = 0;
    for (j = 0; j < *row; j++) {
        /* Check how many index exceeds min_pts */
        if (C[i * (*row) + j] <= *epsilon) {
            (*count)++;
        }
    }

    /* Check if count is less than min_pts */
    if (*count < *min_pts) {
        return false;
    }

    /* Iterate forward */
    for (j = i + 1; j < *row; j++) {
        if (!visited[j] && C[i * (*row) + j] <= *epsilon) {
            /* Recursion */
            if (recursion_clustering(j, row, epsilon, min_pts, C, visited, idx, id, count)){
                /* Set ID value */
                idx[j] = *id;
            }
        }
    }

    /* Iterate backward */
    for (j = i - 1; j >= 0; j--) {
        if (!visited[j]) {
            /* Recursion */
            if (recursion_clustering(j, row, epsilon, min_pts, C, visited, idx, id, count)) {
                /* Set ID value */
                idx[j] = *id;
            }
        }
    }

    /* Return true */
    return true;
}

If you want to run this code, you need this working example.

#define row 65
#define column 3

int main() {
    /* Define X, idx, epsilon and min_pts */
    float X[row * column] = { -0.251521, 1.045117, -1.281658,
        -1.974109, 0.278170, -1.023392,
        -0.957729, -0.977450, 0.477872,
        -0.449159, -1.016680, 0.095610,
        -1.785787, -1.403543, 0.483454,
        1.366889, -0.762590, -1.162454,
        2.129839, 0.358568, -2.118250,
        0.751071, -1.766582, 0.178434,
        -1.980847, -1.320933, -0.457778,
        -0.478030, 0.606917, -1.630624,
        3.674916, 0.088957, 0.877373,
        0.637213, 0.079176, 0.342038,
        1.142329, 0.629997, 0.311134,
        -0.878974, 0.042527, 0.736522,
        1.751637, -1.434299, -1.325140,
        1.110682, 1.091970, 1.434869,
        -0.504482, -2.504821, -1.245315,
        -0.102915, -0.203266, -0.849767,
        -0.822834, 1.158801, -0.405579,
        -1.278287, 0.391306, 0.857077,

        /* Outliers */
         45, 43, 0,
         23, -3, 2,

        10.6772, 10.7365, 9.9264,
        8.7785, 11.1680, 9.5915,
        8.6872, 9.6464, 10.3801,
        10.0142, 8.8311, 9.2021,
        8.4179, 9.8572, 11.6356,
        9.8487, 10.4979, 10.8620,
        10.0957, 9.7878, 12.2653,
        11.4528, 11.5186, 10.3050,
        10.9284, 9.9654, 10.4562,
        8.5272, 10.7451, 9.8355,
        10.1508, 10.2318, 10.2417,
        10.7342, 10.0689, 9.9918,
        10.4784, 9.2032, 10.6060,
        10.1309, 9.4392, 10.9674,
        10.6971, 10.3347, 11.0447,
        7.9489, 9.4566, 9.5258,
        10.4827, 10.3030, 10.5582,
        10.4496, 10.3880, 11.1661,
        11.0291, 10.0233, 9.9280,
        9.0638, 9.3650, 9.3670,

        /* Outliers */
         32, 54, 23,
         23, 51, 77,


    -34.233,  -30.841,  -31.720,
     -32.629,  -31.786,  -31.290,
     -31.466,  -31.984,  -33.254,
     -31.878,  -33.052,  -31.761,
     -33.528,  -30.921,  -32.836,
     -31.793,  -32.082,  -30.453,
     -31.812,  -32.417,  -31.874,
     -32.127,  -32.599,  -32.806,
     -32.979,  -32.096,  -31.754,
     -31.759,  -31.925,  -31.313,
     -30.531,  -31.838,  -31.179,
     -32.168,  -31.928,  -30.649,
     -31.049,  -32.092,  -31.408,
     -33.006,  -31.753,  -31.961,
     -32.092,  -32.391,  -31.501,
     -31.184,  -31.634,  -32.802,
     -30.658,  -31.616,  -31.493,
     -31.958,  -31.694,  -31.425,
     -33.114,  -32.029,  -31.459,
     -31.081,  -34.486,  -32.020,

        /* Outlier */
        -22, -34, 53 };

    size_t idx[row] = { 0 };
    float epsilon = 10;
    size_t min_pts = 3;

    /* Compute dbscan */
    dbscan(X, idx, epsilon, min_pts, row, column);

    /* Print idx */
    size_t i;
    for (i = 0; i < row; i++) {
        printf("%i\n", idx[i]);
    }

    return 0;
}

The output will be: (zeros are outliers = noise)

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
0
0
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
0
\$\endgroup\$
1

3 Answers 3

3
\$\begingroup\$

Check the return value of malloc():

malloc() and family returns NULL to signify an error. Failing to check it risks invoking undefined behavior by a subsequent null pointer dereference.

#if 0
    float* C = (float*)malloc(row * row * sizeof(float));
#else
    /* Allocate to the referenced object, not the type. 
     * It is easier to maintain.
     */
    float *C = malloc(row * row * sizeof *C);
#endif

Casting the return value of malloc() is redundant and adds clutter to the code. malloc() and family returns a generic void * that is implicitly converted to any other pointer type.

Note that you can replace malloc() + memset() with calloc().

Comment only where necessary:

/* Set ID value */
idx[i] = id;

This comment doesn't add anything helpful, the code suffices.

/* Free */
free(C);
free(visited);

Similarly, the function name suffices here. I can see that it frees the given argument.

/* Iterate */
int32_t i, count, id = 1;
for (i = 0; i < row; i++) {

Yes, a for loop iterates. You don't need to mention that in a comment. Note that you can limit the scope of i to the for loop by declaring it like so:

for (int32_t i = 0; i < row; ++i) {
...

Perhaps it would be better to declare i as a size_t, as row is an object of type size_t too.

/* Return true */
return true;

There was nothing unclear about the return statement that warranted this comment.

Most of the other comments serve no purpose, and should be removed.

\$\endgroup\$
1
  • 3
    \$\begingroup\$ There is a small advantage of sizeof *C * row * row over row * row * sizeof *C: Should row, which is now size_t, later get replaced with a narrower type, sizeof *C * row * row is less likely to overflow than row * row * sizeof *C. \$\endgroup\$
    – chux
    Commented Jan 2, 2024 at 11:10
2
\$\begingroup\$

Some side issues for OP.

Deeper math issues

Best to use float constants rather than double ones to initialize a float array.

float X[row * column] = { -0.251521, 1.045117, -1.281658,
// float X[row * column] = { -0.251521f, 1.045117f, -1.281658f,

Sometimes it makes a value difference and it quiets certain warnings.


There are multiple float that when printed to 6 places report 0.178434.

In general, use at least FLT_DECIMAL_DIG (9) significant digits to maintain an about 1-to-1 mapping of float to display value.

\$\endgroup\$
2
\$\begingroup\$

Clean-Up

Use calloc() Where Appropriate

You currently have:

    /* Flags */
    bool* visited = (bool*)malloc(row * sizeof(bool));
    memset(visited, false, row * sizeof(bool));

Since you immediately set this entire array to zeroes, it would be preferable to write:

bool* const visited = calloc(row, sizeof(bool));

You do not need an explicit pointer cast from void* in C++, and since you never mean to update the pointer visited to alias a different array, it should be declared const. This way, if you ever accidentally write something like visited = 0; instead of visited[j] = 0;, the compiler will catch the error. not happily let you shoot yourself in the foot by setting visited to a null pointer and seg-faulting the next time you dereference it. As a nice side bonus, this might help your compiler figure out that the base address of the array you are iterating through cannot change.

The advice in other answers, to check the return value for an out-of-memory-error, applies here as well.

Minor notes: before C23, which hasn’t quite come out yet (Happy New Year 2024!), a C program that uses bool, true and false should #include <stdbool.h>.

As others have brought up, many of the comments in this program are useless. It is obvious that the elements of the array are flags, from their type. If the name visited and the bound row tell maintainers what the flags represent, that’s a great variable name and no further comment is needed. (I would not use the name row for either the number of rows or a pointer thereto; I would expect a size_t named row to hold a row index.) If there are any possible surprises, or important invariants that the program needs to maintain, a comment should explain those.

Don’t Use Unnecessary Indirection

Currently, you pass around several values, including size_t row, by reference, giving pointers to them. This only pessimizes the program (although it would be much worse if you were passing a pointer to the sentinel value that could modify it, to a function call that escapes the body of the loop, as then the optimizer would likely assume that its value could change arbitrarily on each iteration). On every actually-existing architecture, a pointer is at least as expensive to pass as a size_t. If you use this idiom at all, use it only for large structures.

Global Variables Aren’t Always That Harmful

At present, you’re using tramp data, which many people consider an anti-pattern. That is, you’re adding several variables to your function signatures for the sole purpose of passing them to other functions. At a certain point, this gets ugly and confusing.

In a language that allowed nested functions (or closures), what you would rather do instead is declare recursion_clustering a nested helper of dbscan, and all the variables that are meant to be shared between calls to recursion_clustering within the scope of dbscan. The nested helper would then see them without needing any explicit parameters or space on the stack. However, C does not support this.

The Structured-Programming mitigation is to pack all that tramp data into a struct so you have only one extra parameter to pass around.

However, just as you already declare recursion_clustering in module scope (that is, static and at file scope), you could at least weigh the advantages and disadvantages of declaring some of the algorithm’s data the same way. This is less safe, because the compiler can no longer check that the variable is a static single assignment, and also because a multi-threaded program can no longer calculate more than one of these.

Consider a More Self-Explanatory Function Signature

One of the most helpful comments in the program is the one before dbscan, but it could be made clearer:

 /*
  * Create ID's of clusters
  * X[m*n]
  * idx[m] = Contains cluster ID. Noise ID is 0
  * epsilon = Raduis of the clusters
  * min_pts = Minimum points of a valid cluster
  */
void dbscan(const float X[], size_t idx[], const float epsilon, const size_t min_pts, const size_t row, const size_t column) {

Doxygen-style comments explaining a function contract are the exception to my earlier advice not to state the obvious. In this context, I really would prefer that you say what each parameter is supposed to be, and whether they are input parameters, output parameters or both. Maintainers need to know how to call the function. If this is truly unnecessary for something like a local helper, consider giving each parameter its own line and adding a // comment after the ones where it would be needed.

You don’t use the same notation for row and column in your argument list as in your comment, nor explain which is which, and you don’t take advantage of C’s notation to specify the size of array parameters. It would also be better to say a few words about what X[] contains, and perhaps even mention in the comment which of these are meant to be input and output parameters.

On a side note, ALL_CAPS identifiers in C are traditionally macros, and TitleCase identifiers (if used at all) are traditionally reserved for type names, so consider some other convention for your array names, such as xs and cs. And—without exception—explicitly say what the bounds of each of your arrays should be. Always, always, always, check for buffer overruns in C!

You might find this a bit clearer, with a slightly cleaned-up comment:

void dbscan( const size_t m,
             const size_t n,
             const float points[m][n],
             size_t cluster_ids[m],
             const float epsilon, 
             const size_t threshold
           ) {

It’s now much more apparent what the size of each array is, and the compiler would even have a prayer of catching a bug involving that. Also, arrays and only arrays have a plural name.

The one small gotcha there is that, with this definition of points as a rectangular array, you will need to cast it to a flat pointer when passing it to pdist2. But this is safe.

Your Main Question

You ask whether it’s possible to store all this algorithm’s data on the heap rather than the stack. A general algorithm for this is to convert the recursive calls into a while loop with some kind of trampoline data structure, possibly a stack on the heap, containing the current state. For the present algorithm, this might be as simple as a dynamic array backtracking the values of i. The remaining data that you pass to each recursive call appears to be state shared between every call.

It also appears at a glance as if your algorithm could probably be simplified to use either tail-recursion or iteration. If it replaced the big visited lookup table with a series of loops that walk the array in the correct order and skip already-visited rows, it would be much more efficient. But I haven’t investigated this.

\$\endgroup\$
2
  • \$\begingroup\$ Good answer! Is there a benefit to const-qualifying m and n in dbscan()? \$\endgroup\$
    – Madagascar
    Commented Jan 3, 2024 at 3:07
  • 1
    \$\begingroup\$ @Harith Thanks. There are at least two benefits: the bounds are never supposed to be modified, so if I do, the compiler will now catch the bug. Less frequently, if I’m using m or n in a loop condition, and there’s also an alias to it in the same scope, declaring the bound const can sometimes tell the optimizer that the body of the loop never modified m or n. The static analyzer might not be able to figure this out, especially if I’ve passed &m or &n to some external function. \$\endgroup\$
    – Davislor
    Commented Jan 7, 2024 at 12:54

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.