1
$\begingroup$

I have an interesting problem I am trying to solve and I cannot find any non-deep methods available to solve it.

Problem Description

Plain

The real life problem this relates to are handwritten digits on a piece of paper. I have an object detector that can detect single handwritten digits (0-9). I want to find a way to "stick" these digits together when the single digits are actually forming a multi-digit number. Therefore, the data I have are lists of bounding boxes (x, y, width, height, category) on different images and labels between them indicating whether they are part of a larger number.
This is a true supervised clustering problem due to each point having a "true" clustering since each individual digit belongs to a group of digits (or on a group of its own). In addition, each image contains its own set of points, so the problem can't be formulated as semi-supervised since a brand new unseen image will have no existing cluster labels to work from.

Mathematical

Let $D$ be a collection of lists $P$ of bounding boxes with true cluster labels, $D = \{P_0, P_1, ..., P_i\}$ where $P_i = \{(x_{0_{i}}, y_{0_{i}}, \text{width}_{0_{i}}, \text{height}_{0_{i}}, \text{category}_{0_{i}}, \text{cluster}_{0_{i}}), ...\}$.

Devise an algorithm which, given a collection of training lists $T = \{P_{t0}, P_{t1}, ..., P_{tn}\}$, can predict the "cluster" of an unseen list $P_{u}$.

Solution Ideas

I mentioned that I couldn't find any non-deep methods. I was able to devise a setup for a graph neural network where each image was a new graph, each node was a bounding box, and each edge indicated whether or not the nodes belonged to the same cluster. This has provided some promising initial results, but since I only have 46 images, there wasn't a lot of training data to go on.

I have attempted to use unsupervised methods and see how they perform, but my testing has not shown good results. K-Means requires me to know the number of clusters ahead of time, and using the silhouette score when there are a large number of clusters and a small number of elements per cluster has lead to inaccuracies. DBSCAN worked the best with the maximum distance parameter set to the maximum distance between two elements in a cluster observed in training data + 10%. Even this did not do well in the test set.

My third setup for the problem was to set it up as a supervised classification problem. I created a dataset in the following way. I found the maximum relative distance between two digits that were in the same cluster in the dataset (it was ~3% of image width). Then, for each bounding box, filtered for all bounding boxes on the same image that were within that distance, added both boxes as a single line in the dataset, and whether or not they were linked (each entry in the dataset looked like (x1, y1, w1, h1, c1, x2, y2, w2, h2, c2, are_linked). I then trained a random forest model to predict the "are_linked" boolean, and it performed very well on the test set (0.7 accuracy on the "True" class), but when I tried to get it to predict all the links on an image, only 14% of the clusters it identified were actually correct.


I am very curious to hear if anyone has an alternative formulation of the problem they think could work. Sorry this is so long winded, but I lack even the vocabulary for this problem.

$\endgroup$
2
  • $\begingroup$ Just out of curiosity: How is your performance when you query ChatGPT with the image? $\endgroup$ Commented Oct 28 at 22:02
  • 1
    $\begingroup$ Just tried it for a few sheets and it confidently gives the wrong answer, although it kind of had an idea of the magnitude of the numbers in the sense that it predicted 10 instead of 1000 when the number had two digits. $\endgroup$ Commented Oct 29 at 15:15

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.