I have an interesting problem I am trying to solve and I cannot find any non-deep methods available to solve it.
Problem Description
Plain
The real life problem this relates to are handwritten digits on a piece of paper. I have an object detector that can detect single handwritten digits (0-9). I want to find a way to "stick" these digits together when the single digits are actually forming a multi-digit number. Therefore, the data I have are lists of bounding boxes (x, y, width, height, category) on different images and labels between them indicating whether they are part of a larger number.
This is a true supervised clustering problem due to each point having a "true" clustering since each individual digit belongs to a group of digits (or on a group of its own). In addition, each image contains its own set of points, so the problem can't be formulated as semi-supervised since a brand new unseen image will have no existing cluster labels to work from.
Mathematical
Let $D$ be a collection of lists $P$ of bounding boxes with true cluster labels, $D = \{P_0, P_1, ..., P_i\}$ where $P_i = \{(x_{0_{i}}, y_{0_{i}}, \text{width}_{0_{i}}, \text{height}_{0_{i}}, \text{category}_{0_{i}}, \text{cluster}_{0_{i}}), ...\}$.
Devise an algorithm which, given a collection of training lists $T = \{P_{t0}, P_{t1}, ..., P_{tn}\}$, can predict the "cluster" of an unseen list $P_{u}$.
Solution Ideas
I mentioned that I couldn't find any non-deep methods. I was able to devise a setup for a graph neural network where each image was a new graph, each node was a bounding box, and each edge indicated whether or not the nodes belonged to the same cluster. This has provided some promising initial results, but since I only have 46 images, there wasn't a lot of training data to go on.
I have attempted to use unsupervised methods and see how they perform, but my testing has not shown good results. K-Means requires me to know the number of clusters ahead of time, and using the silhouette score when there are a large number of clusters and a small number of elements per cluster has lead to inaccuracies. DBSCAN worked the best with the maximum distance parameter set to the maximum distance between two elements in a cluster observed in training data + 10%. Even this did not do well in the test set.
My third setup for the problem was to set it up as a supervised classification problem. I created a dataset in the following way. I found the maximum relative distance between two digits that were in the same cluster in the dataset (it was ~3% of image width). Then, for each bounding box, filtered for all bounding boxes on the same image that were within that distance, added both boxes as a single line in the dataset, and whether or not they were linked (each entry in the dataset looked like (x1, y1, w1, h1, c1, x2, y2, w2, h2, c2, are_linked). I then trained a random forest model to predict the "are_linked" boolean, and it performed very well on the test set (0.7 accuracy on the "True" class), but when I tried to get it to predict all the links on an image, only 14% of the clusters it identified were actually correct.
I am very curious to hear if anyone has an alternative formulation of the problem they think could work. Sorry this is so long winded, but I lack even the vocabulary for this problem.