0
$\begingroup$

My project has the following steps:

  1. Use elbow method to determine the features and number of clusters for kmeans.
  2. Run kmeans on the data (with determined features and n clusters), and gives the cluster label.
  3. Train supervised ML model to predict the cluster index.

So my question is what is the optimal cross validation strategy to train kmeans? Should I split data to train and test. Use train to fit kmeans model and apply it to test? I will only use test data with cluster index for step 3.

The key point is I will use the data with cluster index to train a supervised ML model. So I guess I cannot use all the data to run cluster analysis and then feed them into the supervised ML.

My intuition is I should not fit kmeans with all the data, attached cluster index and fed all the labeled data to step 3. Am I correct?

$\endgroup$
2
  • $\begingroup$ I don't quite understand what you plan on doing. After clustering, you already have the cluster indices for all observations. Or are you saying you want to cluster a subset $A$ of your observations, and then predict which cluster the points in a different subset $B$ belong to? But in k-means, this is trivial, because by definition, each data point is assigned to the cluster with the closest center (up to recalculation of cluster centers), no need for ML, just calculate distances. Can you explain what you want to do? $\endgroup$ Commented Jun 18 at 5:52
  • 1
    $\begingroup$ @StephanKolassa: my guess is that for the application, new data should be assigned to the clusters obtained by training. I.e., the task is really setting up a predictive model, and the cluster analysis is done to make the model training easier. Too long for a comment, so see my answer for a scenario where this makes sense for model development (but of course makes the data unusable for validation purposes). A 2nd scenario that comes to my mind is allowing for increased complexity by having more low complexity classifiers to reproduce the clustering, with a post-processing cluster -> class. $\endgroup$ Commented Jun 18 at 10:47

1 Answer 1

3
$\begingroup$

Yes, you need to split your data before the clustering step and run cluster analysis only on the training data subset.


I guess your goal is to predict cluster index for new (unseen, unknown) data.

You need to evaluate on data that was not used during cluster analysis: setting up the groups (in data space, i.e. the cluster analysis) is part of your model training.

Plausibility check: otherwise, you'd create a self-fulfilling prophecy since by construction, clusters are (easily) separable in data-space. (Having a classifier that distinguish given groups which form clusters in the classifier's data space is rather trivial)


@StephanKolassa: here's a scenario where I have seen such a construction in the literature:

Task: tumor tissue recognition from spectroscopic images, i.e. data with many wavelengths (variates) recorded for many pixels (x/y locations) of biological tissue thin sections.

A classifier is set up to predict tissue type (including various normal and tumor tissue classes) for each such pixel/location from its spectrum. The thin sections cover each various types of biological tissue.

The training data thus needs to have spectra with corresponding tissue type labels. This labelling is done by a pathologist. A very convenient (or maybe tempting) way of obtaining labels for large numbers of spectra is to run a cluster analysis on the spectra, display the results for the pathologist and have them assign the tissue type to each cluster.

For training data, this is fine - though one may need to think already whether data-driven model optimization is negatively affected by the inbuilt separability causing over-optimism in any internal verification steps.

For validation (as in establishing fitness-for-purpose for a clinical diagnostic tool), there is no way around obtaining data whose reference labels are obtained independently - so using clusters of the spectroscopic data as intermediate step to help the pathologist is unacceptable.


2nd scenario that comes to my mind:

Modelling fewer ($c$) classes by more ($n > c$) clusters and a low-complexity classifier for $n$ clusters. The cluster predictions are then post-processed into the $c$ class labels.

  • This is certainly a valid approach for training
  • I'd expect it to be less efficient compared to directly using a classifier that allows this sort of complexity within the class: that one would not "waste effort" on distinguishing clusters that belong to the same class.

Again, data to establish generalization performance needs to be independent of the cluster analysis.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.