2
$\begingroup$

Problem description

I have a dataset which is a combination of multiple sources gathering the same kind of data. I have retrieved those data to fit them into several columns of a pandas dataframe. All of those sources contain the same information but not in the same quantity (one source could not contain a / several column(s)). Most of the columns are lists.

I want to build a clustering model in order to find out patterns and similarities among items for a given cluster.

Steps tried

  • Feature engineer columns to only gather categorical data
  • use CLARA with different metrics (jaccard, hamming and masked hamming (not to punish missing data))

But in fact, it resulted in really strange clusters where I had only one source in a given cluster : clusters' distribution

I have no background in data science so any advice would be appreciated. Do you think using only categorical data for clustering is a good idea ? I wanted to use only "global" data so that the clustering won't try to guess clusters based on continuous data.

I also need to use a scalable clustering algorithm since my dataset is quite big (150K rows coming from 7 sources) that's why I tried using CLARA.

What I would like

Because sources contains mostly the same information. Clusters should contain data from every sources (but it could happen that one source is the only one retrieving its data).

Questions

  • Does it make sense to stay with only categorical features ?
  • What should I use for missing data ? Imputation ? Metric taking that into account ?
$\endgroup$
2
  • 2
    $\begingroup$ I would start off analysing and clustering just one or a handful datasets that have no missing data (or so little missing data that you can drop it initially). I would build understanding of the data from that, and then think about which dataset(s) I could work in (ones with little missing data but which I could reasonably impute). $\endgroup$ Commented Sep 8, 2025 at 20:23
  • $\begingroup$ The fact is that my dataset will be unknown in advance. I have to find a generic way to handle this case. Depending on the sources I will have several rate of missing data but I will always have some. $\endgroup$ Commented Sep 9, 2025 at 7:01

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.