Skip to main content

Timeline for answer to Kmean clustering on text data by Latent

Current License: CC BY-SA 4.0

Post Revisions

11 events
when toggle format what by license comment
Feb 23, 2019 at 20:43 comment added jen ki I was looking through one hot encoding and tried this for my dataset but I don't know how to approach it.
Feb 23, 2019 at 9:02 comment added Latent @Anony-Mousse i've updated the main answer and added some more advanced methods which can be more beneficial for categorical clustering
Feb 23, 2019 at 9:01 history edited Latent CC BY-SA 4.0
updated more approaches
Feb 23, 2019 at 7:50 comment added Has QUIT--Anony-Mousse While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
Feb 22, 2019 at 17:09 comment added Latent @jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
Feb 22, 2019 at 17:07 history edited Latent CC BY-SA 4.0
added 544 characters in body
Feb 22, 2019 at 16:11 comment added jen ki Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
Feb 22, 2019 at 15:25 history edited Latent CC BY-SA 4.0
added 511 characters in body
Feb 22, 2019 at 15:13 comment added HFulcher Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
Feb 22, 2019 at 15:08 history edited Latent CC BY-SA 4.0
added 1 character in body
Feb 22, 2019 at 15:03 history answered Latent CC BY-SA 4.0