Timeline for answer to Kmean clustering on text data by Latent

Current License: CC BY-SA 4.0

Post Revisions

11 events

when toggle format	what		by	license	comment
Feb 23, 2019 at 20:43	comment	added	jen ki		I was looking through one hot encoding and tried this for my dataset but I don't know how to approach it.
Feb 23, 2019 at 9:02	comment	added	Latent		@Anony-Mousse i've updated the main answer and added some more advanced methods which can be more beneficial for categorical clustering
Feb 23, 2019 at 9:01	history	edited	Latent	CC BY-SA 4.0	updated more approaches
Feb 23, 2019 at 7:50	comment	added	Has QUIT--Anony-Mousse		While you can use one-hot encoding and similar, that usually yields quite poor and uninterpretable results. Using a method that is actually designed for text or factors is better.
Feb 22, 2019 at 17:09	comment	added	Latent		@jenki , your data set type is categorical data type , i've added to the main answer the common method to handle that type of data. there are more advanced methods but One-hot-encoding is (as far as i know) the most common method for that type of data.
Feb 22, 2019 at 17:07	history	edited	Latent	CC BY-SA 4.0	added 544 characters in body
Feb 22, 2019 at 16:11	comment	added	jen ki		Hi, thanks for replying. In my dataset, I have text and some numeric values with plus and pound symbols. This is where I got the data from and its related to crime: old.datahub.io/dataset/uk-criminal-justice/resource/….
Feb 22, 2019 at 15:25	history	edited	Latent	CC BY-SA 4.0	added 511 characters in body
Feb 22, 2019 at 15:13	comment	added	HFulcher		Maybe you could flesh out your answer a bit more by suggesting what preprocessing could be done to convert strings to a suitable format?
Feb 22, 2019 at 15:08	history	edited	Latent	CC BY-SA 4.0	added 1 character in body
Feb 22, 2019 at 15:03	history	answered	Latent	CC BY-SA 4.0