2
$\begingroup$

I have a dataset with latitude/longitude of hotels of a "destination".
A destination is a city neighbourhood, whole city, or small region, usually having between 3 and 50 hotels.

About 1% of the latitudes/longitudes are totally erroneous (due to GPS/software/human error).

For instance, here are the locations of hotels in Nelson and Chiyoda:

Nelson: Nelson
Chiyoda: Chiyoda

As you can see, Nelson has no erroneous coordinates, most hotels in the city center and 3 ones in the suburbs. But Chiyoda clearly has one erroneous hotel.

I have tried using the LOF algorithm with a k=2
The Chiyoda outlier get 52 while the 3 suburub hotels in Nelson get 13, 3, 3.

QUESTION: Is LOF is the most adapted algorithm to my problem?
Also, I chose k=2 pretty randomly, it might not be the best.

The algorithm must work for tiny but dense city neighbourhoods like Upper East Side, but also for larger and more sparsely-populated areas like The Hamptons.
Algorithm speed is not a problem, I have time.

$\endgroup$

2 Answers 2

1
$\begingroup$

Given the small number of hotels within each region, consider using agglomerative clustering and look for large changes in some cluster metric.

If being implemented this by hand, it sounds like each region only has one cluster which makes things much easier by using a modified agglomerative clustering algorithm. For example, first join the two locations that are closest together into a single cluster. Then for each location not in the cluster, look at the average distance from it to every location already in the cluster, and add the candidate location with the smallest average. Continue until there is one cluster containing all locations.

As the cluster is being grown, compute the average distance between all pairs of locations in the cluster after adding each location. Then plot the average distance vs the number of locations in the cluster, there will be a sudden increase in distance when outliers are added to the cluster.

There are many possible variations to be considered. When looking for candidate locations to add to the cluster, consider

  1. Minimizing Average distance from candidate location to all other locations in the cluster
  2. Finding candidate location with minimum distance to the closest location in the cluster
  3. Finding candidate location with minimum distance to the farthest location in the cluster.

When plotting the current "quality" if the cluster, consider these metrics:

  1. Average distance between all pairs of locations in the cluster
  2. Maximum distance between any pair of locations in the cluster.

If there is concern about multiple clusters within each region, take a look at Wikipedia's full description of agglomerative clustering. It also provides many more variations.

$\endgroup$
3
  • 1
    $\begingroup$ Thanks! If being implemented this by hand → I'd rather avoid reimplementing if it has been done already. Actually I use ELKI which has these outlier implementations: elki.dbs.ifi.lmu.de/releases/release0.6.5~20141030/doc/de/lmu/… (in "All Known Implementing Classes"). Any idea which one fits your description, maybe? :-) $\endgroup$ Commented May 18, 2015 at 2:20
  • $\begingroup$ Very nice package. I don't think they've implemented an clustering "elbow" solution. But they've got a few clustering methods to find outliers $\endgroup$ Commented May 18, 2015 at 9:54
  • $\begingroup$ I don't think clustering is the way to go. The standard outlier detection methods are well understood, and may just work. $\endgroup$ Commented May 31, 2015 at 10:24
1
$\begingroup$

$k=2$ is most likely too small to really benefit from LOF. But you may not need LOF at all (you really should read the Wikipedia page to see its intuition! with $k=2$ you compare to just one or two other values, this will be rather unstable). kNN outlier may work better on tiny data sets, with smaller values of $k$ than LOF.

If all you want to detect are single extremely off coordinates, then $k$NN outlier with $k=1$ may just work as desired - it takes the distance to the next hotel as outlier score. But to automatically remove outliers, you would need to decide on a threshold, i.e. "if there is no other hotel within a radius of 10 miles, I will discard this as erroneous".

LOF may yield a threshold that is more comparable across different countries / cities. But in my experience, you would need at least $k=10$ (and larger data sets).

$\endgroup$
4
  • $\begingroup$ So you suggest en.wikipedia.org/wiki/K-nearest_neighbors_algorithm right? "if there is no other hotel within a radius of 10 miles" <- Did you write "10 miles" as a metaphor, or will I actually have to hard-code a particular distance (the same 10 miles might work great for The Hamptons but not for Upper East Side)? Thanks! $\endgroup$ Commented Jun 1, 2015 at 6:26
  • $\begingroup$ I'm suggesting KNNOutlierDetection or whatever the ELKI class was called. The kNN algorithm in Wikipedia is a classifier, but you don't have labels. Yes, if you eventually want a yes/no decision, you will end up with choosing a radius in miles (assuming you use geo distance). $\endgroup$ Commented Jun 1, 2015 at 6:32
  • $\begingroup$ My destinations are areas whose "diameter" vary from 1km to 200km. Rather than choosing something like "10 miles", I would rather choose something like "2.5 times the average distance from the barycenter of all hotels" or something similar. Something that can scale, working for both small destinations and large destinations. $\endgroup$ Commented Jun 1, 2015 at 6:46
  • $\begingroup$ Then you may benefit from LOF, have a look at it, to understand how to choose k. Or simply a Gaussian model. $\endgroup$ Commented Jun 1, 2015 at 16:37

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.