Curse of Dimensionality
In high dimensional space distances (esp. Euclidean Distance) become less meaningful
- distance between each pair of point is almost the same
- for many data distributions and distances
$$\lim_{d \to \infty} \frac{\text{dist}_\max - \text{dist}_\min}{\text{dist}_\min} = 0$$
- Curse of Dimensionality makes similarity functions behave poorly
- KNN makes less sense
- distances become more uniform as dimensionality grows
- and this makes clustering difficult
Similarity between two point of high dimensionality can be misleading
- often points may be similar even though they should belong to different clusters
How to deal?
NLP and Embeddings
solution: Word Embeddings
- embed the words into some dense low-dimensional space
- express the join distribution in terms of these embeddings
Papers
- Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." 1999. [1]
- Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." (2009)
Sources