Data Mining And Data Warehousing
DBSCAN
Density-Base Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points that are close to each other in dense regions, and marks points in low-density areas as outliers (noise).
It is used when:
- Clusters are of arbitrary shapes
- Automatically detect the number of clusters
- Data contains noise or outliers
Key Concepts
- Core Point: A point with at least min_samples neighbors within a distance eps.
- eps distance: minimum distance from core point to another point.
- min points: number of points close to given point such that the point is core point
- Density reachable and density connected: Density-reachable refers to a point being reached via a chain of core points within a specified radius, while density-connected means two points can be reached from a common core point, also within the specified radius.
- Border Point: A point within eps of a core point, but not a core point itself.
- Noise (Outlier): A point that is not a core point and not within eps of any core point.
Working of DBSCAN
- For each point, find its eps-neighborhood (nearby points within a radius eps).
- If it has greater than min_samples neighbors, it becomes a core point and starts forming a new cluster.
- Expand the cluster by recursively adding:
- Core points\' neighbors
- Border points
- Label points as noise that do not belong to any cluster.
Advantages
- No need to specify the number of clusters (unlike K-Means).
- Can find arbitrarily shaped clusters.
- Automatically detects outliers.
- Works well for spatial and noisy datasets.
Limitations
- Not good with varying densities (some where very dense data, somewhere low dense data) in the same dataset.
- Sensitive to parameter settings (eps, min_samples).
- Not ideal for high-dimensional data (distance becomes less meaningful).