Density based clustering
Broad category of feature space analysis techniques rely on the estimation of
the probability density function (PDF) of the data set. Estimated density
reveals patterns in data distribution where
dense regions correspond to clusters of the data separated
by regions of low density or noise. Clustering is based on
determining local maxima (modes) of PDF and associated
basins of attraction. The main concept is based on the density attractor
notion [22].
Each density attractor of PDF delineates associated basin of attraction, a
set of points
for which hill-climbing procedure
started at converges to .
Hill-climbing procedure can be guided by a gradient of PDF
[23,24] or step-wise [22].
Definition 2 (Strong density attractor)
Strong density attractor is density attractor satisfying
|
(2) |
where is a noise level.
Definition 2 defines significant density
attractors with regard to the uniformly distributed noise and non-specific
data samples, called outliers. Due to noise and outliers, PDF can exhibit
local variations with the density low compared to the modes corresponding to
significant clusters. Implicitly defined number of clusters corresponds to the
number of modes above noise level.
Kernel density estimation [25] is a PDF estimation method based on
the concept that the density function at a continuity point can be estimated
using the sample observation that falls within a region around that point.
Let
be a set of data samples in -dimensional
vector space
, and
, a radially symmetric kernel satisfying
|
(3) |
The inherent distribution of the data set at
can be estimated as the average of the identically scaled kernels
centered at each sample:
|
(4) |
where bandwidth is the scaling factor.
The shape of the PDF estimate is strongly influenced by the bandwidth ,
which defines the scale of observation. Larger values of
result in smoother density estimate, while for smaller values the contribution
of each sample to overall density has the emphasized local character,
resulting in density estimate revealing details on a finer scale.
The fixed bandwidth , constant across
, affects the
estimation performance when the data exhibit local scale
variations. Frequently more smoothing is needed in the tails of the
distribution, while less smoothing is needed in the dense regions.
Underlying data distribution can be more accurately described using the
variable bandwidth kernel.
By selecting a different bandwidth for each sample , the
sample point density estimator [5,26] is defined:
|
(5) |
Improvements can be obtained by setting the bandwidth
to be reciprocal to the square root [27] of the density
|
(6) |
where is the fixed bandwidth and is a proportionality
constant. Since is unknown density to be estimated, the practical
approach [26] is to use an initial pilot estimate
and to take proportionality constant as the
geometric mean of all
.
Damir Krstinic
2011-11-04