Subsections


Grid-based clustering with adaptive kernel density estimation

Grid-based clustering methods discretize the data domain into a finite number of cells, forming a multidimensional grid structure on which the operations are performed. These techniques are typically fast since they depend on the number of cells in each dimension of the quantized space rather than the number of actual data objects [28, p. 243].

We propose a new two-step technique for the discrete density estimation. Firstly, initial histogram is created by counting input data samples which populate each cell of the discretized domain. Secondly, the sample point density estimator (5) is applied to gain final density estimate, where initial density estimate is used to compute variable bandwidth (6). We will now proceed with the introduction of the general grid-based clustering technique.


Discrete density estimation

The process of the estimation of the discrete density of the data set $ D=\{x_1, ..., x_N\} \in \mathbb{R}^d$ starts with domain discretization. Bounding $ d$-dimensional hyperrectangle is determined and the domain is partitioned into $ d$-dimensional hypercubes with side length $ \sigma$. The density at some point $ x$ is strongly influenced by a limited set of samples near that point, thus density can be approximated with local density function $ \hat{f}^D(x)$. We adopt the sample point density estimation technique (5) where each sample is assigned the variable bandwidth (6), resulting in the adaptive local density function:

$\displaystyle \hat{f}^D(x) = \frac{1}{N} \sum_{x_l\in \mathcal{N}_l(x)} \frac{1}{h(x_l)^d}K \left( \frac{x-x_l}{h(x_l)} \right)$ (7)

The variable bandwidth is defined by:

$\displaystyle h(x_l) = h_0 \sqrt{\frac{\lambda}{\tilde{f}^D(x_l)}},$ (8)

where $ \tilde{f}^D$ is the initial density estimate obtained by counting data samples which populate each cell of the multidimensional histogram. Proportionality constant $ \lambda$ is set to the geometric mean of $ \tilde{f}^D$ [26]. Adaptive neighborhood for each sample $ x_l$ is defined by:

$\displaystyle \mathcal{N}_l(x) = \left\{ x_l : d(x_l,x) \leq \tau(x_l)\sigma \right\}$ (9)

where $ \tau(x_l)$ is the variable proportionality factor. According to (6) we define $ \tau(x_l)$ with:

$\displaystyle \tau(x_l) = \tau_0 \sqrt{\frac{\lambda}{\tilde{f}^D(x_l)}},$ (10)

where $ \tau_0$ is the fixed proportionality factor. The cell width $ \sigma$ is computed from the fixed bandwidth $ h_0$ by:

$\displaystyle \sigma = \frac{h_0}{\rho}$ (11)

where $ \rho$ is the discretization parameter. Setting the bandwidth to be multiple of the cell width intuitively prescribes a set of neighboring cells in which contribution of each sample should be accounted for.

The introduced PDF estimation technique requires two passes through the data to compute the initial and the final density estimate. Efficiency of this procedure can be increased such that it requires one pass while preserving the accuracy. Let

$\displaystyle Z(x, x_j) = \frac{1}{h(x_j)^d}K\left(\frac{x-x_j}{h(x_j)} \right)$ (12)

be the contribution of the sample $ x_j$ at point $ x$. Inserting (12) into (7) yields

$\displaystyle \hat{f}^D(x) = \frac{1}{N} \sum_{x_l\in \mathcal{N}_l(x)} Z(x, x_l)$ (13)

Let $ c_u$ denote the spatial coordinates of the center of the $ u$-th histogram cell. The contribution of all samples which populate the $ v$-th cell to the density of the $ u$-th cell is approximately

$\displaystyle \hat{Z}(c_u, c_v) = \tilde{f}^D(c_v) \frac{1}{h(c_v)^d}K\left(\frac{c_u-c_v}{h(c_v)} \right)$ (14)

The contribution is accounted for as if all samples were located in the center of the cell. Asset of the adaptively scaled kernel centered at the $ v$-th cell is multiplied with the number of samples populating that cell. By inserting (14) into (13) we obtain:

$\displaystyle \hat{f}^D(c_u) \approx \frac{1}{N} \sum_{c_l : d(c_u, c_l) \leq \tau(c_l)\sigma} \hat{Z}(c_u, c_l),$ (15)

where $ \hat{f}^D(c_u)$ is estimated discrete density of the $ u$-th cell. The approximation in (14) computes the contribution of all samples which populate a single cell in one pass.

The basic feature of all grid-based clustering techniques is a low computational complexity. However, discretization unavoidably introduces error and lowers the performance in disclosing clusters of different scales which are often present in the real data. Some techniques attempt to solve this problem by allowing arbitrary mesh refinement in densely populated areas [29,30]. The resulting grid structure has complex neighboring relations between adjacent cells, which introduces computational overhead in the clustering procedure. Moreover, an additional pass through the data is often required to construct the adaptive grid.

The proposed technique is based on a single resolution grid, thus a simple hill-climbing procedure can efficiently detect modes and associated basins of attraction. The estimator is self-tuned to the local scale variations in the data set by using the variable bandwidth and adaptive neighbourhood. The density estimation algorithm requires one pass through the input data and an additional pass through populated cells, yielding computational complexity $ O(N+rM)$, where $ N$ is cardinality of the data set, $ M$ is the number of populated cells and constant $ r$, proportional to $ \rho^d$, is the average number of neighboring cells in which kernel contribution is accounted for.


Clustering and noise level

Clustering is based on determining strong density attractors and the associated basins of attraction of the estimated discrete density. Subset of input samples pertaining to a basin of attraction of a strong attractor is assigned a unique cluster label. Samples pertaining to basins of attraction of attractors with the density below noise level $ \xi$ are labeled as noise. Step wise hill-climbing procedure started from populated cells (satisfying $ \tilde{f}^D > 0$) has two stopping criteria:
  1. Local maxima detected
    If the density $ \hat{f}^D$ of the detected maxima is below noise level $ \xi$, all cells on the path of the hill-climbing procedure are labeled as noise. Otherwise, a strong attractor is detected and a new label denoting new cluster is assigned to all cells on the path of the procedure.
  2. Cell with already assigned label detected
    Continuation of the procedure from this point leads to an already detected maxima. The encountered label is assigned to all cells on the path of the hill-climbing procedure, regardless of representing noise or clustered data.
Clustering procedure stops when all populated cells are labeled. As prior knowledge would be required to anticipate absolute noise level, we define $ \xi$ relative to PDF by:

$\displaystyle \xi = \varepsilon \Lambda,$ (16)

where $ \Lambda$ is the geometric mean of $ \hat{f}^D$ and $ \varepsilon $ is the algorithm parameter. Note that value $ \Lambda$ differs from the proportionality constant $ \lambda$, defined as the geometric mean of the initial density estimate. The computational complexity of the hill-climbing procedure is $ O(M)$.

Damir Krstinic 2011-11-04