Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample ofa given large data set may lead to biased results.

Highly scalable clustering algorithms are needed.

➢ Ability to deal with different types of attributes:

Many algorithms are designed to cluster interval-based (numerical) data. However, applicationsmay require clustering other types of data, such as binary, categorical (nominal), and ordinal data,or mixtures of these data types.

➢ Discovery of clusters with arbitrary shape:

Many clustering algorithms determine clusters based on Euclidean or Manhattan distancemeasures. Algorithms based on such distance measures tend to find spherical clusters with similarsize and density.

However, a cluster could be of any shape. It is important to develop algorithms thatcan detectclusters of arbitrary shape.

➢ Minimal requirements for domain knowledge to determine input parameters:

Many clustering algorithms require users to input certain parameters in cluster analysis (such asthe number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensionalobjects. This not only burdens users, but it also makes the quality of clustering difficult to control.

➢ Ability to deal with noisy data:

Most real-world databases contain outliers or missing, unknown, or erroneous data. Someclustering algorithms are sensitive to such data and may lead to clusters of poor quality.

➢ Incremental clustering and insensitivity to the order of input records:

Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Someclustering algorithms are sensitive to the order of input data.

That is, given a set of data objects, such an algorithm may return dramatically different clusteringsdepending on the order of presentation of the input objects.

It is important to develop incremental clustering algorithms and algorithms that are insensitive tothe order of input.

➢ High dimensionality:

A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that suchdata can be sparse and highly skewed.

➢ Constraint-based clustering:

Real-world applications may need to perform clustering under various kinds of constraints.Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfyspecified constraints.

➢ Interpretability and usability:

Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods.

A Categorization of Major Clustering Methods:

➢ Partitioning Methods

➢ Hierarchical Methods

➢ Density-Based Methods

➢ Grid-Based Methods

➢ Model-Based Methods

Partitioning Methods:

A partitioning method constructs k partitions of the data, where each partition represents a cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the followingrequirements:

Each group must contain at least one object, and

Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another.

The general criterion of a good partitioning is that objects in the same cluster are close orrelated to each other, whereas objects of different clusters are far apart or very different.

Hierarchical Methods:

A hierarchical method creates a hierarchical decomposition o fthe given set of data objects. Ahierarchical method can be classified as being either agglomerative or divisive, based on how thehierarchical decomposition is formed.

❖ The agglomerative approach, also called the bottom-up approach, starts with each objectforming a separate group. It successively merges the objects or groups that are close to oneanother, until all of the groups are merged into one or until a termination condition

holds.

❖ The divisive approach, also called the top-down approach, starts with all of the objects in thesame cluster. In each successive iteration, a cluster is split up into smaller clusters,

until eventually each object is in one cluster, or until a termination condition holds.

Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not having to worryabout a combinatorial number of different choices.

There are two approaches to improving the quality of hierarchical clustering:

❖ Perform careful analysis of object ―linkages‖ at each hierarchical partitioning, such as inChameleon, or

❖ Integrate hierarchical agglomeration and other approaches by first using a hierarchicalagglomerative algorithm to group objects into micro clusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation.

Density-based methods:

❖ Most partitioning methods cluster objects based on the distance between objects. Suchmethods can find only spherical-shaped clusters and encounter difficulty at discoveringclusters of arbitrary shapes.

❖ Other clustering methods have been developed based on the notion of density. Their

general idea is to continue growing the given cluster as long as the density in theneighborhood exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape.

❖ DBSCAN and its extension, OPTICS, are typical density-based methods that grow

clusters according to a density-based connectivity analysis. DENCLUE is a method that clustersobjects based on the analysis of the value distributions of density functions.

Grid-Based Methods:

❖ Grid-based methods quantize the object space into a finite number of cells that form a gridstructure.

📷

❖ All of the clustering operations are performed on the grid structure i.e., on the quantized space.The main advantage of this approach is its fast processing time, which is typically independentof the number of data objects and dependent only on the number of cells in each dimension inthe quantized space.

❖ STING is a typical example of a grid-based method. Wave Cluster applies wavelettransformation for clustering analysis and is both grid-based and density-based.

Model-Based Methods:

❖ Model-based methods hypothesize a model for each of the clusters and find the best fit of thedata to the given model.

❖ A model-based algorithm may locate clusters by constructing a density function thatreflects the spatial distribution of the data points.

❖ It also leads to a way of automatically determining the number of clusters based on

standard statistics, taking ―noise‖ or outliers into account and thus yielding robust clusteringmethods.

Tasks in Data Mining:

➢ Clustering High-Dimensional Data

➢ Constraint-Based Clustering

Clustering High-Dimensional Data:

It is a particularly important task in cluster analysis because many applications require the analysis of objects containing a large number of features or dimensions.

For example, text documents may contain thousands of terms or keywords as features, and DNA micro array data may provide information on the expression levels of thousands of genes under hundreds of conditions

Clustering high-dimensional data is challenging due to the curse of dimensionality

Many dimensions may not be relevant. As the number of dimensions increases, the data become increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data islikely to be low. Therefore, a different clustering methodology needs to be developed for high-dimensional data.

CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in subspaces of the data, ratherthan over the entire data space

Frequent pattern–based clustering, another clustering methodology, extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters

Constraint-Based Clustering:

It is a clustering approach that performs clustering by incorporation of user-specified or application-oriented constraints

A constraint expresses a user’s expectation or describes properties of the desired clustering results, and provides an effectivemeans for communicating with the clustering process.

Various kinds of constraints can be specified, either by a user or as per application requirements.

Spatial clustering employs with the existence of obstacles and clustering under user- specified constraints. In addition, semi-supervised clustering employs for pairwise constraints in order to improve the quality of the resulting clustering.

Classical Partitioning Methods:

The most well-known and commonly used partitioning methods are

❖ The k-Means Method

❖ k-Medoids Method

Partitioning Clustering: The K-Means Method:

The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.

Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can beviewed as the cluster’s centroid or center of gravity.

The k-means algorithm proceeds as follows.

First, it randomly selects k of the objects, each of which initially represents a cluster mean or center

For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance betweenthe object and the cluster mean

It then computes the new mean for each cluster

This process iterates until the criterion function converges

Typically, the square-error criterion is used, defined as

Where E is the sum of the square error for all objects in the data set p isthe point in space representing a given object

## ➢ Scalability:

Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample ofa given large data set may lead to biased results.

Highly scalable clustering algorithms are needed.

## ➢ Ability to deal with different types of attributes:

Many algorithms are designed to cluster interval-based (numerical) data. However, applicationsmay require clustering other types of data, such as binary, categorical (nominal), and ordinal data,or mixtures of these data types.

## ➢ Discovery of clusters with arbitrary shape:

Many clustering algorithms determine clusters based on Euclidean or Manhattan distancemeasures. Algorithms based on such distance measures tend to find spherical clusters with similarsize and density.

However, a cluster could be of any shape. It is important to develop algorithms thatcan detectclusters of arbitrary shape.

## ➢ Minimal requirements for domain knowledge to determine input parameters:

Many clustering algorithms require users to input certain parameters in cluster analysis (such asthe number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensionalobjects. This not only burdens users, but it also makes the quality of clustering difficult to control.

## ➢ Ability to deal with noisy data:

Most real-world databases contain outliers or missing, unknown, or erroneous data. Someclustering algorithms are sensitive to such data and may lead to clusters of poor quality.

## ➢ Incremental clustering and insensitivity to the order of input records:

Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Someclustering algorithms are sensitive to the order of input data.

That is, given a set of data objects, such an algorithm may return dramatically different clusteringsdepending on the order of presentation of the input objects.

It is important to develop incremental clustering algorithms and algorithms that are insensitive tothe order of input.

## ➢ High dimensionality:

A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that suchdata can be sparse and highly skewed.

## ➢ Constraint-based clustering:

Real-world applications may need to perform clustering under various kinds of constraints.Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfyspecified constraints.

## ➢ Interpretability and usability:

Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods.

## A Categorization of Major Clustering Methods:

➢ Partitioning Methods

➢ Hierarchical Methods

➢ Density-Based Methods

➢ Grid-Based Methods

➢ Model-Based Methods

## Partitioning Methods:

A partitioning method constructs

kpartitions of the data, where each partition represents a cluster andk<=n. That is, it classifies the data intokgroups, which together satisfy the followingrequirements:A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another.

The general criterion of a good partitioning is that objects in the same cluster are close orrelated to each other, whereas objects of different clusters are far apart or very different.

## Hierarchical Methods:

A hierarchical method creates a hierarchical decomposition o fthe given set of data objects. Ahierarchical method can be classified as being either agglomerative or divisive, based on how thehierarchical decomposition is formed.

❖ The agglomerative approach, also called the bottom-up approach, starts with each objectforming a separate group. It successively merges the objects or groups that are close to oneanother, until all of the groups are merged into one or until a termination condition

holds.

❖ The divisive approach, also called the top-down approach, starts with all of the objects in thesame cluster. In each successive iteration, a cluster is split up into smaller clusters,

until eventually each object is in one cluster, or until a termination condition holds.

Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not having to worryabout a combinatorial number of different choices.

There are two approaches to improving the quality of hierarchical clustering:

❖ Perform careful analysis of object ―linkages‖ at each hierarchical partitioning, such as inChameleon, or

❖ Integrate hierarchical agglomeration and other approaches by first using a hierarchicalagglomerative algorithm to group objects into micro clusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation.

## Density-based methods:

❖ Most partitioning methods cluster objects based on the distance between objects. Suchmethods can find only spherical-shaped clusters and encounter difficulty at discoveringclusters of arbitrary shapes.

❖ Other clustering methods have been developed based on the notion of density. Their

general idea is to continue growing the given cluster as long as the density in theneighborhood exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape.

❖ DBSCAN and its extension, OPTICS, are typical density-based methods that grow

clusters according to a density-based connectivity analysis. DENCLUE is a method that clustersobjects based on the analysis of the value distributions of density functions.

## Grid-Based Methods:

❖ Grid-based methods quantize the object space into a finite number of cells that form a gridstructure.

📷

❖ All of the clustering operations are performed on the grid structure i.e., on the quantized space.The main advantage of this approach is its fast processing time, which is typically independentof the number of data objects and dependent only on the number of cells in each dimension inthe quantized space.

❖ STING is a typical example of a grid-based method. Wave Cluster applies wavelettransformation for clustering analysis and is both grid-based and density-based.

## Model-Based Methods:

❖ Model-based methods hypothesize a model for each of the clusters and find the best fit of thedata to the given model.

❖ A model-based algorithm may locate clusters by constructing a density function thatreflects the spatial distribution of the data points.

❖ It also leads to a way of automatically determining the number of clusters based on

standard statistics, taking ―noise‖ or outliers into account and thus yielding robust clusteringmethods.

## Tasks in Data Mining:

➢ Clustering High-Dimensional Data

➢ Constraint-Based Clustering

Clustering High-Dimensional Data:It is a particularly important task in cluster analysis because many applications require the analysis of objects containing a large number of features or dimensions.

For example, text documents may contain thousands of terms or keywords as features, and DNA micro array data may provide information on the expression levels of thousands of genes under hundreds of conditions

Clustering high-dimensional data is challenging due to the curse of dimensionality

Many dimensions may not be relevant. As the number of dimensions increases, the data become increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data islikely to be low. Therefore, a different clustering methodology needs to be developed for high-dimensional data.

CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in subspaces of the data, ratherthan over the entire data space

Frequent pattern–based clustering, another clustering methodology, extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters

Constraint-Based Clustering:## Classical Partitioning Methods:

The most well-known and commonly used partitioning methods are

❖ The

k-Means Method❖ k-Medoids Method

Partitioning Clustering: TheK-Means Method:k-means algorithm proceeds as follows.kof the objects, each of which initially represents a cluster mean or centerTypically, the square-error criterion is used, defined as

Where E is the sum of the square error for all objects in the data set p isthe point in space representing a given object

Mi is the mean of cluster Ci