4.1 Cluster Analysis Intro

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to theobjects in other clusters

A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression

Cluster analysis tools based on k-means, k-medoids, and several methods have also been built into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and SAS.

4.1.1 Applications:

Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, dataanalysis, and image processing.

In business, clustering can help marketers discover distinct groups in their customer bases and characterizecustomer groups based on purchasing patterns.

In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations

Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value,and geographic location, as well as the identification of groups ofautomobile insurance policy holders with a high average claim cost

Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity.

Clustering can also be used for outlier detection,Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce.

4.1.1 Typical Requirements Of Clustering InData Mining:

Ø Scalability:

Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set maylead to biased results.

Highly scalable clustering algorithms are needed.

Ø Ability to deal with different types of attributes:

Many algorithms are designed to cluster interval-based (numerical) data. However, applications may requireclustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.

Ø Discovery of clusters with arbitrary shape:

Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density.

However, a cluster could be of any shape. It is important to develop algorithms thatcan detect clusters of arbitrary shape.

Ø Minimal requirements for domain knowledge to determine input parameters:

Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number ofdesired clusters). The clustering results can be quite sensitive to input parameters. Parameters are oftendifficult to determine, especially for data sets containing high-dimensional objects. This not only burdensusers, but it also makes the quality of clustering difficult to control.

Ø Ability to deal with noisy data:

Most real-world databases contain outliers or missing, unknown, or erroneous data.

Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.

Ø Incremental clustering and insensitivity to the order of input records:

Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existingclustering structures and, instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data.

That is, given a set of data objects, such an algorithm may return dramatically different clusteringsdepending on the order of presentation of the input objects.

It is important to develop incremental clustering algorithms and algorithms thatare insensitive to the order ofinput.

Ø High dimensionality:

A database or a data warehouse can contain several dimensionsor attributes.Many clustering algorithms aregood at handling low-dimensional data,involving only two to three dimensions. Human eyes are good at judging the qualityof clustering for up to three dimensions. Finding clusters of data objects inhighdimensionalspace is challenging, especially considering that such data can be sparseand highly skewed.

Ø Constraint-based clustering:

Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. Todecide upon this, you may cluster households while considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints.

Ø Interpretability and usability:

Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is important to study how an application goalmay influence the selection of clustering features and methods.

4.2 Major Clustering Methods:

Ø Partitioning Methods

Ø Hierarchical Methods

Ø Density-Based Methods

Ø Grid-Based Methods

Ø Model-Based Methods

4.2.1 Partitioning Methods:

A partitioning method constructs k partitions of the data, where each partition represents a cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the following requirements:

Each group must contain at least one object, and
Each object must belong to exactly one group

A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another.

The general criterion of a good partitioning is that objects in the same cluster are close or related to each other,whereas objects of different clusters are far apart or very different.

4.1.1 Hierarchical Methods:

A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A hierarchical method can be classified as being eitheragglomerative or divisive, based on howthe hierarchical decomposition is formed.

v Theagglomerative approach, also called the bottom-up approach, starts with each objectforming a separate group. It successively merges the objects or groups that are closeto one another, until all of the groups are merged into one or until a termination condition holds.

v The divisive approach, also calledthe top-down approach, starts with all of the objects in the same cluster. In each successiveiteration, a cluster is split up into smaller clusters, until eventually each objectis in one cluster,or until a termination condition holds.

Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never be undone. This rigidity is useful in that it leads to smaller computationcosts by not having toworry about a combinatorial number of different choices.

There are two approachesto improving the quality of hierarchical clustering:

v Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in Chameleon, or

v Integratehierarchical agglomeration and other approaches by first using a hierarchicalagglomerative algorithmto group objects into microclusters, and then performingmacroclustering on the microclusters using another clustering method such as iterative relocation.

4.1.1 Density-based methods:

v Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes.

v Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold; thatis, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers)and discover clusters ofarbitrary shape.

v DBSCAN and its extension, OPTICS, are typical density-based methods that growclusters according to adensity-based connectivity analysis. DENCLUE is a methodthat clusters objects based on the analysis of the value distributions of density functions.

4.1.2 Grid-Based Methods:

v Grid-based methods quantize the object space into a finite number of cells that form a grid structure.

v All of the clustering operations are performed on the grid structure i.e., on the quantized space. The mainadvantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space.

v STING is a typical example of a grid-based method. Wave Cluster applies wavelet transformation for clustering analysis and is both grid-based and density-based.

4.1.3 Model-Based Methods:

v Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to thegiven model.

v A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points.

v It also leads to a way of automatically determining the number of clusters based on standard statistics, taking ―noise‖ or outliers into account and thus yielding robust clustering methods.

4.2 Tasks in Data Mining:

Ø Clustering High-Dimensional Data

Ø Constraint-Based Clustering

4.2.1 Clustering High-Dimensional Data:

It is a particularly important task in cluster analysis because many applications require the analysis of objects containing a largenumber of features or dimensions.

For example, text documents may contain thousands of terms or keywords as features, and DNA micro array data may provide information on the expression levels of thousands of genes under hundreds of conditions

Clustering high-dimensional data is challenging due to the curse of dimensionality

Many dimensions may not be relevant. As the number of dimensions increases,
thedata become increasingly sparse so that the distance measurement between pairs ofpoints becomemeaningless and the average density of points anywhere in the data islikely to be low. Therefore, a different clustering methodology needs to be developedfor high-dimensional data.

CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in subspaces ofthe data, ratherthan over the entire data space.

Frequent pattern–based clustering,another clustering methodology, extractsdistinct frequent patterns among subsets ofdimensions that occur frequently. It uses such patterns to group objects and generatemeaningful clusters

4.1.1 Constraint-Based Clustering:

It is a clustering approach that performs clustering by incorporation of user-specified or application-oriented constraints.

A constraint expresses a user’s expectation or describes properties of the desired clustering results, and providesan effective means for communicating with the clustering process.

Various kinds of constraints can be specified, either by a user or as per application requirements

Spatial clustering employs with the existence of obstacles and clustering under user- specified constraints. In addition, semi-supervised clusteringemploys forpairwise constraints in order to improvethe quality of the resulting clustering

4.1 Classical Partitioning Methods:

The mostwell-known and commonly used partitioningmethods are

v The k-Means Method

v k-Medoids Method

4.1.1 Centroid-Based Technique: The K-Means Method:

The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok clusters so that theresulting intracluster similarity is high but the intercluster similarity is low.

Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as thecluster’s centroid or center of gravity.

The k-means algorithm proceeds as follows.

First, it randomly selects k of the objects, each of which initially represents a cluster mean or center.

For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distancebetween the object and the cluster mean

It then computes the new mean for each cluster.

This process iterates until the criterion function converges

Typically, the square-error criterion is used, defined as

whereE is the sum of the square error for all objects in the data set pis the point in space representing a given object

miis the mean of cluster Ci

The k-means partitioning algorithm

The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster

Clustering of a set of objects based on the k-means method

4.4.1 The k-Medoids Method:

The k-means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort thedistribution of data. This effect is particularly exacerbated due to the use of the square-error function

Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar

Thepartitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. That is, an absolute-error criterion is used, defined as

whereE is the sum of the absolute error for all objects in the data set

pis the point inspace representing a given object in clusterCj

ojis the representative object of Cj

The initial representative objects are chosen arbitrarily. The iterative process of replacing representative objects by non representative objects continues as long as the quality of the resulting clustering is improved.

This quality is estimated using a cost function that measures the average dissimilaritybetween an object and therepresentative object of its cluster.

To determine whether a non representative object, oj random, is a good replacement for a current representativeobject, oj, the following four cases are examined for each of the nonrepresentative objects.

Case 1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case 2:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object and p is closest to orandom, then p is reassigned to orandom.

Case 3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative object and p is stillclosest to oi, then the assignment does notchange.

Case 4:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative object and p is closest to orandom, then p is reassigned to random.

Four cases of the cost function for k-medoids clustering

Thek-MedoidsAlgorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.

The k-medoids method ismore robust than k-means in the presence of noise and outliers, because a medoid is lessinfluenced by outliers or other extreme values than a mean. However, its processing ismore costly than the k-means method

4.1 Hierarchical Clustering Methods:

A hierarchical clustering method works by grouping data objects into a tree of clusters

The quality of a pure hierarchical clusteringmethod suffers fromits inability to performadjustment once amerge or split decision hasbeen executed. That is, if a particular merge or split decision later turns out to have been apoor choice, the method cannot backtrack and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up or top-down fashion.

Agglomerative hierarchical clustering:

This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.

Most hierarchical clustering methods belong to this category. They differ only in their definition of interclustersimilarity.

4.1.1 Divisive hierarchical clustering:

This top-down strategy does the reverse of agglomerativehierarchical clustering by starting with all objects in onecluster.

It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster on itsown or until it satisfies certain termination conditions, such as a desired number ofclusters is obtained or the diameter of each cluster is within a certain threshold.

4.1 Constraint-Based Cluster Analysis:

Constraint-based clustering finds clusters that satisfy user-specified preferences orconstraints. Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different approaches.

There are a few categories of constraints.

Ø Constraints on individual objects:

We can specify constraints on the objects to beclustered. In a real estate application, for example, one may like to spatially cluster only those luxury mansions worth over a million dollars. This constraint confines the setofobjects to be clustered. It can easily be handled by preprocessing after which the problem reduces to aninstance ofunconstrained clustering.

Ø Constraints on the selection of clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters are usually quite specific to the given clustering algorithm. Examples of parameters include k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum number of points in the DBSCAN algorithm. Although such user-specified parameters may strongly influence the clustering results, they are usually confined to the algorithm itself. Thus, their fine tuning and processing are usually not considered a form of constraint-basedclustering.

Ø Constraints on distance or similarity functions:

We can specify different distance orsimilarity functions for specific attributes of the objects to be clustered, or differentdistance measures for specific pairs of objects.When clustering sportsmen, for example,we may use different weighting schemes for height, body weight, age, and skilllevel. Although this will likely change the mining results, it may not alter the clusteringprocess per se. However, in some cases, such changes may make the evaluationof the distance function nontrivial, especially when it is tightly intertwined with the clusteringprocess.

Ø User-specified constraints on the properties of individual clusters:

A user may like tospecify desired characteristics of the resulting clusters, which may strongly influencethe clustering process.

Ø Semi-supervised clustering based on partial supervision:

The quality of unsupervisedclustering can be significantly improved using some weak form of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled as belongingto the same or different cluster). Such a constrained clustering process is calledsemi-supervised clustering.

4.2 Outlier Analysis:

There exist data objects that do not comply with the general behavior or model of the data. Such data objects,which are grossly different from or inconsistent with the remaining set of data, are called outliers.

Many data mining algorithms try to minimize the influence of outliers or eliminate them all together. This, however, could result in the loss of important hidden information because one person’s noise could be another person’s signal. In other words, the outliers may be of particular interest, such as in the case of fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,referred to as outlier mining.

It can be used in fraud detection, for example, by detecting unusual usage of credit cards or telecommunicationservices. In addition, it is useful in customized marketing for identifying the spending behavior of customers with extremely low or extremely high incomes, or in medicalanalysis for finding unusual responses to variousmedical treatments.

Outlier mining can be described as follows: Given a set of n data points or objectsand k, the expected number ofoutliers, find the top k objects that are considerablydissimilar, exceptional, or inconsistent with respect to the remaining data. The outliermining problem can be viewed as two subproblems:

Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined

Types of outlier detection:

Ø Statistical Distribution-Based Outlier Detection

Ø Distance-Based Outlier Detection

Ø Density-Based Local Outlier Detection

Ø Deviation-Based Outlier Detection

4.1.1 Statistical Distribution-Based Outlier Detection:

The statistical distribution-based approach to outlier detection assumes a distributionor probability model for the given data set (e.g., a normal or Poisson distribution) andthen identifies outliers with respect to the model using a discordancy test. Application ofthe test requires knowledge of the data set parameters knowledge of distribution parameters such as the mean and variance and theexpected number of outliers.

A statistical discordancy test examines two hypotheses:

A working hypothesis
An alternative hypothesis

A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its rejection. A discordancy test verifies whether an object, oi, is significantly large (or small) in relation to the distribution F. Different test statistics have been proposed for use as a discordancy test, depending on the available knowledge of the data. Assuming that some statistic, T, has been chosen for discordancy testing, and thevalue of the statistic for object oi is vi, then the distribution of T is constructed. Significance probability,SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and the workinghypothesis is rejected.

An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen because oimay be an outlier under one modeland a perfectly valid value under another. The alternative distribution is very important in determining the power of the test, that is, the probability that the working hypothesis is rejected when oi is really an outlier.

There are different kinds of alternative distributions.

Inherent alternative distribution:

In this case, the working hypothesis that all of the objects come from distribution F is rejected in favor of the alternative hypothesis that all of the objects arise from another distribution, G:

H :oi € G, where i = 1, 2,…, n

F and G may be different distributions or differ only in parameters of the same distribution.There are constraints on the form of the G distribution in that it must have potential to produce outliers. For example, it may have a different mean or dispersion, or a longer tail

Mixture alternative distribution

The mixture alternative states that discordant values are not outliers in the F population, but contaminants from some other population,

G. In this case, the alternative hypothesis is

Slippage alternative distribution

This alternative states that all of the objects (apart from some prescribed small number) arise independently from the initial model, F, with its given parameters, whereas the remaining objects are independent observations from a modified version of F in which the parameters have been shifted.

There are two basic types of procedures for detecting outliers:

Block procedures:

In this case, either all of the suspect objects are treated as outliersor all of them are accepted as consistent.

Consecutive procedures:

An example of such a procedure is the insideoutprocedure. Its main idea is that the object that is least likely tobe an outlier istested first. If it is found to be an outlier, then all of the

more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is tested, and so on.Thisprocedure tends to be more effective than block procedures.

4.1.1 Distance-Based Outlier Detection:

The notion of distance-based outliers was introduced to counter the main limitationsimposed by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a fraction,pct, of the objects in D lie at a distance greater than dmin from o. In other words, rather thatrelying on statistical tests, we can think of distance-based outliers as thoseobjects that do not have enoughneighbors, where neighbors are defined based ondistance from the given object. In comparison with statistical-based methods, distancebased outlier detection generalizes the ideas behind discordancy testing for various standarddistributions. Distance-based outlier detection avoids the excessive computationthat can be associated with fitting the observed distribution into some standard distributionand in selecting discordancy tests.

For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.

For example, if objects that lie three or more standard deviations from the mean

are considered to be outliers, assuming a normal distribution, then this definition can be generalized by a DB(0.9988, 0.13s) outlier.

Several efficient algorithms for mining distance-based outliers have been developed.

Index-based algorithm:

Given a data set, the index-based algorithm uses multidimensionalindexing structures, such as R-trees or k-d trees, to search for neighbors of eachobject o within radius dminaround that object. Let Mbe the maximum number ofobjects within the dmin-neighborhood of an outlier. Therefore, onceM+1 neighborsof object o are found, it is clear that o is not an outlier. This algorithm has a worst-casecomplexity of O(n2k), where n is the number of objects in the data set and k is thedimensionality. The index-based algorithm scales well as k increases. However, thiscomplexity evaluation takes only the search time into account, even though the taskof building an index in itself can be computationally intensive.

Nested-loop algorithm:

The nested-loop algorithm has the same computational complexityas the index-based algorithm but avoids index structure construction and triesto minimize the number of I/Os. It divides the memory buffer space into two halvesand the data set into several logical blocks. By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can be achieved.

Cell-based algorithm:

To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory- resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number of cells and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length equal to

Eachcell has two layers surrounding it. The first layer is one cell thick, while the second is

cells thick, rounded up to the closest integer. The algorithm countsoutliers on a cell-by-cell rather than an object-by-object basis. For a given cell, itaccumulates three counts—the number of objects in the cell, in the cell and the firstlayer together, and in the cell and both layers together. Let’s refer to these counts ascell count, cell + 1 layer count, and cell + 2 layers count, respectively.

Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an outlier.

An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less than or equal to M. If this condition does not hold, then all of the objectsin the cell can be removed from further investigation as theycannot be outliers.

If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are considered outliers. Otherwise, if this number is more than M, then itis possible that some of the objects in the cell may be outliers.To detect theseoutliers, object-by-object processing is used where, for each object, o, in the cell,objects in the second layer of o are examined. For objects in the cell, only thoseobjects having no more than M points intheir dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of the object’s cell, all of its firstlayer, and some of its second layer.

A variation to the algorithm is linear with respect to n and guarantees that no morethan three passes over the data set are required. It can be used for large disk-residentdata sets, yet does not scale well for high dimensions

4.1.1 Density-Based Local Outlier Detection:

Statistical and distance-based outlier detection both depend on the overall or globaldistribution of the given setof data points, D. However, data are usually not uniformlydistributed. These methods encounter difficulties when analyzing data with rather different

density distributions.

To define the local outlier factor of an object, we need to introduce the concepts ofk- distance, k-distanceneighborhood, reachability distance,13 and local reachability density.

These are defined as follows:

The k-distance of an object p is the maximal distance that p gets from its k- nearestneighbors. Thisdistance is denoted as k-distance(p). It is defined as the distance, d(p, o), between p and an object o 2 D, suchthat for at least k objects, o0 2 D, it holds that d(p, o’)_d(p, o). That is, there are at least k objects inDthat are asclose asor closer to p than o, and for at most k-1 objects, o00 2 D, it holds that d(p;o’’) <d(p, o).

That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this point how k is determined. The LOF method links to density-basedclustering in that it sets k to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying clusters based on density.

Here, MinPts (as k) is used to define the local neighborhood of an object, p.

The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it contains every object whose distance is not greater than theMinPts-distance of p.

The reachability distance of an object p with respect to object o (where o is within theMinPts-nearestneighbors of p), is defined as reach

distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.

Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply their actual distance.However, if they are sufficientlyclose (i.e., where p is within the MinPts-distance neighborhood of o), thentheactual distance is replaced by the MinPts- distance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of the p close to o.

The higher thevalue of MinPts is, the more similar is the reachability distance for objects withinthe same neighborhood.

Intuitively, the local reachability density of p is the inverse of the average reachability density based on the MinPts-nearestneighbors of p. It is defined as

The local outlier factor (LOF) of p captures the degree to which we call p an outlier. It is defined as

It is the average of the ratio of the local reachability density of p and those of p’s MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density is, and the higher the local reachability density of p’s MinPts-nearest neighbors are, the higher LOF(p) is.

4.1.1 Deviation-Based Outlier Detection:

Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to identify exceptionalobjects. Instead, it identifies outliers by examining themain characteristics of objects in a group.Objects that―deviate‖ fromthisdescription areconsidered outliers. Hence, in this approach the term deviations is typically used to referto outliers. In this section, we study two techniques for deviation-based outlier detection.The firstsequentially compares objects in a set, while the second employs an OLAPdata cube approach.

Sequential Exception Technique:

The sequential exception technique simulates the way in which humans can distinguishunusual objects from among a series of supposedly like objects. It uses implicit redundancyof the data. Given a data set, D, of n objects, it builds a sequence of subsets,{D1, D2, …,Dm}, of these objects with 2<=m <= n such that

Dissimilarities are assessed between subsets in the sequence. The technique introducesthe following key terms.

Exception set:

This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose removal results in the greatest reduction of dissimilarity in the residual set.

Dissimilarity function:

This function does not require a metric distance between theobjects. It is any function that, if given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater the dissimilarity among theobjects, the higherthe value returned by the function. The dissimilarity of a subset is incrementally computedbased on the subset prior to it in the sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the variance of the numbers in theset, that is,

where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.

Cardinality function:

This is typically the count of the number of objects in a given set.

Smoothing factor:

This function is computed for each subset in the sequence. Itassesses how much the dissimilarity can be reduced byremoving the subset from theoriginal set of objects.