• No products in the cart.

Understanding weka and R


samples, of irises. For this dataset, the input attributes are numerical attributes, meaning that the attributes are given as real numbers, inthis case in centimeters. The output attribute is a nominal attribute, in other words, a name for a particular species of Iris.



Working with a Dataset in Weka

When you install Weka, you can expect to find a Weka icon on your desktop. When you start up Weka, you will see the Weka GUIChooser screen as shown in Figure 3.7.

From Weka Chooser, we can select any of the applications such as Explorer, Experimenter, KnowledgeFlow, Workbench andSimple CLI. The brief description about these applications is given in Table 3.1.

In this chapter, we will cover the application Explorer only as we can directly apply data mining algorithms through this option.


3.1.1  Removing input/output attributes

Since we have just removed an attribute, this would be a good point to save. Here, we choose to save our modified dataset in thepreferred Weka.arff file format. As expected, we are now working on just 5 total attributes, having removed the ‘Instance’ attribute asshown in Figure 3.10.

The Explorer Preprocess screen provides several types of information about our dataset. There are three main elements of this screenas shown in Figure 3.11.

i.  The class designator,

ii.  The attribute histogram,

iii.  The attribute statistics.


3.1.1  Histogram

To see a histogram for any attribute, select it in the Attributes section of the Preprocess tab.

Figure 3.13 represents the histogram that shows us the distribution of Petal widths for all three species. As it turns out for thishistogram, dark blue is Setosa, red is Versicolor, and bright blue is Virginica. The histogram shows that there are, for example, 49samples in the lower histogram bin for Petal width, all of which are Iris-Setosa, and shows that there are 23 samples in the highest bin, all of which are Virginica. The histogram also shows that there are 41 samples in the middle bin, in which most of thesamples belong to Versicolor irises and rest are Virginica. Now, click on the Visualize All (as shown in Figure 3.13) button to seethe histograms of all the attributes together.


First, consider the Distinct characteristic. The distinct characteristic shows how many different values are taken on by a particularattribute in the dataset. In case of Petal width attribute, we see a segment of the Iris Dataset showing seven of the 150 samples in thedataset as shown in Figure

3.16. For just this segment of seven samples in the iris dataset, we see four distinct values for Petal width, i.e., 0.1, 0.2, 0.6 and 0.3.There are a total of 22 distinct values for Petal width in the entire dataset of 150 samples.

The Unique characteristic on the Explorer screen tells us the total of Petal measurement values that appear only once in the full dataset, i.e. out of 150 samples. In the case of attribute ‘Petal Width’ we have three samples with Petal width of 0.1, 0.6 and 0.3 thatare unique in selected instances of the dataset as shown in Figure 3.16. However, overall we have only 2 unique samples in the entiredataset of 150 samples as indicated in Figure 3.15.

To practice this concept, let us find distinct and unique values for the following dataset: 23, 45, 56, 23, 78, 90, 56, 34, 90


Distinct: 6 (23, 45, 56, 78, 90, 34)

Unique: 3 (45, 78, 34)


3.1.1  Visualizer

It is also possible to do data visualization on our dataset from the Weka GUI Chooser. On the GUI Chooser, choose Visualization, and then Plot as shown in Figure 3.19.


Jitter: Sometimes there are some overlapping data points in the dataset and it becomes difficult to analyze these points, therefore we add artificial random noise to the coordinates of the plotted points in order to spread the data slightly and the process of adding noise iscalled as Jitter.

In Weka, jitter slider can be used if we have points with the same or similar x-y coordinates. Increasing jitter adds noise to the values, thereby separating overlapping points. The plot screen also shows us which colors are associated with which classes on the various Weka screens. Note that plots of attributes can also be obtained by using the Weka Explorer option. To do this, switchfrom the Preprocess tab to the Visualize tab. The screenshots of Weka plots can be taken by left-clicking in the plot area that you want to capture. This brings up a Save screen with several file type options. If you want to get a screenshot of the entire screen,you can use your computer’s Print Screen feature.

In the next section, the well known programming language R for implementing data mining algorithms has been discussed.

Introduction to R

R is a programming language for statistical computing and graphics. It was named R on the basis of the first letter of first name of thetwo R authors (Robert Gentleman and Ross Ihaka). It was developed at the University of Auckland in New Zealand. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systemslike Linux, Windows and Mac.

This section will cover the installation of R, basic operations in R, loading of datasets in R. Let us discuss each process step-by-step.

3.1.1  Features of R

The basic features of R are given as follows.

•  R is a well-developed, simple and effective programming language which includes conditionals, loops, user definedrecursive functions and input and output facilities.

•  R has an effective data handling and storage facility.

•  R provides a suite of operators for calculations on arrays, lists, vectors and matrices.

•  R provides a large, coherent and integrated collection of tools for classification, clustering, time-series analysis,linear and non-linear modeling.

R provides graphical facilities for data analysis and display either directly at the computer screen or for printing on paper

3.1.1  Installing R

R can be downloaded from one of the mirror sites available at http://cran.r-project.org/mirrors.html as shown in Figure 3.21.

Depending on the needs, you can program either at R command prompt or you can use an R script file to write your program.After R is started, there is a console waiting for input. At the prompt (>), you can start typing your program.

Variable Assignment and Output Printing in R

In R, a variable name consists of letters, numbers and the dot or underline characters. The variable name starts with a letter or the dot notfollowed by a number. The variables can be assigned values using leftward, rightward and equal to operator. The values of the variablescan be printed using print( ) or cat( ) function. cat( ) function combines multiple items into a continuous print output as shown in Figure 3.23.


Data Types

In R, there is no need to declare the data type of variables as we do in other programming languages such as Java and C. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable.

The data type of a variable can be identified using the class( ) function as shown in Figure 3.24. Here, the variables var1 and var3 represent the data type numeric as we have assigned a numeric value 5 and 23.5 to variable var1 and var3 respectively. Similarly, var2 represents the data type character as a character string is being assigned to var2.📷

Template Design © VibeThemes. All rights reserved.