Matlab: Reading and Reformatting IRIS Dataset
This is the start of a series of classification methods. In any data analysis the data has to be read from a source. In this article the steps that has to be taken in Matlab to read a simple comma delimited data file is detailed. Here, we have chosen the IRIS dataset as our sample dataset. Please note that although there are other function specific for reading data files they may encounter problem in cases where for example a string values exists in the file.
The first step in reading file is to assigning the address of the file by using file open (fopen) function.
%Initialization of the input file fileName = 'iris.data'; fid = fopen(fileName);
Let’s say we need to separately read he inputs and outputs. IRIS dataset consists of 150 datapoints, 4 input and 1 output. The output is the name of flowers and the input is characteristics of the flower. The IRIS dataset was designed to test the accuracy of different classification methods. To have the data in Matlab in run-time the easiest option is using arrays. However string cannot be saved in arrays due to inconsistency in length. To resolve this issue output names are compared and assigned to different categories. In later articles we will show how these categories are used to solve classification problems.
%Flower categories saved as category numbers in output array str_out1 = 'Iris-setosa'; str_out2 = 'Iris-versicolor'; str_out3 = 'Iris-virginica';
A good coding practice is to always initialize the arrays.
%Number of data points in the dataset
%The first 4 columns of data inputs = zeros(150,4); %Flower categories 1, 2 or 3 is saved in this variable. outputs = zeros(150,1);
In order to read and reformat the dataset a while loop is used with a condition that breaks the loop when the EOF is reached. Here is the while loop for reading and reformatting.
while(1) pcursor = pcursor + 1; %Read next line in the file tline = fgetl(fid); %Check for EOF if (length(tline) < 2) break; end %Find location of commas commaLocs=findstr(',',tline); %Extract data from the data files %Input assignment inputs(pcursor,1) = str2double(tline(1:(commaLocs(1)-1))); inputs(pcursor,2) = str2double(tline((commaLocs(1)+1):(commaLocs(2)-1))); inputs(pcursor,3) = str2double(tline((commaLocs(2)+1):(commaLocs(3)-1))); inputs(pcursor,4) = str2double(tline((commaLocs(3)+1):(commaLocs(4)-1))); %Output assignment str_out = tline((commaLocs(4)+1):length(tline)); switch str_out case str_out1 outputs(pcursor,:) = 1; case str_out2 outputs(pcursor,:) = 2; case str_out3 outputs(pcursor,:) = 3; end end
It should be noted that pcursor variable is assigned to zero in initialization and is responsible for keeping the location in the loop. commaLocs variable saves the location of commas in the string. Finally the file should be closed to free the engaged memory.
fid = fclose(fileName);
Please leave comments or twit on: http://www.twitter.com/auto_trade