Matlab: Reading and Reformatting IRIS Dataset

This is the start of a series of classification methods. In any data analysis the data has to be read from a source. In this article the steps that has to be taken in Matlab to read a simple comma delimited data file is detailed. Here, we have chosen the IRIS dataset as our sample dataset. Please note that although there are other function specific for reading data files they may encounter problem in cases where for example a string values exists in the file.

The first step in reading file is to assigning the address of the file by using file open (fopen) function.

%Initialization of the input file
fileName = 'iris.data';
fid = fopen(fileName);

Let’s say we need to separately read he inputs and outputs. IRIS dataset consists of 150 datapoints, 4 input and 1 output. The output is the name of flowers and the input is characteristics of the flower. The IRIS dataset was designed to test the accuracy of different classification methods. To have the data in Matlab in run-time the easiest option is using arrays. However string cannot be saved in arrays due to inconsistency in length. To resolve this issue output names are compared and assigned to different categories. In later articles we will show how these categories are used to solve classification problems.

%Flower categories saved as category numbers in output array
str_out1 = 'Iris-setosa';
str_out2 = 'Iris-versicolor';
str_out3 = 'Iris-virginica';

A good coding practice is to always initialize the arrays.

%Number of data points in the dataset
no_data =150;
%The first 4 columns of data
inputs = zeros(150,4);
%Flower categories 1, 2 or 3 is saved in this variable.
outputs = zeros(150,1);

In order to read and reformat the dataset a while loop is used with a condition that breaks the loop when the EOF is reached. Here is the while loop for reading and reformatting.

while(1)
pcursor = pcursor + 1;
%Read next line in the file
tline = fgetl(fid);
%Check for EOF
if (length(tline) < 2)
break;
end
%Find location of commas
commaLocs=findstr(',',tline);
%Extract data from the data files
%Input assignment
inputs(pcursor,1) = str2double(tline(1:(commaLocs(1)-1)));
inputs(pcursor,2) = str2double(tline((commaLocs(1)+1):(commaLocs(2)-1)));
inputs(pcursor,3) = str2double(tline((commaLocs(2)+1):(commaLocs(3)-1)));
inputs(pcursor,4) = str2double(tline((commaLocs(3)+1):(commaLocs(4)-1)));
%Output assignment
str_out = tline((commaLocs(4)+1):length(tline));
switch str_out
case str_out1
outputs(pcursor,:) = 1;
case str_out2
outputs(pcursor,:) = 2;
case str_out3
outputs(pcursor,:) = 3;
end
end

It should be noted that pcursor variable is assigned to zero in initialization and is responsible for keeping the location in the loop. commaLocs variable saves the location of commas in the string. Finally the file should be closed to free the engaged memory.

fid = fclose(fileName);

Please leave comments or twit on: http://www.twitter.com/auto_trade

Advertisements

~ by infinova on March 13, 2010.

2 Responses to “Matlab: Reading and Reformatting IRIS Dataset”

  1. Thank you 🙂

    Just,

    fid = fclose(fileName);

    line is error in my matlab.. how about -> fclose(fid);

    Have nice work and research 🙂

  2. Very nice!
    thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: