Connecting Matlab to MSSQL: Preparation Steps

•April 28, 2011 • 2 Comments

Keeping data in a database storage has several advantages over conventional file system such as reliability, fast access and play back ability. For these reasons, databases are very suitable for analysis tasks where a huge amount of data is to be kept and constantly analysed.

My research background is Machine Learning and I always had to look at multiple sections of data in different ways to be able to comprehend and design my algorithms. Building an efficient database takes lots of research and practice which comes in time. But for now, let’s focus on a very basic question, how do we connect from Matlab to MSSQL? This is going to give us some headlight on the journey.

I start with a brief explanation on setting up MSSQL tables and then a few setting that should be done before connection, in my next blog I will include a sample program on how to write and read data from MSSQL database. All screenshots are for MSSQL 2000 but the interface hasn’t changed much in newer versions. So you should still be able to follow.

Creating a Database in MSSQL 2000

Database files can be built and modified using SQL Server Enterprise Manager. Open the Enterprise Manager and right click on Databases, select the new database.

New Database

Creating a Table 

By creating a database, a new tab will open for setting up tables. Again you can build a new table with a simple right click and new.

New Table

In the window above we can give column name, its data type. For example first column is for ID and we want to increment the ID for each row of the variable, then we can take its data type ‘int’ and in column description select ‘yes’ in identity and identity increment by 1.

Security Setting in SQL

We can secure out database by adding the user name and password. For this we have to security and then open ‘logins’ and select the name of security and select the database name ’check’. Also we can chose either the window authentication or SQL Server authentication.

Security Setting

In Database access, select database and permit all the database access because we want to access this data base from MATLAB.

Connection with MATLAB

To make connection with MATLAB, we need to have the database tool box installed in MATLAB. Database toolbox in MATLAB can do open data base connectivity (ODBC) and Joint data base connectivity (JDBC).

We use the windows ODBC driver to connect MATLAB and SQL Server. For this, in Control Panel-Administrator Tools-Data sources (ODBC). In ODBC we can add data file from excel, access, MS SQL, my SQL .

Finally we test the connection to see if everything is working.

In my next post, we will develop a Matlab code to read and write from and to MSSQL database.

Matlab: Reading and Reformatting IRIS Dataset

•March 13, 2010 • 2 Comments

This is the start of a series of classification methods. In any data analysis the data has to be read from a source. In this article the steps that has to be taken in Matlab to read a simple comma delimited data file is detailed. Here, we have chosen the IRIS dataset as our sample dataset. Please note that although there are other function specific for reading data files they may encounter problem in cases where for example a string values exists in the file.

The first step in reading file is to assigning the address of the file by using file open (fopen) function.

%Initialization of the input file
fileName = 'iris.data';
fid = fopen(fileName);

Let’s say we need to separately read he inputs and outputs. IRIS dataset consists of 150 datapoints, 4 input and 1 output. The output is the name of flowers and the input is characteristics of the flower. The IRIS dataset was designed to test the accuracy of different classification methods. To have the data in Matlab in run-time the easiest option is using arrays. However string cannot be saved in arrays due to inconsistency in length. To resolve this issue output names are compared and assigned to different categories. In later articles we will show how these categories are used to solve classification problems.

%Flower categories saved as category numbers in output array
str_out1 = 'Iris-setosa';
str_out2 = 'Iris-versicolor';
str_out3 = 'Iris-virginica';

A good coding practice is to always initialize the arrays.

%Number of data points in the dataset
no_data =150;
%The first 4 columns of data
inputs = zeros(150,4);
%Flower categories 1, 2 or 3 is saved in this variable.
outputs = zeros(150,1);

In order to read and reformat the dataset a while loop is used with a condition that breaks the loop when the EOF is reached. Here is the while loop for reading and reformatting.

while(1)
pcursor = pcursor + 1;
%Read next line in the file
tline = fgetl(fid);
%Check for EOF
if (length(tline) < 2)
break;
end
%Find location of commas
commaLocs=findstr(',',tline);
%Extract data from the data files
%Input assignment
inputs(pcursor,1) = str2double(tline(1:(commaLocs(1)-1)));
inputs(pcursor,2) = str2double(tline((commaLocs(1)+1):(commaLocs(2)-1)));
inputs(pcursor,3) = str2double(tline((commaLocs(2)+1):(commaLocs(3)-1)));
inputs(pcursor,4) = str2double(tline((commaLocs(3)+1):(commaLocs(4)-1)));
%Output assignment
str_out = tline((commaLocs(4)+1):length(tline));
switch str_out
case str_out1
outputs(pcursor,:) = 1;
case str_out2
outputs(pcursor,:) = 2;
case str_out3
outputs(pcursor,:) = 3;
end
end

It should be noted that pcursor variable is assigned to zero in initialization and is responsible for keeping the location in the loop. commaLocs variable saves the location of commas in the string. Finally the file should be closed to free the engaged memory.

fid = fclose(fileName);

Please leave comments or twit on: http://www.twitter.com/auto_trade

Motor Fault Diagnosis: Overload

•January 28, 2010 • Leave a Comment

The overload fault occurs as a result of loading the motor more than its nominal load value. Initially this results to an increase in current to compensate for the torque and overcome the negative current generated as the result of the overload (Because of flux collision). This compensation stabilizes the function of the motor by using the feedback link between the motor and the generator. The compensation process stops when the generator reaches its current limits.

As a result in minor overload cases the motor will work inefficiently and heat is generated. In induction motors the slip angle in increased. Finally if the generated current is not sufficient to overcome negative current, load, friction the motor enters the locked mode where it is 100% inefficient. In severe cases the role of the motor and load can change, where the motor can act as a generator if the negative current exceeds the generator current.

Overload fault is important to manufacturers because it can result in inefficient production line with less output and a increase in energy consumption. Finally the overload results in heat which can cause insulation melt-down and short circuit and critical failure of the system.

Please  leave us your comments and contact us for customised consultancy in regards to intelligent condition monitoring and fault detection platform.

Model-based Vs. Model-less AI Fault Diagnosis

•January 27, 2010 • Leave a Comment

The main aim of diagnostics is the identification of cause-effect chain that is not fully achieved by reliance only on available machine parameters.

Restriction of the analysis to deterministic systems means that modelling of faulty machines can be done successfully. Relying on these models, the behaviour of a specific machine can be accurately analysed but the extension of the results to other machines is not straightforward. Another issue is the accuracy of the model and if it can apply to all modes of the machine. The majority of machine models only correspond to the linear part of the system. For example in overload of the motor where the motor is in non-linear performing area or in short winding faults where the number of windings and eventually the core is changing the linear model is not applicable.

Artificial Intelligence (AI) on the other hand is not as accurate in the linear part of the system but can extend itself to non-linear part of the operation or even other similar machinery of it’s type.

Classification of Stator Electrical Faults in Rotary Machinery

•January 26, 2010 • Leave a Comment

The faults in Electro-mechanical rotary machinery or motors are divided into two major categories of electrical and mechanical faults. Here, the electrical faults that appear in the stator of motors are briefly explained. Basically two classes of stator winding failures  can be considered namely asymmetry in the stator winding such as open phase and short circuit of a few turns in phase winding.

The former allows the machine to operate in a reduced torque while the latter leads to catastrophic failure in a short time. In case of winding asymmetry the motor parameters, and model changes, while in short turns the structure of equations changes because of  increased number of  state variables.

A short circuit fault is one of the most difficult faults to detect. The traditional protection circuit might not work and motor might keep running. The heating in the circuits will then results in the insulations to melt down. If undetected the fault will propagate and results in phase-phase or phase-ground faults. Ground current ultimately results in irreversible damage to the core and the machine should be removed from the process.

Distance between documents

•January 26, 2010 • 2 Comments

A ‘distance’ between documents. Granted, an odd concept, however one that works reasonably well. After you have converted your document into a bag of words, assigning a numerical value to each token, you can then calculate the distance between two of these documents.

Imagine a 3D vector space, which is where a single document resides. This vector space also contains all your other documents, at a particular distance from one another. But this would be the case if your document only contained 3 tokens. If it contains 20 tokens, then you would have to consider a 20D vector space. So, a vector space is usually referred to as n-dimensional. Furthermore, you can only calculate the distance between the same token in two different documents.

The most common is Euclidean distance, and is the one that will be discussed here. Mahalanobis distance is another measure, which has the advantage of being unaffected by scale (scale invariant), which is not an issue here as tf*idf values are normalised, but is also a bit more computationally intensive.

function euclidean(array $objectX, array $objectY) {
	// Assuming we're dealing with two associative arrays.
	// $objectX[token] = tfidf
	$missingValue = 0.0001;
	$dist = 0;
	$tokens = array_keys(array_merge($objectX, $objectY));
	foreach($tokens as $token) {
		if(!isset($objectY[$token])) { $objectY[$token] = $missingValue; }
		if(!isset($objectX[$token])) { $objectY[$token] = $missingValue; }
		$dist += pow(($objectX[$token] - $objectY[$token]), 2);
	}
	return $dist;
}

The most important point to make is the thinking behind the $missingValue. If it’s missing in one document, but not in the other, then we assign a small missing value to the document it’s missing in (e.g. the token has little bearing upon the meaning of that document). If it’s missing in both, the distance is 0, so we have found a perfect match (which is appropriate). You can either use pow(,2) or abs(), to make the value an unsigned scalar.

Consequently, the distance between two documents is ‘the sum of their parts’.

TF*IDF Weighting

•January 26, 2010 • 2 Comments

What is this used for?

You have a document such as an email, but you need to turn each word, or token, into a numeric value for that document for it to be meaningfully classified or processed. This is typically known as a ‘bag of words’ approach, where the meaning of a document is purely based upon the words (in no particular order) which are contained within the document. To improve the accuracy of the method, but only in cases where you have lots of data in your corpus (collection of documents) you can alter this to be a bigram or trigram model, where you combine tokens into two or three words long. This is great, because it takes each word based upon context, but on the other hand if there isn’t much data available there won’t be enough bigrams or trigrams to match against new data. Another approach is to use Porter’s stemming algorithm, but meanings of individual tokens can be lost during the stemming process so this isn’t approach always favoured.

The Wikipedia article also explains this principle quite well. Though, evidently, you need to get the tokens in the first place from the document, and I would suggest using some PHP like this, which introduces a space in front of exclamation marks, question marks, fullstops, commas, asterisks and speech marks, but leaves words like ‘don’t’ as one token, without splitting it.

$message = preg_replace("/([!?.,\*\"]{1})/", " \\1 ", $message);
$tokens = preg_split("/\s+/", $message);

Term Frequency (or TF)

There are a few alternatives to TF*IDF weighting, though I’ve used it frequently in the past and it usually does a good job. To start off with, what does ‘term frequency’ (TF) refer to? For each token in the tokens array, we calculate the number of times this token appears in the current email. To normalise this against the email’s length (if an email is longer than another email it will have more tokens in it, and consequently there’s a higher probability of the same token occurring more often) we divide this by the number of tokens in the current email. So that’s term frequency or TF.

Inverse Document Frequency (or IDF)

Now, inverse document frequency. This sounds a bit scarier, and is a little bit more work to calculate but is also reasonably simple, and this measures the relative importance (for disambiguation from other documents) of the token. So, for each token, we need to get the number of documents that the token appears in. This divides the total size of the collection (how many documents we have). For a single token therefore, say ‘hello’, which appears in an email 3 times. The size of the email is 100 tokens long, so we take 3 / 100 = 0.03 which is the term frequency. Our collection of documents or corpus is 300 documents large, and ‘hello’ appears in 120 of these. So, we have ln(300/120) = 0.916. The log is just used as a scaling factor, and to stop a certain amount of saturation toward 0.

So multiplying tf * idf = 0.02, which gives the value for the individual token as a disambiguator which isn’t very high (the token occurs in a lot of other documents).