Local Support Vector Machine for Noise Reduction

SW Author: Nicola Segata <segata@disi.unitn.it>
Dipartimento di Ingegneria e Scienza dell'Informazione, University of Trento

Version: 0.9
Date: 01.10.2008


An updated version of LSVM-nr can be obtained as part of the Fast Local Kernel Machine Library (Segata 2009, FaLKM-lib) freely available with source code for research and education purposes here.

Overview

Local Support Vector Machine Noise Reduction (LSVM-nr) [Segata, Blanzieri, Delany, Cunningham, 2008 ] is a novel approach to noise reduction based on local Support Vector Machines (LSVM) [Blanzieri, Melgani, 2006, 2008 ] which brings the benefits of maximal margin classifiers to bear on noise reduction. This provides a more robust alternative to the majority rule on which almost all the existing noise reduction techniques are based. Roughly speaking, for each training sample an SVM is trained on its neighbourhood and if the SVM classification for the central sample disagrees with its actual class there is evidence in favour of removing it from the training set. There is empirical evidence of improved generalization accuracy of nearest neighbor based classifiers in a number of real datasets when using training data edited with LSVM-nr. In particular some experiments suggest that LSVM-nr is particularly effective in the spam filtering application domain, for datasets affected by Gaussian noise and in presence of uneven class densities.

References and documents

LSVM-nr is described in the following document:

[Segata, Blanzieri, Delany, Cunningham, 2008]

N. Segata, E. Blanzieri, S.J. Delany, P. Cunningham, Noise Reduction for Instance-Based Learning with a Local Maximal Margin Approach. Technical Report.

LSVM-nr is based on a probabilistic variant of Local Support Vector Machines (LSVM). The main references fo LSVM are:

[Blanzieri, Melgani, 2006]

E. Blanzieri, F. Melgani, An Adaptive SVM Nearest Neighbor Classifier for Remotely Sensed Imagery. IEEE International Conference on Geoscience and Remote Sensing Symposium, 2006. pp. 3931-3934.

[Blanzieri, Melgani, 2008]

E. Blanzieri, F. Melgani, Nearest Neighbor Classification of Remote Sensing Images With the Maximal Margin Principle. IEEE Transactions on Geoscience and Remote Sensing. June 2008 Volume: 46, Issue: 6 On page(s): 1804-1811

[Segata, Blanzieri, 2008]

N. Segata, E. Blanzieri, Empirical Assessment of Classification Accuracy of Local SVM. DISI Technical Report.

Software

For now LSVM-nr can be obtained only as win32 executable available here. The sofware for performing LSVM model selection is available here. In the implementation of LSVM-nr and LSVM-cv LibSVM version 2.86 is used for training and evaluating the local SVM models.

How to use

LSVM-nr takes the unedited dataset and produces the edited dataset. Both datasets are represented in with sparse instance vectors encoded in the LibSVM and SVM-light file format.

LSVM-nr is called with the following parameters:

LSVM-nr [options] input_unedited_set_file_name output_edited_file_name

Available options are:

-k k: set the LSVM neighborhood size (default 1/10 input set cardinality)
-l l: set LSVM probabilistic output threshold for noise removal (default 0.5)
-u u: set LSVM probabilistic output threshold for redundancy reduction (default 1.0, i.e. no redundancy reduction)
-t kernel_type : set type of kernel function (default 0)
        0 -- linear: u'*v
        1 -- polynomial: (gamma*u'*v + coef0)^degree
        2 -- radial basis function: exp(-gamma*|u-v|^2)
-d degree : set degree in kernel function (default 2)
-g gamma : set gamma in kernel function (default 1)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
 
LSVM-cv is used to select the parameters (kernel type, kernel parameters and regularization parameter C) for LSVM-nr:

LSVM-cv [options] training_set_file_name

Available options are:

-k k: neighborhood size (default 1/10 input set cardinality)
-t kernel_type : set type of kernel function (default 0)
        0 -- linear: u'*v
        1 -- polynomial: (gamma*u'*v + coef0)^degree
        2 -- radial basis function: exp(-gamma*|u-v|^2)
-d degree : set degree in kernel function (default 2)
-g gamma : set gamma in kernel function (default 1)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
-f folds: number of folds of cross validation (defaoult 10)
 

Last modified October 6, 2008 by Nicola Segata <segata@disi.unitn.it>