FaLKM-lib

a Library for Fast Local Kernel Machines

Author: Nicola Segata <nsegata@hsph.harvard.edu>

Current version: 1.1
Date: 5.06.2010

The paper "Fast and Scalable Local Kernel Machines" has been published on JMLR is available [Segata, Blanzieri 2010]. It describes the FaLK-SVM techniques with extensive theoretical and empirical analyses.

Version 1.1 available.

The paper "Fast Local Support Vector Machines for Large Datasets" [Segata, Blanzieri 2009b], describing a very preliminary version of FaLK-SVM, received the Best Paper Award at the 6th International Conference on Machine Learning and Data Mining.

Overview

FaLKM-lib is a library for fast local kernel machine implemented in C++ extedning LibSVM. It contains modules for classification, regression and noise reduction. All the neighbourhood operations are implemented with our implementation of Cover Trees [Beygelzimer, Kakade, Langford, 2006] for computational reasons, whereas the training and testing of the Local SVMs is performed using LibSVM [Chang, Lin 2001] version 2.88. FaLKM-lib contains modules for very fast and scalable learning that showed to be statistically more accurate than SVM and, using some implemented strategies, faster than SVM and thus scalable for very large datasets (some million of samples for non high-dimensional data). The FaLKM-lib library and its algorithms are concisely introduced and formally described in the document [Segata, 2009] available here.

If you have any general comments, suggestions or problems using the code, please write to nsegata@hsph.harvard.edu.

Brief Description

Local Support Vector Machines (LSVM) have been introduced by [Blanzieri, Melgani, 2006, 2008] and proved to be statisticlaly more accurate than SVM as shown in [Segata, Blanzieri 2010, Segata, Blanzieri 2009a]. LSVM learning can be made scalable and dramatically faster than SVM using some approximation strategies of LSVM as shown in [Segata, Blanzieri 2009b] and some other modifications (not published yet). The LSVM approach showed to be very effective also for noise reduction (which is an important preprocessing step for other analysis and for instance-based learning) both with the original LSVM setting [Segata, Blanzieri, Delany, Cunningham, 2008] and with its scalable version [Segata, Blanzieri, Cunningham, 2009]. FaLKM-lib implements the Cover Tree data-structure [Beygelzimer, Kakade, Langford, 2006] for fast retrieval of neighborhoods in the feature space, and integrates LibSVM [Chang, Lin 2001] version 2.88 for SVM training and prediction.

FaLKM-lib consists in the following five modules:

FkNN is a fast implementation (using Cover Tree) of (kernel) k-nearest neighbors.
FkNNSVM is a fast implementation (using Cover Tree) of the accurate (i.e. non-approximated) Local SVM descried and tested in [Blanzieri, Melgani, 2006, 2008], [Segata, Blanzieri 2009b]. Very briefly it is a classifier representative of the clas of local learnin algorithms [Bottou, Vapnik 1992], and it builds, for each testing sample, a SVM model on its local neighborhood in the feature space of the training set, and uses the local SVM model to predict the label. It can give as out ut the label or a probability estimate.
FaLK-SVM (Fast Local Kernel Support Vector Machines) is a very fast and scalable version of FkNNSVM. The approximation is based on building a set of local SVM models covering the entire training set (with redundancy) and on predicting the label of a testing point with its nearest feature-space SVM model and other optimization as described in [Segata, Blanzieri 2010]. In the current version we adopt Cover Tree obtaining training and testing time mush lower than SVM maintaining however a higher classification rate. FaLK-SVM consists in a training module called FaLK-SVM-train and a testing module called FaLK-SVM-predict.
FkNNSVM-nr is a version of FkNNSVM adapted for the use as a noise reduction method as described in [Segata, Blanzieri, Delany, Cunningham, 2008]. It is based on the probabilistic version of FkNNSVM and buils a local model in the neighbourhood of each training sample removing the sample if the probability associated with the correct classification is below a threshold. It has been shown that FkNNSVM-nr is more accurate than state-of-the-art noise reduction techniques (which are based on the majority rule) on real and artificial datasets and it is very effective for datasets perturbed by Gaussian noise and in presence of uneven class densities. A preliminari verison of FkNNSVM-nr is available here.
FaLKNR is a very fast and scalable version of FkNNSVM-nr based on the approximations introduced for FaLK-SVM an dit is presented in [Segata, Blanzieri, Cunningham, 2009] where it is shown to perform better than other noise-reduction techniques both in terms of competence enhancement of the training set and of computational performances.

In addition to the traditional cross-validation model selection, the modules contains primitives for performing local model selection as described for FaLKNR in [Segata, Blanzieri, Cunningham, 2009]. Moreover the library can handle multi-class data and gives as output the Accuracy, the PrecisIon, the Recall and the F-Measure.

The matematical formulations on which the modules of FaLKM-lib are based are available here, the explantions about how to use FaLKM-lin can be found in the README file in library package and at the end of this page.

Source Code and Binaries

The current release of FaLKM-lib is the 1.1 (May 2009) and it is freely available for research and educational purposes (please read the COPYRIGHT before downloading FaLKM-lib) and can be obtained downloading the package as tar.gz archive. The previous release (1.01) is still available here.

The archives contain the sources, the makefile for building the executables in Unix-based systems, the Makefile.win for building the executables in Windows systems using Visual Studio compiler, the README file with detailed explantations for the usage, and the COPYRIGHT file. FaLKM-lib has been tested mainly under Linux, so the more stable and safe version is the Unix-based one.

Previous versions

Current version is 1.1. Starting from verision 1.0 the modifications are:

1.1: Some bugs corrected, cross validation and local model selection updated and implemented for almost all classifiers.
1.01: Some minor bugs corrected, addition of silent mode (-S option)
1.00: First Version

Please write to segata@disi.unitn.it if you need versions different from the current one.

How to cite FaLKM-lib

If you use FaLKM-lib please cite the library as:

Nicola Segata, FaLKM-lib: a Library for Fast Local Kernel Machines, 2009. Software available at http://disi.unitn.it/~segata/FaLKM-lib

The bibtex entry is:

@TECHREPORT{Segata2009,
  AUTHOR =       {Nicola Segata},
  TITLE =        {{FaLKM-lib v1.0: a Library for Fast Local Kernel Machines}},
  INSTITUTION =  {DISI, University of Trento, Italy},
  YEAR =         {2009},
  note = 	 {Software available at \url{http://disi.unitn.it/~segata/FaLKM-lib}}
  number =       {DISI-09-025},
}

Please cite also the paper relative to the FaLKM-lib you use; for FkNNSVM cite [Blanzieri, Melgani, 2008], for FaLKM-SVM cite [Segata, Blanzieri 2010] [Segata, Blanzieri 2009b], for FkNNSVM-nr cite [Segata, Blanzieri, Delany, Cunningham, 2008], for FaLKNR cite [Segata, Blanzieri, Cunningham, 2009].

Benchmarks

Benchmarks of FaLK-SVM vs LibSVm and several approximated SVM solvers on large and very large datasets (up to 3 millions points) con be found in [Segata, Blanzieri 2010]
Benchmarks of FkNNSVM on non-large datasets can be found in [Segata, Blanzieri 2009a]
Benchmarks of FkNNSVM-nr on non-large datasets can be found in [Segata, Blanzieri, Delany, Cunningham, 2008]
Benchmarks of FaLK-SVM vs. LibSVM (and FkNN) on 8 large datasets are available here
Benchmarks of FaLKNR vs. ENN, RENN, AkNN and AkNNc (implemented using Cover Trees) on 9 large datasets are available here and in [Segata, Blanzieri, Cunningham, 2008]

Aknowledgements

The author of FaLKM-lib would like to thank Enrico Blanzieri for the fruitful discussions related to local approaches for learning with kernels and his crucial support in the implementation work, Padraig Cunningham and Sarah Jane Delany for their suggestions about the implemented noise reduction algorithms.

References

[Segata, Blanzieri, 2010]	N. Segata, E. Blanzieri, Fast and Scalable Local Kernel Machines. JMLR 11(Jun):1883−1926, 2010.
[Segata, 2009]	N. Segata, FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Technical Report DISI-09-025. Department of Information Engineering and Computer Science, University of Trento, Italy, 2009. Software available at http://disi.unitn.it/~segata/FaLKM-lib
[Blanzieri, Melgani, 2006]	E. Blanzieri, F. Melgani, An Adaptive SVM Nearest Neighbor Classifier for Remotely Sensed Imagery. IEEE International Conference on Geoscience and Remote Sensing Symposium, 2006. pp. 3931-3934.
[Blanzieri, Melgani, 2008]	E. Blanzieri, F. Melgani, Nearest Neighbor Classification of Remote Sensing Images With the Maximal Margin Principle. IEEE Transactions on Geoscience and Remote Sensing. June 2008 Volume: 46, Issue: 6 On page(s): 1804-1811
[Segata, Blanzieri, 2009a]	N. Segata, E. Blanzieri, Empirical Assessment of Classification Accuracy of Local SVM. Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning (Benelearn 09), pp 47-55. Slides available here.
[Segata, Blanzieri, 2009b]	N. Segata, E. Blanzieri, Fast Local Support Vector Machines for Large Datasets. In "Machine Learning and Data Mining in Pattern Recognition", LNCS, Springer, pp 295-310. Proceedings of the 6th International Conference on Machine Learning and Data Mining 2009. Best Paper Award
[Segata, Blanzieri, Cunningham, 2009]	N. Segata, E. Blanzieri, P. Cunningham, A Scalable Noise Reduction Technique for Large Case-Based Systems. In "Case-Based Reasoning Research and Development", LNCS, Springer, pp 328-342. Proceedings of the International Conference on Case-Based Reasoning 2009. Slides available here
[Segata, Blanzieri, Delany, Cunningham, 2008]	N. Segata, E. Blanzieri, S.J. Delany, P. Cunningham, Noise Reduction for Instance-Based Learning with a Local Maximal Margin Approach. Journal of Intelligent Information Systems (in press)
[Beygelzimer, Kakade, Langford, 2006]	A. Beygelzimer, A. Kakade, J. Langford, Cover trees for nearest neighbor. Proceedings of the 23rd international conference on Machine learning, 2006.
[Bottou, Vapnik 1992]	L. Bottou, V. Vapnik, Local learning algorithms. Neural computation 4:(6), 1992.
[Chang, Lin 2001]	C.C. Chang, C.J. Lin, LIBSVM : a library for support vector machines. 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

How to use

All modules accept inputs and generate output accordingly to the LibSVM and SVM-light file format which is defined as:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string> In the following we briefly describe the modules of the library, listing the available options and some examples.

FkNN

FkNN takes the training set and the testing set as input and writes into the output file the class predictions of the testing examples. If the cross validation option -v is enabled FkNN accepts the training set only. FkNN prints into the standard output the accuracy (and FMeasure, precision and recall) results for classification and the mean squared error for regression. FkNN is called in the following way:

Usage: FkNN [options] training_dataset_file [testing_dataset_file output_label_file]

and the available options are:

-K neighbourhood size (default 1)
-T classification (0, default) or regression (1)
-C conservative variant for the majority rule, (def.  0 = disabled, 1 enabled)
-R relaxation parameter for cover trees (default 1)
-t kernel_type : set type of kernel function (default 0)
        0 -- linear: u'*v
        1 -- polynomial: (gamma*u'*v + coef0)^degree
        2 -- radial basis function: exp(-gamma*|u-v|^2) [eq. to -t 0!!]
-d degree : set degree in kernel function (default 3)
-r coef0 : set coef0 in kernel function (default 1)
-s knn strategy : (default 0 = majority rule)
-g gamma : set gamma in kernel function (default = 1 for polynomial kernels)
-v n : n-fold cross validation mode
-S silent mode (default 0)

Examples

The nearest neighbor results reported in [Segata, Blanzieri, Cunningham, 2009] are obtained in the following way:

./FkNN -t 0 -K 1 -R 3 training_set.tr testing_set.te output_file.out > accuracy_results.nn

the 10 fold CV for a neighbourhood size of 10 using the metric induced by an inhomogeneous polynomial kernel of degree 3 can be obtained as follows:

./FkNN -t 1 -g 1 -d 3 -r 1 -K 5 -v 10 training_set.tr

FkNNSVM

The crucial parameters for FkNNSVM are the selection of the classification (-Z 0) or the regression (-Z 3) option, the neighbourhood size -K, the SVM regularisation parameter -c, the kernel -t and its parameters. The option -g sets the value of the inverse of the RBF kernel width; setting for -g negative values in [0,1] indicates that the width parameter is automatically chosen using a data-dependend strategy based on the histogram of the distances and using as percentile the absolute value of -g multiplied by 100. With the -wi is is possible to set different penalising constant for different class in order to improve the accuracy performances for unbalanced datasets. FkNNSVM is called in the following way:

Usage: FkNNSVM [options] training_dataset_file [test_dataset_file output_label_file]

and the available options are:

-Z local SVM type (default 0 = C-SVM, 3 for epsilon-SVR)
-R relaxation parameter for cover trees (default 1)
-K neighbourhood size k
-t kernel_type : set type of kernel function (default 2)
    0 -- linear: u'*v
    1 -- polynomial: (gamma*u'*v + coef0)^degree
    2 -- radial basis function: exp(-gamma*|u-v|^2)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-d degree : set degree in kernel function (default 3)
-g gamma :  set gamma in kernel func., negative values for estimation based on
            histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels)
-m cachesize : set cache memory size in MB (default 100)
-r coef0 : set coef0 in kernel function
-S silent mode (default 0)

Examples

The following setting is for applying the local SVM approach with neighbourhood size of 200, regularisation parameter C equals to 10, with the RBF kernel function whose width is computed on every local model as the median of the histogram of distances in the local neighbourhood:

./FkNNSVM -t 2 -c 10 -g -0.5 -K 200 training_set.tr testing_set.te output_file.out

FaLK-SVM

FaLK-SVM is the implementation of a modification of FkNNSVM in order to make it scalable for large and very large datasets. FaLK-SVM is subdivided in a training module, called \FaLK-SVM-train, that trains the model and in a prediction module, called FaLK-SVM-predict, that makes the prediction on unseen examples using the trained model.

FaLK-SVM-train takes the training set and the name of the output file on which the trained model will be stored (unless the model selection is enabled with -v option). The option -P, which takes values between 0 and 1, specifies the assignment neighbourhood size k' regulating the level of redundancy in covering the training set with local models. FaLK-SVM-train can also enable the local model selection option with -L 1; on this case other needs also the options for setting the number of local models that will be used for the procedure (with -M), the possible values for k with -N and the grid of SVM and kernel parameters to be considered: the regularisation parameter among the values specified with -C, the RBF kernel width and the degree of the polynomial kernels among the values specified with -G (possibly using negative values as detailed for FkNNSVM). It is also possible to train the model for obtaining probability estimates rather than predicted class labels with -b 1. The other options correspond to the FkNNSVM ones.

Specifically, the command line syntax is:

Usage: FaLK-SVM-train [options] input_dataset_file [model_file]

-t kernel_type : set type of kernel function (default 2)
    0 -- linear: u'*v
    1 -- polynomial: (gamma*u'*v + coef0)^degree
    2 -- radial basis function: exp(-gamma*|u-v|^2)
-L local model selection (default 0)
    0 -- no local model selection
    1 -- local model selection on C K and kernel parameters on -M local models
         (the ranges are specified with -C -G -N)
-R relaxation parameter for cover trees (default 1)
-K neighbourhood size k
-P assignment neighbourhood size as fraction of K such that: K'=K*P (def. 0.5)
-d degree : set degree in kernel function (default 3)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-g gamma :  set gamma in kernel func., negative values for estimation based on
            histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels)
-C c values for local grid model selection separated by ':'
-G G values for local grid model selection separated by ':'
-N K values for local grid model selection separated by ':'
-M number of local models used for local model selection
-m cachesize : set cache memory size in MB (default 100)
-wi weight: set the parameter C of class i to weight*C, (default 1)
-b probability_estimates : whether to train local models for probability
                           estimates, 0 or 1 (default 0)
-v n: n-fold cross validation mode
-S silent mode (default 0)

FaLK-SVM-predict uses the model trained with FaLK-SVM-train to predict the label of the query examples. It is possible to specify to use the FaLK-SVMc variant (with -T 1) instead of the default FaLK-SVM approach for prediction, in order to increase the computational performances. The probability output of FaLK-SVM can be enabled with -b 1 (it is necessary that the model has been trained be FaLK-SVM-train with the same option).

FaLK-SVM-predict is launched as follows:

Usage: FaLK-SVM-predict [options] test_file model_file output_label_file

-T local model retrieval strategy (default 0)
    0 -- nn with all points
    1 -- nn with centers only
    2 -- knn with centers only
-b probability_estimates : whether to train local models for probability
                           estimates, 0 or 1 (default 0)
-K k for knn with centers only (default 3)
-M performance measure
-S silent mode (default 0)

Examples

The following example show how to train a model and predict the testing labels using FaLKM-SVM with a neighbourhood size of 1000, assignment neighbourhood size 500 (0.5*neighbourhood size k'), regularisation parameter C equals to 10 and with the RBF kernel function whose width is computed on every local model as the 10th percentile of the histogram of distances in the local neighbourhood:

./FaLK-SVM-train -t 2 -c 10 -g -0.1 -K 1000 -P 0.5 training_set.tr model_file.m
./FaLK-SVM-predict testing_file.te model_file.m predictions.p

For estimating the empirical error using cross-validation with the same settings the command line is

./FaLK-SVM-train -t 2 -c 10 -g -0.1 -K 1000 -P 0.5 -v 10 training_set.tr

It is possible to use the local model selection and directly train the model. Based on the previous example, if we want to perform local model selection (using 10 local models) choosing C among {1,10,100,1000}, g (the inverse of the width of the RBF kernel) among {-0.1, -0.5, -0.9} (i.e. the 10th, 50th and 90th percentile of the distances) and the neighbour hood size k among {250, 500, 1000, 2000} the proper command line is:

./FaLK-SVM-train -t 2 -L 1 -C 1.0:10.0:100.0:1000.0 -G -0.1:-0.5:-0.9
    -N 250:500:1000:2000 -P 0.5 -M 10 training_set.tr model_file.m
./FaLK-SVM-predict testing_file.te model_file.m predictions.p

The same model can be evaluated faster (but probably with less accuracy) using only the centers of the models to retrieve the local SVM:

./FaLK-SVM-predict -T 1 testing_file.te model_file.m predictions.p

FkNNSVM-nr

FkNNSVM-nr takes the dataset and return the dataset without the noisy examples. The principal options are the neighbourhood size -K, the regularisation parameter -c, the kernel -t and its parameters -d, -g and -r. All these options are specified in the same way of FkNNSVM. For FkNNSVM-nr it is also possible to perform local model selection enabling it with -L 1 and specifying the grid of parameters with the -C, -G, -M and -N options in the same way of FaLK-SVM. The command line of FkNNSVM-nr is:

Usage: FkNNSVM-nr [options] input_dataset_file output_edited_dataset_file

-R relaxation parameter for cover trees (default 1)
-K neighbourhood size k
-L local model selection (default 0)
    0 -- no local model selection
    1 -- local model selection on C K and kernel parameters on -M local models
         (the ranges are specified with -C -G -N)
-t kernel_type : set type of kernel function (default 2)
        0 -- linear: u'*v
        1 -- polynomial: (gamma*u'*v + coef0)^degree
        2 -- radial basis function: exp(-gamma*|u-v|^2)
-C c values for local grid model selection separated by ':'
-G G values for local grid model selection separated by ':'
-N K values for local grid model selection separated by ':'
-M number of local models used for local model selection
-m cachesize : set cache memory size in MB (default 100)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
-T balancing threshold for selecting local models for local model selection
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-d degree : set degree in kernel function (default 3)
-g gamma :  set gamma in kernel func., negative values for estimation based on
            histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels)
-r coef0 : set coef0 in kernel function

Examples

The following example show hot to apply FkNNSVM-nr with the linear kernel performing the local model selection on a total of 20 local models choosing the regularisation parameter c in {0.01,1.0,100.0,10000.0} and the neighbourhood size k in {25,50,100,200}.

./FkNNSVM-nr -t 0 -L 1 -C 0.01:1.0:100.0:10000.0 -N 25:50:100:200 -T 0.4 -M 20
    -R 1 training_set.tr edited_training_set.tr > output.txt

In the next example, we use the same setting of the previous one, but the kernel function is the RBF. Consequently the local model selection procedure has also to chose the width of the RBF kernel among {-0.1, -0.5, -0.9} (i.e. the 10th, 50th and 90th percentile of the distances).

./FkNNSVM-nr -t 2 -L 1 -C 0.01:1.0:100.0:10000.0 -G -0.1:-0.5:-0.9 -N 25:50:100
    -T 0.4 -M 20 -R 1 training_set.tr edited_training_set.tr > output.txt

FaLKNR

FaLKNR takes the training set and the edited set on which the training set without noisy examples is written. Its options are basically the same of FkNNSVM-nr: the neighbourhood size -K, the kernel -t, the regularisation parameter -c, the kernel parameters -d, -g, and -r, and the local model selection options -L 1, -C, -G, -N, -M. The -P option specifies the size of the assignment neighbourhood (the k' parameter of LKM) as $k'=k*P$. The command line for FaLKNR is:

Usage: FaLKNR [options] input_dataset_file edited_dataset_file

-K neighbourhood size k
-P assignment neighbourhood size as fraction of K such that: K'=K*P (def. 0.5)
-t kernel_type : set type of kernel function (default 2)
        0 -- linear: u'*v
        1 -- polynomial: (gamma*u'*v + coef0)^degree
        2 -- radial basis function: exp(-gamma*|u-v|^2)
-L local model selection (default 0)
    0 -- no local model selection
    1 -- local model selection on C K and kernel parameters on -M local models
         (the ranges are specified with -C -G -N)
-R relaxation parameter for cover trees (default 3)
-d degree : set degree in kernel function (default 3)
-g gamma :  set gamma in kernel func., negative values for estimation based on
            histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C (default 1)
-C c values for local grid model selection separated by ':'
-G G values for local grid model selection separated by ':'
-N K values for local grid model selection separated by ':'
-M number of local models used for local model selection
-m cachesize : set cache memory size in MB (default 100)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
-S silent mode (default 0)

Examples

The results detailed in [Segata, Blanzieri, Cunningham, 2009] are obtained performing local model selection on C, estimating the g parameter (inverse of RBF kernel width) based on the median of the histogram of distances in the local neighbourhood, the neighbourhood size fixed to 1000 and the assignment neighbourhood size to 250 (0.25*k). The command for this setting is:

./FaLKNR -t 2 -L 1 -R 3 -K 1000 -P 0.25 -C 1.0:10.0:100.0 -g -0.5 -M 10
    training_set.tr edited_training_set.tr

Last modified -december 17, 2009 by Nicola Segata <segata@disi.unitn.it>