FaLKM-liba Library for Fast Local Kernel MachinesAuthor: Nicola Segata <nsegata@hsph.harvard.edu> Current version: 1.1 |
FaLKM-lib is a library for fast local kernel machine implemented in C++ extedning LibSVM. It contains modules for classification, regression and noise reduction. All the neighbourhood operations are implemented with our implementation of Cover Trees [Beygelzimer, Kakade, Langford, 2006] for computational reasons, whereas the training and testing of the Local SVMs is performed using LibSVM [Chang, Lin 2001] version 2.88. FaLKM-lib contains modules for very fast and scalable learning that showed to be statistically more accurate than SVM and, using some implemented strategies, faster than SVM and thus scalable for very large datasets (some million of samples for non high-dimensional data). The FaLKM-lib library and its algorithms are concisely introduced and formally described in the document [Segata, 2009] available here.
If you have any general comments, suggestions or problems using the code, please write to nsegata@hsph.harvard.edu.
Local Support Vector Machines (LSVM) have been introduced by [Blanzieri, Melgani, 2006, 2008] and proved to be statisticlaly more accurate than SVM as shown in [Segata, Blanzieri 2010, Segata, Blanzieri 2009a]. LSVM learning can be made scalable and dramatically faster than SVM using some approximation strategies of LSVM as shown in [Segata, Blanzieri 2009b] and some other modifications (not published yet). The LSVM approach showed to be very effective also for noise reduction (which is an important preprocessing step for other analysis and for instance-based learning) both with the original LSVM setting [Segata, Blanzieri, Delany, Cunningham, 2008] and with its scalable version [Segata, Blanzieri, Cunningham, 2009]. FaLKM-lib implements the Cover Tree data-structure [Beygelzimer, Kakade, Langford, 2006] for fast retrieval of neighborhoods in the feature space, and integrates LibSVM [Chang, Lin 2001] version 2.88 for SVM training and prediction.
FaLKM-lib consists in the following five modules:
The matematical formulations on which the modules of FaLKM-lib are based are available here, the explantions about how to use FaLKM-lin can be found in the README file in library package and at the end of this page.
The current release of FaLKM-lib is the 1.1 (May 2009) and it is freely available for research and educational purposes (please read the COPYRIGHT before downloading FaLKM-lib) and can be obtained downloading the package as tar.gz archive. The previous release (1.01) is still available here.
The archives contain the sources, the makefile for building the executables in Unix-based systems, the Makefile.win for building the executables in Windows systems using Visual Studio compiler, the README file with detailed explantations for the usage, and the COPYRIGHT file. FaLKM-lib has been tested mainly under Linux, so the more stable and safe version is the Unix-based one.
Nicola Segata, FaLKM-lib: a Library for Fast Local Kernel Machines, 2009. Software available at http://disi.unitn.it/~segata/FaLKM-lib
The bibtex entry is:@TECHREPORT{Segata2009, AUTHOR = {Nicola Segata}, TITLE = {{FaLKM-lib v1.0: a Library for Fast Local Kernel Machines}}, INSTITUTION = {DISI, University of Trento, Italy}, YEAR = {2009}, note = {Software available at \url{http://disi.unitn.it/~segata/FaLKM-lib}} number = {DISI-09-025}, }
[Segata, Blanzieri, 2010] |
N. Segata, E. Blanzieri, Fast and Scalable Local Kernel Machines. JMLR 11(Jun):1883−1926, 2010. |
[Segata, 2009] |
N. Segata, FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Technical Report DISI-09-025. Department of Information Engineering and Computer Science, University of Trento, Italy, 2009. Software available at http://disi.unitn.it/~segata/FaLKM-lib |
[Blanzieri, Melgani, 2006] |
E. Blanzieri, F. Melgani, An Adaptive SVM Nearest Neighbor Classifier for Remotely Sensed Imagery. IEEE International Conference on Geoscience and Remote Sensing Symposium, 2006. pp. 3931-3934. |
[Blanzieri, Melgani, 2008] |
E. Blanzieri, F. Melgani, Nearest Neighbor Classification of Remote Sensing Images With the Maximal Margin Principle. IEEE Transactions on Geoscience and Remote Sensing. June 2008 Volume: 46, Issue: 6 On page(s): 1804-1811 |
[Segata, Blanzieri, 2009a] |
N. Segata, E. Blanzieri, Empirical Assessment of Classification Accuracy of Local SVM. Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning (Benelearn 09), pp 47-55. Slides available here. |
[Segata, Blanzieri, 2009b] |
N. Segata, E. Blanzieri, Fast Local Support Vector Machines for Large Datasets. In "Machine Learning and Data Mining in Pattern Recognition", LNCS, Springer, pp 295-310. Proceedings of the 6th International Conference on Machine Learning and Data Mining 2009. Best Paper Award |
[Segata, Blanzieri, Cunningham, 2009] |
N. Segata, E. Blanzieri, P. Cunningham, A Scalable Noise Reduction Technique for Large Case-Based Systems. In "Case-Based Reasoning Research and Development", LNCS, Springer, pp 328-342. Proceedings of the International Conference on Case-Based Reasoning 2009. Slides available here |
[Segata, Blanzieri, Delany, Cunningham, 2008] |
N. Segata, E. Blanzieri, S.J. Delany, P. Cunningham, Noise Reduction for Instance-Based Learning with a Local Maximal Margin Approach. Journal of Intelligent Information Systems (in press) |
[Beygelzimer, Kakade, Langford, 2006] |
A. Beygelzimer, A. Kakade, J. Langford, Cover trees for nearest neighbor. Proceedings of the 23rd international conference on Machine learning, 2006. |
[Bottou, Vapnik 1992] |
L. Bottou, V. Vapnik, Local learning algorithms. Neural computation 4:(6), 1992. |
[Chang, Lin 2001] |
C.C. Chang, C.J. Lin, LIBSVM : a library for support vector machines. 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. |
All modules accept inputs and generate output accordingly to the LibSVM and SVM-light file format which is defined as:
Usage: FkNN [options] training_dataset_file [testing_dataset_file output_label_file]
-K neighbourhood size (default 1) -T classification (0, default) or regression (1) -C conservative variant for the majority rule, (def. 0 = disabled, 1 enabled) -R relaxation parameter for cover trees (default 1) -t kernel_type : set type of kernel function (default 0) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) [eq. to -t 0!!] -d degree : set degree in kernel function (default 3) -r coef0 : set coef0 in kernel function (default 1) -s knn strategy : (default 0 = majority rule) -g gamma : set gamma in kernel function (default = 1 for polynomial kernels) -v n : n-fold cross validation mode -S silent mode (default 0)
./FkNN -t 0 -K 1 -R 3 training_set.tr testing_set.te output_file.out > accuracy_results.nn
./FkNN -t 1 -g 1 -d 3 -r 1 -K 5 -v 10 training_set.tr
Usage: FkNNSVM [options] training_dataset_file [test_dataset_file output_label_file]
-Z local SVM type (default 0 = C-SVM, 3 for epsilon-SVR) -R relaxation parameter for cover trees (default 1) -K neighbourhood size k -t kernel_type : set type of kernel function (default 2) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) -c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1) -wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1) -p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1) -d degree : set degree in kernel function (default 3) -g gamma : set gamma in kernel func., negative values for estimation based on histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels) -m cachesize : set cache memory size in MB (default 100) -r coef0 : set coef0 in kernel function -S silent mode (default 0)
./FkNNSVM -t 2 -c 10 -g -0.5 -K 200 training_set.tr testing_set.te output_file.out
FaLK-SVM-train takes the training set and the name of the output file on which the trained model will be stored (unless the model selection is enabled with -v option). The option -P, which takes values between 0 and 1, specifies the assignment neighbourhood size k' regulating the level of redundancy in covering the training set with local models. FaLK-SVM-train can also enable the local model selection option with -L 1; on this case other needs also the options for setting the number of local models that will be used for the procedure (with -M), the possible values for k with -N and the grid of SVM and kernel parameters to be considered: the regularisation parameter among the values specified with -C, the RBF kernel width and the degree of the polynomial kernels among the values specified with -G (possibly using negative values as detailed for FkNNSVM). It is also possible to train the model for obtaining probability estimates rather than predicted class labels with -b 1. The other options correspond to the FkNNSVM ones.
Specifically, the command line syntax is:
Usage: FaLK-SVM-train [options] input_dataset_file [model_file]
-t kernel_type : set type of kernel function (default 2) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) -L local model selection (default 0) 0 -- no local model selection 1 -- local model selection on C K and kernel parameters on -M local models (the ranges are specified with -C -G -N) -R relaxation parameter for cover trees (default 1) -K neighbourhood size k -P assignment neighbourhood size as fraction of K such that: K'=K*P (def. 0.5) -d degree : set degree in kernel function (default 3) -r coef0 : set coef0 in kernel function (default 0) -c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1) -g gamma : set gamma in kernel func., negative values for estimation based on histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels) -C c values for local grid model selection separated by ':' -G G values for local grid model selection separated by ':' -N K values for local grid model selection separated by ':' -M number of local models used for local model selection -m cachesize : set cache memory size in MB (default 100) -wi weight: set the parameter C of class i to weight*C, (default 1) -b probability_estimates : whether to train local models for probability estimates, 0 or 1 (default 0) -v n: n-fold cross validation mode -S silent mode (default 0)
FaLK-SVM-predict uses the model trained with FaLK-SVM-train to predict the label of the query examples. It is possible to specify to use the FaLK-SVMc variant (with -T 1) instead of the default FaLK-SVM approach for prediction, in order to increase the computational performances. The probability output of FaLK-SVM can be enabled with -b 1 (it is necessary that the model has been trained be FaLK-SVM-train with the same option).
FaLK-SVM-predict is launched as follows:
Usage: FaLK-SVM-predict [options] test_file model_file output_label_file
-T local model retrieval strategy (default 0) 0 -- nn with all points 1 -- nn with centers only 2 -- knn with centers only -b probability_estimates : whether to train local models for probability estimates, 0 or 1 (default 0) -K k for knn with centers only (default 3) -M performance measure -S silent mode (default 0)
./FaLK-SVM-train -t 2 -c 10 -g -0.1 -K 1000 -P 0.5 training_set.tr model_file.m ./FaLK-SVM-predict testing_file.te model_file.m predictions.p
./FaLK-SVM-train -t 2 -c 10 -g -0.1 -K 1000 -P 0.5 -v 10 training_set.tr
./FaLK-SVM-train -t 2 -L 1 -C 1.0:10.0:100.0:1000.0 -G -0.1:-0.5:-0.9 -N 250:500:1000:2000 -P 0.5 -M 10 training_set.tr model_file.m ./FaLK-SVM-predict testing_file.te model_file.m predictions.p
./FaLK-SVM-predict -T 1 testing_file.te model_file.m predictions.p
Usage: FkNNSVM-nr [options] input_dataset_file output_edited_dataset_file
-R relaxation parameter for cover trees (default 1) -K neighbourhood size k -L local model selection (default 0) 0 -- no local model selection 1 -- local model selection on C K and kernel parameters on -M local models (the ranges are specified with -C -G -N) -t kernel_type : set type of kernel function (default 2) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) -C c values for local grid model selection separated by ':' -G G values for local grid model selection separated by ':' -N K values for local grid model selection separated by ':' -M number of local models used for local model selection -m cachesize : set cache memory size in MB (default 100) -wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1) -T balancing threshold for selecting local models for local model selection -c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1) -d degree : set degree in kernel function (default 3) -g gamma : set gamma in kernel func., negative values for estimation based on histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels) -r coef0 : set coef0 in kernel function
./FkNNSVM-nr -t 0 -L 1 -C 0.01:1.0:100.0:10000.0 -N 25:50:100:200 -T 0.4 -M 20 -R 1 training_set.tr edited_training_set.tr > output.txt
./FkNNSVM-nr -t 2 -L 1 -C 0.01:1.0:100.0:10000.0 -G -0.1:-0.5:-0.9 -N 25:50:100 -T 0.4 -M 20 -R 1 training_set.tr edited_training_set.tr > output.txt
Usage: FaLKNR [options] input_dataset_file edited_dataset_file
-K neighbourhood size k -P assignment neighbourhood size as fraction of K such that: K'=K*P (def. 0.5) -t kernel_type : set type of kernel function (default 2) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) -L local model selection (default 0) 0 -- no local model selection 1 -- local model selection on C K and kernel parameters on -M local models (the ranges are specified with -C -G -N) -R relaxation parameter for cover trees (default 3) -d degree : set degree in kernel function (default 3) -g gamma : set gamma in kernel func., negative values for estimation based on histogram of distances (def. = -0.5 for RBF, 1 for pol. kernels) -r coef0 : set coef0 in kernel function (default 0) -c cost : set the parameter C (default 1) -C c values for local grid model selection separated by ':' -G G values for local grid model selection separated by ':' -N K values for local grid model selection separated by ':' -M number of local models used for local model selection -m cachesize : set cache memory size in MB (default 100) -wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1) -S silent mode (default 0)
./FaLKNR -t 2 -L 1 -R 3 -K 1000 -P 0.25 -C 1.0:10.0:100.0 -g -0.5 -M 10 training_set.tr edited_training_set.tr
Last modified -december 17, 2009 by Nicola Segata <segata@disi.unitn.it>