Title: | Resampling Algorithms for Multi-Label Datasets |
---|---|
Description: | Collection of the state of the art multi-label resampling algorithms. The objective of these algorithms is to achieve balance in multi-label datasets. |
Authors: | Miguel Ángel Dávila [cre],
Francisco Charte [aut] |
Maintainer: | Miguel Ángel Dávila <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.3 |
Built: | 2025-02-12 04:56:25 UTC |
Source: | https://github.com/madr0008/mldr.resampling |
Auxiliary function used by MLeNN. Computes the Hamming Distance between two instances
adjustedHammingDist(x, y, D)
adjustedHammingDist(x, y, D)
x |
Index of sample 1 |
y |
Index of sample 2 |
D |
mld |
The Hamming Distance between the instances
Auxiliary function used to calculate the distances between an instance and the ones with a specific active label. Euclidean distance is calculated for numeric attributes, and VDM for non numeric ones.
calculateDistances(sample, rest, label, D, tableVDM = NULL)
calculateDistances(sample, rest, label, D, tableVDM = NULL)
sample |
Index of the sample whose distances to other samples we want to know |
rest |
Indexes of the samples to which we will calculate the distance |
label |
Label that must be active |
D |
mld |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A list with the distance to the rest of samples
Auxiliary function used to calculate an auxiliary table to make VDM calculation faster
calculateTableVDM(D)
calculateTableVDM(D)
D |
mld |
A dataframe with tables, useful for VDM calculation
Auxiliary function used by resample. It executes an algorithm, given as a string, and stores the resulting MLD in a arff file
executeAlgorithm( D, a, P, k, TH, strategy, outputDirectory, neighbors, neighbors2, tableVDM )
executeAlgorithm( D, a, P, k, TH, strategy, outputDirectory, neighbors, neighbors2, tableVDM )
D |
mld |
a |
String with the name of the algorithm to be applied. |
P |
Percentage in which the original dataset is increased/decreased (if required by the algorithm) |
k |
Number of neighbors taken into account for each instance (if required by the algorithm) |
TH |
Threshold for the Hamming Distance in order to consider an instance different to another one (if required by the algorithm) |
strategy |
Strategy for choosing the synthetic labels (if required by the algorithm). Possible values: "union", "intersection" and "ranking" (default) |
outputDirectory |
Route with the directory where the generated ARFF file will be stored |
neighbors |
Structure with all instances and neighbors in the dataset, useful in MLSOL and MLUL |
neighbors2 |
Structure with some instances and neighbors in the dataset, useful in MLeNN and MLTL |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Time (in seconds) taken to execute the algorithm (NULL if no algorithm was executed)
Auxiliary function used by MLSOL. Creates a synthetic sample based on two other samples, taking into account their types
generateInstanceMLSOL(seedInstance, refNeigh, t, D)
generateInstanceMLSOL(seedInstance, refNeigh, t, D)
seedInstance |
Index of the sample we are using as "template" |
refNeigh |
Index of the reference neighbor |
t |
types of the instances |
D |
mld |
A synthetic sample derived from the one passed as a parameter and its neighbors
Auxiliary function used by MLSOL and MLUL. Computes the kNN of every instance in a dataset
getAllNeighbors(D, d, tableVDM = NULL)
getAllNeighbors(D, d, tableVDM = NULL)
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A list of vectors with the indexes of the neighbors for each instance
Auxiliary function used by MLeNN and MLTL. Gets the kNN of every instance in a dataset, when compared to some of the rest
getAllNeighbors2(neighbors, d, k)
getAllNeighbors2(neighbors, d, k)
neighbors |
Structure with all the neighbors in the dataset, regardless of which ones to be compared |
d |
Vector with the instances of the dataset which are going to be compared |
k |
Number of neighbors to be retrieved |
A list of vectors with the indexes of the neighbors for each instance
Auxiliary function used by MLUL. For each instance in the dataset, given the neighbors structure, we compute its reverse nearest neighbors
getAllReverseNeighbors(d, neighbors, k)
getAllReverseNeighbors(d, neighbors, k)
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
neighbors |
Structure with the neighbors of every instance in the dataset |
k |
Number of neighbors to be considered |
A list of vectors with the indexes of the reverse nearest neighbors of every instance in the dataset
Auxiliary function used by MLSOL and MLUL. For each instance in the dataset, we compute, for each label, the proportion of neighbors having an opposite class with respect to the proper instance
getC(D, d, neighbors, k)
getC(D, d, neighbors, k)
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
neighbors |
Structure with the neighbors of every instance in the dataset |
k |
Number of neighbors taken into account for each instance |
A structure with the proportion of neighbors having an opposite class with respect to an instance and label
Auxiliary function used to compute the neighbors of an instance
getNN(sample, rest, label, D, tableVDM = NULL)
getNN(sample, rest, label, D, tableVDM = NULL)
sample |
Index of the sample whose neighbors we want to know |
rest |
Indexes of the samples among which we will search |
label |
Label that must be active, in order to calculate the distances |
D |
mld |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A vector with the indexes inside rest of the neighbors
Get the number of cores available for parallel computing
getNumCores()
getNumCores()
The number of cores available for parallel computing
getNumCores()
getNumCores()
Auxiliary function used by MLSOL and MLUL. For non outlier instances, it aggregates the values of C, taking into account the global class imbalance
getS(D, d, C, minoritary)
getS(D, d, C, minoritary)
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
C |
Structure with the proportion of neighbors having an opposite class with respect to an instance and label |
minoritary |
Vector with the minoritary class of each label (normally, 1) |
A structure with the proportion of neighbors having an opposite class with respect to an instance and label, normalized by the global class imbalance
Auxiliary function used by MLUL. It computes the influence of each instance with respect to its reverse neighbors
getU(D, d, rNeighbors, S)
getU(D, d, rNeighbors, S)
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
rNeighbors |
Structure with the reverse nearest neighbors of each instance of the dataset |
S |
Structure with the proportion of neighbors having an opposite class with respect to an instance and label, normalized by the global class imbalance |
A list of values of influence for each instance with respect to its reverse neighbors
Auxiliary function used by MLUL. It calculates, for each instance, how important it is in the dataset
getV(w, u)
getV(w, u)
w |
List of weights for each instance |
u |
List of influences in reverse neighbors for each instance |
A list with the values of importance of each instance in the dataset
Auxiliary function used by MLSOL and MLUL. For non outlier instances, it aggregates the values of S for each label
getW(S)
getW(S)
S |
Structure with the proportion of neighbors having an opposite class with respect to an instance and label, normalized by the global class imbalance |
A vector of weights to be considered when oversampling for each instance
Auxiliary function used by MLSOL. Categorizes each pair instance-label of the dataset with a type
initTypes(C, neighbors, k, minoritary, D, d)
initTypes(C, neighbors, k, minoritary, D, d)
C |
List of vectors with one value for each pair instance-label |
neighbors |
Structure with the k nearest neighbors of each instance of the dataset |
k |
Number of neighbors to be considered for each instance |
minoritary |
Vector with the minoritary value of each label (normally, 1) |
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
A synthetic sample derived from the one passed as a parameter and its neighbors
This function implements the LP-ROS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with minoritary labels, and randomly clone them.
LPROS(D, P)
LPROS(D, P)
D |
mld |
P |
Percentage in which the original dataset is increased |
A mld object containing the preprocessed multilabel dataset
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
library(mldr) LPROS(birds, 25)
library(mldr) LPROS(birds, 25)
This function implements the LP-RUS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with majoritary labelsets, and randomly delete them from the original dataset.
LPRUS(D, P)
LPRUS(D, P)
D |
mld |
P |
Percentage in which the original dataset is increased |
A mld object containing the preprocessed multilabel dataset
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
library(mldr) LPRUS(birds, 25)
library(mldr) LPRUS(birds, 25)
This function implements the MLeNN algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with majoritary labels, and remove its neihgbors which are too different to them, in terms of active labels.
MLeNN(D, TH = 0.5, k = 3, neighbors = NULL, tableVDM = NULL)
MLeNN(D, TH = 0.5, k = 3, neighbors = NULL, tableVDM = NULL)
D |
mld |
TH |
threshold for the Hamming Distance in order to consider an instance different to another one. Defaults to 0.5. |
k |
number of nearest neighbours to check for each instance. Defaults to 3. |
neighbors |
Structure with instances and neighbors. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
An mldr object containing the preprocessed multilabel dataset
Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. MLeNN: A First Approach to Heuristic Multilabel Undersampling. Intelligent Data Engineering and Automated Learning – IDEAL 2014. ISBN 978-3-319-10840-7.
This function implements an algorithm that uses the concept of reverse nearest neighbors, in order to create new instances for each label. Then, several radial SVMs, one for each label, are trained in order to predict each label of the synthetic instances.
MLRkNNOS(D, k, tableVDM = NULL)
MLRkNNOS(D, k, tableVDM = NULL)
D |
mld |
k |
Number of neighbors to be considered when creating a synthetic instance |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A mld object containing the preprocessed multilabel dataset
Sadhukhan, P., & Palit, S. (2019). Reverse-nearest neighborhood based oversampling for imbalanced, multi-label datasets. Pattern Recognition Letters, 125, 813-820
This function implements the ML-ROS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with minoritary labels, and randomly clone them.
MLROS(D, P)
MLROS(D, P)
D |
mld |
P |
Percentage in which the original dataset is increased |
A mld object containing the preprocessed multilabel dataset
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
library(mldr) library(mldr.resampling) MLROS(birds, 25)
library(mldr) library(mldr.resampling) MLROS(birds, 25)
This function implements the ML-RUS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with majoritary labels, and randomly delete them from the original dataset.
MLRUS(D, P)
MLRUS(D, P)
D |
mld |
P |
Percentage in which the original dataset is increased |
A mld object containing the preprocessed multilabel dataset
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
library(mldr) MLRUS(birds, 25)
library(mldr) MLRUS(birds, 25)
This function implements the MLSMOTE algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with minoritary labels, and generate synthetic instances based on their neighbor instances.
MLSMOTE(D, k, strategy = "ranking", tableVDM = NULL)
MLSMOTE(D, k, strategy = "ranking", tableVDM = NULL)
D |
mld |
k |
Number of neighbors to be considered when creating a synthetic instance |
strategy |
Strategy for choosing the synthetic labels. Possible values: "union", "intersection" and "ranking" (default) |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A mld object containing the preprocessed multilabel dataset
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems, 89, 385-397.
This function implements the MLSOL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, which applies oversampling on difficult regions of the instance space, in order to help classifiers distinguish labels.
MLSOL(D, P, k, neighbors = NULL, tableVDM = NULL)
MLSOL(D, P, k, neighbors = NULL, tableVDM = NULL)
D |
mld |
P |
Percentage in which the original dataset is increased |
k |
Number of neighbors to be considered when computing the neighbors of an instance |
neighbors |
Structure with all instances and neighbors in the dataset. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A mld object containing the preprocessed multilabel dataset
Liu, B., Blekas, K., & Tsoumakas, G. (2022). Multi-label sampling based on local label imbalance. Pattern Recognition, 122, 108294.
This function implements the MLTL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify tomek links (majoritary instances with a very different neighbor), and remove them. It's like MLeNN, with the number of neighbors being 1.
MLTL(D, TH, neighbors = NULL, tableVDM = NULL)
MLTL(D, TH, neighbors = NULL, tableVDM = NULL)
D |
mld |
TH |
threshold for the Hamming Distance in order to consider an instance different to another one. |
neighbors |
Structure with instances and neighbors. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
An mldr object containing the preprocessed multilabel dataset
Pereira, R. M., Costa, Y. M., & Silla Jr, C. N. (2020). MLTL: A multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing, 383, 95-105.
This function implements the MLUL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, which applies undersampling, removing difficult instances according to their neighbors.
MLUL(D, P, k, neighbors = NULL, tableVDM = NULL)
MLUL(D, P, k, neighbors = NULL, tableVDM = NULL)
D |
mld |
P |
Percentage in which the original dataset is decreased |
k |
Number of neighbors to be considered when computing the neighbors of an instance |
neighbors |
Structure with all instances and neighbors in the dataset. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A mld object containing the preprocessed multilabel dataset
Liu, B., Blekas, K., & Tsoumakas, G. (2022). Multi-label sampling based on local label imbalance. Pattern Recognition, 122, 108294.
Auxiliary function used by MLSMOTE. Creates a synthetic sample based on values of attributes and labels of its neighbors
newSample(seedInstance, refNeigh, neighbors, strategy, D)
newSample(seedInstance, refNeigh, neighbors, strategy, D)
seedInstance |
Sample we are using as "template" |
refNeigh |
Reference neighbor |
neighbors |
Neighbors to take into account |
strategy |
Strategy for choosing the synthetic labels: union, intersection or ranking |
D |
mld |
A synthetic sample derived from the one passed as a parameter and its neighbors
This function implements the REMEDIAL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to decouple frequent and rare classes appearing in the same instance. For doing so, it aggregates new instances to the dataset and edit the labels present in them.
REMEDIAL(mld)
REMEDIAL(mld)
mld |
|
An mldr object containing the preprocessed multilabel dataset
F. Charte, A. J. Rivera, M. J. del Jesus, F. Herrera. "Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels". Proc. 2015 International Conference on Hybrid Artificial Intelligent Systems (HAIS 2015), pp. 489-501, Bilbao, Spain, 2015. Implementation from the original mldr
package
library(mldr) REMEDIAL(birds)
library(mldr) REMEDIAL(birds)
Interface function of the package. It executes one or several algorithms, given as strings, and stores the resulting MLDs in arff files
resample( D, algorithms, P = 25, k = 3, TH = 0.5, strategy = "ranking", params, outputDirectory = tempdir() )
resample( D, algorithms, P = 25, k = 3, TH = 0.5, strategy = "ranking", params, outputDirectory = tempdir() )
D |
mld |
algorithms |
String, or string vector, with the name(s) of the algorithm(s) to be applied. |
P |
Percentage in which the original dataset is increased/decreased, if required by the algorithm(s). Defaults to 25 |
k |
Number of neighbors taken into account for each instance, if required by the algorithm(s). Defaults to 3 |
TH |
Threshold for the Hamming Distance in order to consider an instance different to another one, if required by the algorithm(s). Defaults to 0.5 |
strategy |
Strategy for choosing the synthetic labels, if required by the algorithm. Defaults to ranking |
params |
Dataframe with 4 columns: name of the algorithm, P, k and TH, in that order, to execute several algorithms with different values for their parameters |
outputDirectory |
Route with the directory where generated ARFF files will be stored. Defaults to a temporary directory |
Dataframe with times (in seconds) taken in to execute each algorithm
library(mldr) library(mldr.resampling) resample(birds, "LPROS", P=25) resample(birds, c("LPROS", "LPRUS"), P=30)
library(mldr) library(mldr.resampling) resample(birds, "LPROS", P=25) resample(birds, c("LPROS", "LPRUS"), P=30)
Set the number of cores available for parallel computing
setNumCores(n)
setNumCores(n)
n |
The new value for the number of cores |
No return value, called in order to change the number of cores
setNumCores(8)
setNumCores(8)
Enable/Disable parallel computing
setParallel(beParallel)
setParallel(beParallel)
beParallel |
A boolean indicating if parallel computing is to be enabled (TRUE) or disabled (FALSE) |
No return value, called in order to enable parallel computing
setParallel(TRUE)
setParallel(TRUE)
Auxiliary function used to calculate the Value Difference Metric (VDM) between two instances considering their non numeric attributes
vdm(D, sample, y, label, tableVDM = NULL)
vdm(D, sample, y, label, tableVDM = NULL)
D |
mld |
sample |
Index of the first sample |
y |
Index of the second sample |
label |
Label that will be considered in calculations |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
A value for the distance