The goal of this research is to design and study the possibility of using methods that combine zero-shot learning and quantification learning to improve quantification performance in cases where targets lack instances and to overcome the difficulty of the combination of prior and concept dataset shifts. Specifically, two approaches are studied: a classifier-based approach, in which a zero-shot classifier predicts class labels for the targets and a quantification algorithm subsequently optimizes the aggregate predictions, and a similarity-based approach, in which each task is treated as an independent quantification problem and the final result is obtained through interpolation. All methods are explored for the binary case.
In this section, we provide a brief introduction to quantification and zero-shot learning paradigms. Some algorithms and a classification of both types of learning are presented, although this is not strictly necessary, since in many cases classifiers and quantifiers are used as black boxes and their internal arquitecture do not need to be fully understood. However, for understanding purposes, we consider this explanation necessary.
Quantification learning is a supervised learning paradigm that arise from the need of obtaining estimations about target distributions in a dataset. Instead of simply obtaining the estimation about the target labels of a certain instance, quantification wants to obtain the prevalence of that particular target label in a certain bag of data.
The most intuitive approach consists in using a classifier to classify all the instances of the bag of data and then count the predictions of a certain label. However, it has been proved that this approach, usually called Classify & Count, is susceptible to the classifier error and do not provide accurate estimations, so other approaches have been explored.
All state-of-the-art quantification algorithms can be classified in three groups or ideas:
-
Adjusted Count (AC) variations: These methods are based on the idea of applying the CC algorithm and then correcting the estimates. The representative of this type of algorithms is the AC algorithm, that is based on correcting the estimate of the CC approach by means of using the true-positive rate (tpr) and false-positive rate (fpr) of the classifier, both obtained during the training phase. The main idea is that the predicted prevalence is a sum of the true label probabilty multiplied by the tpr and the false label probability multiplied by the fpr:
$$ h(\mathcal{T}) = p \cdot \text{tpr} + (1 - p) \cdot \text{fpr} $$
This is assuming that the conditional probability
-
Mixture Models (MM) or distribution matching approaches: These methods modify the training distribution so that it matchs with the test distribution. Usually, the parameters used for matching the distribution are directly the estimated prevalences
$p$ . - Purely quantification algorithms: These methods include algorithms adapted from other paradigms to the quantification task and are specifically designed for quantification. The key idea of these algorithms is to incorporating quantification directly during training. They usually involve training models with loss specific to quantification or incorporating prevalence estimation directly into the learning goal. Some examples of these are: quantification decision trees or quantification using Q-measure.
In this project, some methods of the first two groups are explored. Particularly, we use CC, AC, DeBias, QUANTy and SORDy, which are already defined in the quantificationLib package.
Zero-shot learning is another supervised learning paradigm that tries to make predictions for targets that do not have available instances. The lack of instances introduces the need for other type of information different from common features, which is usually called side information. Side information provide information about how common features and targets are related, as this relationship, usually modeled with the conditional probability
As well as it occurs in quantification, the most intuitive approach, that consists on using side information as common features (Baseline method) is not the most appropiate one. Alternative approaches have been explored, model-based and instance-based approaches being the most common. The former, which is the focus of this research, can be generally classified as:
- Correspondence methods: These methods learn a function that maps the observed task models and their side-information that is later used on the unobserved tasks.
- Relationship methods: They assume a relationship between the side information between observed and unobserve tasks and make predictions using the models from the observed tasks and the similarity with the unobserved tasks.
- Combination methods: They divide the task into a series of basic elements, learn models for these elements and then combine them via an inference process.
Although the majority of methods defined for zero-shot learning are designed for computer vision or NLP, the ones used and describe above are defined for a general zero-shot learning (GZSL) scenario.
The combination of both problems, in what we called Zero-Shot Quantification (ZSQ), suggest an improvement in quantification performance when there are bag of instances with an unknown relationship with targets. In this section a brief but technical description of both approaches is given.
The idea consists on using a zero-shot classifier for clasifying the targets and then applying an adequate quantifier.
Specifically, the classifiers were the Baseline (BS), Dyadic (DYA), Direct Side Information Learning (DSIL), and Similarity (SR) methods. The quantifiers included CC, AC, SORDy, QUANTy and DeBias from quantificationLib library. All possible combinations of these classifiers and quantifiers were tested. The BS and CC combinations were particularly useful as they allowed us to determine the effectiveness of considering zero-shot or quantification techniques in isolation.
CC | AC | SORDy | QUANTy | DeBias | |
---|---|---|---|---|---|
BS | BS+CC | BS+AC | BS+SORDy | BS+QUANTy | BS+DeBias |
DYA | DYA+CC | DYA+AC | DYA+SORDy | DYA+QUANTy | DYA+DeBias |
DSIL | DSIL+CC | DSIL+AC | DSIL+SORDy | DSIL+QUANTy | DSIL+DeBias |
SR | SR+CC | SR+AC | SR+SORDy | SR+QUANTy | SR+DeBias |
Table 1: Considered classifier and quantifier combinations.
The main problem with this method is that zero-shot quantification involves both concept shift and prior probability shift, while many quantifiers assume a constant conditional probability
"The method involves treating each task as an independent quantification subproblem. A quantifier is trained for each task in the training dataset without considering side information. During the testing phase, the trained quantifiers are applied to the unobserved tasks in the testing dataset, and the side information is used to interpolate their results using Inverse Distance Weighting (IDW) interpolation.
Figure 2: SZSQ model. In the training stage, a quantifier is trained for each task. In the testing stage, the results obtained from the trained quantifiers are interpolated, and the final outcome is the predicted prevalence.The Portuguese bank marketing dataset was used and adapted for ZSQ, as side information and multiple tasks were required to apply ZSQ methods.
The bank marketing dataset contains both common and semantic information about the client, which can be separated to improve model performance. This semantic information, collected as side information features, includes data that can socially categorize clients, such as their age, job status, or education.
After selecting the side information, tasks were defined by applying KMeans clustering on the data instances based on their side information. The quality of the resulting clusters was evaluated using the Silhouette score.
This project uses Python 3.10.6. The main libraries are:
- NumPy 2.2.3
- Scikit-learn 1.6.1
- Pandas 2.2.3
- QuantificationLib 0.1.2
However, other libraries might be needed. They can be imported using the requirements.txt file.
The dataset is generated using the bank-marketing.csv dataset. Its data is organized in tasks using a clustering algorithm, allowing the construction of a dataset suitable for zero-shot quantification.
This dataset can then be used to generate multiple smaller datasets for training and testing zero-shot quantification algorithms. For doing so, task_selection_random.py must be executed:
python -m src.task_selection_random.py
Inside this file we can modify the conditions for each dataset:
- DATA_NAME: The name of the parent dataset.
- VERBOSE: Level of output detail.
- NUM_TRAIN_TASKS: The number of tasks to include in the training dataset.
- NUM_TEST_TASKS: The number of tasks to include in the testing dataset.
- NUM_DATASETS: The number of datasets to generate.
The training and testing can be performed using the training_classifier_based.py or training_similarity_based.py_scripts using the command:
python -m src.training_classifier_based.py --classifier your_classifier --quantifier your_quantifier --data your_data
Where the arguments specify:
- --classifier: The classifier to use (from the available options).
- --quantifier: The quantification method to use.
- --data: The index of the dataset to use.
A Makefile is also provided to automate the training of multiple models and datasets. To use it, simply run:
make
The results from the testing/validation phase are saved as JSON files, one per method and dataset in the results/random directory.
There are other scripts that can be executed for studying this problem. The syntax is similar to the ones above, so we only mention its name:
- compute_errors and compute_errors_similarity: For computing the absolute error of the results with the ground truth prevalence.
- create_zsq_dataset: This script generates a CSV file from the original ARFF bank-marketing file. During the creation of the file a one-hot encoding is applied and the side-information is selected from the available social features.
- task_selection_random: Script for creating the datasets.
- dataset_study: For studying the dataset and data visualization.
- friedman_nemenyi_test and friedman_nemenyi_test_similarity.: For applying the Friedman-Nemenyi statistical test to the results.
- results_analysis and results_analysis_similarity: For analyzing the results of the Friedman–Nemenyi test and its average ranks.
The results suggest an improvement in prevalence estimation when using a zero-shot classifier in quantification, i.e., CZSQ method. In the case of SZSQ, no improvement was observed.
The prevalence results were compared with the ground-truth value using the Absolute Error (AE) measure.
where p represent the actual prevalence or ground-truth value of the dataset.
These errors were used for comparing the combinations of zero-shot classifier and quantifier using the Friedman-Nemenyi test.
BS | DSIL | Dyadic | SR | |
---|---|---|---|---|
CC | 10.61 | 9.66 | 8.95 | 15.95 |
AC | 9.03 | 6.66 | 7.16 | 11.42 |
DeBias | 15.26 | 14.68 | 14.61 | 16.21 |
QUANTy | 8.84 | 7.79 | 6.42* | 9.95 |
SORDy | 11.53 | 9.21 | 7.18 | 8.89 |
Table 2: Average scores of combinations of quantification and classification algorithms in CZSQ. The value in each cell represents the average score of a combination of a zero-shot classifier (column name) and a quantifier (row name). The best results are highlighted in bold and the best overall result is highlighted with an asterisk (*).
BS+CC | BS+AC | BS+DeBias | BS+QUANTy | BS+SORDy | CC | AC | DeBias | QUANTy | SORDy |
---|---|---|---|---|---|---|---|---|---|
3.89 | 3.58 | 5.95 | 3.47* | 4.16 | 7.13 | 6.39 | 8.00 | 6.26 | 6.16 |
Table 3: Average ranks for CZSQ methods employing the BS classifier and for SZSQ methods. The SZSQ methods are named according to the base quantifier employed during the process. The best results are highlighted in bold and the best overall result is highlighted with an asterisk (*).
The main ideas have been taken from the following articles: