Skip to content

DavidRodSeg/ZeroShotQuantification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZeroShotQuantification

Python NumPy scikit-learn pandas QuantificationLib

🎯 Objective

The goal of this research is to design and study the possibility of using methods that combine zero-shot learning and quantification learning to improve quantification performance in cases where targets lack instances and to overcome the difficulty of the combination of prior and concept dataset shifts. Specifically, two approaches are studied: a classifier-based approach, in which a zero-shot classifier predicts class labels for the targets and a quantification algorithm subsequently optimizes the aggregate predictions, and a similarity-based approach, in which each task is treated as an independent quantification problem and the final result is obtained through interpolation. All methods are explored for the binary case.

📖 Theoretical background

In this section, we provide a brief introduction to quantification and zero-shot learning paradigms. Some algorithms and a classification of both types of learning are presented, although this is not strictly necessary, since in many cases classifiers and quantifiers are used as black boxes and their internal arquitecture do not need to be fully understood. However, for understanding purposes, we consider this explanation necessary.

Quantification learning

Quantification learning is a supervised learning paradigm that arise from the need of obtaining estimations about target distributions in a dataset. Instead of simply obtaining the estimation about the target labels of a certain instance, quantification wants to obtain the prevalence of that particular target label in a certain bag of data.

The most intuitive approach consists in using a classifier to classify all the instances of the bag of data and then count the predictions of a certain label. However, it has been proved that this approach, usually called Classify & Count, is susceptible to the classifier error and do not provide accurate estimations, so other approaches have been explored.

All state-of-the-art quantification algorithms can be classified in three groups or ideas:

  1. Adjusted Count (AC) variations: These methods are based on the idea of applying the CC algorithm and then correcting the estimates. The representative of this type of algorithms is the AC algorithm, that is based on correcting the estimate of the CC approach by means of using the true-positive rate (tpr) and false-positive rate (fpr) of the classifier, both obtained during the training phase. The main idea is that the predicted prevalence is a sum of the true label probabilty multiplied by the tpr and the false label probability multiplied by the fpr:

    $$ h(\mathcal{T}) = p \cdot \text{tpr} + (1 - p) \cdot \text{fpr} $$

       This is assuming that the conditional probability $P(\mathcal{X}|\mathcal{Y})$ does not change.

  1. Mixture Models (MM) or distribution matching approaches: These methods modify the training distribution so that it matchs with the test distribution. Usually, the parameters used for matching the distribution are directly the estimated prevalences $p$.
  2. Purely quantification algorithms: These methods include algorithms adapted from other paradigms to the quantification task and are specifically designed for quantification. The key idea of these algorithms is to incorporating quantification directly during training. They usually involve training models with loss specific to quantification or incorporating prevalence estimation directly into the learning goal. Some examples of these are: quantification decision trees or quantification using Q-measure.

In this project, some methods of the first two groups are explored. Particularly, we use CC, AC, DeBias, QUANTy and SORDy, which are already defined in the quantificationLib package.

Zero-shot learning

Zero-shot learning is another supervised learning paradigm that tries to make predictions for targets that do not have available instances. The lack of instances introduces the need for other type of information different from common features, which is usually called side information. Side information provide information about how common features and targets are related, as this relationship, usually modeled with the conditional probability $P(\mathcal{X}|\mathcal{Y})$ changes from instance to instance. This change is known as concept shift and make possible to differentiate different tasks based on the relationship between features and targets.

As well as it occurs in quantification, the most intuitive approach, that consists on using side information as common features (Baseline method) is not the most appropiate one. Alternative approaches have been explored, model-based and instance-based approaches being the most common. The former, which is the focus of this research, can be generally classified as:

  1. Correspondence methods: These methods learn a function that maps the observed task models and their side-information that is later used on the unobserved tasks.
  2. Relationship methods: They assume a relationship between the side information between observed and unobserve tasks and make predictions using the models from the observed tasks and the similarity with the unobserved tasks.
  3. Combination methods: They divide the task into a series of basic elements, learn models for these elements and then combine them via an inference process.

Although the majority of methods defined for zero-shot learning are designed for computer vision or NLP, the ones used and describe above are defined for a general zero-shot learning (GZSL) scenario.

🧪 Zero-shot quantification

The combination of both problems, in what we called Zero-Shot Quantification (ZSQ), suggest an improvement in quantification performance when there are bag of instances with an unknown relationship with targets. In this section a brief but technical description of both approaches is given.

Classifier-based Zero-Shot Quantifier (CZSQ)

The idea consists on using a zero-shot classifier for clasifying the targets and then applying an adequate quantifier.

Specifically, the classifiers were the Baseline (BS), Dyadic (DYA), Direct Side Information Learning (DSIL), and Similarity (SR) methods. The quantifiers included CC, AC, SORDy, QUANTy and DeBias from quantificationLib library. All possible combinations of these classifiers and quantifiers were tested. The BS and CC combinations were particularly useful as they allowed us to determine the effectiveness of considering zero-shot or quantification techniques in isolation.

CC AC SORDy QUANTy DeBias
BS BS+CC BS+AC BS+SORDy BS+QUANTy BS+DeBias
DYA DYA+CC DYA+AC DYA+SORDy DYA+QUANTy DYA+DeBias
DSIL DSIL+CC DSIL+AC DSIL+SORDy DSIL+QUANTy DSIL+DeBias
SR SR+CC SR+AC SR+SORDy SR+QUANTy SR+DeBias

Table 1: Considered classifier and quantifier combinations.

The main problem with this method is that zero-shot quantification involves both concept shift and prior probability shift, while many quantifiers assume a constant conditional probability $P(\mathcal{X}|\mathcal{Y})$, which is incompatible with prior probability shift. We address this by changing the assumption to a constant $P(\mathcal{X}, \mathcal{S}|\mathcal{Y})$, as we now have side information. The accuracy of the quantifiers depends on the plausibility of this assumption.

Alt Text

Figure 1: CZSQ model. In the training stage, a zero-shot classifier is trained using the training data and the quantifier is built combining it with an adequate aggregate method. In the testing stage, the quantifier is applied to make predictions. The zero-shot classifier makes predictions about the target labels and the aggregate method estimates the prevalence.

Similarity-based Zero-Shot Quantifier (SZSQ)

"The method involves treating each task as an independent quantification subproblem. A quantifier is trained for each task in the training dataset without considering side information. During the testing phase, the trained quantifiers are applied to the unobserved tasks in the testing dataset, and the side information is used to interpolate their results using Inverse Distance Weighting (IDW) interpolation.

Alt Text

Figure 2: SZSQ model. In the training stage, a quantifier is trained for each task. In the testing stage, the results obtained from the trained quantifiers are interpolated, and the final outcome is the predicted prevalence.

📊 Datasets. ZSQ Bank-marketing dataset

The Portuguese bank marketing dataset was used and adapted for ZSQ, as side information and multiple tasks were required to apply ZSQ methods.

The bank marketing dataset contains both common and semantic information about the client, which can be separated to improve model performance. This semantic information, collected as side information features, includes data that can socially categorize clients, such as their age, job status, or education.

After selecting the side information, tasks were defined by applying KMeans clustering on the data instances based on their side information. The quality of the resulting clusters was evaluated using the Silhouette score.

🛠️ How to use it

📦 Dependencies

This project uses Python 3.10.6. The main libraries are:

  • NumPy 2.2.3
  • Scikit-learn 1.6.1
  • Pandas 2.2.3
  • QuantificationLib 0.1.2

However, other libraries might be needed. They can be imported using the requirements.txt file.

🧩 Task selection and generation of the dataset

The dataset is generated using the bank-marketing.csv dataset. Its data is organized in tasks using a clustering algorithm, allowing the construction of a dataset suitable for zero-shot quantification.

This dataset can then be used to generate multiple smaller datasets for training and testing zero-shot quantification algorithms. For doing so, task_selection_random.py must be executed:

python -m src.task_selection_random.py

Inside this file we can modify the conditions for each dataset:

  • DATA_NAME: The name of the parent dataset.
  • VERBOSE: Level of output detail.
  • NUM_TRAIN_TASKS: The number of tasks to include in the training dataset.
  • NUM_TEST_TASKS: The number of tasks to include in the testing dataset.
  • NUM_DATASETS: The number of datasets to generate.

🏋️ Training and testing

The training and testing can be performed using the training_classifier_based.py or training_similarity_based.py_scripts using the command:

python -m src.training_classifier_based.py --classifier your_classifier --quantifier your_quantifier --data your_data

Where the arguments specify:

  • --classifier: The classifier to use (from the available options).
  • --quantifier: The quantification method to use.
  • --data: The index of the dataset to use.

A Makefile is also provided to automate the training of multiple models and datasets. To use it, simply run:

make

The results from the testing/validation phase are saved as JSON files, one per method and dataset in the results/random directory.

➕ Others

There are other scripts that can be executed for studying this problem. The syntax is similar to the ones above, so we only mention its name:

  1. compute_errors and compute_errors_similarity: For computing the absolute error of the results with the ground truth prevalence.
  2. create_zsq_dataset: This script generates a CSV file from the original ARFF bank-marketing file. During the creation of the file a one-hot encoding is applied and the side-information is selected from the available social features.
  3. task_selection_random: Script for creating the datasets.
  4. dataset_study: For studying the dataset and data visualization.
  5. friedman_nemenyi_test and friedman_nemenyi_test_similarity.: For applying the Friedman-Nemenyi statistical test to the results.
  6. results_analysis and results_analysis_similarity: For analyzing the results of the Friedman–Nemenyi test and its average ranks.

📈 Results

The results suggest an improvement in prevalence estimation when using a zero-shot classifier in quantification, i.e., CZSQ method. In the case of SZSQ, no improvement was observed.

The prevalence results were compared with the ground-truth value using the Absolute Error (AE) measure.

$$ AE = h(\mathcal{T}) - p, $$

where p represent the actual prevalence or ground-truth value of the dataset.

These errors were used for comparing the combinations of zero-shot classifier and quantifier using the Friedman-Nemenyi test.

BS DSIL Dyadic SR
CC 10.61 9.66 8.95 15.95
AC 9.03 6.66 7.16 11.42
DeBias 15.26 14.68 14.61 16.21
QUANTy 8.84 7.79 6.42* 9.95
SORDy 11.53 9.21 7.18 8.89

Table 2: Average scores of combinations of quantification and classification algorithms in CZSQ. The value in each cell represents the average score of a combination of a zero-shot classifier (column name) and a quantifier (row name). The best results are highlighted in bold and the best overall result is highlighted with an asterisk (*).

BS+CC BS+AC BS+DeBias BS+QUANTy BS+SORDy CC AC DeBias QUANTy SORDy
3.89 3.58 5.95 3.47* 4.16 7.13 6.39 8.00 6.26 6.16

Table 3: Average ranks for CZSQ methods employing the BS classifier and for SZSQ methods. The SZSQ methods are named according to the base quantifier employed during the process. The best results are highlighted in bold and the best overall result is highlighted with an asterisk (*).

📚 References

The main ideas have been taken from the following articles:

Fdez-Díaz, Miriam, Elena Montañés, and José Quevedo. 2023. “Direct Side Information Learning for Zero-Shot Regression.” Neurocomputing 561: 126873. https://doi.org/10.1016/j.neucom.2023.126873.

González, Pablo, Alberto Castaño, Nitesh Chawla, and Juan del Coz. 2017. “A Review on Quantification Learning.” ACM Computing Surveys 50: 1–40. https://doi.org/10.1145/3117807.

Fdez-Díaz, Miriam, José Quevedo, and Elena Montañés. 2022. “Target Inductive Methods for Zero-Shot Regression.” Information Sciences 599. https://doi.org/10.1016/j.ins.2022.03.075.

Castaño, Alberto, Jaime Alonso, Pablo González, Pablo Pérez, and Juan José del Coz. 2024. “QuantificationLib: A Python Library for Quantification and Prevalence Estimation.” SoftwareX 26: 101728. https://doi.org/10.1016/j.softx.2024.101728.

Demšar, Janez. 2006. “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research 7 (1): 1–30. https://dl.acm.org/doi/10.5555/1248547.1248548.

About

Study of the plausibility of combining quantification and zero-shot algorithms.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published