Welcome to SPEAR’s documentation!¶
SPEAR: Semi-Supervised Data Programming
We present SPEAR, an open-source python library for data programming with semi-supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data. SPEAR facilitates weak supervision, either pre-defined, in the form of rules/heuristics and associate ‘noisy’ labels(or prelabels) to the training dataset. These noisy labels are aggregated to assign labels to the unlabeled data for downstream tasks. Several label aggregation approaches have been proposed that aggregate the noisy labels and then train the ‘noisily’ labeled set in a cascaded manner, while other approaches ‘jointly’ aggregates and trains the model. In the python package, we integrate several cascade and joint data-programming approaches while providing facility to define rules. The code and tutorial notebooks are available here.
Labeling¶
This module takes inspiration and build upon Ratner et al. [RBE+20]
LF¶
- class spear.labeling.lf.core.LabelingFunction(name: str, f: Callable[..., int], label=None, resources: Optional[Mapping[str, Any]] = None, pre: Optional[List[spear.labeling.preprocess.core.BasePreprocessor]] = None, cont_scorer: Optional[spear.labeling.continuous_scoring.core.BaseContinuousScorer] = None)[source]¶
Base class for labeling function
- Parameters
name (str) – name for this LF object
f (Callable[..., int]) – core function which labels the input
label (enum) – Which class this LF corresponds to
resources (Optional[Mapping[str, Any]], optional) – Additional resources for core function. Defaults to None.
pre (Optional[List[BasePreprocessor]], optional) – Preprocessors to apply on input before labeling. Defaults to None.
cont_scorer (Optional[BaseContinuousScorer], optional) – Continuous Scorer to calculate the confidence score. Defaults to None.
- class spear.labeling.lf.core.labeling_function(name: Optional[str] = None, label=None, resources: Optional[Mapping[str, Any]] = None, pre: Optional[List[spear.labeling.preprocess.core.BasePreprocessor]] = None, cont_scorer: Optional[spear.labeling.continuous_scoring.core.BaseContinuousScorer] = None)[source]¶
Decorator class for a labeling function
- Parameters
name (Optional[str], optional) – Name for this labeling function. Defaults to None.
label (Optional[Enum], optional) – An enum. Which class this LF corresponds to. Defaults to None.
resources (Optional[Mapping[str, Any]], optional) – Additional resources for the LF. Defaults to None.
pre (Optional[List[BasePreprocessor]], optional) – Preprocessors to apply on input before labeling . Defaults to None.
cont_scorer (Optional[BaseContinuousScorer], optional) – Continuous Scorer to calculate the confidence score. Defaults to None.
- Raises
ValueError – If the decorator is missing parantheses
Continuous scoring¶
- class spear.labeling.continuous_scoring.core.BaseContinuousScorer(name: str, cf: Callable[..., int], resources: Optional[Mapping[str, Any]] = None)[source]¶
Base Class for Continuous Scoring function used by the Labeling Function
- Parameters
name (str) – Name of the continuous scoring function
cf (Callable[..., int]) – Core function which calculates continuous score
resources (Optional[Mapping[str, Any]], optional) – Resources for the scorer. Defaults to None.
- class spear.labeling.continuous_scoring.core.continuous_scorer(name: Optional[str] = None, resources: Optional[Mapping[str, Any]] = None)[source]¶
Decorator class for continuous scoring.
- Parameters
name (Optional[str], optional) – Name for the decorator. Defaults to None.
resources (Optional[Mapping[str, Any]], optional) – Resources for the scorer. Defaults to None.
- Raises
ValueError – If decorator is missing parantheses.
LFApply¶
- class spear.labeling.apply.core.ApplierMetadata(faults: Dict[str, int])[source]¶
Metadata about Applier call.
- property faults¶
Alias for field number 0
- class spear.labeling.apply.core.BaseLFApplier(lf_set: spear.labeling.lf_set.core.LFSet)[source]¶
Base class for LF applier objects. Base class for LF applier objects, which executes a set of LFs on a collection of data points. Subclasses should operate on a single data point collection format (e.g.
DataFrame
). Subclasses must implement theapply
method.- Parameters
lf_set (LFSet) – Instace of LFset which has information of set of labeling functions(which is applied on data)
- Raises
ValueError – If names of LFs are not unique
- spear.labeling.apply.core.apply_lfs_to_data_point(x: Any, index: int, lfs: List[spear.labeling.lf.core.LabelingFunction], f_caller: spear.labeling.apply.core._FunctionCaller) → List[Tuple[int, int, int, float]][source]¶
Label a single data point with a set of LFs
- Parameters
x (DataPoint) – Data point to label
index (int) – Index of the data point
lfs (List[LabelingFunction]) – List of LFs to label
x
withf_caller (_FunctionCaller) – A
_FunctionCaller
to record failed LF executions
- Returns
A list of (data point index, LF index, label enum, confidence) tuples
- Return type
RowData
- class spear.labeling.apply.core.LFApplier(lf_set: spear.labeling.lf_set.core.LFSet)[source]¶
LF applier for a list of data points (e.g.
SimpleNamespace
) or a NumPy array.- Parameters
lf_set (LFSet) – Instace of LFset which has information of set of labeling functions(which is applied on data)
- apply(data_points: Union[Sequence[Any], numpy.ndarray], progress_bar: bool = True, fault_tolerant: bool = False, return_meta: bool = False) → Union[numpy.ndarray, Tuple[numpy.ndarray, spear.labeling.apply.core.ApplierMetadata]][source]¶
Label list of data points or a NumPy array with LFs.
- Parameters
data_points (Union[DataPoints, np.ndarray]) – List of data points or NumPy array to be labeled by LFs
progress_bar (bool, optional) – Display a progress bar?. Defaults to True.
fault_tolerant (bool, optional) – Output
-1
if LF execution fails?. Defaults to False.return_meta (bool, optional) – Return metadata from apply call?. Defaults to False.
- Returns
- np.ndarray:
Matrix of labels emitted by LFs
- ApplierMetadata:
Metadata, such as fault counts, for the apply call
- Return type
Union[np.ndarray, Tuple[np.ndarray, ApplierMetadata]]
LFSet¶
- class spear.labeling.lf_set.core.LFSet(name: str, lfs: List[spear.labeling.lf.core.LabelingFunction] = [])[source]¶
Class for Set of Labeling Functions
- Parameters
name (str) – Name for this LFset.
lfs (List[LabelingFunction], optional) – List of LFs to add to this object. Defaults to [].
- get_lfs() → Set[spear.labeling.lf.core.LabelingFunction][source]¶
Returns LFs contained in this LFSet object
- Returns
LFs in this LFSet
- Return type
Set[LabelingFunction]
- add_lf(lf: spear.labeling.lf.core.LabelingFunction) → None[source]¶
Adds single LF to this LFSet
- Parameters
lf (LabelingFunction) – LF to add
- add_lf_list(lf_list: List[spear.labeling.lf.core.LabelingFunction]) → None[source]¶
Adds a list of LFs to this LFSet
- Parameters
lf_list (List[LabelingFunction]) – List of LFs to add to this LFSet
- remove_lf(lf: spear.labeling.lf.core.LabelingFunction) → None[source]¶
Removes a LF from this set
- Parameters
lf (LabelingFunction) – LF to remove from this set
- Raises
Warning – If LF not already in LFset
LFAnalysis¶
- class spear.labeling.analysis.core.LFAnalysis(enum, L: numpy.ndarray, rules=None)[source]¶
Run analysis on LFs using label matrix.
- Parameters
L (np.ndarray) – Label matrix where L_{i,j} is the label given by the jth LF to the ith x instance
lfs (Optional[List[LabelingFunction]], optional) – Labeling functions used to generate ‘L`. Defaults to None.
abstain (int, optional) – label associated with abstain. Defaults to -1.
- Raises
ValueError – If number of LFs and number of LF matrix columns differ
- label_coverage() → float[source]¶
Compute the fraction of data points with at least one label.
- Returns
Fraction of data points with labels
- Return type
float
Example
>>> L = np.array([ ... [-1, 0, 0], ... [-1, -1, -1], ... [1, 0, -1], ... [-1, 0, -1], ... [0, 0, 0], ... ]) >>> LFAnalysis(L).label_coverage() 0.8
- label_overlap() → float[source]¶
Compute the fraction of data points with at least two (non-abstain) labels.
- Returns
Fraction of data points with overlapping labels
- Return type
float
Example
>>> L = np.array([ ... [-1, 0, 0], ... [-1, -1, -1], ... [1, 0, -1], ... [-1, 0, -1], ... [0, 0, 0], ... ]) >>> LFAnalysis(L).label_overlap() 0.6
- label_conflict() → float[source]¶
Compute the fraction of data points with conflicting (non-abstain) labels.
- Returns
Fraction of data points with conflicting labels
- Return type
float
Example
>>> L = np.array([ ... [-1, 0, 0], ... [-1, -1, -1], ... [1, 0, -1], ... [-1, 0, -1], ... [0, 0, 0], ... ]) >>> LFAnalysis(L).label_conflict() 0.2
- lf_polarities() → List[List[int]][source]¶
Infer the polarities of each LF based on evidence in a label matrix.
- Returns
Unique output labels for each LF
- Return type
List[List[int]]
Example
>>> L = np.array([ ... [-1, 0, 0], ... [-1, -1, -1], ... [1, 0, -1], ... [-1, 0, -1], ... [0, 0, 0], ... ]) >>> LFAnalysis(L).lf_polarities() [[0, 1], [0], [0]]
- lf_coverages() → numpy.ndarray[source]¶
Compute frac. of examples each LF labels.
- Returns
Fraction of labeled examples for each LF
- Return type
np.ndarray
Example
>>> L = np.array([ ... [-1, 0, 0], ... [-1, -1, -1], ... [1, 0, -1], ... [-1, 0, -1], ... [0, 0, 0], ... ]) >>> LFAnalysis(L).lf_coverages() array([0.4, 0.8, 0.4])
- lf_overlaps(normalize_by_coverage: bool = False) → numpy.ndarray[source]¶
Compute frac. of examples each LF labels that are labeled by another LF. An overlapping example is one that at least one other LF returns a (non-abstain) label for. Note that the maximum possible overlap fraction for an LF is the LF’s coverage, unless
normalize_by_coverage=True
, in which case it is 1- Parameters
normalize_by_coverage (bool, optional) – Normalize by coverage of the LF, so that it returns the percent of LF labels that have overlaps. Defaults to False.
- Returns
Fraction of overlapping examples for each LF
- Return type
np.ndarray
Example
>>> L = np.array([ ... [-1, 0, 0], ... [-1, -1, -1], ... [1, 0, -1], ... [-1, 0, -1], ... [0, 0, 0], ... ]) >>> LFAnalysis(L).lf_overlaps() array([0.4, 0.6, 0.4]) >>> LFAnalysis(L).lf_overlaps(normalize_by_coverage=True) array([1. , 0.75, 1. ])
- lf_conflicts(normalize_by_overlaps: bool = False) → numpy.ndarray[source]¶
Compute frac. of examples each LF labels and labeled differently by another LF. A conflicting example is one that at least one other LF returns a different (non-abstain) label for. Note that the maximum possible conflict fraction for an LF is the LF’s overlaps fraction, unless
normalize_by_overlaps=True
, in which case it is 1. Parameters- Parameters
normalize_by_overlaps (bool, optional) – Normalize by overlaps of the LF, so that it returns the percent of LF overlaps that have conflicts. Defaults to False.
- Returns
Fraction of conflicting examples for each LF
- Return type
np.ndarray
Example
>>> L = np.array([ ... [-1, 0, 0], ... [-1, -1, -1], ... [1, 0, -1], ... [-1, 0, -1], ... [0, 0, 0], ... ]) >>> LFAnalysis(L).lf_conflicts() array([0.2, 0.2, 0. ]) >>> LFAnalysis(L).lf_conflicts(normalize_by_overlaps=True) array([0.5 , 0.33333333, 0. ])
- lf_empirical_accuracies(Y: numpy.ndarray) → numpy.ndarray[source]¶
Compute empirical accuracy against a set of labels Y for each LF. Usually, Y represents development set labels.
- Parameters
Y (np.ndarray) – [n] np.ndarray of gold labels
- Returns
Empirical accuracies for each LF
- Return type
np.ndarray
- lf_summary(Y: Optional[numpy.ndarray] = None, plot: Optional[bool] = False) → pandas.DataFrame[source]¶
Create a pandas DataFrame with the various per-LF statistics.
- Parameters
Y (Optional[np.ndarray], optional) – [n] np.ndarray of gold labels. If provided, the empirical accuracy for each LF will be calculated. Defaults to None.
plot (Optional[bool], optional) – If set to true a bar graph is plotted. Defaults to False.
- Returns
Summary statistics for each LF
- Return type
DataFrame
Pre Labels¶
- class spear.labeling.prelabels.core.PreLabels(name: str, data: Sequence[Any], rules: spear.labeling.lf_set.core.LFSet, num_classes: int, labels_enum, data_feats: Optional[Sequence[Any]] = numpy.array, gold_labels: Optional[Sequence[Any]] = numpy.array, exemplars: Sequence[Any] = numpy.array)[source]¶
Generate noisy lables, continuous score from lf’s applied on data
- Parameters
name (str) – Name for this object.
data (DataPoints) – Datapoints.
gold_labels (Optional[DataPoints]) – Labels for datapoints if available.
rules (LFSet) – Set of Rules to generate noisy labels for the dataset.
exemplars (DataPoints) – [description]
- get_labels()[source]¶
Applies LFs to the dataset to generate noisy labels and returns noisy labels and confidence scores
- Returns
Noisy Labels and Confidences
- Return type
Tuple(DataPoints, DataPoints)
- analyse_lfs(plot=False)[source]¶
Analyse the lfs in LFSet on data
- Parameters
plot (bool, optional) – Plot the values. Defaults to False.
- Returns
dataframe consisting of Ploarity, Coverage, Overlap, Conflicts, Empirical Acc
- Return type
DataFrame
CAGE¶
Chatterjee et al. [CRS20]
- class spear.cage.core.Cage(path_json, n_lfs)[source]¶
- Cage class:
Class for Data Programming using CAGE [Note: from here on, graphical model(gm) and CAGE algorithm terms are used interchangeably]
- Parameters
path_json – Path to json file consisting of number to string(class name) map
n_lfs – number of labelling functions used to generate pickle files
- save_params(save_path)[source]¶
member function to save parameters of Cage
- Parameters
save_path – path to pickle file to save parameters
- load_params(load_path)[source]¶
member function to load parameters to Cage
- Parameters
load_path – path to pickle file to load parameters
- fit_and_predict_proba(path_pkl, path_test=None, path_log=None, qt=0.9, qc=0.85, metric_avg=['binary'], n_epochs=100, lr=0.01)[source]¶
- Parameters
path_pkl – Path to pickle file of input data in standard format
path_test – Path to the pickle file containing test data in standard format
path_log – Path to log file. No log is produced if path_test is None. Default is None which prints accuracies/f1_scores is printed to terminal
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – List of average metric to be used in calculating f1_score, default is [‘binary’]. Use None for not calculating f1_score
n_epochs – Number of epochs, default is 100
lr – Learning rate for torch.optim, default is 0.01
- Returns
numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum)
- fit_and_predict(path_pkl, path_test=None, path_log=None, qt=0.9, qc=0.85, metric_avg=['binary'], n_epochs=100, lr=0.01, need_strings=False)[source]¶
- Parameters
path_pkl – Path to pickle file of input data in standard format
path_test – Path to the pickle file containing test data in standard format
path_log – Path to log file. No log is produced if path_test is None. Default is None which prints accuracies/f1_scores is printed to terminal
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – List of average metric to be used in calculating f1_score, default is [‘binary’]
n_epochs – Number of epochs, default is 100
lr – Learning rate for torch.optim, default is 0.01
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False
- Returns
numpy.ndarray of shape (num_instances,) which are aggregated/predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp.
- predict_proba(path_test, qc=0.85)[source]¶
Used to predict labels based on a pickle file with path path_test
- Parameters
path_test – Path to the pickle file containing test data set in standard format
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
- Returns
numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]
- predict(path_test, qc=0.85, need_strings=False)[source]¶
Used to predict labels based on a pickle file with path path_test
- Parameters
path_test – Path to the pickle file containing test data set in standard format
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False
- Returns
numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]
Joint Learning(JL)¶
Maheshwari et al. [MCK+20]
From here on, Feature model(fm) imply Feature based classification model
- class spear.jl.core.JL(path_json, n_lfs, n_features, feature_model='nn', n_hidden=512)[source]¶
- Joint_Learning class:
[Note: from here on, feature model(fm) and feature-based classification model are used interchangeably. graphical model(gm) and CAGE algorithm terms are used interchangeably]
Loss function number | Calculated over | Loss function: (useful for loss_func_mask in fit_and_predict_proba and fit_and_predict functions)
1, L, Cross Entropy(prob_from_feature_model, true_labels)
2, U, Entropy(prob_from_feature_model)
3, U, Cross Entropy(prob_from_feature_model, prob_from_graphical_model)
4, L, Negative Log Likelihood
5, U, Negative Log Likelihood(marginalised over true labels)
6, L and U, KL Divergence(prob_feature_model, prob_graphical_model)
7, _, Quality guide
- Parameters
path_json – Path to json file containing the dictionary of number to string(class name) map
n_lfs – number of labelling functions used to generate pickle files
n_features – number of features for each instance in the first array of pickle file aka feature matrix
feature_model – The model intended to be used for features, allowed values are ‘lr’(Logistic Regression) or ‘nn’(Neural network with 2 hidden layer) string, default is ‘nn’
n_hidden – Number of hidden layer nodes if feature model is ‘nn’, type is integer, default is 512
- save_params(save_path)[source]¶
member function to save parameters of JL
- Parameters
save_path – path to pickle file to save parameters
- load_params(load_path)[source]¶
member function to load parameters to JL
- Parameters
load_path – path to pickle file to load parameters
- fit_and_predict_proba(path_L, path_U, path_V, path_T, loss_func_mask, batch_size, lr_fm, lr_gm, use_accuracy_score, path_log=None, return_gm=False, n_epochs=100, start_len=7, stop_len=10, is_qt=True, is_qc=True, qt=0.9, qc=0.85, metric_avg='binary')[source]¶
- Parameters
path_L – Path to pickle file of labelled instances
path_U – Path to pickle file of unlabelled instances
path_V – Path to pickle file of validation instances
path_T – Path to pickle file of test instances
loss_func_mask – list of size 7 where loss_func_mask[i] should be 1 if Loss function (i+1) should be included, 0 else. Checkout Eq(3) in [MCK+20]
batch_size – Batch size, type should be integer
lr_fm – Learning rate for feature model, type is integer or float
lr_gm – Learning rate for graphical model(cage algorithm), type is integer or float
use_accuracy_score – The score to use for termination condition on validation set. True for accuracy_score, False for f1_score
path_log – Path to log file to append log. Default is None which prints accuracies/f1_scores is printed to terminal
return_gm – Return the predictions of graphical model? the allowed values are True, False. Default value is False
n_epochs – Number of epochs in each run, type is integer, default is 100
start_len – A parameter used in validation, refers to the least epoch after which validation checks need to be performed, type is integer, default is 7
stop_len – A parameter used in validation, refers to the least number of continuous epochs of non incresing validation accuracy after which the training should be stopped, type is integer, default is 10
is_qt – True if quality guide is available(and will be provided in ‘qt’ argument). False if quality guide is intended to be found from validation instances. Default is True
is_qc – True if quality index is available(and will be provided in ‘qc’ argument). False if quality index is intended to be found from validation instances. Default is True
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – Average metric to be used in calculating f1_score/precision/recall, default is ‘binary’
- Returns
If return_gm is True; the return value is two predicted labels of numpy array of shape (num_instances, num_classes), first one is through feature model, other one through graphical model. Else; the return value is predicted labels of numpy array of shape (num_instances, num_classes) through feature model. For a given model i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) using that model. It is suggested to use the probailities of feature model
- fit_and_predict(path_L, path_U, path_V, path_T, loss_func_mask, batch_size, lr_fm, lr_gm, use_accuracy_score, path_log=None, return_gm=False, n_epochs=100, start_len=7, stop_len=10, is_qt=True, is_qc=True, qt=0.9, qc=0.85, metric_avg='binary', need_strings=False)[source]¶
- Parameters
path_L – Path to pickle file of labelled instances
path_U – Path to pickle file of unlabelled instances
path_V – Path to pickle file of validation instances
path_T – Path to pickle file of test instances
loss_func_mask – list of size 7 where loss_func_mask[i] should be 1 if Loss function (i+1) should be included, 0 else. Checkout Eq(3) in [MCK+20]
batch_size – Batch size, type should be integer
lr_fm – Learning rate for feature model, type is integer or float
lr_gm – Learning rate for graphical model(cage algorithm), type is integer or float
use_accuracy_score – The score to use for termination condition on validation set. True for accuracy_score, False for f1_score
path_log – Path to log file to append log. Default is None which prints accuracies/f1_scores is printed to terminal
return_gm – Return the predictions of graphical model? the allowed values are True, False. Default value is False
n_epochs – Number of epochs in each run, type is integer, default is 100
start_len – A parameter used in validation, refers to the least epoch after which validation checks need to be performed, type is integer, default is 7
stop_len – A parameter used in validation, refers to the least number of continuous epochs of non incresing validation accuracy after which the training should be stopped, type is integer, default is 10
is_qt – True if quality guide is available(and will be provided in ‘qt’ argument). False if quality guide is intended to be found from validation instances. Default is True
is_qc – True if quality index is available(and will be provided in ‘qc’ argument). False if quality index is intended to be found from validation instances. Default is True
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – Average metric to be used in calculating f1_score/precision/recall, default is ‘binary’
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False
- Returns
If return_gm is True; the return value is two predicted labels of numpy array of shape (num_instances, ), first one is through feature model, other one through graphical model. Else; the return value is predicted labels of numpy array of shape (num_instances,) through feature model. It is suggested to use the probailities of feature model
- predict_gm_proba(path_test, qc=0.85)[source]¶
Used to find the predicted labels based on the trained parameters of graphical model(CAGE)
- Parameters
path_test – Path to the pickle file containing test data set
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
- Returns
numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model
- predict_fm_proba(x_test)[source]¶
Used to find the predicted labels based on the trained parameters of feature model
- Parameters
x_test – numpy array of shape (num_instances, num_features) containing data whose labels are to be predicted
- Returns
numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model
- predict_gm(path_test, qc=0.85, need_strings=False)[source]¶
Used to find the predicted labels based on the trained parameters of graphical model(CAGE)
- Parameters
path_test – Path to the pickle file containing test data set
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False
- Returns
numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model
- predict_fm(x_test, need_strings=False)[source]¶
Used to find the predicted labels based on the trained parameters of feature model
- Parameters
x_test – numpy array of shape (num_instances, num_features) containing data whose labels are to be predicted
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False
- Returns
numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model
Subset Selection¶
Uses facilityLocation from submodlib library which is also provided by DECILE for submodular optimization
- spear.jl.subset_selection.rand_subset(n_all, n_instances)[source]¶
A function to choose random indices of the input instances to be labeled
- Parameters
n_all – number of available instances, type in integer
n_intances – number of instances to be labelled, type is integer
- Returns
A numpy.ndarray of the indices(of shape (n_sup,) and each element in the range [0,n_all-1)) to be labeled
- spear.jl.subset_selection.unsup_subset(x_train, n_unsup)[source]¶
A function for unsupervised subset selection(the subset to be labeled)
- Parameters
x_train – A numpy.ndarray of shape (n_instances, n_features). All the data, intended to be used for training
n_unsup – number of instances to be found during unsupervised subset selection, type is integer
- Returns
numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,x_train.shape[0])), the result of subset selection
- spear.jl.subset_selection.sup_subset(path_json, path_pkl, n_sup, qc=0.85)[source]¶
A helper function for supervised subset selection(the subset to be labeled) which just returns indices
- Parameters
path_json – Path to json file of number to string(class name) map
path_pkl – Path to the pickle file containing all the training data in standard format
n_sup – Number of instances to be found during supervised subset selection
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
- Returns
numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection AND the data which is list of contents of path_pkl
- spear.jl.subset_selection.sup_subset_indices(path_json, path_pkl, n_sup, qc=0.85)[source]¶
A function for supervised subset selection(the subset to be labeled) whcih just returns indices
- Parameters
path_json – Path to json file of number to string(class name) map
path_pkl – Path to the pickle file containing all the training data in standard format
n_sup – Number of instances to be found during supervised subset selection
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
- Returns
numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection
- spear.jl.subset_selection.sup_subset_save_files(path_json, path_pkl, path_save_L, path_save_U, n_sup, qc=0.85)[source]¶
A function for supervised subset selection(the subset to be labeled) which makes separate pickle files of data, one for those to be labelled, other that can be left unlabelled
- Parameters
path_json – Path to json file of number to string(class name) map
path_pkl – Path to the pickle file containing all the training data in standard format
path_save_L – Path to save the pickle file of set of instances to be labelled. Note that instances are not labelled yet. Extension should be .pkl
path_save_U – Path to save the pickle file of set of instances that can be left unlabelled. Extension should be .pkl
n_sup – number of instances to be found during supervised subset selection
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
- Returns
numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection. Also two pickle files are saved at path_save_L and path_save_U
- spear.jl.subset_selection.replace_in_pkl(path, path_save, np_array, index)[source]¶
A function to insert the true labels, after labeling the instances, to the pickle file
- Parameters
path – Path to the pickle file containing all the data in standard format
path_save – Path to save the pickle file after replacing the ‘L’(true labels numpy array) of data in path pickle file
np_array – The data which is to be used to replace the data in path pickle file with
index – Index of the numpy array, in data of path pickle file, to be replaced with np_array. Value should be in [0,8]
- Returns
No return value. A pickle file is generated at path_save
- spear.jl.subset_selection.insert_true_labels(path, path_save, labels)[source]¶
A function to insert the true labels, after labeling the instances, to the pickle file
- Parameters
path – Path to the pickle file containing all the data in standard format
path_save – Path to save the pickle file after replacing the ‘L’(true labels numpy array) of data in path pickle file
labels – The true labels of the data in pickle file. numpy.ndarray of shape (num_instances, 1)
- Returns
No return value. A pickle file is generated at path_save
CAGE, JL - UTILS¶
Note: The arguments whose shapes are mentioned in ‘[….]’ are torch tensors.
Data loaders¶
The common utils to CAGE and JL algorithms are in this file. Don’t change the name or location of this file.
- spear.utils.data_editor.is_dict_trivial(dict)[source]¶
A helper function that checks if the dictionary have key and value equal values for all keys except if its null
- Parameters
dict – the dictionary
- Returns
True if all keys(which are not None) are equal to respective values. False otherwise
- spear.utils.data_editor.get_data(path, check_shapes=True, class_map=None)[source]¶
- Standard format in pickle file contains the NUMPY ndarrays x, l, m, L, d, r, s, n, k and an int n_classes
x: (num_instances, num_features), x[i][j] is jth feature of ith instance. Note that the dimension fo this array can vary depending on the dimension of input
l: (num_instances, num_lfs), l[i][j] is the prediction of jth LF(co-domain: the values used in Enum) on ith instance. l[i][j] = None imply Abstain
m: (num_instances, num_lfs), m[i][j] is 1 if jth LF didn’t Abstain on ith instance. Else it’s 0
L: (num_instances, 1), L[i] is true label(co-domain: the values used in Enum) of ith instance, if available. Else L[i] is None
d: (num_instances, 1), d[i] is 1 if ith instance is labelled. Else it is 0
r: (num_instances, num_lfs), r[i][j] is 1 if ith instance is an exemplar for jth rule. Else it’s 0
s: (num_instances, num_lfs), s[i][j] is the continuous score of ith instance given by jth continuous LF. If jth LF is not continuous, then s[i][j] is None
n: (num_lfs,), n[i] is 1 if ith LF has continuous counter part, else n[i] is 0
k: (num_lfs,), k[i] is the class of ith LF, co-domain: the values used in Enum
n_classes: total number of classes
In case the numpy array is not available(can be possible for x, L, d, r, s), it is stored as numpy.zeros(0)
- Parameters
path – path to pickle file with data in the format above
check_shapes – if true, checks whether the shapes of numpy arrays in pickle file are consistent as per the format mentioned above. Else it doesn’t check. Default is True.
class_map – dictionary of class numbers(sorted, mapped to [0,n_classes-1]) are per the Enum defined in labeling part. l,L are modified(needed inside algorithms) before returning, using class_map. Default is None which doesn’t do any mapping
- Returns
A list containing all the numpy arrays mentioned above. The arrays l, L are modified using the class_map
- spear.utils.data_editor.get_classes(path)[source]¶
The json file should contain a dictionary of number to string(class name) map as defined in Enum
- Parameters
path – path to json file with contents mentioned above
- Returns
A dictionary (number to string(class name) map)
- spear.utils.data_editor.get_predictions(proba, class_map, class_dict, need_strings)[source]¶
This function takes probaility of instances being a class and gives what class each instance belongs to, using the maximum of probabilities
- Parameters
proba – probability numpy.ndarray of shape (num_instances, num_classes)
class_map – dictionary mapping the class numbers(as per Enum class defined) to numbers in range [0, num_classes-1]
class_dict – dictionary consisting of number to string(class name) mapping as per the Enum class defined
need_trings – If True, the output conatians strings(of class names), else it consists of numbers(class numbers as used in Enum definition)
- Returns
numpy.ndarray of shape (num_instances,), where elements are class_names/class_numbers depending on need_strings is True/False, where the elements represent the class of each instance
- spear.utils.data_editor.get_enum(np_array, enm)[source]¶
This function is used to convert a numpy array of numbers to a numpy array of enums based on the Enum class provided ‘enm’
- Parameters
np_array – a numpy.ndarray of any shape consisting of numbers
enm – An class derived from ‘Enum’ class, which must contain map from every number in np_array to an enum
- Returns
numpy.ndarray of shape shape as np_array but now contains enums(as per the mapping in ‘enm’) instead of numbers
CAGE and JL utils¶
From here on, Graphical model(gm) imply CAGE algorithm and Feature model(fm) imply Feature based classification model
The common utils to CAGE and JL algorithms are in this file. Don’t change the name or location of this file.
- spear.utils.utils_cage.phi(theta, l, device)[source]¶
Graphical model utils: A helper function
- Parameters
theta – [n_classes, n_lfs], the parameters
l – [n_lfs]
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
a tensor of shape [n_classes, n_lfs], element wise product of input tensors(each row of theta dot product with l)
- spear.utils.utils_cage.calculate_normalizer(theta, k, n_classes, device)[source]¶
Graphical model utils: Used to find Z(the normaliser) in CAGE. Eq(4) in [CRS20]
- Parameters
theta – [n_classes, n_lfs], the parameters
k – [n_lfs], labels corresponding to LFs
n_classes – num of classes/labels
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
a real value, representing the normaliser
- spear.utils.utils_cage.probability_l_y(theta, m, k, n_classes, device)[source]¶
Graphical model utils: Used to find probability involving the term psi_theta(in Eq(1) in [CRS20]), the potential function for all LFs
- Parameters
theta – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
a tensor of shape [n_instances, n_classes], the psi_theta value for each instance, for each class(true label y)
- spear.utils.utils_cage.probability_s_given_y_l(pi, s, y, m, k, continuous_mask, qc)[source]¶
Graphical model utils: Used to find probability involving the term psi_pi(in Eq(1) in [CRS20]), the potential function for all continuous LFs
- Parameters
pi – [n_lfs], the parameters for the class y
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
y – a value in [0, n_classes-1], representing true label, for which psi_pi is calculated
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1
- Returns
a tensor of shape [n_instances], the psi_pi value for each instance, for the given label(true label y)
- spear.utils.utils_cage.probability(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]¶
Graphical model utils: Used to find probability of given instances for all possible true labels(y’s). Eq(1) in [CRS20]
- Parameters
theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
a tensor of shape [n_instances, n_classes], the probability for an instance being a particular class
- spear.utils.utils_cage.log_likelihood_loss(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]¶
Graphical model utils: Negative of log likelihood loss. Negative of Eq(6) in [CRS20]
- Parameters
theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
a real value, negative of summation over (the log of probability for an instance, marginalised over y(true labels))
- spear.utils.utils_cage.precision_loss(theta, k, n_classes, a, device)[source]¶
Graphical model utils: Negative of the regularizer term in Eq(9) in [CRS20]
- Parameters
theta – [n_classes, n_lfs], the parameters
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
a – [n_lfs], a[i] is the quality guide for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
a real value, negative of regularizer term
- spear.utils.utils_cage.predict_gm_labels(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]¶
Graphical model utils: Used to predict the labels after the training is done
- Parameters
theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
numpy.ndarray of shape (n_instances,), the predicted class for an instance
JL utils¶
- spear.utils.utils_jl.log_likelihood_loss_supervised(theta, pi, y, m, s, k, n_classes, continuous_mask, qc, device)[source]¶
Joint Learning utils: Negative log likelihood loss, used in loss 4 in [MCK+20]
- Parameters
theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF
device – ‘cuda’ if drivers are available, else ‘cpu’
- Returns
a real value, summation over (the log of probability for an instance)
- spear.utils.utils_jl.entropy(probabilities)[source]¶
Joint Learning utils: Entropy, Used in loss 2 in [MCK+20]
- Parameters
probabilities – [num_unsup_instances, num_classes], probabilities[i][j] is probability of ith instance being jth class
- Returns
a real value, the entropy value of given probability
- spear.utils.utils_jl.kl_divergence(probs_p, probs_q)[source]¶
Joint Learning utils: KL divergence of two probabilities, used in loss 6 in [MCK+20]
- Parameters
probs_p – [num_instances, num_classes]
probs_q – [num_instances, num_classes]
- Returns
a real value, the KL divergence of given probabilities
- spear.utils.utils_jl.find_indices(data, data_sub)[source]¶
A helper function for subset selection
- Parameters
data – the complete data, torch tensor of shape [num_instances, num_classes]
data_sub – the subset of ‘data’ whose indices are to be found. Should be of same shape as ‘data’
- Returns
list of indices, to be found from the result of apricot library
Feature-based Models¶
- class spear.jl.models.models.LogisticRegression(*args: Any, **kwargs: Any)[source]¶
Class for Logistic Regression, used in Joint learning class/Algorithm
- Parameters
input_size – number of features
output_size – number of classes
Hls¶
Hls Checkmate¶
- class spear.Implyloss.checkmate.BestCheckpointSaver(save_dir, num_to_keep=1, maximize=True, saver=None)[source]¶
Maintains a directory containing only the best n checkpoints
Inside the directory is a best_checkpoints JSON file containing a dictionary mapping of the best checkpoint filepaths to the values by which the checkpoints are compared. Only the best n checkpoints are contained in the directory and JSON file.
This is a light-weight wrapper class only intended to work in simple, non-distributed settings. It is not intended to work with the tf.Estimator framework.
- handle(value, sess, global_step_tensor)[source]¶
Func Desc: Updates the set of best checkpoints based on the given result.
Input: value: The value by which to rank the checkpoint. sess: A tf.Session to use to save the checkpoint global_step_tensor: A tf.Tensor represent the global step
Output: True or False
- spear.Implyloss.checkmate.get_best_checkpoint(best_checkpoint_dir, select_maximum_value=True)[source]¶
Func Desc: Reads the best_checkpoints file in the best_checkpoint_dir directory. Returns the filepath in the best_checkpoints file associated with the highest value if select_maximum_value is True, or the filepath associated with the lowest value if select_maximum_value is False.
Input: best_checkpoint_dir: Directory containing best_checkpoints JSON file select_maximum_value: If True, select the filepath associated with the highest value. Otherwise, select the filepath associated with the lowest value.
Output: The full path to the best checkpoint file
Hls Checkpoints¶
- spear.Implyloss.checkpoints.test_mru_checkpoints(num_to_keep)[source]¶
Func Desc: Runs different sessions while changing the checkpoint number that is currently being worked with and tests the same
Input: num_to_keep(int) - a limit on the size of the global step for checkpoint traversal
Output:
- spear.Implyloss.checkpoints.test_checkpoint()[source]¶
Func Desc: tests whether the checkpoints stored are as expected
Input:
Output:
Hls Data Feeders¶
Hls Data Feeders Utils¶
- spear.Implyloss.data_feeder_utils.change_values(l, user_class_to_num_map)[source]¶
Func Desc: Replace the class labels in l by sequential labels - 0,1,2,..
Input: l - the class label matrix user_class_to_num_map - dictionary storing mapping from original class labels to sequential labels
Output: l - with sequential labels
- spear.Implyloss.data_feeder_utils.load_data(fname, jname, num_load=None)[source]¶
Func Desc: load the data from the given file
Input: fname - filename num_load (default - None)
Output: the structured F_d_U_Data
- spear.Implyloss.data_feeder_utils.get_rule_classes(l, num_classes)[source]¶
Func Desc: get the different rule_classes
Input: l ([batch_size, num_rules]) num_classes (int) - the number of available classes
Output: rule_classes ([num_rules,1]) - the list of valid classes labelled by rules (say class 2 by r0, class 1 by r1, class 4 by r2 => [2,1,4])
- spear.Implyloss.data_feeder_utils.extract_rules_satisfying_min_coverage(m, min_coverage)[source]¶
Func Desc: extract the rules that satisfy the specified minimum coverage
Input: m ([batch_size, num_rules]) - mij specifies whether ith example is associated with the jth rule min_coverage
Output: satisfying_rules - list of satisfying rules not_satisfying_rules - list of not satisfying rules rule_map_new_to_old rule_map_old_to_new
- spear.Implyloss.data_feeder_utils.remap_2d_array(arr, map_old_to_new)[source]¶
Func Desc: remap those columns of 2D array that are present in map_old_to_new
Input: arr ([batch_size, num_rules]) map_old_to_new
Output: modified array
- spear.Implyloss.data_feeder_utils.remap_1d_array(arr, map_old_to_new)[source]¶
Func Desc: remap those positions of 1D array that are present in map_old_to_new
Input: arr ([batch_size, num_rules]) map_old_to_new
Output: modified array
- spear.Implyloss.data_feeder_utils.modify_d_or_U_using_rule_map(raw_U_or_d, rule_map_old_to_new)[source]¶
Func Desc: Modify d or U using the rule map
Input: raw_U_or_d - the raw data (labelled(d) or unlabelled(U)) rule_map_old_to_new - the rule map
Output: the modified raw_U_or_d
- spear.Implyloss.data_feeder_utils.shuffle_F_d_U_Data(data)[source]¶
Func Desc: shuffle the input data along the 0th axis i.e. among the different instances
Input: data
Output: the structured and shuffled F_d_U_Data
Hls Gen Cross Entropy Utils¶
Hls Model¶
Hls PR Utils¶
- spear.Implyloss.pr_utils.exp_term_for_constraints(rule_classes, num_classes, C)[source]¶
Func Desc: Compute the exponential term for the constraints
Input: rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C
Output: the required exponential term
- spear.Implyloss.pr_utils.pr_product_term(weights, rule_classes, num_classes, C)[source]¶
Func Desc: Compute the probability product term for the constraints
Input: weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C
Output: the required product term
- spear.Implyloss.pr_utils.get_q_y_from_p(f_probs, weights, rule_classes, num_classes, C)[source]¶
Func Desc: Compute the q_y term from the p (f_network) distribution
Input: f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C
Output: the required q_y term
- spear.Implyloss.pr_utils.get_q_r_from_p(f_probs, weights, rule_classes, num_classes, C)[source]¶
Func Desc: Compute the q_r term from the p (f_network) distribution
Input: f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C
Output: the required q_r term
- spear.Implyloss.pr_utils.theta_term_in_pr_loss(f_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]¶
Func Desc: Compute the theta term in the pr loss
Input: f_logits ([batch_size, num_classes]) f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C d ([batch_size,1])
Output: the required theta term (third term in equation 14) - used to supervise f (classification) network from instances in U
- spear.Implyloss.pr_utils.phi_term_in_pr_loss(m, w_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]¶
Func Desc: Compute the phi term in the pr loss
Input: m ([batch_size, num_rules]) - mij = 1 if ith example is associated with jth rule w_logits ([batch_size, num_rules]) f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C d ([batch_size,1])
Output: the required phi term (fourth term in equation 14) - used to superwise w (rule) network from instances in U
- spear.Implyloss.pr_utils.pr_loss(m, f_logits, w_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]¶
Func Desc: Compute the pr loss
Input: m ([batch_size, num_rules]) - mij = 1 if ith example is associated with jth rule f_logits w_logits ([batch_size, num_rules]) - logit before sigmoid activation in w network f_probs ([batch_size, num_classes]) - output of f network weights ([batch_size, num_rules]) - the sigmoid output from w network rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C - lamda in equation 10 (hyperparameter) d ([batch_size,1]) - if ith instance is from “d” set (labelled data) d[i] = 1, else if ith instance is from “U” set, d[i] = 0
Output: the required phi term
Hls Test¶
- class spear.Implyloss.test.HLSTest(hls)[source]¶
Class Desc: This Class is designed to test the HLS model and its accuracy and precision obtained on the validation and test datasets
- maybe_save_predictions(save_filename, x, l, m, preds, d)[source]¶
Func Desc: Saves the predictions obtained from the model if required
Input: self save_filename - the filename where the predictions have to be saved if required x ([batch_size, num_features]) l ([batch_size, num_rules]) m ([batch_size, num_rules]) preds d ([batch_size,1]) - d[i] = 1 if the ith data instance is from the labelled dataset
Output:
- test_f(datafeeder, log_output=False, data_type='test_f', save_filename=None, use_joint_f_w=False)[source]¶
Func Desc: tests the f_network (classification network)
Input: self datafeeder - the datafeeder object log_output (default - False) data_type (fixed to test_f) - the type of the data that we want to test save_filename (default - None) - the file where we can possibly store the test results use_join_f_w (default - None)
Output: precision recall f1_score support
- test_w(datafeeder, log_output=False, data_type='test_w', save_filename=None)[source]¶
Func Desc: tests the w_network (rule network)
Input: self datafeeder - the datafeeder object log_output (default - False) data_type (fixed to test_w) - the type of the data that we want to test save_filename (default - None) - the file where we can possibly store the test results
Analyzes: the obtained w_predictions
Hls Train¶
- class spear.Implyloss.train.HLSTrain(hls, f_d_metrics_pickle, f_d_U_metrics_pickle, f_d_adam_lr, f_d_U_adam_lr, early_stopping_p, f_d_primary_metric, mode, data_dir)[source]¶
Func Desc: This Class is designed to train the HLS model using the Implyloss Algorithm
- make_f_summary_ops()[source]¶
Func Desc: make the summary of all the essential parameters of f_network
Input: Self
Summarizes: f_d_loss_ph f_d_loss f_d_f1_score_ph f_d_f1_score f_d_accuracy_ph f_d_accuracy f_d_avg_f1_score_ph f_d_avg_f1_score f_d_summaries
- report_f_d_perfs_to_tensorboard(f_d_loss, metrics_dict, global_step)[source]¶
Func Desc: report the f_d_performance to tensorboard
Input: self f_d_loss metrics_dict global_step
Output:
- train_f_on_d(datafeeder, num_epochs)[source]¶
Func Desc: trains the f_network (classification network) on labelled data
Input: self datafeeder - datafeeder object num_epochs - number of epochs for training
Output:
- train_f_on_d_U(datafeeder, num_epochs, loss_type)[source]¶
Func Desc: trains the f_network (classification network) on labelled amd unlabelled data
Input: self datafeeder - datafeeder object num_epochs - number of epochs for training loss_type - different available losses
Output:
- get_metric(run_type, metrics_dict)[source]¶
Func desc: get the metrics
Input: self run_type metrics_dict
Output: the required metrics_dict
- save_metrics(run_type, metrics_dict)[source]¶
Func desc: save the metrics
Input: self run_type metrics_dict
Prints: The saved metric file
Hls Utils¶
- spear.Implyloss.utils.get_data(path)[source]¶
func desc: takes the pickle file and arranges it in a matrix list form so as to set the member variables accordingly expected order in pickle file is NUMPY arrays x, l, m, L, d, r, s, n, k x: [num_instances, num_features] l: [num_instances, num_rules] m: [num_instances, num_rules] L: [num_instances, 1] d: [num_instances, 1] r: [num_instances, num_rules] s: [num_instances, num_rules] n: [num_rules] Mask for s k: [num_rules] LF classes, range 0 to num_classes-1
- spear.Implyloss.utils.analyze_w_predictions(x, l, m, L, d, weights, probs, rule_classes)[source]¶
func desc: analyze the rule network by computing the precisions of the rules and comparing old and new rule stats
input: x: [num_instances, num_features] l: [num_instances, num_rules] m: [num_instances, num_rules] L: [num_instances, 1] d: [num_instances, 1] weights: [num_instances, num_rules] probs: [num_instances, num_classes] rule_classes: [num_rules,1]
output: void, prints the required statistics
- spear.Implyloss.utils.convert_weights_to_m(weights)[source]¶
func desc: converts weights to m
input: weights([batch_size, num_rules]) - the weights matrix corresponding to rule network(w_network) in the algorithm
output: m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance
- spear.Implyloss.utils.convert_m_to_l(m, rule_classes, num_classes)[source]¶
func desc: converts m to l
input: m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance rule_classes - num_classes(non_negative integer) - number of available classes
output: l([batch_size, num_rules]) - labels assigned by the rules
- spear.Implyloss.utils.get_rule_precision(l, L, m)[source]¶
func desc: get the precision of the rules
input: l([batch_size, num_rules]) - labels assigned by the rules L([batch_size, 1]) - L_i = 1 if the ith instance has already a label assigned to it in the dataset m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance
output: micro_p - macro_p - comp -
- spear.Implyloss.utils.merge_dict_a_into_b(a, b)[source]¶
func desc: set the dict values of b to that of a
input: a, b : dicts
output: void
- spear.Implyloss.utils.print_tf_global_variables()[source]¶
Func Desc: prints all the global variables
Input:
Output:
- spear.Implyloss.utils.print_var_list(var_list)[source]¶
Func Desc: Prints the given variable list
Input: var_list
Output:
- spear.Implyloss.utils.pretty_print(data_structure)[source]¶
Func Desc: prints the given data structure in the desired format
Input: data_structure
Output:
- spear.Implyloss.utils.get_list_or_None(s, dtype=<class 'int'>)[source]¶
Func Desc: Returns the list of types of the variables in the string s
Input: s - string dtype function (default - int)
Output: None or list
- spear.Implyloss.utils.get_list(s)[source]¶
Func Desc: returns the output of get_list_or_None as a list
Input: s - list
Output: lst - list
- spear.Implyloss.utils.None_if_zero(n)[source]¶
Func Desc: the max(0,n) function with none id n<=0
Input: n - integer
Output: if n>0 then n else None
- spear.Implyloss.utils.boolean(s)[source]¶
Func Desc: returns the expected boolean value for the given string
Input: s - string
Output: boolean or error
- spear.Implyloss.utils.set_to_list_of_values_if_None_or_empty(lst, val, num_vals)[source]¶
Func Desc: returns lst if it is not empty else returns a same length list but with all its entries equal to val lst - list val - value num_vals (integer) - length of the list lst
Output: lst or same length val list
- spear.Implyloss.utils.conv_l_to_lsnork(l, m)[source]¶
func desc: in snorkel convention if a rule does not cover an instance assign it label -1 we follow the convention where we assign the label num_classes instead of -1 valid class labels range from {0,1,…num_classes-1} conv_l_to_lsnork: converts l in our format to snorkel’s format
input: l([batch_size, num_rules]) - rule label matrix m([batch_size, num_rules]) - rule coverage matrix
output: lsnork([batch_size, num_rules])
- spear.Implyloss.utils.compute_accuracy(support, recall)[source]¶
func desc: compute the required accuracy
input: support recall
output: accuracy
- spear.Implyloss.utils.dump_labels_to_file(save_filename, x, l, m, L, d, weights=None, f_d_U_probs=None, rule_classes=None)[source]¶
Func Desc: dumps the given data into a pickle file
Input: save_filename - the name of the pickle file in which the arguments/data is required to be saved x ([batch_size x num_features]) l ([batch_size x num_rules]) m ([batch_size x num_rules]) L ([batch_size x 1]) d ([batch_size x 1]) weights (default - None) f_d_U_probs (default - None) rule_classes (default - None)
Output:
- spear.Implyloss.utils.load_from_pickle_with_per_class_sampling_factor(fname, per_class_sampling_factor)[source]¶
Func Desc: load the data from the given pickle file with per class sampling factor
Input: fname - name of the pickle file from which data need to be loaded per_class_sampling_factor
Output: the required matrices x1 ([batch_size x num_features]) l1 ([batch_size x num_rules]) m1 ([batch_size x num_rules]) L1 ([batch_size x 1]) d1 ([batch_size x 1])
- spear.Implyloss.utils.combine_d_covered_U_pickles(d_name, infer_U_name, out_name, d_sampling_factor, U_sampling_factor)[source]¶
Func Desc: combine the labelled and unlabelled data, merge the corresponding parameters together and store them in new file
Input: d_name - the pickle file storing labelled data infer_U_name - the pickle file storing unlabelled data out_name - the name of the file where merged output needs to be stored d_sampling_factor - the per_class_sampling_factor for labelled data U_sampling_factor - the per_class_sampling_factor for unlabelled data
Output:
Bibilography¶
- CRS20(1,2,3,4,5,6,7)
Oishik Chatterjee, Ganesh Ramakrishnan, and Sunita Sarawagi. Robust data programming with precision-guided labeling functions. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3397–3404, Apr. 2020. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5742, doi:10.1609/aaai.v34i04.5742.
- MCK+20(1,2,3,4,5,6)
Ayush Maheshwari, Oishik Chatterjee, KrishnaTeja Killamsetty, Rishabh K. Iyer, and Ganesh Ramakrishnan. Data programming using semi-supervision and subset selection. CoRR, 2020. URL: https://arxiv.org/abs/2008.09887, arXiv:2008.09887.
- RBE+20
Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: rapid training data creation with weak supervision. The VLDB Journal, 29:, 05 2020. doi:10.1007/s00778-019-00552-1.