Welcome to SPEAR’s documentation!¶

SPEAR: Semi-Supervised Data Programming

We present SPEAR, an open-source python library for data programming with semi-supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data. SPEAR facilitates weak supervision, either pre-defined, in the form of rules/heuristics and associate ‘noisy’ labels(or prelabels) to the training dataset. These noisy labels are aggregated to assign labels to the unlabeled data for downstream tasks. Several label aggregation approaches have been proposed that aggregate the noisy labels and then train the ‘noisily’ labeled set in a cascaded manner, while other approaches ‘jointly’ aggregates and trains the model. In the python package, we integrate several cascade and joint data-programming approaches while providing facility to define rules. The code and tutorial notebooks are available here.

Labeling¶

This module takes inspiration and build upon Ratner et al. [RBE+20]

LF¶

class spear.labeling.lf.core.LabelingFunction(name: str, f: Callable[..., int], label=None, resources: Optional[Mapping[str, Any]] = None, pre: Optional[List[spear.labeling.preprocess.core.BasePreprocessor]] = None, cont_scorer: Optional[spear.labeling.continuous_scoring.core.BaseContinuousScorer] = None)[source]¶

Base class for labeling function

Parameters

name (str) – name for this LF object
f (Callable[..., int]) – core function which labels the input
label (enum) – Which class this LF corresponds to
resources (Optional[Mapping[str, Any]], optional) – Additional resources for core function. Defaults to None.
pre (Optional[List[BasePreprocessor]], optional) – Preprocessors to apply on input before labeling. Defaults to None.
cont_scorer (Optional[BaseContinuousScorer], optional) – Continuous Scorer to calculate the confidence score. Defaults to None.

class spear.labeling.lf.core.labeling_function(name: Optional[str] = None, label=None, resources: Optional[Mapping[str, Any]] = None, pre: Optional[List[spear.labeling.preprocess.core.BasePreprocessor]] = None, cont_scorer: Optional[spear.labeling.continuous_scoring.core.BaseContinuousScorer] = None)[source]¶

Decorator class for a labeling function

Parameters

name (Optional[str], optional) – Name for this labeling function. Defaults to None.
label (Optional[Enum], optional) – An enum. Which class this LF corresponds to. Defaults to None.
resources (Optional[Mapping[str, Any]], optional) – Additional resources for the LF. Defaults to None.
pre (Optional[List[BasePreprocessor]], optional) – Preprocessors to apply on input before labeling . Defaults to None.
cont_scorer (Optional[BaseContinuousScorer], optional) – Continuous Scorer to calculate the confidence score. Defaults to None.

Raises

ValueError – If the decorator is missing parantheses

Continuous scoring¶

class spear.labeling.continuous_scoring.core.BaseContinuousScorer(name: str, cf: Callable[..., int], resources: Optional[Mapping[str, Any]] = None)[source]¶

Base Class for Continuous Scoring function used by the Labeling Function

Parameters

name (str) – Name of the continuous scoring function
cf (Callable[..., int]) – Core function which calculates continuous score
resources (Optional[Mapping[str, Any]], optional) – Resources for the scorer. Defaults to None.

class spear.labeling.continuous_scoring.core.continuous_scorer(name: Optional[str] = None, resources: Optional[Mapping[str, Any]] = None)[source]¶

Decorator class for continuous scoring.

Parameters

name (Optional[str], optional) – Name for the decorator. Defaults to None.
resources (Optional[Mapping[str, Any]], optional) – Resources for the scorer. Defaults to None.

Raises

ValueError – If decorator is missing parantheses.

LFApply¶

class spear.labeling.apply.core.ApplierMetadata(faults: Dict[str, int])[source]¶

Metadata about Applier call.

property faults¶: Alias for field number 0

class spear.labeling.apply.core.BaseLFApplier(lf_set: spear.labeling.lf_set.core.LFSet)[source]¶

Base class for LF applier objects. Base class for LF applier objects, which executes a set of LFs on a collection of data points. Subclasses should operate on a single data point collection format (e.g. DataFrame). Subclasses must implement the apply method.

Parameters: lf_set (LFSet) – Instace of LFset which has information of set of labeling functions(which is applied on data)
Raises: ValueError – If names of LFs are not unique

spear.labeling.apply.core.apply_lfs_to_data_point(x: Any, index: int, lfs: List[spear.labeling.lf.core.LabelingFunction], f_caller: spear.labeling.apply.core._FunctionCaller) → List[Tuple[int, int, int, float]][source]¶

Label a single data point with a set of LFs

Parameters

x (DataPoint) – Data point to label
index (int) – Index of the data point
lfs (List[LabelingFunction]) – List of LFs to label x with
f_caller (_FunctionCaller) – A _FunctionCaller to record failed LF executions

Returns

A list of (data point index, LF index, label enum, confidence) tuples

Return type

RowData

class spear.labeling.apply.core.LFApplier(lf_set: spear.labeling.lf_set.core.LFSet)[source]¶

LF applier for a list of data points (e.g. SimpleNamespace) or a NumPy array.

Parameters: lf_set (LFSet) – Instace of LFset which has information of set of labeling functions(which is applied on data)

apply(data_points: Union[Sequence[Any], numpy.ndarray], progress_bar: bool = True, fault_tolerant: bool = False, return_meta: bool = False) → Union[numpy.ndarray, Tuple[numpy.ndarray, spear.labeling.apply.core.ApplierMetadata]][source]¶

Label list of data points or a NumPy array with LFs.

Parameters

data_points (Union[DataPoints, np.ndarray]) – List of data points or NumPy array to be labeled by LFs
progress_bar (bool, optional) – Display a progress bar?. Defaults to True.
fault_tolerant (bool, optional) – Output -1 if LF execution fails?. Defaults to False.
return_meta (bool, optional) – Return metadata from apply call?. Defaults to False.

Returns

np.ndarray:: Matrix of labels emitted by LFs
ApplierMetadata:: Metadata, such as fault counts, for the apply call

Return type

Union[np.ndarray, Tuple[np.ndarray, ApplierMetadata]]

LFSet¶

class spear.labeling.lf_set.core.LFSet(name: str, lfs: List[spear.labeling.lf.core.LabelingFunction] = [])[source]¶

Class for Set of Labeling Functions

Parameters

name (str) – Name for this LFset.
lfs (List[LabelingFunction], optional) – List of LFs to add to this object. Defaults to [].

get_lfs() → Set[spear.labeling.lf.core.LabelingFunction][source]¶

Returns LFs contained in this LFSet object

Returns: LFs in this LFSet
Return type: Set[LabelingFunction]

add_lf(lf: spear.labeling.lf.core.LabelingFunction) → None[source]¶

Adds single LF to this LFSet

Parameters: lf (LabelingFunction) – LF to add

add_lf_list(lf_list: List[spear.labeling.lf.core.LabelingFunction]) → None[source]¶

Adds a list of LFs to this LFSet

Parameters: lf_list (List[LabelingFunction]) – List of LFs to add to this LFSet

remove_lf(lf: spear.labeling.lf.core.LabelingFunction) → None[source]¶

Removes a LF from this set

Parameters: lf (LabelingFunction) – LF to remove from this set
Raises: Warning – If LF not already in LFset

LFAnalysis¶

class spear.labeling.analysis.core.LFAnalysis(enum, L: numpy.ndarray, rules=None)[source]¶

Run analysis on LFs using label matrix.

Parameters

L (np.ndarray) – Label matrix where L_{i,j} is the label given by the jth LF to the ith x instance
lfs (Optional[List[LabelingFunction]], optional) – Labeling functions used to generate ‘L`. Defaults to None.
abstain (int, optional) – label associated with abstain. Defaults to -1.

Raises

ValueError – If number of LFs and number of LF matrix columns differ

label_coverage() → float[source]¶

Compute the fraction of data points with at least one label.

Returns: Fraction of data points with labels
Return type: float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_coverage()
0.8

label_overlap() → float[source]¶

Compute the fraction of data points with at least two (non-abstain) labels.

Returns: Fraction of data points with overlapping labels
Return type: float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_overlap()
0.6

label_conflict() → float[source]¶

Compute the fraction of data points with conflicting (non-abstain) labels.

Returns: Fraction of data points with conflicting labels
Return type: float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_conflict()
0.2

lf_polarities() → List[List[int]][source]¶

Infer the polarities of each LF based on evidence in a label matrix.

Returns: Unique output labels for each LF
Return type: List[List[int]]

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_polarities()
[[0, 1], [0], [0]]

lf_coverages() → numpy.ndarray[source]¶

Compute frac. of examples each LF labels.

Returns: Fraction of labeled examples for each LF
Return type: np.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_coverages()
array([0.4, 0.8, 0.4])

lf_overlaps(normalize_by_coverage: bool = False) → numpy.ndarray[source]¶

Compute frac. of examples each LF labels that are labeled by another LF. An overlapping example is one that at least one other LF returns a (non-abstain) label for. Note that the maximum possible overlap fraction for an LF is the LF’s coverage, unless normalize_by_coverage=True, in which case it is 1

Parameters: normalize_by_coverage (bool, optional) – Normalize by coverage of the LF, so that it returns the percent of LF labels that have overlaps. Defaults to False.
Returns: Fraction of overlapping examples for each LF
Return type: np.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_overlaps()
array([0.4, 0.6, 0.4])
>>> LFAnalysis(L).lf_overlaps(normalize_by_coverage=True)
array([1.  , 0.75, 1.  ])

lf_conflicts(normalize_by_overlaps: bool = False) → numpy.ndarray[source]¶

Compute frac. of examples each LF labels and labeled differently by another LF. A conflicting example is one that at least one other LF returns a different (non-abstain) label for. Note that the maximum possible conflict fraction for an LF is the LF’s overlaps fraction, unless normalize_by_overlaps=True, in which case it is 1. Parameters

Parameters: normalize_by_overlaps (bool, optional) – Normalize by overlaps of the LF, so that it returns the percent of LF overlaps that have conflicts. Defaults to False.
Returns: Fraction of conflicting examples for each LF
Return type: np.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_conflicts()
array([0.2, 0.2, 0. ])
>>> LFAnalysis(L).lf_conflicts(normalize_by_overlaps=True)
array([0.5       , 0.33333333, 0.        ])

lf_empirical_accuracies(Y: numpy.ndarray) → numpy.ndarray[source]¶

Compute empirical accuracy against a set of labels Y for each LF. Usually, Y represents development set labels.

Parameters: Y (np.ndarray) – [n] np.ndarray of gold labels
Returns: Empirical accuracies for each LF
Return type: np.ndarray

lf_summary(Y: Optional[numpy.ndarray] = None, plot: Optional[bool] = False) → pandas.DataFrame[source]¶

Create a pandas DataFrame with the various per-LF statistics.

Parameters

Y (Optional[np.ndarray], optional) – [n] np.ndarray of gold labels. If provided, the empirical accuracy for each LF will be calculated. Defaults to None.
plot (Optional[bool], optional) – If set to true a bar graph is plotted. Defaults to False.

Returns

Summary statistics for each LF

Return type

DataFrame

Pre Labels¶

class spear.labeling.prelabels.core.PreLabels(name: str, data: Sequence[Any], rules: spear.labeling.lf_set.core.LFSet, num_classes: int, labels_enum, data_feats: Optional[Sequence[Any]] = numpy.array, gold_labels: Optional[Sequence[Any]] = numpy.array, exemplars: Sequence[Any] = numpy.array)[source]¶

Generate noisy lables, continuous score from lf’s applied on data

Parameters

name (str) – Name for this object.
data (DataPoints) – Datapoints.
gold_labels (Optional[DataPoints]) – Labels for datapoints if available.
rules (LFSet) – Set of Rules to generate noisy labels for the dataset.
exemplars (DataPoints) – [description]

get_labels()[source]¶

Applies LFs to the dataset to generate noisy labels and returns noisy labels and confidence scores

Returns: Noisy Labels and Confidences
Return type: Tuple(DataPoints, DataPoints)

analyse_lfs(plot=False)[source]¶

Analyse the lfs in LFSet on data

Parameters: plot (bool, optional) – Plot the values. Defaults to False.
Returns: dataframe consisting of Ploarity, Coverage, Overlap, Conflicts, Empirical Acc
Return type: DataFrame

generate_json(filename=None)[source]¶

Generates a json file with label value to label name mapping

Parameters: filename (str, optional) – Name for json file. Defaults to None.

generate_pickle(filename=None)[source]¶

Generates a pickle file with noisy labels, confidence and other Metadata

Parameters: filename (str, optional) – Name for pickle file. Defaults to None.

CAGE¶

Chatterjee et al. [CRS20]

class spear.cage.core.Cage(path_json, n_lfs)[source]¶

Cage class:: Class for Data Programming using CAGE [Note: from here on, graphical model(gm) and CAGE algorithm terms are used interchangeably]

Parameters

path_json – Path to json file consisting of number to string(class name) map
n_lfs – number of labelling functions used to generate pickle files

save_params(save_path)[source]¶

member function to save parameters of Cage

Parameters: save_path – path to pickle file to save parameters

load_params(load_path)[source]¶

member function to load parameters to Cage

Parameters: load_path – path to pickle file to load parameters

fit_and_predict_proba(path_pkl, path_test=None, path_log=None, qt=0.9, qc=0.85, metric_avg=['binary'], n_epochs=100, lr=0.01)[source]¶

Parameters

path_pkl – Path to pickle file of input data in standard format
path_test – Path to the pickle file containing test data in standard format
path_log – Path to log file. No log is produced if path_test is None. Default is None which prints accuracies/f1_scores is printed to terminal
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – List of average metric to be used in calculating f1_score, default is [‘binary’]. Use None for not calculating f1_score
n_epochs – Number of epochs, default is 100
lr – Learning rate for torch.optim, default is 0.01

Returns

numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum)

fit_and_predict(path_pkl, path_test=None, path_log=None, qt=0.9, qc=0.85, metric_avg=['binary'], n_epochs=100, lr=0.01, need_strings=False)[source]¶

Parameters

path_pkl – Path to pickle file of input data in standard format
path_test – Path to the pickle file containing test data in standard format
path_log – Path to log file. No log is produced if path_test is None. Default is None which prints accuracies/f1_scores is printed to terminal
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – List of average metric to be used in calculating f1_score, default is [‘binary’]
n_epochs – Number of epochs, default is 100
lr – Learning rate for torch.optim, default is 0.01
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are aggregated/predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp.

predict_proba(path_test, qc=0.85)[source]¶

Used to predict labels based on a pickle file with path path_test

Parameters

path_test – Path to the pickle file containing test data set in standard format
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]

predict(path_test, qc=0.85, need_strings=False)[source]¶

Used to predict labels based on a pickle file with path path_test

Parameters

path_test – Path to the pickle file containing test data set in standard format
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]

Joint Learning(JL)¶

Maheshwari et al. [MCK+20]

From here on, Feature model(fm) imply Feature based classification model

class spear.jl.core.JL(path_json, n_lfs, n_features, feature_model='nn', n_hidden=512)[source]¶

Joint_Learning class:

[Note: from here on, feature model(fm) and feature-based classification model are used interchangeably. graphical model(gm) and CAGE algorithm terms are used interchangeably]

Loss function number | Calculated over | Loss function: (useful for loss_func_mask in fit_and_predict_proba and fit_and_predict functions)

1, L, Cross Entropy(prob_from_feature_model, true_labels)

2, U, Entropy(prob_from_feature_model)

3, U, Cross Entropy(prob_from_feature_model, prob_from_graphical_model)

4, L, Negative Log Likelihood

5, U, Negative Log Likelihood(marginalised over true labels)

6, L and U, KL Divergence(prob_feature_model, prob_graphical_model)

7, _, Quality guide

Parameters

path_json – Path to json file containing the dictionary of number to string(class name) map
n_lfs – number of labelling functions used to generate pickle files
n_features – number of features for each instance in the first array of pickle file aka feature matrix
feature_model – The model intended to be used for features, allowed values are ‘lr’(Logistic Regression) or ‘nn’(Neural network with 2 hidden layer) string, default is ‘nn’
n_hidden – Number of hidden layer nodes if feature model is ‘nn’, type is integer, default is 512

save_params(save_path)[source]¶

member function to save parameters of JL

Parameters: save_path – path to pickle file to save parameters

load_params(load_path)[source]¶

member function to load parameters to JL

Parameters: load_path – path to pickle file to load parameters

fit_and_predict_proba(path_L, path_U, path_V, path_T, loss_func_mask, batch_size, lr_fm, lr_gm, use_accuracy_score, path_log=None, return_gm=False, n_epochs=100, start_len=7, stop_len=10, is_qt=True, is_qc=True, qt=0.9, qc=0.85, metric_avg='binary')[source]¶

Parameters

path_L – Path to pickle file of labelled instances
path_U – Path to pickle file of unlabelled instances
path_V – Path to pickle file of validation instances
path_T – Path to pickle file of test instances
loss_func_mask – list of size 7 where loss_func_mask[i] should be 1 if Loss function (i+1) should be included, 0 else. Checkout Eq(3) in [MCK+20]
batch_size – Batch size, type should be integer
lr_fm – Learning rate for feature model, type is integer or float
lr_gm – Learning rate for graphical model(cage algorithm), type is integer or float
use_accuracy_score – The score to use for termination condition on validation set. True for accuracy_score, False for f1_score
path_log – Path to log file to append log. Default is None which prints accuracies/f1_scores is printed to terminal
return_gm – Return the predictions of graphical model? the allowed values are True, False. Default value is False
n_epochs – Number of epochs in each run, type is integer, default is 100
start_len – A parameter used in validation, refers to the least epoch after which validation checks need to be performed, type is integer, default is 7
stop_len – A parameter used in validation, refers to the least number of continuous epochs of non incresing validation accuracy after which the training should be stopped, type is integer, default is 10
is_qt – True if quality guide is available(and will be provided in ‘qt’ argument). False if quality guide is intended to be found from validation instances. Default is True
is_qc – True if quality index is available(and will be provided in ‘qc’ argument). False if quality index is intended to be found from validation instances. Default is True
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – Average metric to be used in calculating f1_score/precision/recall, default is ‘binary’

Returns

If return_gm is True; the return value is two predicted labels of numpy array of shape (num_instances, num_classes), first one is through feature model, other one through graphical model. Else; the return value is predicted labels of numpy array of shape (num_instances, num_classes) through feature model. For a given model i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) using that model. It is suggested to use the probailities of feature model

fit_and_predict(path_L, path_U, path_V, path_T, loss_func_mask, batch_size, lr_fm, lr_gm, use_accuracy_score, path_log=None, return_gm=False, n_epochs=100, start_len=7, stop_len=10, is_qt=True, is_qc=True, qt=0.9, qc=0.85, metric_avg='binary', need_strings=False)[source]¶

Parameters

path_L – Path to pickle file of labelled instances
path_U – Path to pickle file of unlabelled instances
path_V – Path to pickle file of validation instances
path_T – Path to pickle file of test instances
loss_func_mask – list of size 7 where loss_func_mask[i] should be 1 if Loss function (i+1) should be included, 0 else. Checkout Eq(3) in [MCK+20]
batch_size – Batch size, type should be integer
lr_fm – Learning rate for feature model, type is integer or float
lr_gm – Learning rate for graphical model(cage algorithm), type is integer or float
use_accuracy_score – The score to use for termination condition on validation set. True for accuracy_score, False for f1_score
path_log – Path to log file to append log. Default is None which prints accuracies/f1_scores is printed to terminal
return_gm – Return the predictions of graphical model? the allowed values are True, False. Default value is False
n_epochs – Number of epochs in each run, type is integer, default is 100
start_len – A parameter used in validation, refers to the least epoch after which validation checks need to be performed, type is integer, default is 7
stop_len – A parameter used in validation, refers to the least number of continuous epochs of non incresing validation accuracy after which the training should be stopped, type is integer, default is 10
is_qt – True if quality guide is available(and will be provided in ‘qt’ argument). False if quality guide is intended to be found from validation instances. Default is True
is_qc – True if quality index is available(and will be provided in ‘qc’ argument). False if quality index is intended to be found from validation instances. Default is True
qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
metric_avg – Average metric to be used in calculating f1_score/precision/recall, default is ‘binary’
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

If return_gm is True; the return value is two predicted labels of numpy array of shape (num_instances, ), first one is through feature model, other one through graphical model. Else; the return value is predicted labels of numpy array of shape (num_instances,) through feature model. It is suggested to use the probailities of feature model

predict_gm_proba(path_test, qc=0.85)[source]¶

Used to find the predicted labels based on the trained parameters of graphical model(CAGE)

Parameters

path_test – Path to the pickle file containing test data set
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model

predict_fm_proba(x_test)[source]¶

Used to find the predicted labels based on the trained parameters of feature model

Parameters: x_test – numpy array of shape (num_instances, num_features) containing data whose labels are to be predicted
Returns: numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model

predict_gm(path_test, qc=0.85, need_strings=False)[source]¶

Used to find the predicted labels based on the trained parameters of graphical model(CAGE)

Parameters

path_test – Path to the pickle file containing test data set
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model

predict_fm(x_test, need_strings=False)[source]¶

Used to find the predicted labels based on the trained parameters of feature model

Parameters

x_test – numpy array of shape (num_instances, num_features) containing data whose labels are to be predicted
need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model

Subset Selection¶

Uses facilityLocation from submodlib library which is also provided by DECILE for submodular optimization

spear.jl.subset_selection.rand_subset(n_all, n_instances)[source]¶

A function to choose random indices of the input instances to be labeled

Parameters

n_all – number of available instances, type in integer
n_intances – number of instances to be labelled, type is integer

Returns

A numpy.ndarray of the indices(of shape (n_sup,) and each element in the range [0,n_all-1)) to be labeled

spear.jl.subset_selection.unsup_subset(x_train, n_unsup)[source]¶

A function for unsupervised subset selection(the subset to be labeled)

Parameters

x_train – A numpy.ndarray of shape (n_instances, n_features). All the data, intended to be used for training
n_unsup – number of instances to be found during unsupervised subset selection, type is integer

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,x_train.shape[0])), the result of subset selection

spear.jl.subset_selection.sup_subset(path_json, path_pkl, n_sup, qc=0.85)[source]¶

A helper function for supervised subset selection(the subset to be labeled) which just returns indices

Parameters

path_json – Path to json file of number to string(class name) map
path_pkl – Path to the pickle file containing all the training data in standard format
n_sup – Number of instances to be found during supervised subset selection
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection AND the data which is list of contents of path_pkl

spear.jl.subset_selection.sup_subset_indices(path_json, path_pkl, n_sup, qc=0.85)[source]¶

A function for supervised subset selection(the subset to be labeled) whcih just returns indices

Parameters

path_json – Path to json file of number to string(class name) map
path_pkl – Path to the pickle file containing all the training data in standard format
n_sup – Number of instances to be found during supervised subset selection
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection

spear.jl.subset_selection.sup_subset_save_files(path_json, path_pkl, path_save_L, path_save_U, n_sup, qc=0.85)[source]¶

A function for supervised subset selection(the subset to be labeled) which makes separate pickle files of data, one for those to be labelled, other that can be left unlabelled

Parameters

path_json – Path to json file of number to string(class name) map
path_pkl – Path to the pickle file containing all the training data in standard format
path_save_L – Path to save the pickle file of set of instances to be labelled. Note that instances are not labelled yet. Extension should be .pkl
path_save_U – Path to save the pickle file of set of instances that can be left unlabelled. Extension should be .pkl
n_sup – number of instances to be found during supervised subset selection
qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection. Also two pickle files are saved at path_save_L and path_save_U

spear.jl.subset_selection.replace_in_pkl(path, path_save, np_array, index)[source]¶

A function to insert the true labels, after labeling the instances, to the pickle file

Parameters

path – Path to the pickle file containing all the data in standard format
path_save – Path to save the pickle file after replacing the ‘L’(true labels numpy array) of data in path pickle file
np_array – The data which is to be used to replace the data in path pickle file with
index – Index of the numpy array, in data of path pickle file, to be replaced with np_array. Value should be in [0,8]

Returns

No return value. A pickle file is generated at path_save

spear.jl.subset_selection.insert_true_labels(path, path_save, labels)[source]¶

A function to insert the true labels, after labeling the instances, to the pickle file

Parameters

path – Path to the pickle file containing all the data in standard format
path_save – Path to save the pickle file after replacing the ‘L’(true labels numpy array) of data in path pickle file
labels – The true labels of the data in pickle file. numpy.ndarray of shape (num_instances, 1)

Returns

No return value. A pickle file is generated at path_save

CAGE, JL - UTILS¶

Note: The arguments whose shapes are mentioned in ‘[….]’ are torch tensors.

Data loaders¶

The common utils to CAGE and JL algorithms are in this file. Don’t change the name or location of this file.

spear.utils.data_editor.is_dict_trivial(dict)[source]¶

A helper function that checks if the dictionary have key and value equal values for all keys except if its null

Parameters: dict – the dictionary
Returns: True if all keys(which are not None) are equal to respective values. False otherwise

spear.utils.data_editor.get_data(path, check_shapes=True, class_map=None)[source]¶

Standard format in pickle file contains the NUMPY ndarrays x, l, m, L, d, r, s, n, k and an int n_classes
x: (num_instances, num_features), x[i][j] is jth feature of ith instance. Note that the dimension fo this array can vary depending on the dimension of input

l: (num_instances, num_lfs), l[i][j] is the prediction of jth LF(co-domain: the values used in Enum) on ith instance. l[i][j] = None imply Abstain

m: (num_instances, num_lfs), m[i][j] is 1 if jth LF didn’t Abstain on ith instance. Else it’s 0

L: (num_instances, 1), L[i] is true label(co-domain: the values used in Enum) of ith instance, if available. Else L[i] is None

d: (num_instances, 1), d[i] is 1 if ith instance is labelled. Else it is 0

r: (num_instances, num_lfs), r[i][j] is 1 if ith instance is an exemplar for jth rule. Else it’s 0

s: (num_instances, num_lfs), s[i][j] is the continuous score of ith instance given by jth continuous LF. If jth LF is not continuous, then s[i][j] is None

n: (num_lfs,), n[i] is 1 if ith LF has continuous counter part, else n[i] is 0

k: (num_lfs,), k[i] is the class of ith LF, co-domain: the values used in Enum

n_classes: total number of classes

In case the numpy array is not available(can be possible for x, L, d, r, s), it is stored as numpy.zeros(0)

Parameters

path – path to pickle file with data in the format above
check_shapes – if true, checks whether the shapes of numpy arrays in pickle file are consistent as per the format mentioned above. Else it doesn’t check. Default is True.
class_map – dictionary of class numbers(sorted, mapped to [0,n_classes-1]) are per the Enum defined in labeling part. l,L are modified(needed inside algorithms) before returning, using class_map. Default is None which doesn’t do any mapping

Returns

A list containing all the numpy arrays mentioned above. The arrays l, L are modified using the class_map

spear.utils.data_editor.get_classes(path)[source]¶

The json file should contain a dictionary of number to string(class name) map as defined in Enum

Parameters: path – path to json file with contents mentioned above
Returns: A dictionary (number to string(class name) map)

spear.utils.data_editor.get_predictions(proba, class_map, class_dict, need_strings)[source]¶

This function takes probaility of instances being a class and gives what class each instance belongs to, using the maximum of probabilities

Parameters

proba – probability numpy.ndarray of shape (num_instances, num_classes)
class_map – dictionary mapping the class numbers(as per Enum class defined) to numbers in range [0, num_classes-1]
class_dict – dictionary consisting of number to string(class name) mapping as per the Enum class defined
need_trings – If True, the output conatians strings(of class names), else it consists of numbers(class numbers as used in Enum definition)

Returns

numpy.ndarray of shape (num_instances,), where elements are class_names/class_numbers depending on need_strings is True/False, where the elements represent the class of each instance

spear.utils.data_editor.get_enum(np_array, enm)[source]¶

This function is used to convert a numpy array of numbers to a numpy array of enums based on the Enum class provided ‘enm’

Parameters

np_array – a numpy.ndarray of any shape consisting of numbers
enm – An class derived from ‘Enum’ class, which must contain map from every number in np_array to an enum

Returns

numpy.ndarray of shape shape as np_array but now contains enums(as per the mapping in ‘enm’) instead of numbers

CAGE and JL utils¶

From here on, Graphical model(gm) imply CAGE algorithm and Feature model(fm) imply Feature based classification model

The common utils to CAGE and JL algorithms are in this file. Don’t change the name or location of this file.

spear.utils.utils_cage.phi(theta, l, device)[source]¶

Graphical model utils: A helper function

Parameters

theta – [n_classes, n_lfs], the parameters
l – [n_lfs]
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a tensor of shape [n_classes, n_lfs], element wise product of input tensors(each row of theta dot product with l)

spear.utils.utils_cage.calculate_normalizer(theta, k, n_classes, device)[source]¶

Graphical model utils: Used to find Z(the normaliser) in CAGE. Eq(4) in [CRS20]

Parameters

theta – [n_classes, n_lfs], the parameters
k – [n_lfs], labels corresponding to LFs
n_classes – num of classes/labels
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, representing the normaliser

spear.utils.utils_cage.probability_l_y(theta, m, k, n_classes, device)[source]¶

Graphical model utils: Used to find probability involving the term psi_theta(in Eq(1) in [CRS20]), the potential function for all LFs

Parameters

theta – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a tensor of shape [n_instances, n_classes], the psi_theta value for each instance, for each class(true label y)

spear.utils.utils_cage.probability_s_given_y_l(pi, s, y, m, k, continuous_mask, qc)[source]¶

Graphical model utils: Used to find probability involving the term psi_pi(in Eq(1) in [CRS20]), the potential function for all continuous LFs

Parameters

pi – [n_lfs], the parameters for the class y
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
y – a value in [0, n_classes-1], representing true label, for which psi_pi is calculated
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1

Returns

a tensor of shape [n_instances], the psi_pi value for each instance, for the given label(true label y)

spear.utils.utils_cage.probability(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]¶

Graphical model utils: Used to find probability of given instances for all possible true labels(y’s). Eq(1) in [CRS20]

Parameters

theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a tensor of shape [n_instances, n_classes], the probability for an instance being a particular class

spear.utils.utils_cage.log_likelihood_loss(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]¶

Graphical model utils: Negative of log likelihood loss. Negative of Eq(6) in [CRS20]

Parameters

theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, negative of summation over (the log of probability for an instance, marginalised over y(true labels))

spear.utils.utils_cage.precision_loss(theta, k, n_classes, a, device)[source]¶

Graphical model utils: Negative of the regularizer term in Eq(9) in [CRS20]

Parameters

theta – [n_classes, n_lfs], the parameters
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
a – [n_lfs], a[i] is the quality guide for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, negative of regularizer term

spear.utils.utils_cage.predict_gm_labels(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]¶

Graphical model utils: Used to predict the labels after the training is done

Parameters

theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

numpy.ndarray of shape (n_instances,), the predicted class for an instance

JL utils¶

spear.utils.utils_jl.log_likelihood_loss_supervised(theta, pi, y, m, s, k, n_classes, continuous_mask, qc, device)[source]¶

Joint Learning utils: Negative log likelihood loss, used in loss 4 in [MCK+20]

Parameters

theta – [n_classes, n_lfs], the parameters
pi – [n_classes, n_lfs], the parameters
m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0
s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF
k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1
n_classes – num of classes/labels
continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0
qc – a float value OR [n_lfs], qc[i] quality index for ith LF
device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, summation over (the log of probability for an instance)

spear.utils.utils_jl.entropy(probabilities)[source]¶

Joint Learning utils: Entropy, Used in loss 2 in [MCK+20]

Parameters: probabilities – [num_unsup_instances, num_classes], probabilities[i][j] is probability of ith instance being jth class
Returns: a real value, the entropy value of given probability

spear.utils.utils_jl.kl_divergence(probs_p, probs_q)[source]¶

Joint Learning utils: KL divergence of two probabilities, used in loss 6 in [MCK+20]

Parameters

probs_p – [num_instances, num_classes]
probs_q – [num_instances, num_classes]

Returns

a real value, the KL divergence of given probabilities

spear.utils.utils_jl.find_indices(data, data_sub)[source]¶

A helper function for subset selection

Parameters

data – the complete data, torch tensor of shape [num_instances, num_classes]
data_sub – the subset of ‘data’ whose indices are to be found. Should be of same shape as ‘data’

Returns

list of indices, to be found from the result of apricot library

spear.utils.utils_jl.get_similarity_kernel(preds)[source]¶

A helper function for subset selection

Parameters: preds – numpy.ndarray of shape (num_samples,)
Returns: numpy.ndarray of shape (num_sample, num_samples)

Feature-based Models¶

class spear.jl.models.models.LogisticRegression(*args: Any, **kwargs: Any)[source]¶

Class for Logistic Regression, used in Joint learning class/Algorithm

Parameters

input_size – number of features
output_size – number of classes

forward(x)[source]¶

class spear.jl.models.models.DeepNet(*args: Any, **kwargs: Any)[source]¶

Class for Deep neural network, used in Joint learning class/Algorithm

Parameters

input_size – number of features
hidden_size – number of nodes in each of the two hidden layers
output_size – number of classes

forward(x)[source]¶

Hls¶

Hls Checkmate¶

class spear.Implyloss.checkmate.BestCheckpointSaver(save_dir, num_to_keep=1, maximize=True, saver=None)[source]¶

Maintains a directory containing only the best n checkpoints

Inside the directory is a best_checkpoints JSON file containing a dictionary mapping of the best checkpoint filepaths to the values by which the checkpoints are compared. Only the best n checkpoints are contained in the directory and JSON file.

This is a light-weight wrapper class only intended to work in simple, non-distributed settings. It is not intended to work with the tf.Estimator framework.

handle(value, sess, global_step_tensor)[source]¶

Func Desc: Updates the set of best checkpoints based on the given result.

Input: value: The value by which to rank the checkpoint. sess: A tf.Session to use to save the checkpoint global_step_tensor: A tf.Tensor represent the global step

Output: True or False

spear.Implyloss.checkmate.get_best_checkpoint(best_checkpoint_dir, select_maximum_value=True)[source]¶

Func Desc: Reads the best_checkpoints file in the best_checkpoint_dir directory. Returns the filepath in the best_checkpoints file associated with the highest value if select_maximum_value is True, or the filepath associated with the lowest value if select_maximum_value is False.

Input: best_checkpoint_dir: Directory containing best_checkpoints JSON file select_maximum_value: If True, select the filepath associated with the highest value. Otherwise, select the filepath associated with the lowest value.

Output: The full path to the best checkpoint file

Hls Checkpoints¶

spear.Implyloss.checkpoints.test_mru_checkpoints(num_to_keep)[source]¶

Func Desc: Runs different sessions while changing the checkpoint number that is currently being worked with and tests the same

Input: num_to_keep(int) - a limit on the size of the global step for checkpoint traversal

Output:

spear.Implyloss.checkpoints.test_checkpoint()[source]¶

Func Desc: tests whether the checkpoints stored are as expected

Input:

Output:

spear.Implyloss.checkpoints.test_best_ckpt()[source]¶

Func Desc: test for the best checkpoint so far

Input:

Output:

spear.Implyloss.checkpoints.test_checkmate()[source]¶

Func Desc: test whether the checkmate model is working fine

Input:

Output:

Hls Data Feeders¶

Hls Data Feeders Utils¶

spear.Implyloss.data_feeder_utils.change_values(l, user_class_to_num_map)[source]¶

Func Desc: Replace the class labels in l by sequential labels - 0,1,2,..

Input: l - the class label matrix user_class_to_num_map - dictionary storing mapping from original class labels to sequential labels

Output: l - with sequential labels

spear.Implyloss.data_feeder_utils.load_data(fname, jname, num_load=None)[source]¶

Func Desc: load the data from the given file

Input: fname - filename num_load (default - None)

Output: the structured F_d_U_Data

spear.Implyloss.data_feeder_utils.get_rule_classes(l, num_classes)[source]¶

Func Desc: get the different rule_classes

Input: l ([batch_size, num_rules]) num_classes (int) - the number of available classes

Output: rule_classes ([num_rules,1]) - the list of valid classes labelled by rules (say class 2 by r0, class 1 by r1, class 4 by r2 => [2,1,4])

spear.Implyloss.data_feeder_utils.extract_rules_satisfying_min_coverage(m, min_coverage)[source]¶

Func Desc: extract the rules that satisfy the specified minimum coverage

Input: m ([batch_size, num_rules]) - mij specifies whether ith example is associated with the jth rule min_coverage

Output: satisfying_rules - list of satisfying rules not_satisfying_rules - list of not satisfying rules rule_map_new_to_old rule_map_old_to_new

spear.Implyloss.data_feeder_utils.remap_2d_array(arr, map_old_to_new)[source]¶

Func Desc: remap those columns of 2D array that are present in map_old_to_new

Input: arr ([batch_size, num_rules]) map_old_to_new

Output: modified array

spear.Implyloss.data_feeder_utils.remap_1d_array(arr, map_old_to_new)[source]¶

Func Desc: remap those positions of 1D array that are present in map_old_to_new

Input: arr ([batch_size, num_rules]) map_old_to_new

Output: modified array

spear.Implyloss.data_feeder_utils.modify_d_or_U_using_rule_map(raw_U_or_d, rule_map_old_to_new)[source]¶

Func Desc: Modify d or U using the rule map

Input: raw_U_or_d - the raw data (labelled(d) or unlabelled(U)) rule_map_old_to_new - the rule map

Output: the modified raw_U_or_d

spear.Implyloss.data_feeder_utils.shuffle_F_d_U_Data(data)[source]¶

Func Desc: shuffle the input data along the 0th axis i.e. among the different instances

Input: data

Output: the structured and shuffled F_d_U_Data

spear.Implyloss.data_feeder_utils.oversample_f_d(x, labels, sampling_dist)[source]¶

Func Desc: Oversample the labelled data using the arguments provided

Input: x ([batch_size, num_features]) - the data labels samping_dist

spear.Implyloss.data_feeder_utils.oversample_d(raw_d, sampling_dist)[source]¶

Func Desc: performs oversampling on the raw labelled data using the given distribution

Input: raw_d - raw labelled data sampling_dist - the given sampling dist

Output: F_d_U_Data

Hls Gen Cross Entropy Utils¶

spear.Implyloss.gen_cross_entropy_utils.generalized_cross_entropy(logits, one_hot_labels, q=0.6)[source]¶

Func Desc: Computes the generalized cross entropy loss

Input: logits([batch_size, num_classes]) - weights one_hot_labels([batch_size, num_classes]) q (default = 0.6)

Output: loss

spear.Implyloss.gen_cross_entropy_utils.generalized_cross_entropy_bernoulli(p, q=0.2)[source]¶

Func Desc: computes the bernoulli generalized cross entropy

Input: p - base q (default = 0.2) - exponent

Output: loss

Hls Model¶

Hls PR Utils¶

spear.Implyloss.pr_utils.exp_term_for_constraints(rule_classes, num_classes, C)[source]¶

Func Desc: Compute the exponential term for the constraints

Input: rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required exponential term

spear.Implyloss.pr_utils.pr_product_term(weights, rule_classes, num_classes, C)[source]¶

Func Desc: Compute the probability product term for the constraints

Input: weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required product term

spear.Implyloss.pr_utils.get_q_y_from_p(f_probs, weights, rule_classes, num_classes, C)[source]¶

Func Desc: Compute the q_y term from the p (f_network) distribution

Input: f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required q_y term

spear.Implyloss.pr_utils.get_q_r_from_p(f_probs, weights, rule_classes, num_classes, C)[source]¶

Func Desc: Compute the q_r term from the p (f_network) distribution

Input: f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required q_r term

spear.Implyloss.pr_utils.theta_term_in_pr_loss(f_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]¶

Func Desc: Compute the theta term in the pr loss

Input: f_logits ([batch_size, num_classes]) f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C d ([batch_size,1])

Output: the required theta term (third term in equation 14) - used to supervise f (classification) network from instances in U

spear.Implyloss.pr_utils.phi_term_in_pr_loss(m, w_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]¶

Func Desc: Compute the phi term in the pr loss

Input: m ([batch_size, num_rules]) - mij = 1 if ith example is associated with jth rule w_logits ([batch_size, num_rules]) f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C d ([batch_size,1])

Output: the required phi term (fourth term in equation 14) - used to superwise w (rule) network from instances in U

spear.Implyloss.pr_utils.pr_loss(m, f_logits, w_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]¶

Func Desc: Compute the pr loss

Input: m ([batch_size, num_rules]) - mij = 1 if ith example is associated with jth rule f_logits w_logits ([batch_size, num_rules]) - logit before sigmoid activation in w network f_probs ([batch_size, num_classes]) - output of f network weights ([batch_size, num_rules]) - the sigmoid output from w network rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C - lamda in equation 10 (hyperparameter) d ([batch_size,1]) - if ith instance is from “d” set (labelled data) d[i] = 1, else if ith instance is from “U” set, d[i] = 0

Output: the required phi term

Hls Test¶

class spear.Implyloss.test.HLSTest(hls)[source]¶

Class Desc: This Class is designed to test the HLS model and its accuracy and precision obtained on the validation and test datasets

maybe_save_predictions(save_filename, x, l, m, preds, d)[source]¶

Func Desc: Saves the predictions obtained from the model if required

Input: self save_filename - the filename where the predictions have to be saved if required x ([batch_size, num_features]) l ([batch_size, num_rules]) m ([batch_size, num_rules]) preds d ([batch_size,1]) - d[i] = 1 if the ith data instance is from the labelled dataset

Output:

test_f(datafeeder, log_output=False, data_type='test_f', save_filename=None, use_joint_f_w=False)[source]¶

Func Desc: tests the f_network (classification network)

Input: self datafeeder - the datafeeder object log_output (default - False) data_type (fixed to test_f) - the type of the data that we want to test save_filename (default - None) - the file where we can possibly store the test results use_join_f_w (default - None)

Output: precision recall f1_score support

test_w(datafeeder, log_output=False, data_type='test_w', save_filename=None)[source]¶

Func Desc: tests the w_network (rule network)

Input: self datafeeder - the datafeeder object log_output (default - False) data_type (fixed to test_w) - the type of the data that we want to test save_filename (default - None) - the file where we can possibly store the test results

Analyzes: the obtained w_predictions

Hls Train¶

class spear.Implyloss.train.HLSTrain(hls, f_d_metrics_pickle, f_d_U_metrics_pickle, f_d_adam_lr, f_d_U_adam_lr, early_stopping_p, f_d_primary_metric, mode, data_dir)[source]¶

Func Desc: This Class is designed to train the HLS model using the Implyloss Algorithm

make_f_summary_ops()[source]¶

Func Desc: make the summary of all the essential parameters of f_network

Input: Self

Summarizes: f_d_loss_ph f_d_loss f_d_f1_score_ph f_d_f1_score f_d_accuracy_ph f_d_accuracy f_d_avg_f1_score_ph f_d_avg_f1_score f_d_summaries

report_f_d_perfs_to_tensorboard(f_d_loss, metrics_dict, global_step)[source]¶

Func Desc: report the f_d_performance to tensorboard

Input: self f_d_loss metrics_dict global_step

Output:

train_f_on_d(datafeeder, num_epochs)[source]¶

Func Desc: trains the f_network (classification network) on labelled data

Input: self datafeeder - datafeeder object num_epochs - number of epochs for training

Output:

train_f_on_d_U(datafeeder, num_epochs, loss_type)[source]¶

Func Desc: trains the f_network (classification network) on labelled amd unlabelled data

Input: self datafeeder - datafeeder object num_epochs - number of epochs for training loss_type - different available losses

Output:

init_metrics()[source]¶

Func desc: initialize the metrics

Input: self

Output:

get_metric(run_type, metrics_dict)[source]¶

Func desc: get the metrics

Input: self run_type metrics_dict

Output: the required metrics_dict

save_metrics(run_type, metrics_dict)[source]¶

Func desc: save the metrics

Input: self run_type metrics_dict

Prints: The saved metric file

maybe_save_metrics_dict(run_type, metrics_dict)[source]¶

Func desc: save the metric if it is the best till now

Input: self run_type metrics_dict

Output: True or False denoting whether the current metric is saved or not

Prints: The saved metric file

compute_f_d_metrics(metrics_dict, precision, recall, f1_score, support, global_epoch, f_d_global_step)[source]¶

Func desc: compute the f_d metrics

input: self metrics_dict precision recall f1_score support global_epoch f_d_global_step

output: void

evaluates: metrics_dict, accuracy

Hls Utils¶

spear.Implyloss.utils.get_data(path)[source]¶: func desc: takes the pickle file and arranges it in a matrix list form so as to set the member variables accordingly expected order in pickle file is NUMPY arrays x, l, m, L, d, r, s, n, k x: [num_instances, num_features] l: [num_instances, num_rules] m: [num_instances, num_rules] L: [num_instances, 1] d: [num_instances, 1] r: [num_instances, num_rules] s: [num_instances, num_rules] n: [num_rules] Mask for s k: [num_rules] LF classes, range 0 to num_classes-1

spear.Implyloss.utils.analyze_w_predictions(x, l, m, L, d, weights, probs, rule_classes)[source]¶

func desc: analyze the rule network by computing the precisions of the rules and comparing old and new rule stats

input: x: [num_instances, num_features] l: [num_instances, num_rules] m: [num_instances, num_rules] L: [num_instances, 1] d: [num_instances, 1] weights: [num_instances, num_rules] probs: [num_instances, num_classes] rule_classes: [num_rules,1]

output: void, prints the required statistics

spear.Implyloss.utils.convert_weights_to_m(weights)[source]¶

func desc: converts weights to m

input: weights([batch_size, num_rules]) - the weights matrix corresponding to rule network(w_network) in the algorithm

output: m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance

spear.Implyloss.utils.convert_m_to_l(m, rule_classes, num_classes)[source]¶

func desc: converts m to l

input: m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance rule_classes - num_classes(non_negative integer) - number of available classes

output: l([batch_size, num_rules]) - labels assigned by the rules

spear.Implyloss.utils.get_rule_precision(l, L, m)[source]¶

func desc: get the precision of the rules

input: l([batch_size, num_rules]) - labels assigned by the rules L([batch_size, 1]) - L_i = 1 if the ith instance has already a label assigned to it in the dataset m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance

output: micro_p - macro_p - comp -

spear.Implyloss.utils.merge_dict_a_into_b(a, b)[source]¶

func desc: set the dict values of b to that of a

input: a, b : dicts

output: void

spear.Implyloss.utils.print_tf_global_variables()[source]¶

Func Desc: prints all the global variables

Input:

Output:

spear.Implyloss.utils.print_var_list(var_list)[source]¶

Func Desc: Prints the given variable list

Input: var_list

Output:

spear.Implyloss.utils.pretty_print(data_structure)[source]¶

Func Desc: prints the given data structure in the desired format

Input: data_structure

Output:

spear.Implyloss.utils.get_list_or_None(s, dtype=<class 'int'>)[source]¶

Func Desc: Returns the list of types of the variables in the string s

Input: s - string dtype function (default - int)

Output: None or list

spear.Implyloss.utils.get_list(s)[source]¶

Func Desc: returns the output of get_list_or_None as a list

Input: s - list

Output: lst - list

spear.Implyloss.utils.None_if_zero(n)[source]¶

Func Desc: the max(0,n) function with none id n<=0

Input: n - integer

Output: if n>0 then n else None

spear.Implyloss.utils.boolean(s)[source]¶

Func Desc: returns the expected boolean value for the given string

Input: s - string

Output: boolean or error

spear.Implyloss.utils.set_to_list_of_values_if_None_or_empty(lst, val, num_vals)[source]¶

Func Desc: returns lst if it is not empty else returns a same length list but with all its entries equal to val lst - list val - value num_vals (integer) - length of the list lst

Output: lst or same length val list

spear.Implyloss.utils.conv_l_to_lsnork(l, m)[source]¶

func desc: in snorkel convention if a rule does not cover an instance assign it label -1 we follow the convention where we assign the label num_classes instead of -1 valid class labels range from {0,1,…num_classes-1} conv_l_to_lsnork: converts l in our format to snorkel’s format

input: l([batch_size, num_rules]) - rule label matrix m([batch_size, num_rules]) - rule coverage matrix

output: lsnork([batch_size, num_rules])

spear.Implyloss.utils.compute_accuracy(support, recall)[source]¶

func desc: compute the required accuracy

input: support recall

output: accuracy

spear.Implyloss.utils.dump_labels_to_file(save_filename, x, l, m, L, d, weights=None, f_d_U_probs=None, rule_classes=None)[source]¶

Func Desc: dumps the given data into a pickle file

Input: save_filename - the name of the pickle file in which the arguments/data is required to be saved x ([batch_size x num_features]) l ([batch_size x num_rules]) m ([batch_size x num_rules]) L ([batch_size x 1]) d ([batch_size x 1]) weights (default - None) f_d_U_probs (default - None) rule_classes (default - None)

Output:

spear.Implyloss.utils.load_from_pickle_with_per_class_sampling_factor(fname, per_class_sampling_factor)[source]¶

Func Desc: load the data from the given pickle file with per class sampling factor

Input: fname - name of the pickle file from which data need to be loaded per_class_sampling_factor

Output: the required matrices x1 ([batch_size x num_features]) l1 ([batch_size x num_rules]) m1 ([batch_size x num_rules]) L1 ([batch_size x 1]) d1 ([batch_size x 1])

spear.Implyloss.utils.combine_d_covered_U_pickles(d_name, infer_U_name, out_name, d_sampling_factor, U_sampling_factor)[source]¶

Func Desc: combine the labelled and unlabelled data, merge the corresponding parameters together and store them in new file

Input: d_name - the pickle file storing labelled data infer_U_name - the pickle file storing unlabelled data out_name - the name of the file where merged output needs to be stored d_sampling_factor - the per_class_sampling_factor for labelled data U_sampling_factor - the per_class_sampling_factor for unlabelled data

Output:

spear.Implyloss.utils.updated_theta_copy(grads, variables, lr, mode)[source]¶

Func Desc: updates the theta (parameters) using rhe given learning rate, grads and variables

Input: grads - gradients variables lr - learning rate mode

Output: vals - list of the updated gradients

Bibilography¶

CRS20(1,2,3,4,5,6,7): Oishik Chatterjee, Ganesh Ramakrishnan, and Sunita Sarawagi. Robust data programming with precision-guided labeling functions. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3397–3404, Apr. 2020. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5742, doi:10.1609/aaai.v34i04.5742.
MCK+20(1,2,3,4,5,6): Ayush Maheshwari, Oishik Chatterjee, KrishnaTeja Killamsetty, Rishabh K. Iyer, and Ganesh Ramakrishnan. Data programming using semi-supervision and subset selection. CoRR, 2020. URL: https://arxiv.org/abs/2008.09887, arXiv:2008.09887.
RBE+20: Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: rapid training data creation with weak supervision. The VLDB Journal, 29:, 05 2020. doi:10.1007/s00778-019-00552-1.