Welcome to SPEAR’s documentation!

SPEAR: Semi-Supervised Data Programming

We present SPEAR, an open-source python library for data programming with semi-supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data. SPEAR facilitates weak supervision, either pre-defined, in the form of rules/heuristics and associate ‘noisy’ labels(or prelabels) to the training dataset. These noisy labels are aggregated to assign labels to the unlabeled data for downstream tasks. Several label aggregation approaches have been proposed that aggregate the noisy labels and then train the ‘noisily’ labeled set in a cascaded manner, while other approaches ‘jointly’ aggregates and trains the model. In the python package, we integrate several cascade and joint data-programming approaches while providing facility to define rules. The code and tutorial notebooks are available here.

Labeling

This module takes inspiration and build upon Ratner et al. [RBE+20]

LF

class spear.labeling.lf.core.LabelingFunction(name: str, f: Callable[..., int], label=None, resources: Optional[Mapping[str, Any]] = None, pre: Optional[List[spear.labeling.preprocess.core.BasePreprocessor]] = None, cont_scorer: Optional[spear.labeling.continuous_scoring.core.BaseContinuousScorer] = None)[source]

Base class for labeling function

Parameters
  • name (str) – name for this LF object

  • f (Callable[..., int]) – core function which labels the input

  • label (enum) – Which class this LF corresponds to

  • resources (Optional[Mapping[str, Any]], optional) – Additional resources for core function. Defaults to None.

  • pre (Optional[List[BasePreprocessor]], optional) – Preprocessors to apply on input before labeling. Defaults to None.

  • cont_scorer (Optional[BaseContinuousScorer], optional) – Continuous Scorer to calculate the confidence score. Defaults to None.

class spear.labeling.lf.core.labeling_function(name: Optional[str] = None, label=None, resources: Optional[Mapping[str, Any]] = None, pre: Optional[List[spear.labeling.preprocess.core.BasePreprocessor]] = None, cont_scorer: Optional[spear.labeling.continuous_scoring.core.BaseContinuousScorer] = None)[source]

Decorator class for a labeling function

Parameters
  • name (Optional[str], optional) – Name for this labeling function. Defaults to None.

  • label (Optional[Enum], optional) – An enum. Which class this LF corresponds to. Defaults to None.

  • resources (Optional[Mapping[str, Any]], optional) – Additional resources for the LF. Defaults to None.

  • pre (Optional[List[BasePreprocessor]], optional) – Preprocessors to apply on input before labeling . Defaults to None.

  • cont_scorer (Optional[BaseContinuousScorer], optional) – Continuous Scorer to calculate the confidence score. Defaults to None.

Raises

ValueError – If the decorator is missing parantheses

Continuous scoring

class spear.labeling.continuous_scoring.core.BaseContinuousScorer(name: str, cf: Callable[..., int], resources: Optional[Mapping[str, Any]] = None)[source]

Base Class for Continuous Scoring function used by the Labeling Function

Parameters
  • name (str) – Name of the continuous scoring function

  • cf (Callable[..., int]) – Core function which calculates continuous score

  • resources (Optional[Mapping[str, Any]], optional) – Resources for the scorer. Defaults to None.

class spear.labeling.continuous_scoring.core.continuous_scorer(name: Optional[str] = None, resources: Optional[Mapping[str, Any]] = None)[source]

Decorator class for continuous scoring.

Parameters
  • name (Optional[str], optional) – Name for the decorator. Defaults to None.

  • resources (Optional[Mapping[str, Any]], optional) – Resources for the scorer. Defaults to None.

Raises

ValueError – If decorator is missing parantheses.

LFApply

class spear.labeling.apply.core.ApplierMetadata(faults: Dict[str, int])[source]

Metadata about Applier call.

property faults

Alias for field number 0

class spear.labeling.apply.core.BaseLFApplier(lf_set: spear.labeling.lf_set.core.LFSet)[source]

Base class for LF applier objects. Base class for LF applier objects, which executes a set of LFs on a collection of data points. Subclasses should operate on a single data point collection format (e.g. DataFrame). Subclasses must implement the apply method.

Parameters

lf_set (LFSet) – Instace of LFset which has information of set of labeling functions(which is applied on data)

Raises

ValueError – If names of LFs are not unique

spear.labeling.apply.core.apply_lfs_to_data_point(x: Any, index: int, lfs: List[spear.labeling.lf.core.LabelingFunction], f_caller: spear.labeling.apply.core._FunctionCaller)List[Tuple[int, int, int, float]][source]

Label a single data point with a set of LFs

Parameters
  • x (DataPoint) – Data point to label

  • index (int) – Index of the data point

  • lfs (List[LabelingFunction]) – List of LFs to label x with

  • f_caller (_FunctionCaller) – A _FunctionCaller to record failed LF executions

Returns

A list of (data point index, LF index, label enum, confidence) tuples

Return type

RowData

class spear.labeling.apply.core.LFApplier(lf_set: spear.labeling.lf_set.core.LFSet)[source]

LF applier for a list of data points (e.g. SimpleNamespace) or a NumPy array.

Parameters

lf_set (LFSet) – Instace of LFset which has information of set of labeling functions(which is applied on data)

apply(data_points: Union[Sequence[Any], numpy.ndarray], progress_bar: bool = True, fault_tolerant: bool = False, return_meta: bool = False)Union[numpy.ndarray, Tuple[numpy.ndarray, spear.labeling.apply.core.ApplierMetadata]][source]

Label list of data points or a NumPy array with LFs.

Parameters
  • data_points (Union[DataPoints, np.ndarray]) – List of data points or NumPy array to be labeled by LFs

  • progress_bar (bool, optional) – Display a progress bar?. Defaults to True.

  • fault_tolerant (bool, optional) – Output -1 if LF execution fails?. Defaults to False.

  • return_meta (bool, optional) – Return metadata from apply call?. Defaults to False.

Returns

np.ndarray:

Matrix of labels emitted by LFs

ApplierMetadata:

Metadata, such as fault counts, for the apply call

Return type

Union[np.ndarray, Tuple[np.ndarray, ApplierMetadata]]

LFSet

class spear.labeling.lf_set.core.LFSet(name: str, lfs: List[spear.labeling.lf.core.LabelingFunction] = [])[source]

Class for Set of Labeling Functions

Parameters
  • name (str) – Name for this LFset.

  • lfs (List[LabelingFunction], optional) – List of LFs to add to this object. Defaults to [].

get_lfs()Set[spear.labeling.lf.core.LabelingFunction][source]

Returns LFs contained in this LFSet object

Returns

LFs in this LFSet

Return type

Set[LabelingFunction]

add_lf(lf: spear.labeling.lf.core.LabelingFunction)None[source]

Adds single LF to this LFSet

Parameters

lf (LabelingFunction) – LF to add

add_lf_list(lf_list: List[spear.labeling.lf.core.LabelingFunction])None[source]

Adds a list of LFs to this LFSet

Parameters

lf_list (List[LabelingFunction]) – List of LFs to add to this LFSet

remove_lf(lf: spear.labeling.lf.core.LabelingFunction)None[source]

Removes a LF from this set

Parameters

lf (LabelingFunction) – LF to remove from this set

Raises

Warning – If LF not already in LFset

LFAnalysis

class spear.labeling.analysis.core.LFAnalysis(enum, L: numpy.ndarray, rules=None)[source]

Run analysis on LFs using label matrix.

Parameters
  • L (np.ndarray) – Label matrix where L_{i,j} is the label given by the jth LF to the ith x instance

  • lfs (Optional[List[LabelingFunction]], optional) – Labeling functions used to generate ‘L`. Defaults to None.

  • abstain (int, optional) – label associated with abstain. Defaults to -1.

Raises

ValueError – If number of LFs and number of LF matrix columns differ

label_coverage()float[source]

Compute the fraction of data points with at least one label.

Returns

Fraction of data points with labels

Return type

float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_coverage()
0.8
label_overlap()float[source]

Compute the fraction of data points with at least two (non-abstain) labels.

Returns

Fraction of data points with overlapping labels

Return type

float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_overlap()
0.6
label_conflict()float[source]

Compute the fraction of data points with conflicting (non-abstain) labels.

Returns

Fraction of data points with conflicting labels

Return type

float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_conflict()
0.2
lf_polarities()List[List[int]][source]

Infer the polarities of each LF based on evidence in a label matrix.

Returns

Unique output labels for each LF

Return type

List[List[int]]

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_polarities()
[[0, 1], [0], [0]]
lf_coverages()numpy.ndarray[source]

Compute frac. of examples each LF labels.

Returns

Fraction of labeled examples for each LF

Return type

np.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_coverages()
array([0.4, 0.8, 0.4])
lf_overlaps(normalize_by_coverage: bool = False)numpy.ndarray[source]

Compute frac. of examples each LF labels that are labeled by another LF. An overlapping example is one that at least one other LF returns a (non-abstain) label for. Note that the maximum possible overlap fraction for an LF is the LF’s coverage, unless normalize_by_coverage=True, in which case it is 1

Parameters

normalize_by_coverage (bool, optional) – Normalize by coverage of the LF, so that it returns the percent of LF labels that have overlaps. Defaults to False.

Returns

Fraction of overlapping examples for each LF

Return type

np.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_overlaps()
array([0.4, 0.6, 0.4])
>>> LFAnalysis(L).lf_overlaps(normalize_by_coverage=True)
array([1.  , 0.75, 1.  ])
lf_conflicts(normalize_by_overlaps: bool = False)numpy.ndarray[source]

Compute frac. of examples each LF labels and labeled differently by another LF. A conflicting example is one that at least one other LF returns a different (non-abstain) label for. Note that the maximum possible conflict fraction for an LF is the LF’s overlaps fraction, unless normalize_by_overlaps=True, in which case it is 1. Parameters

Parameters

normalize_by_overlaps (bool, optional) – Normalize by overlaps of the LF, so that it returns the percent of LF overlaps that have conflicts. Defaults to False.

Returns

Fraction of conflicting examples for each LF

Return type

np.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_conflicts()
array([0.2, 0.2, 0. ])
>>> LFAnalysis(L).lf_conflicts(normalize_by_overlaps=True)
array([0.5       , 0.33333333, 0.        ])
lf_empirical_accuracies(Y: numpy.ndarray)numpy.ndarray[source]

Compute empirical accuracy against a set of labels Y for each LF. Usually, Y represents development set labels.

Parameters

Y (np.ndarray) – [n] np.ndarray of gold labels

Returns

Empirical accuracies for each LF

Return type

np.ndarray

lf_summary(Y: Optional[numpy.ndarray] = None, plot: Optional[bool] = False)pandas.DataFrame[source]

Create a pandas DataFrame with the various per-LF statistics.

Parameters
  • Y (Optional[np.ndarray], optional) – [n] np.ndarray of gold labels. If provided, the empirical accuracy for each LF will be calculated. Defaults to None.

  • plot (Optional[bool], optional) – If set to true a bar graph is plotted. Defaults to False.

Returns

Summary statistics for each LF

Return type

DataFrame

Pre Labels

class spear.labeling.prelabels.core.PreLabels(name: str, data: Sequence[Any], rules: spear.labeling.lf_set.core.LFSet, num_classes: int, labels_enum, data_feats: Optional[Sequence[Any]] = numpy.array, gold_labels: Optional[Sequence[Any]] = numpy.array, exemplars: Sequence[Any] = numpy.array)[source]

Generate noisy lables, continuous score from lf’s applied on data

Parameters
  • name (str) – Name for this object.

  • data (DataPoints) – Datapoints.

  • gold_labels (Optional[DataPoints]) – Labels for datapoints if available.

  • rules (LFSet) – Set of Rules to generate noisy labels for the dataset.

  • exemplars (DataPoints) – [description]

get_labels()[source]

Applies LFs to the dataset to generate noisy labels and returns noisy labels and confidence scores

Returns

Noisy Labels and Confidences

Return type

Tuple(DataPoints, DataPoints)

analyse_lfs(plot=False)[source]

Analyse the lfs in LFSet on data

Parameters

plot (bool, optional) – Plot the values. Defaults to False.

Returns

dataframe consisting of Ploarity, Coverage, Overlap, Conflicts, Empirical Acc

Return type

DataFrame

generate_json(filename=None)[source]

Generates a json file with label value to label name mapping

Parameters

filename (str, optional) – Name for json file. Defaults to None.

generate_pickle(filename=None)[source]

Generates a pickle file with noisy labels, confidence and other Metadata

Parameters

filename (str, optional) – Name for pickle file. Defaults to None.


CAGE

Chatterjee et al. [CRS20]

class spear.cage.core.Cage(path_json, n_lfs)[source]
Cage class:

Class for Data Programming using CAGE [Note: from here on, graphical model(gm) and CAGE algorithm terms are used interchangeably]

Parameters
  • path_json – Path to json file consisting of number to string(class name) map

  • n_lfs – number of labelling functions used to generate pickle files

save_params(save_path)[source]

member function to save parameters of Cage

Parameters

save_path – path to pickle file to save parameters

load_params(load_path)[source]

member function to load parameters to Cage

Parameters

load_path – path to pickle file to load parameters

fit_and_predict_proba(path_pkl, path_test=None, path_log=None, qt=0.9, qc=0.85, metric_avg=['binary'], n_epochs=100, lr=0.01)[source]
Parameters
  • path_pkl – Path to pickle file of input data in standard format

  • path_test – Path to the pickle file containing test data in standard format

  • path_log – Path to log file. No log is produced if path_test is None. Default is None which prints accuracies/f1_scores is printed to terminal

  • qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

  • metric_avg – List of average metric to be used in calculating f1_score, default is [‘binary’]. Use None for not calculating f1_score

  • n_epochs – Number of epochs, default is 100

  • lr – Learning rate for torch.optim, default is 0.01

Returns

numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum)

fit_and_predict(path_pkl, path_test=None, path_log=None, qt=0.9, qc=0.85, metric_avg=['binary'], n_epochs=100, lr=0.01, need_strings=False)[source]
Parameters
  • path_pkl – Path to pickle file of input data in standard format

  • path_test – Path to the pickle file containing test data in standard format

  • path_log – Path to log file. No log is produced if path_test is None. Default is None which prints accuracies/f1_scores is printed to terminal

  • qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

  • metric_avg – List of average metric to be used in calculating f1_score, default is [‘binary’]

  • n_epochs – Number of epochs, default is 100

  • lr – Learning rate for torch.optim, default is 0.01

  • need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are aggregated/predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp.

predict_proba(path_test, qc=0.85)[source]

Used to predict labels based on a pickle file with path path_test

Parameters
  • path_test – Path to the pickle file containing test data set in standard format

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]

predict(path_test, qc=0.85, need_strings=False)[source]

Used to predict labels based on a pickle file with path path_test

Parameters
  • path_test – Path to the pickle file containing test data set in standard format

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

  • need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]


Joint Learning(JL)

Maheshwari et al. [MCK+20]

From here on, Feature model(fm) imply Feature based classification model

class spear.jl.core.JL(path_json, n_lfs, n_features, feature_model='nn', n_hidden=512)[source]
Joint_Learning class:

[Note: from here on, feature model(fm) and feature-based classification model are used interchangeably. graphical model(gm) and CAGE algorithm terms are used interchangeably]

Loss function number | Calculated over | Loss function: (useful for loss_func_mask in fit_and_predict_proba and fit_and_predict functions)

1, L, Cross Entropy(prob_from_feature_model, true_labels)

2, U, Entropy(prob_from_feature_model)

3, U, Cross Entropy(prob_from_feature_model, prob_from_graphical_model)

4, L, Negative Log Likelihood

5, U, Negative Log Likelihood(marginalised over true labels)

6, L and U, KL Divergence(prob_feature_model, prob_graphical_model)

7, _, Quality guide

Parameters
  • path_json – Path to json file containing the dictionary of number to string(class name) map

  • n_lfs – number of labelling functions used to generate pickle files

  • n_features – number of features for each instance in the first array of pickle file aka feature matrix

  • feature_model – The model intended to be used for features, allowed values are ‘lr’(Logistic Regression) or ‘nn’(Neural network with 2 hidden layer) string, default is ‘nn’

  • n_hidden – Number of hidden layer nodes if feature model is ‘nn’, type is integer, default is 512

save_params(save_path)[source]

member function to save parameters of JL

Parameters

save_path – path to pickle file to save parameters

load_params(load_path)[source]

member function to load parameters to JL

Parameters

load_path – path to pickle file to load parameters

fit_and_predict_proba(path_L, path_U, path_V, path_T, loss_func_mask, batch_size, lr_fm, lr_gm, use_accuracy_score, path_log=None, return_gm=False, n_epochs=100, start_len=7, stop_len=10, is_qt=True, is_qc=True, qt=0.9, qc=0.85, metric_avg='binary')[source]
Parameters
  • path_L – Path to pickle file of labelled instances

  • path_U – Path to pickle file of unlabelled instances

  • path_V – Path to pickle file of validation instances

  • path_T – Path to pickle file of test instances

  • loss_func_mask – list of size 7 where loss_func_mask[i] should be 1 if Loss function (i+1) should be included, 0 else. Checkout Eq(3) in [MCK+20]

  • batch_size – Batch size, type should be integer

  • lr_fm – Learning rate for feature model, type is integer or float

  • lr_gm – Learning rate for graphical model(cage algorithm), type is integer or float

  • use_accuracy_score – The score to use for termination condition on validation set. True for accuracy_score, False for f1_score

  • path_log – Path to log file to append log. Default is None which prints accuracies/f1_scores is printed to terminal

  • return_gm – Return the predictions of graphical model? the allowed values are True, False. Default value is False

  • n_epochs – Number of epochs in each run, type is integer, default is 100

  • start_len – A parameter used in validation, refers to the least epoch after which validation checks need to be performed, type is integer, default is 7

  • stop_len – A parameter used in validation, refers to the least number of continuous epochs of non incresing validation accuracy after which the training should be stopped, type is integer, default is 10

  • is_qt – True if quality guide is available(and will be provided in ‘qt’ argument). False if quality guide is intended to be found from validation instances. Default is True

  • is_qc – True if quality index is available(and will be provided in ‘qc’ argument). False if quality index is intended to be found from validation instances. Default is True

  • qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

  • metric_avg – Average metric to be used in calculating f1_score/precision/recall, default is ‘binary’

Returns

If return_gm is True; the return value is two predicted labels of numpy array of shape (num_instances, num_classes), first one is through feature model, other one through graphical model. Else; the return value is predicted labels of numpy array of shape (num_instances, num_classes) through feature model. For a given model i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) using that model. It is suggested to use the probailities of feature model

fit_and_predict(path_L, path_U, path_V, path_T, loss_func_mask, batch_size, lr_fm, lr_gm, use_accuracy_score, path_log=None, return_gm=False, n_epochs=100, start_len=7, stop_len=10, is_qt=True, is_qc=True, qt=0.9, qc=0.85, metric_avg='binary', need_strings=False)[source]
Parameters
  • path_L – Path to pickle file of labelled instances

  • path_U – Path to pickle file of unlabelled instances

  • path_V – Path to pickle file of validation instances

  • path_T – Path to pickle file of test instances

  • loss_func_mask – list of size 7 where loss_func_mask[i] should be 1 if Loss function (i+1) should be included, 0 else. Checkout Eq(3) in [MCK+20]

  • batch_size – Batch size, type should be integer

  • lr_fm – Learning rate for feature model, type is integer or float

  • lr_gm – Learning rate for graphical model(cage algorithm), type is integer or float

  • use_accuracy_score – The score to use for termination condition on validation set. True for accuracy_score, False for f1_score

  • path_log – Path to log file to append log. Default is None which prints accuracies/f1_scores is printed to terminal

  • return_gm – Return the predictions of graphical model? the allowed values are True, False. Default value is False

  • n_epochs – Number of epochs in each run, type is integer, default is 100

  • start_len – A parameter used in validation, refers to the least epoch after which validation checks need to be performed, type is integer, default is 7

  • stop_len – A parameter used in validation, refers to the least number of continuous epochs of non incresing validation accuracy after which the training should be stopped, type is integer, default is 10

  • is_qt – True if quality guide is available(and will be provided in ‘qt’ argument). False if quality guide is intended to be found from validation instances. Default is True

  • is_qc – True if quality index is available(and will be provided in ‘qc’ argument). False if quality index is intended to be found from validation instances. Default is True

  • qt – Quality guide of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.9

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

  • metric_avg – Average metric to be used in calculating f1_score/precision/recall, default is ‘binary’

  • need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

If return_gm is True; the return value is two predicted labels of numpy array of shape (num_instances, ), first one is through feature model, other one through graphical model. Else; the return value is predicted labels of numpy array of shape (num_instances,) through feature model. It is suggested to use the probailities of feature model

predict_gm_proba(path_test, qc=0.85)[source]

Used to find the predicted labels based on the trained parameters of graphical model(CAGE)

Parameters
  • path_test – Path to the pickle file containing test data set

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model

predict_fm_proba(x_test)[source]

Used to find the predicted labels based on the trained parameters of feature model

Parameters

x_test – numpy array of shape (num_instances, num_features) containing data whose labels are to be predicted

Returns

numpy.ndarray of shape (num_instances, num_classes) where i,j-th element is the probability of ith instance being the jth class(the jth value when sorted in ascending order of values in Enum) [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model

predict_gm(path_test, qc=0.85, need_strings=False)[source]

Used to find the predicted labels based on the trained parameters of graphical model(CAGE)

Parameters
  • path_test – Path to the pickle file containing test data set

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

  • need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model

predict_fm(x_test, need_strings=False)[source]

Used to find the predicted labels based on the trained parameters of feature model

Parameters
  • x_test – numpy array of shape (num_instances, num_features) containing data whose labels are to be predicted

  • need_strings – If True, the output will be in the form of strings(class names). Else it is in the form of class values(given to classes in Enum). Default is False

Returns

numpy.ndarray of shape (num_instances,) which are predicted labels. Elements are numbers/strings depending on need_strings attribute is false/true resp. [Note: no aggregration/algorithm-running will be done using the current input]. It is suggested to use the probailities of feature model


Subset Selection

Uses facilityLocation from submodlib library which is also provided by DECILE for submodular optimization

spear.jl.subset_selection.rand_subset(n_all, n_instances)[source]

A function to choose random indices of the input instances to be labeled

Parameters
  • n_all – number of available instances, type in integer

  • n_intances – number of instances to be labelled, type is integer

Returns

A numpy.ndarray of the indices(of shape (n_sup,) and each element in the range [0,n_all-1)) to be labeled

spear.jl.subset_selection.unsup_subset(x_train, n_unsup)[source]

A function for unsupervised subset selection(the subset to be labeled)

Parameters
  • x_train – A numpy.ndarray of shape (n_instances, n_features). All the data, intended to be used for training

  • n_unsup – number of instances to be found during unsupervised subset selection, type is integer

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,x_train.shape[0])), the result of subset selection

spear.jl.subset_selection.sup_subset(path_json, path_pkl, n_sup, qc=0.85)[source]

A helper function for supervised subset selection(the subset to be labeled) which just returns indices

Parameters
  • path_json – Path to json file of number to string(class name) map

  • path_pkl – Path to the pickle file containing all the training data in standard format

  • n_sup – Number of instances to be found during supervised subset selection

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection AND the data which is list of contents of path_pkl

spear.jl.subset_selection.sup_subset_indices(path_json, path_pkl, n_sup, qc=0.85)[source]

A function for supervised subset selection(the subset to be labeled) whcih just returns indices

Parameters
  • path_json – Path to json file of number to string(class name) map

  • path_pkl – Path to the pickle file containing all the training data in standard format

  • n_sup – Number of instances to be found during supervised subset selection

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection

spear.jl.subset_selection.sup_subset_save_files(path_json, path_pkl, path_save_L, path_save_U, n_sup, qc=0.85)[source]

A function for supervised subset selection(the subset to be labeled) which makes separate pickle files of data, one for those to be labelled, other that can be left unlabelled

Parameters
  • path_json – Path to json file of number to string(class name) map

  • path_pkl – Path to the pickle file containing all the training data in standard format

  • path_save_L – Path to save the pickle file of set of instances to be labelled. Note that instances are not labelled yet. Extension should be .pkl

  • path_save_U – Path to save the pickle file of set of instances that can be left unlabelled. Extension should be .pkl

  • n_sup – number of instances to be found during supervised subset selection

  • qc – Quality index of shape (n_lfs,) of type numpy.ndarray OR a float. Values must be between 0 and 1. Default is 0.85

Returns

numpy.ndarray of indices(shape is (n_sup,), each element lies in [0,num_instances)), the result of subset selection. Also two pickle files are saved at path_save_L and path_save_U

spear.jl.subset_selection.replace_in_pkl(path, path_save, np_array, index)[source]

A function to insert the true labels, after labeling the instances, to the pickle file

Parameters
  • path – Path to the pickle file containing all the data in standard format

  • path_save – Path to save the pickle file after replacing the ‘L’(true labels numpy array) of data in path pickle file

  • np_array – The data which is to be used to replace the data in path pickle file with

  • index – Index of the numpy array, in data of path pickle file, to be replaced with np_array. Value should be in [0,8]

Returns

No return value. A pickle file is generated at path_save

spear.jl.subset_selection.insert_true_labels(path, path_save, labels)[source]

A function to insert the true labels, after labeling the instances, to the pickle file

Parameters
  • path – Path to the pickle file containing all the data in standard format

  • path_save – Path to save the pickle file after replacing the ‘L’(true labels numpy array) of data in path pickle file

  • labels – The true labels of the data in pickle file. numpy.ndarray of shape (num_instances, 1)

Returns

No return value. A pickle file is generated at path_save


CAGE, JL - UTILS

Note: The arguments whose shapes are mentioned in ‘[….]’ are torch tensors.

Data loaders

The common utils to CAGE and JL algorithms are in this file. Don’t change the name or location of this file.

spear.utils.data_editor.is_dict_trivial(dict)[source]

A helper function that checks if the dictionary have key and value equal values for all keys except if its null

Parameters

dict – the dictionary

Returns

True if all keys(which are not None) are equal to respective values. False otherwise

spear.utils.data_editor.get_data(path, check_shapes=True, class_map=None)[source]
Standard format in pickle file contains the NUMPY ndarrays x, l, m, L, d, r, s, n, k and an int n_classes

x: (num_instances, num_features), x[i][j] is jth feature of ith instance. Note that the dimension fo this array can vary depending on the dimension of input

l: (num_instances, num_lfs), l[i][j] is the prediction of jth LF(co-domain: the values used in Enum) on ith instance. l[i][j] = None imply Abstain

m: (num_instances, num_lfs), m[i][j] is 1 if jth LF didn’t Abstain on ith instance. Else it’s 0

L: (num_instances, 1), L[i] is true label(co-domain: the values used in Enum) of ith instance, if available. Else L[i] is None

d: (num_instances, 1), d[i] is 1 if ith instance is labelled. Else it is 0

r: (num_instances, num_lfs), r[i][j] is 1 if ith instance is an exemplar for jth rule. Else it’s 0

s: (num_instances, num_lfs), s[i][j] is the continuous score of ith instance given by jth continuous LF. If jth LF is not continuous, then s[i][j] is None

n: (num_lfs,), n[i] is 1 if ith LF has continuous counter part, else n[i] is 0

k: (num_lfs,), k[i] is the class of ith LF, co-domain: the values used in Enum

n_classes: total number of classes

In case the numpy array is not available(can be possible for x, L, d, r, s), it is stored as numpy.zeros(0)

Parameters
  • path – path to pickle file with data in the format above

  • check_shapes – if true, checks whether the shapes of numpy arrays in pickle file are consistent as per the format mentioned above. Else it doesn’t check. Default is True.

  • class_map – dictionary of class numbers(sorted, mapped to [0,n_classes-1]) are per the Enum defined in labeling part. l,L are modified(needed inside algorithms) before returning, using class_map. Default is None which doesn’t do any mapping

Returns

A list containing all the numpy arrays mentioned above. The arrays l, L are modified using the class_map

spear.utils.data_editor.get_classes(path)[source]

The json file should contain a dictionary of number to string(class name) map as defined in Enum

Parameters

path – path to json file with contents mentioned above

Returns

A dictionary (number to string(class name) map)

spear.utils.data_editor.get_predictions(proba, class_map, class_dict, need_strings)[source]

This function takes probaility of instances being a class and gives what class each instance belongs to, using the maximum of probabilities

Parameters
  • proba – probability numpy.ndarray of shape (num_instances, num_classes)

  • class_map – dictionary mapping the class numbers(as per Enum class defined) to numbers in range [0, num_classes-1]

  • class_dict – dictionary consisting of number to string(class name) mapping as per the Enum class defined

  • need_trings – If True, the output conatians strings(of class names), else it consists of numbers(class numbers as used in Enum definition)

Returns

numpy.ndarray of shape (num_instances,), where elements are class_names/class_numbers depending on need_strings is True/False, where the elements represent the class of each instance

spear.utils.data_editor.get_enum(np_array, enm)[source]

This function is used to convert a numpy array of numbers to a numpy array of enums based on the Enum class provided ‘enm’

Parameters
  • np_array – a numpy.ndarray of any shape consisting of numbers

  • enm – An class derived from ‘Enum’ class, which must contain map from every number in np_array to an enum

Returns

numpy.ndarray of shape shape as np_array but now contains enums(as per the mapping in ‘enm’) instead of numbers


CAGE and JL utils

From here on, Graphical model(gm) imply CAGE algorithm and Feature model(fm) imply Feature based classification model

The common utils to CAGE and JL algorithms are in this file. Don’t change the name or location of this file.

spear.utils.utils_cage.phi(theta, l, device)[source]

Graphical model utils: A helper function

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • l – [n_lfs]

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a tensor of shape [n_classes, n_lfs], element wise product of input tensors(each row of theta dot product with l)

spear.utils.utils_cage.calculate_normalizer(theta, k, n_classes, device)[source]

Graphical model utils: Used to find Z(the normaliser) in CAGE. Eq(4) in [CRS20]

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • k – [n_lfs], labels corresponding to LFs

  • n_classes – num of classes/labels

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, representing the normaliser

spear.utils.utils_cage.probability_l_y(theta, m, k, n_classes, device)[source]

Graphical model utils: Used to find probability involving the term psi_theta(in Eq(1) in [CRS20]), the potential function for all LFs

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0

  • k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1

  • n_classes – num of classes/labels

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a tensor of shape [n_instances, n_classes], the psi_theta value for each instance, for each class(true label y)

spear.utils.utils_cage.probability_s_given_y_l(pi, s, y, m, k, continuous_mask, qc)[source]

Graphical model utils: Used to find probability involving the term psi_pi(in Eq(1) in [CRS20]), the potential function for all continuous LFs

Parameters
  • pi – [n_lfs], the parameters for the class y

  • s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF

  • y – a value in [0, n_classes-1], representing true label, for which psi_pi is calculated

  • m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0

  • k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1

  • continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0

  • qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1

Returns

a tensor of shape [n_instances], the psi_pi value for each instance, for the given label(true label y)

spear.utils.utils_cage.probability(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]

Graphical model utils: Used to find probability of given instances for all possible true labels(y’s). Eq(1) in [CRS20]

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • pi – [n_classes, n_lfs], the parameters

  • m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0

  • s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF

  • k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1

  • n_classes – num of classes/labels

  • continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0

  • qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a tensor of shape [n_instances, n_classes], the probability for an instance being a particular class

spear.utils.utils_cage.log_likelihood_loss(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]

Graphical model utils: Negative of log likelihood loss. Negative of Eq(6) in [CRS20]

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • pi – [n_classes, n_lfs], the parameters

  • m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0

  • s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF

  • k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1

  • n_classes – num of classes/labels

  • continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0

  • qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, negative of summation over (the log of probability for an instance, marginalised over y(true labels))

spear.utils.utils_cage.precision_loss(theta, k, n_classes, a, device)[source]

Graphical model utils: Negative of the regularizer term in Eq(9) in [CRS20]

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1

  • n_classes – num of classes/labels

  • a – [n_lfs], a[i] is the quality guide for ith LF. Value(s) must be between 0 and 1

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, negative of regularizer term

spear.utils.utils_cage.predict_gm_labels(theta, pi, m, s, k, n_classes, continuous_mask, qc, device)[source]

Graphical model utils: Used to predict the labels after the training is done

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • pi – [n_classes, n_lfs], the parameters

  • m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0

  • s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF

  • k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1

  • n_classes – num of classes/labels

  • continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0

  • qc – a float value OR [n_lfs], qc[i] quality index for ith LF. Value(s) must be between 0 and 1

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

numpy.ndarray of shape (n_instances,), the predicted class for an instance


JL utils

spear.utils.utils_jl.log_likelihood_loss_supervised(theta, pi, y, m, s, k, n_classes, continuous_mask, qc, device)[source]

Joint Learning utils: Negative log likelihood loss, used in loss 4 in [MCK+20]

Parameters
  • theta – [n_classes, n_lfs], the parameters

  • pi – [n_classes, n_lfs], the parameters

  • m – [n_instances, n_lfs], m[i][j] is 1 if jth LF is triggered on ith instance, else it is 0

  • s – [n_instances, n_lfs], s[i][j] is the continuous score of ith instance given by jth continuous LF

  • k – [n_lfs], k[i] is the class of ith LF, range: 0 to num_classes-1

  • n_classes – num of classes/labels

  • continuous_mask – [n_lfs], continuous_mask[i] is 1 if ith LF has continuous counter part, else it is 0

  • qc – a float value OR [n_lfs], qc[i] quality index for ith LF

  • device – ‘cuda’ if drivers are available, else ‘cpu’

Returns

a real value, summation over (the log of probability for an instance)

spear.utils.utils_jl.entropy(probabilities)[source]

Joint Learning utils: Entropy, Used in loss 2 in [MCK+20]

Parameters

probabilities – [num_unsup_instances, num_classes], probabilities[i][j] is probability of ith instance being jth class

Returns

a real value, the entropy value of given probability

spear.utils.utils_jl.kl_divergence(probs_p, probs_q)[source]

Joint Learning utils: KL divergence of two probabilities, used in loss 6 in [MCK+20]

Parameters
  • probs_p – [num_instances, num_classes]

  • probs_q – [num_instances, num_classes]

Returns

a real value, the KL divergence of given probabilities

spear.utils.utils_jl.find_indices(data, data_sub)[source]

A helper function for subset selection

Parameters
  • data – the complete data, torch tensor of shape [num_instances, num_classes]

  • data_sub – the subset of ‘data’ whose indices are to be found. Should be of same shape as ‘data’

Returns

list of indices, to be found from the result of apricot library

spear.utils.utils_jl.get_similarity_kernel(preds)[source]

A helper function for subset selection

Parameters

preds – numpy.ndarray of shape (num_samples,)

Returns

numpy.ndarray of shape (num_sample, num_samples)


Feature-based Models

class spear.jl.models.models.LogisticRegression(*args: Any, **kwargs: Any)[source]

Class for Logistic Regression, used in Joint learning class/Algorithm

Parameters
  • input_size – number of features

  • output_size – number of classes

forward(x)[source]
class spear.jl.models.models.DeepNet(*args: Any, **kwargs: Any)[source]

Class for Deep neural network, used in Joint learning class/Algorithm

Parameters
  • input_size – number of features

  • hidden_size – number of nodes in each of the two hidden layers

  • output_size – number of classes

forward(x)[source]

Hls

Hls Checkmate

class spear.Implyloss.checkmate.BestCheckpointSaver(save_dir, num_to_keep=1, maximize=True, saver=None)[source]

Maintains a directory containing only the best n checkpoints

Inside the directory is a best_checkpoints JSON file containing a dictionary mapping of the best checkpoint filepaths to the values by which the checkpoints are compared. Only the best n checkpoints are contained in the directory and JSON file.

This is a light-weight wrapper class only intended to work in simple, non-distributed settings. It is not intended to work with the tf.Estimator framework.

handle(value, sess, global_step_tensor)[source]

Func Desc: Updates the set of best checkpoints based on the given result.

Input: value: The value by which to rank the checkpoint. sess: A tf.Session to use to save the checkpoint global_step_tensor: A tf.Tensor represent the global step

Output: True or False

spear.Implyloss.checkmate.get_best_checkpoint(best_checkpoint_dir, select_maximum_value=True)[source]

Func Desc: Reads the best_checkpoints file in the best_checkpoint_dir directory. Returns the filepath in the best_checkpoints file associated with the highest value if select_maximum_value is True, or the filepath associated with the lowest value if select_maximum_value is False.

Input: best_checkpoint_dir: Directory containing best_checkpoints JSON file select_maximum_value: If True, select the filepath associated with the highest value. Otherwise, select the filepath associated with the lowest value.

Output: The full path to the best checkpoint file

Hls Checkpoints

spear.Implyloss.checkpoints.test_mru_checkpoints(num_to_keep)[source]

Func Desc: Runs different sessions while changing the checkpoint number that is currently being worked with and tests the same

Input: num_to_keep(int) - a limit on the size of the global step for checkpoint traversal

Output:

spear.Implyloss.checkpoints.test_checkpoint()[source]

Func Desc: tests whether the checkpoints stored are as expected

Input:

Output:

spear.Implyloss.checkpoints.test_best_ckpt()[source]

Func Desc: test for the best checkpoint so far

Input:

Output:

spear.Implyloss.checkpoints.test_checkmate()[source]

Func Desc: test whether the checkmate model is working fine

Input:

Output:

Hls Data Feeders

Hls Data Feeders Utils

spear.Implyloss.data_feeder_utils.change_values(l, user_class_to_num_map)[source]

Func Desc: Replace the class labels in l by sequential labels - 0,1,2,..

Input: l - the class label matrix user_class_to_num_map - dictionary storing mapping from original class labels to sequential labels

Output: l - with sequential labels

spear.Implyloss.data_feeder_utils.load_data(fname, jname, num_load=None)[source]

Func Desc: load the data from the given file

Input: fname - filename num_load (default - None)

Output: the structured F_d_U_Data

spear.Implyloss.data_feeder_utils.get_rule_classes(l, num_classes)[source]

Func Desc: get the different rule_classes

Input: l ([batch_size, num_rules]) num_classes (int) - the number of available classes

Output: rule_classes ([num_rules,1]) - the list of valid classes labelled by rules (say class 2 by r0, class 1 by r1, class 4 by r2 => [2,1,4])

spear.Implyloss.data_feeder_utils.extract_rules_satisfying_min_coverage(m, min_coverage)[source]

Func Desc: extract the rules that satisfy the specified minimum coverage

Input: m ([batch_size, num_rules]) - mij specifies whether ith example is associated with the jth rule min_coverage

Output: satisfying_rules - list of satisfying rules not_satisfying_rules - list of not satisfying rules rule_map_new_to_old rule_map_old_to_new

spear.Implyloss.data_feeder_utils.remap_2d_array(arr, map_old_to_new)[source]

Func Desc: remap those columns of 2D array that are present in map_old_to_new

Input: arr ([batch_size, num_rules]) map_old_to_new

Output: modified array

spear.Implyloss.data_feeder_utils.remap_1d_array(arr, map_old_to_new)[source]

Func Desc: remap those positions of 1D array that are present in map_old_to_new

Input: arr ([batch_size, num_rules]) map_old_to_new

Output: modified array

spear.Implyloss.data_feeder_utils.modify_d_or_U_using_rule_map(raw_U_or_d, rule_map_old_to_new)[source]

Func Desc: Modify d or U using the rule map

Input: raw_U_or_d - the raw data (labelled(d) or unlabelled(U)) rule_map_old_to_new - the rule map

Output: the modified raw_U_or_d

spear.Implyloss.data_feeder_utils.shuffle_F_d_U_Data(data)[source]

Func Desc: shuffle the input data along the 0th axis i.e. among the different instances

Input: data

Output: the structured and shuffled F_d_U_Data

spear.Implyloss.data_feeder_utils.oversample_f_d(x, labels, sampling_dist)[source]

Func Desc: Oversample the labelled data using the arguments provided

Input: x ([batch_size, num_features]) - the data labels samping_dist

spear.Implyloss.data_feeder_utils.oversample_d(raw_d, sampling_dist)[source]

Func Desc: performs oversampling on the raw labelled data using the given distribution

Input: raw_d - raw labelled data sampling_dist - the given sampling dist

Output: F_d_U_Data

Hls Gen Cross Entropy Utils

spear.Implyloss.gen_cross_entropy_utils.generalized_cross_entropy(logits, one_hot_labels, q=0.6)[source]

Func Desc: Computes the generalized cross entropy loss

Input: logits([batch_size, num_classes]) - weights one_hot_labels([batch_size, num_classes]) q (default = 0.6)

Output: loss

spear.Implyloss.gen_cross_entropy_utils.generalized_cross_entropy_bernoulli(p, q=0.2)[source]

Func Desc: computes the bernoulli generalized cross entropy

Input: p - base q (default = 0.2) - exponent

Output: loss

Hls Model

Hls PR Utils

spear.Implyloss.pr_utils.exp_term_for_constraints(rule_classes, num_classes, C)[source]

Func Desc: Compute the exponential term for the constraints

Input: rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required exponential term

spear.Implyloss.pr_utils.pr_product_term(weights, rule_classes, num_classes, C)[source]

Func Desc: Compute the probability product term for the constraints

Input: weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required product term

spear.Implyloss.pr_utils.get_q_y_from_p(f_probs, weights, rule_classes, num_classes, C)[source]

Func Desc: Compute the q_y term from the p (f_network) distribution

Input: f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required q_y term

spear.Implyloss.pr_utils.get_q_r_from_p(f_probs, weights, rule_classes, num_classes, C)[source]

Func Desc: Compute the q_r term from the p (f_network) distribution

Input: f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C

Output: the required q_r term

spear.Implyloss.pr_utils.theta_term_in_pr_loss(f_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]

Func Desc: Compute the theta term in the pr loss

Input: f_logits ([batch_size, num_classes]) f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C d ([batch_size,1])

Output: the required theta term (third term in equation 14) - used to supervise f (classification) network from instances in U

spear.Implyloss.pr_utils.phi_term_in_pr_loss(m, w_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]

Func Desc: Compute the phi term in the pr loss

Input: m ([batch_size, num_rules]) - mij = 1 if ith example is associated with jth rule w_logits ([batch_size, num_rules]) f_probs ([batch_size, num_classes]) weights ([batch_size, num_rules]) - the w_network weights rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C d ([batch_size,1])

Output: the required phi term (fourth term in equation 14) - used to superwise w (rule) network from instances in U

spear.Implyloss.pr_utils.pr_loss(m, f_logits, w_logits, f_probs, weights, rule_classes, num_classes, C, d)[source]

Func Desc: Compute the pr loss

Input: m ([batch_size, num_rules]) - mij = 1 if ith example is associated with jth rule f_logits w_logits ([batch_size, num_rules]) - logit before sigmoid activation in w network f_probs ([batch_size, num_classes]) - output of f network weights ([batch_size, num_rules]) - the sigmoid output from w network rule_classes ([num_rules,1]) - a list of classes associated with the rules num_classes (int) C - lamda in equation 10 (hyperparameter) d ([batch_size,1]) - if ith instance is from “d” set (labelled data) d[i] = 1, else if ith instance is from “U” set, d[i] = 0

Output: the required phi term

Hls Test

class spear.Implyloss.test.HLSTest(hls)[source]

Class Desc: This Class is designed to test the HLS model and its accuracy and precision obtained on the validation and test datasets

maybe_save_predictions(save_filename, x, l, m, preds, d)[source]

Func Desc: Saves the predictions obtained from the model if required

Input: self save_filename - the filename where the predictions have to be saved if required x ([batch_size, num_features]) l ([batch_size, num_rules]) m ([batch_size, num_rules]) preds d ([batch_size,1]) - d[i] = 1 if the ith data instance is from the labelled dataset

Output:

test_f(datafeeder, log_output=False, data_type='test_f', save_filename=None, use_joint_f_w=False)[source]

Func Desc: tests the f_network (classification network)

Input: self datafeeder - the datafeeder object log_output (default - False) data_type (fixed to test_f) - the type of the data that we want to test save_filename (default - None) - the file where we can possibly store the test results use_join_f_w (default - None)

Output: precision recall f1_score support

test_w(datafeeder, log_output=False, data_type='test_w', save_filename=None)[source]

Func Desc: tests the w_network (rule network)

Input: self datafeeder - the datafeeder object log_output (default - False) data_type (fixed to test_w) - the type of the data that we want to test save_filename (default - None) - the file where we can possibly store the test results

Analyzes: the obtained w_predictions

Hls Train

class spear.Implyloss.train.HLSTrain(hls, f_d_metrics_pickle, f_d_U_metrics_pickle, f_d_adam_lr, f_d_U_adam_lr, early_stopping_p, f_d_primary_metric, mode, data_dir)[source]

Func Desc: This Class is designed to train the HLS model using the Implyloss Algorithm

make_f_summary_ops()[source]

Func Desc: make the summary of all the essential parameters of f_network

Input: Self

Summarizes: f_d_loss_ph f_d_loss f_d_f1_score_ph f_d_f1_score f_d_accuracy_ph f_d_accuracy f_d_avg_f1_score_ph f_d_avg_f1_score f_d_summaries

report_f_d_perfs_to_tensorboard(f_d_loss, metrics_dict, global_step)[source]

Func Desc: report the f_d_performance to tensorboard

Input: self f_d_loss metrics_dict global_step

Output:

train_f_on_d(datafeeder, num_epochs)[source]

Func Desc: trains the f_network (classification network) on labelled data

Input: self datafeeder - datafeeder object num_epochs - number of epochs for training

Output:

train_f_on_d_U(datafeeder, num_epochs, loss_type)[source]

Func Desc: trains the f_network (classification network) on labelled amd unlabelled data

Input: self datafeeder - datafeeder object num_epochs - number of epochs for training loss_type - different available losses

Output:

init_metrics()[source]

Func desc: initialize the metrics

Input: self

Output:

get_metric(run_type, metrics_dict)[source]

Func desc: get the metrics

Input: self run_type metrics_dict

Output: the required metrics_dict

save_metrics(run_type, metrics_dict)[source]

Func desc: save the metrics

Input: self run_type metrics_dict

Prints: The saved metric file

maybe_save_metrics_dict(run_type, metrics_dict)[source]

Func desc: save the metric if it is the best till now

Input: self run_type metrics_dict

Output: True or False denoting whether the current metric is saved or not

Prints: The saved metric file

compute_f_d_metrics(metrics_dict, precision, recall, f1_score, support, global_epoch, f_d_global_step)[source]

Func desc: compute the f_d metrics

input: self metrics_dict precision recall f1_score support global_epoch f_d_global_step

output: void

evaluates: metrics_dict, accuracy

Hls Utils

spear.Implyloss.utils.get_data(path)[source]

func desc: takes the pickle file and arranges it in a matrix list form so as to set the member variables accordingly expected order in pickle file is NUMPY arrays x, l, m, L, d, r, s, n, k x: [num_instances, num_features] l: [num_instances, num_rules] m: [num_instances, num_rules] L: [num_instances, 1] d: [num_instances, 1] r: [num_instances, num_rules] s: [num_instances, num_rules] n: [num_rules] Mask for s k: [num_rules] LF classes, range 0 to num_classes-1

spear.Implyloss.utils.analyze_w_predictions(x, l, m, L, d, weights, probs, rule_classes)[source]

func desc: analyze the rule network by computing the precisions of the rules and comparing old and new rule stats

input: x: [num_instances, num_features] l: [num_instances, num_rules] m: [num_instances, num_rules] L: [num_instances, 1] d: [num_instances, 1] weights: [num_instances, num_rules] probs: [num_instances, num_classes] rule_classes: [num_rules,1]

output: void, prints the required statistics

spear.Implyloss.utils.convert_weights_to_m(weights)[source]

func desc: converts weights to m

input: weights([batch_size, num_rules]) - the weights matrix corresponding to rule network(w_network) in the algorithm

output: m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance

spear.Implyloss.utils.convert_m_to_l(m, rule_classes, num_classes)[source]

func desc: converts m to l

input: m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance rule_classes - num_classes(non_negative integer) - number of available classes

output: l([batch_size, num_rules]) - labels assigned by the rules

spear.Implyloss.utils.get_rule_precision(l, L, m)[source]

func desc: get the precision of the rules

input: l([batch_size, num_rules]) - labels assigned by the rules L([batch_size, 1]) - L_i = 1 if the ith instance has already a label assigned to it in the dataset m([batch_size, num_rules]) - the rule coverage matrix where m_ij = 1 if jth rule covers ith instance

output: micro_p - macro_p - comp -

spear.Implyloss.utils.merge_dict_a_into_b(a, b)[source]

func desc: set the dict values of b to that of a

input: a, b : dicts

output: void

spear.Implyloss.utils.print_tf_global_variables()[source]

Func Desc: prints all the global variables

Input:

Output:

spear.Implyloss.utils.print_var_list(var_list)[source]

Func Desc: Prints the given variable list

Input: var_list

Output:

spear.Implyloss.utils.pretty_print(data_structure)[source]

Func Desc: prints the given data structure in the desired format

Input: data_structure

Output:

spear.Implyloss.utils.get_list_or_None(s, dtype=<class 'int'>)[source]

Func Desc: Returns the list of types of the variables in the string s

Input: s - string dtype function (default - int)

Output: None or list

spear.Implyloss.utils.get_list(s)[source]

Func Desc: returns the output of get_list_or_None as a list

Input: s - list

Output: lst - list

spear.Implyloss.utils.None_if_zero(n)[source]

Func Desc: the max(0,n) function with none id n<=0

Input: n - integer

Output: if n>0 then n else None

spear.Implyloss.utils.boolean(s)[source]

Func Desc: returns the expected boolean value for the given string

Input: s - string

Output: boolean or error

spear.Implyloss.utils.set_to_list_of_values_if_None_or_empty(lst, val, num_vals)[source]

Func Desc: returns lst if it is not empty else returns a same length list but with all its entries equal to val lst - list val - value num_vals (integer) - length of the list lst

Output: lst or same length val list

spear.Implyloss.utils.conv_l_to_lsnork(l, m)[source]

func desc: in snorkel convention if a rule does not cover an instance assign it label -1 we follow the convention where we assign the label num_classes instead of -1 valid class labels range from {0,1,…num_classes-1} conv_l_to_lsnork: converts l in our format to snorkel’s format

input: l([batch_size, num_rules]) - rule label matrix m([batch_size, num_rules]) - rule coverage matrix

output: lsnork([batch_size, num_rules])

spear.Implyloss.utils.compute_accuracy(support, recall)[source]

func desc: compute the required accuracy

input: support recall

output: accuracy

spear.Implyloss.utils.dump_labels_to_file(save_filename, x, l, m, L, d, weights=None, f_d_U_probs=None, rule_classes=None)[source]

Func Desc: dumps the given data into a pickle file

Input: save_filename - the name of the pickle file in which the arguments/data is required to be saved x ([batch_size x num_features]) l ([batch_size x num_rules]) m ([batch_size x num_rules]) L ([batch_size x 1]) d ([batch_size x 1]) weights (default - None) f_d_U_probs (default - None) rule_classes (default - None)

Output:

spear.Implyloss.utils.load_from_pickle_with_per_class_sampling_factor(fname, per_class_sampling_factor)[source]

Func Desc: load the data from the given pickle file with per class sampling factor

Input: fname - name of the pickle file from which data need to be loaded per_class_sampling_factor

Output: the required matrices x1 ([batch_size x num_features]) l1 ([batch_size x num_rules]) m1 ([batch_size x num_rules]) L1 ([batch_size x 1]) d1 ([batch_size x 1])

spear.Implyloss.utils.combine_d_covered_U_pickles(d_name, infer_U_name, out_name, d_sampling_factor, U_sampling_factor)[source]

Func Desc: combine the labelled and unlabelled data, merge the corresponding parameters together and store them in new file

Input: d_name - the pickle file storing labelled data infer_U_name - the pickle file storing unlabelled data out_name - the name of the file where merged output needs to be stored d_sampling_factor - the per_class_sampling_factor for labelled data U_sampling_factor - the per_class_sampling_factor for unlabelled data

Output:

spear.Implyloss.utils.updated_theta_copy(grads, variables, lr, mode)[source]

Func Desc: updates the theta (parameters) using rhe given learning rate, grads and variables

Input: grads - gradients variables lr - learning rate mode

Output: vals - list of the updated gradients

Bibilography

CRS20(1,2,3,4,5,6,7)

Oishik Chatterjee, Ganesh Ramakrishnan, and Sunita Sarawagi. Robust data programming with precision-guided labeling functions. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3397–3404, Apr. 2020. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5742, doi:10.1609/aaai.v34i04.5742.

MCK+20(1,2,3,4,5,6)

Ayush Maheshwari, Oishik Chatterjee, KrishnaTeja Killamsetty, Rishabh K. Iyer, and Ganesh Ramakrishnan. Data programming using semi-supervision and subset selection. CoRR, 2020. URL: https://arxiv.org/abs/2008.09887, arXiv:2008.09887.

RBE+20

Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: rapid training data creation with weak supervision. The VLDB Journal, 29:, 05 2020. doi:10.1007/s00778-019-00552-1.