3. dossier.label — store ground truth data as labels

A simple storage interface for labels (ground truth data).

dossier.label provides a convenient interface to a kvlayer table for storing ground truth data, otherwise known as “labels.” Each label, at the highest level, maps two things (addressed by content identifiers) to a coreferent value. This coreferent value is an indication by a human that these two things are “the same”, “not the same” or “I don’t know if they are the same.” Sameness in this case is determined by the human doing the annotation.

Each label also contains an annotator_id, which identifies the human that created the label. A timestamp (in milliseconds since the Unix epoch) is also included on every label.

3.1. Example

Using a storage backend in your code requires a working kvlayer configuration, which is usually written in a YAML file like so:

kvlayer:
  app_name: store
  namespace: dossier
  storage_type: redis
  storage_addresses: ["redis.example.com:6379"]

And here’s a full working example that uses local memory to store labels:

from dossier.label import Label, LabelStore, CorefValue
import kvlayer
import yakonfig

yaml = """
kvlayer:
  app_name: store
  namespace: dossier
  storage_type: local
"""
with yakonfig.defaulted_config([kvlayer], yaml=yaml):
    label_store = LabelStore(kvlayer.client())

    lab = Label('a', 'b', 'annotator', CorefValue.Positive)
    label_store.put(lab)

    assert lab == label_store.get('a', 'b', 'annotator')

See the documentation for yakonfig for more details on the configuration setup.

class dossier.label.LabelStore(kvlclient)[source]

A label database.

__init__(kvlclient)[source]

Create a new label store.

Parameters:kvlclient (kvlayer._abstract_storage.AbstractStorage) – kvlayer client
Return type:LabelStore
put(*labels)[source]

Add a new label to the store.

Parameters:label (Label) – label
get(cid1, cid2, annotator_id, subid1='', subid2='')[source]

Retrieve a label from the store.

When subid1 and subid2 are empty, then a label without subtopic identifiers will be returned.

If there are multiple labels stored with the same parts, the single most recent one will be returned. If there are no labels with these parts, exceptions.KeyError will be raised.

Parameters:
  • cid1 (str) – content id
  • cid2 (str) – content id
  • annotator_id (str) – annotator id
  • subid1 (str) – subtopic id
  • subid2 (str) – subtopic id
Return type:

Label

Raises:

KeyError if no label could be found.

directly_connected(ident)[source]

Return a generator of labels connected to ident.

ident may be a content_id or a (content_id, subtopic_id).

If no labels are defined for ident, then the generator will yield no labels.

Note that this only returns directly connected labels. It will not follow transitive relationships.

Parameters:ident (str or (str, str)) – content id or (content id and subtopic id)
Return type:generator of Label
connected_component(ident)[source]

Return a connected component generator for ident.

ident may be a content_id or a (content_id, subtopic_id).

Given an ident, return the corresponding connected component by following all positive transitivity relationships.

For example, if (a, b, 1) is a label and (b, c, 1) is a label, then connected_component('a') will return both labels even though a and c are not directly connected.

(Note that even though this returns a generator, it will still consume memory proportional to the number of labels in the connected component.)

Parameters:ident (str or (str, str)) – content id or (content id and subtopic id)
Return type:generator of Label
expand(ident)[source]

Return expanded set of labels from a connected component.

The connected component is derived from ident. ident may be a content_id or a (content_id, subtopic_id). If ident identifies a subtopic, then expansion is done on a subtopic connected component (and expanded labels retain subtopic information).

The labels returned by LabelStore.connected_component() contains only the Label stored in the LabelStore, and does not include the labels you can infer from the connected component. This method returns both the data-backed labels and the inferred labels.

Subtopic assignments of the expanded labels will be empty. The annotator_id will be an arbitrary annotator_id within the connected component.

Parameters:
  • content_id (str) – content id
  • value (CorefValue) – coreferent value
Return type:

list of Label

everything(include_deleted=False, content_id=None, subtopic_id=None)[source]

Returns a generator of all labels in the store.

If include_deleted is True, labels that have been overwritten with more recent labels are also included. If content_id is not None, only labels for that content ID are retrieved; and then if subtopic_id is not None, only that subtopic is retrieved, else all subtopics are retrieved. The returned labels will always be in sorted order, content IDs first, and with those with the same content, subtopic, and annotator IDs sorted newest first.

Return type:generator of Label
delete_all()[source]

Deletes all labels in the store.

class dossier.label.Label(content_id1, content_id2, annotator_id, value, subtopic_id1=None, subtopic_id2=None, epoch_ticks=None, rating=None, meta=None)[source]

An immutable unit of ground truth data.

This is a statement that the item at content_id1, subtopic_id1 refers to (or doesn’t) the same thing as the item at content_id2, subtopic_id2. This assertion was recorded by annotator_id, a string identifying a user, at epoch_ticks, with CorefValue value.

On creation, the tuple is normalized such that the pair of content_id1 and subtopic_id1 are less than the pair of content_id2 and subtopic_id2.

Labels are comparable, sortable, and hashable. The sort order compares the two content IDs, the two subtopic IDs, annotator_id, epoch_ticks (most recent is smallest), and then other fields.

content_id1

The first content ID.

content_id2

The second content ID.

annotator_id

An identifier of the user making this assertion.

value

A CorefValue stating whether this is a positive or negative coreference assertion.

subtopic_id1

An identifier defining a section or region of content_id1.

subtopic_id2

An identifier defining a section or region of content_id2.

epoch_ticks

The time at which annotator_id made this assertion, in seconds since the Unix epoch.

rating

An additional score showing the relative importance of this label, mirroring streamcorpus.Rating.

meta

Any additional meta data about this label, structured as a dictionary.

__init__(content_id1, content_id2, annotator_id, value, subtopic_id1=None, subtopic_id2=None, epoch_ticks=None, rating=None, meta=None)[source]

Create a new label.

Parameters are assigned to their corresponding fields, with some normalization. If value is an int, the corresponding CorefValue is used instead. If epoch_ticks is None then the current time is used. If rating is None then it will be 1 if value is CorefValue.Positive and 0 otherwise.

__contains__(v)[source]

Tests membership of identifiers.

If v is a tuple of (content_id, subtopic_id), then the pair is checked for membership. Otherwise, v must be a str and is checked for equality to one of the content ids in this label.

As a special case, if v is (content_id, None), then v is treated as if it were content_id.

>>> l = Label('c1', 'c2', 'a', 1)
>>> 'c1' in l
True
>>> 'a' in l
False
>>> ('c1', None) in l
True
>>> ('c1', 's1') in l
False
>>> ll = Label('c1', 'c2', 'a', 1, 's1', 's2')
>>> 'c1' in l
True
>>> ('c1', None) in l
True
>>> ('c1', 's1') in l
True
other(content_id)[source]

Returns the other content id.

If content_id == self.content_id1, then return self.content_id2 (and vice versa). Raises exceptions.KeyError if content_id is neither one.

>>> l = Label('c1, 'c2', 'a', 1)
>>> l.other('c1')
'c2'
>>> l.other('c2')
'c1'
>>> l.other('a')
Traceback (most recent call last):
    ...
KeyError: 'a'
subtopic_for(content_id)[source]

Get the subtopic id that corresponds with a content id.

>>> l = Label('c1', 'c2', 'a', 1, 's1', 's2')
>>> l.subtopic_for('c1')
's1'
>>> l.subtopic_for('c2')
's2'
>>> l.subtopic_for('a')
Traceback (most recent call last):
    ...
KeyError: 'a'
Parameters:content_id (str) – content ID to look up
Returns:subtopic ID for content_id
Raises exceptions.KeyError:
 if content_id is neither content ID in this label
same_subject_as(other)[source]

Determine if two labels are about the same thing.

This predicate returns True if self and other have the same content IDs, subtopic IDs, and annotator ID. The other fields may have any value.

>>> t = time.time()
>>> l1 = Label('c1', 'c2', 'a', CorefValue.Positive,
...            epoch_ticks=t)
>>> l2 = Label('c1', 'c2', 'a', CorefValue.Negative,
...            epoch_ticks=t)
>>> l1.same_subject_as(l2)
True
>>> l1 == l2
False
static most_recent(labels)[source]

Filter an iterator to return the most recent for each subject.

labels is any iterator over Label objects. It should be sorted with the most recent first, which is the natural sort order that sorted() and the LabelStore adapter will return. The result of this is a generator of the same labels but with any that are not the most recent for the same subject (according to same_subject_as()) filtered out.

class dossier.label.CorefValue[source]

A human-assigned value for a coreference judgement.

The judgment is always made with respect to a pair of content items.

Variables:
  • Negative – The two items are not coreferent.
  • Unknown – It is unknown whether the two items are coreferent.
  • Positive – The two items are coreferent.

3.1.1. dossier.label command-line tool

dossier.label command-line tool.

dossier.label is a command line application for viewing the raw label data inside the database. Generally, this is a debugging tool for developers.

Run dossier.label --help for the available commands.