1. dossier.fc — feature collections

Collections of named features.

This module provides dossier.fc.FeatureCollection and a number of supporting classes. A feature collection is a dictionary mapping a feature name to a feature representation. The representation is typically a dossier.fc.StringCounter, an implementation of collections.Counter, which is fundamentally a mapping from a string value to an integer.

This representation allows multiple values to be stored for multiple bits of information about some entity of interest. The weights in the underlying counters can be used to indicate how common a particular value is, or how many times it appears in the source documents.

president = FeatureCollection()
president['NAME']['Barack Obama'] += 1
president['entity_type']['PER'] += 1
president['PER_ADDRESS']['White House'] += 1
president['PER_ADDRESS']['1600 Pennsylvania Ave.'] += 1

Feature collections can have representations other than the basic string-counter representation; these representations may not preserve the source strings, but will still be suitable for machine-learning applications.

Feature collections can be serialized to RFC 7049 CBOR format, similar to a binary JSON representation. They can also be stored sequentially in flat files using dossier.fc.FeatureCollectionChunk as an accessor.

class dossier.fc.FeatureCollection(data=None, read_only=False)[source]

Bases: _abcoll.MutableMapping

A collection of features.

This is a dictionary from feature name to a collections.Counter or similar object. In typical use callers will not try to instantiate individual dictionary elements, but will fall back on the collection’s default-value behavior:

fc = FeatureCollection()
fc['NAME']['John Smith'] += 1

The default default feature type is StringCounter.

Feature collection construction and serialization:

__init__(data=None, read_only=False)[source]

Creates a new empty feature collection.

If data is a dictionary-like object with a structure similar to that of a feature collection (i.e., a dict of multisets), then it is used to initialize the feature collection.

classmethod loads(data)[source]

Create a feature collection from a CBOR byte string.

dumps()[source]

Create a CBOR byte string from a feature collection.

classmethod from_dict(data, read_only=False)[source]

Recreate a feature collection from a dictionary.

The dictionary is of the format dumped by to_dict(). Additional information, such as whether the feature collection should be read-only, is not included in this dictionary, and is instead passed as parameters to this function.

to_dict()[source]

Dump a feature collection’s features to a dictionary.

This does not include additional data, such as whether or not the collection is read-only. The returned dictionary is suitable for serialization into JSON, CBOR, or similar data formats.

static register_serializer(feature_type, obj)[source]

This is a class method that lets you define your own feature type serializers. tag should be the name of the feature type that you want to define serialization for. Currently, the valid values are StringCounter, Unicode, SparseVector or DenseVector.

Note that this function is not thread safe.

obj must be an object with three attributes defined.

obj.loads is a function that takes a CBOR created Python data structure and returns a new feature counter.

obj.dumps is a function that takes a feature counter and returns a Python data structure that can be serialized by CBOR.

obj.constructor is a function with no parameters that returns the Python type that can be used to construct new features. It should be possible to call obj.constructor() to get a new and empty feature counter.

Feature collection values and attributes:

read_only

Flag if this feature collection is read-only.

When a feature collection is read-only, no part of it can be modified. Individual feature counters cannot be added, deleted, or changed. This attribute is preserved across serialization and deserialization.

generation

Get the generation number for this feature collection.

This is the highest generation number across all counters in the collection, if the counters support generation numbers. This collection has not changed if the generation number has not changed.

DISPLAY_PREFIX = '#'

Prefix on names of features that are human-readable.

Processing may convert a feature name to a similar feature #name that is human-readable, while converting the original feature to a form that is machine-readable only; for instance, replacing strings with integers for faster comparison.

EPHEMERAL_PREFIX = '_'

Prefix on names of features that are not persisted.

to_dict() and dumps() will not write out features that begin with this character.

Feature collection computation:

__add__(other)[source]

Add features from two FeatureCollections.

>>> fc1 = FeatureCollection({'foo': Counter('abbb')})
>>> fc2 = FeatureCollection({'foo': Counter('bcc')})
>>> fc1 + fc2
FeatureCollection({'foo': Counter({'b': 4, 'c': 2, 'a': 1})})

Note that if a feature in either of the collections is not an instance of collections.Counter, then it is ignored.

__sub__(other)[source]

Subtract features from two FeatureCollections.

>>> fc1 = FeatureCollection({'foo': Counter('abbb')})
>>> fc2 = FeatureCollection({'foo': Counter('bcc')})
>>> fc1 - fc2
FeatureCollection({'foo': Counter({'b': 2, 'a': 1})})

Note that if a feature in either of the collections is not an instance of collections.Counter, then it is ignored.

__mul__(coef)[source]
__imul__(coef)[source]

In-place multiplication by a scalar.

total()[source]

Returns sum of all counts in all features that are multisets.

merge_with(other, multiset_op, other_op=None)[source]

Merge this feature collection with another.

Merges two feature collections using the given multiset_op on each corresponding multiset and returns a new FeatureCollection. The contents of the two original feature collections are not modified.

For each feature name in both feature sets, if either feature collection being merged has a collections.Counter instance as its value, then the two values are merged by calling multiset_op with both values as parameters. If either feature collection has something other than a collections.Counter, and other_op is not None, then other_op is called with both values to merge them. If other_op is None and a feature is not present in either feature collection with a counter value, then the feature will not be present in the result.

Parameters:
  • other (FeatureCollection) – The feature collection to merge into self.
  • multiset_op (fun(Counter, Counter) -> Counter) – Function to merge two counters
  • other_op (fun(object, object) -> object) – Function to merge two non-counters
Return type:

FeatureCollection

class dossier.fc.StringCounter(*args, **kwargs)[source]

Bases: collections.Counter

Simple counter based on exact string matching.

This is a subclass of collections.Counter that includes a generation counter so that it can be used in a cache.

StringCounter is the default feature type in a feature collection, so you typically don’t have to instantiate a StringCounter explicitly:

fc = FeatureCollection()
fc['NAME']['John Smith'] += 1

But instantiating directly works too:

sc = StringCounter()
sc['John Smith'] += 1

fc = FeatureCollection({'NAME': sc})
fc['NAME']['John Smith'] += 1
assert fc['NAME']['John Smith'] == 2

Note that instances of this class support all the methods defined for a collections.Counter, but only the ones unique to StringCounter are listed here.

__init__(*args, **kwargs)[source]

Initialize a StringCounter with existing counts:

>>> sc = StringCounter(a=4, b=2, c=0)
>>> sc['b']
2

See the documentation for collections.Counter for more examples.

truncate_most_common(*args, **kwargs)[source]

Sorts the counter and keeps only the most common items up to truncation_length in place.

read_only

Flag indicating whether this collection is read-only.

This flag always begins as False, it cannot be set via the constructor for compatibility with collections.Counter. If this flag is set, then any operations that mutate it will raise ReadOnlyException.

generation

Generation number for this counter instance.

This number is incremented by every operation that mutates the counter object. If two collections are the same object and have the same generation number, then they are identical.

Having this property allows a pair of id(sc) and the generation to be an immutable hashable key for things like memoization operations, accounting for the possibility of the counter changing over time.

>>> sc = StringCounter({'a': 1})
>>> cache = {(id(sc), sc.generation): 1}
>>> (id(sc), sc.generation) in cache
True
>>> sc['a']
1
>>> (id(sc), sc.generation) in cache
True
>>> sc['a'] += 1
>>> sc['a']
2
>>> (id(sc), sc.generation) in cache
False
class dossier.fc.SparseVector[source]

Bases: object

An abstract class for sparse vectors.

Currently, there is no default implementation of a sparse vector.

Other implementations of sparse vectors must inherit from this class. Otherwise they cannot be used inside a dossier.fc.FeatureCollection.

class dossier.fc.DenseVector[source]

Bases: object

An abstract class for dense vectors.

Currently, there is no default implementation of a dense vector.

Other implementations of dense vectors must inherit from this class. Otherwise they cannot be used inside a dossier.fc.FeatureCollection.

dossier.fc.FeatureCollectionChunk

alias of <Mock id='140505177601168'>

class dossier.fc.ReadOnlyException[source]

Bases: dossier.fc.exceptions.BaseException

Code attempted to modify a read-only feature collection.

This occurs when adding, deleting, or making other in-place modifications to a FeatureCollection that has its read_only flag set. It also occurs when attempting to make changes to a StringCounter contained in such a collection.

class dossier.fc.SerializationError[source]

Bases: dossier.fc.exceptions.BaseException

A problem occurred serializing or deserializing.

This can occur if a FeatureCollection has an unrecognized feature type, or if a CBOR input does not have the correct format.