5. dossier.web — DossierStack web services

5.1. dossier.web provides REST web services for Dossier Stack

class dossier.web.WebBuilder(add_default_routes=True)[source]

A builder for constructing DossierStack web applications.

DossierStack web services have a lot of knobs, so instead of a single function with a giant list of parameters, we get a “builder” that lets one mutably construct data for building a web application.

These “knobs” include, but are not limited to: adding routes or other Bottle applications, injecting services into routes, setting a URL prefix and adding filters and search engines.

__init__(add_default_routes=True)[source]

Introduce a new builder.

You can use method chaining to configure your web application options. e.g.,

app = WebBuilder().enable_cors().get_app()
app.run()

This code will create a new Bottle web application that enables CORS (Cross Origin Resource Sharing).

If add_default_routes is False, then the default set of routes in dossier.web.routes is not added. This is only useful if you want to compose multiple Bottle applications constructed through multiple instances of WebBuilder.

get_app()[source]

Eliminate the builder by producing a new Bottle application.

This should be the final call in your method chain. It uses all of the built up options to create a new Bottle application.

Return type:bottle.Bottle
mount(prefix)[source]

Mount the application on to the given URL prefix.

Parameters:prefix (str) – A URL prefixi
Return type:WebBuilder
set_config(config_instance)[source]

Set the config instance.

By default, this is an instance of dossier.web.Config, which provides services like kvlclient and label_store. Custom services should probably subclass dossier.web.Config, but it’s not strictly necessary so long as it provides the same set of services (which are used for dependency injection into Bottle routes).

Parameters:config_instance (dossier.web.Config) – A config instance.
Return type:WebBuilder
add_search_engine(name, engine)[source]

Adds a search engine with the given name.

engine must be the class object rather than an instance. The class must be a subclass of dossier.web.SearchEngine, which should provide a means of obtaining recommendations given a query.

The engine must be a class so that its dependencies can be injected when the corresponding route is executed by the user.

If engine is None, then it removes a possibly existing search engine named name.

Parameters:
  • name (str) – The name of the search engine. This appears in the list of search engines provided to the user, and is how the search engine is invoked via REST.
  • engine (type) – A search engine class.
Return type:

WebBuilder

add_filter(name, filter)[source]

Adds a filter with the given name.

filter must be the class object rather than an instance. The class must be a subclass of dossier.web.Filter, which should provide a means of creating a predicate function.

The filter must be a class so that its dependencies can be injected when the corresponding route is executed by the user.

If filter is None, then it removes a possibly existing filter named name.

Parameters:
  • name (str) – The name of the filter. This is how the search engine is invoked via REST.
  • engine (type) – A filter class.
Return type:

WebBuilder

add_routes(routes)[source]

Merges a Bottle application into this one.

Parameters:routes (bottle.Bottle or [bottle route].) – A Bottle application or a sequence of routes.
Return type:WebBuilder
inject(name, closure)[source]

Injects closure() into name parameters in routes.

This sets up dependency injection for parameters named name. When a route is invoked that has a parameter name, then closure() is passed as that parameter’s value.

(The closure indirection is so the caller can control the time of construction for objects. For example, you may want to check the health of a database connection.)

Parameters:
  • name (str) – Parameter name.
  • closure (function) – A function with no parameters.
Return type:

WebBuilder

enable_cors()[source]

Enables Cross Origin Resource Sharing.

This makes sure the necessary headers are set so that this web application’s routes can be accessed from other origins.

Return type:WebBuilder

Here are the available search engines by default:

class dossier.web.search_engines.plain_index_scan(store)[source]

Return a random sample of an index scan.

This scans all indexes defined for all values in the query corresponding to those indexes.

class dossier.web.search_engines.random(store)[source]

Return random results with the same name.

This finds all content objects that have a matching name and returns limit results at random.

If there is no NAME index defined, then this always returns no results.

Here are the available filter predicates by default:

class dossier.web.filters.already_labeled(label_store)[source]

Filter results that have a label associated with them.

If a result has a direct label between it and the query, then it will be removed from the list of results.

class dossier.web.filters.nilsimsa_near_duplicates(label_store, store, nilsimsa_feature_name='nilsimsa_all', threshold=119)[source]

Filter results that nilsimsa says are highly similar.

To perform an filtering, this requires that the FCs carry StringCounter at nilsimsa_feature_name and results with nilsimsa comparison higher than the threshold are filtered. threshold defaults to 119, which is in the range [-128, 128] per the definition of nilsimsa. nilsimsa_feature_name defaults to ‘nilsimsa_all’.

A note about speed performance: the order complexity of this filter is linear in the number of results that get through the filter. While that is unfortunate, it is inherent to the nature of using comparison-based locality sensitive hashing (LSH). Other LSH techniques, such as shingle hashing with simhash tend to have less fidelity, but can be efficiently indexed to allow O(1) lookups in a filter like this.

Before refactoring this to use nilsimsa directly, this was using a “kernel” function that had nilsimsa buried inside it, and it had this kind of speed performance:

dossier/web/tests/test_filter_preds.py::test_near_duplicates_speed_perf 4999 filtered to 49 in 2.838213 seconds, 1761.319555 per second

After refactoring to use nilsimsa directly in this function, the constant factors get better, and the order complexity is still linear in the number of items that the filter has emitted, because it has to remember them and scan over them. Thresholding in the nilsimsa.compare_digests function helps considerably: four times faster on this synthetic test data when there are many different documents, which is the typical case:

Without thresholding in the nilsimsa.compare_digests: dossier/web/tests/test_filter_preds.py::test_nilsimsa_near_duplicates_speed_perf 5049 filtered to 49 in 0.772274 seconds, 6537.834870 per second dossier/web/tests/test_filter_preds.py::test_nilsimsa_near_duplicates_speed_perf 1049 filtered to 49 in 0.162775 seconds, 6444.477004 per second dossier/web/tests/test_filter_preds.py::test_nilsimsa_near_duplicates_speed_perf 209 filtered to 9 in 0.009348 seconds, 22357.355097 per second

With thresholding in the nilsimsa.compare_digests: dossier/web/tests/test_filter_preds.py::test_nilsimsa_near_duplicates_speed_perf 5049 filtered to 49 in 0.249705 seconds, 20219.853262 per second dossier/web/tests/test_filter_preds.py::test_nilsimsa_near_duplicates_speed_perf 1549 filtered to 49 in 0.112724 seconds, 13741.549025 per second dossier/web/tests/test_filter_preds.py::test_nilsimsa_near_duplicates_speed_perf 209 filtered to 9 in 0.009230 seconds, 22643.802754 per second

Some useful utility functions.

dossier.web.streaming_sample(seq, k, limit=None)[source]

Streaming sample.

Iterate over seq (once!) keeping k random elements with uniform distribution.

As a special case, if k is None, then list(seq) is returned.

Parameters:
  • seq – iterable of things to sample from
  • k – size of desired sample
  • limit – stop reading seq after considering this many
Returns:

list of elements from seq, length k (or less if seq is short)

5.1.1. Search engine and filter interfaces

class dossier.web.interface.SearchEngine[source]

Bases: dossier.web.interface.Queryable

Defines an interface for search engines.

A search engine, at a high level, takes a query feature collection and returns a list of results, where each result is itself a feature collection.

Note that this is an abstract class. Implementors must provide the SearchEngine.recommendations() method.

__init__()[source]

Create a new search engine.

The creation of a search engine is distinct from the operation of a search engine. Namely, the creation of a search engine is subject to dependency injection. The following parameters are special in that they will be automatically populated with special values if present in your __init__:

If you want to expand the set of items that can be injected, then you must subclass dossier.web.Config, define your new services as instance attributes, and set your new config instance with dossier.web.Config.set_config().

Return type:A callable with a signature isomorphic to dossier.web.SearchEngine.__call__().
recommendations()[source]

Return recommendations.

The return type is loosely specified. In particular, it must be a dictionary with at least one key, results, which maps to a list of tuples of (content_id, FC). The returned dictionary may contain other keys.

results()[source]

Returns results as a JSON encodable Python value.

This calls SearchEngine.recommendations() and converts the results returned into JSON encodable values. Namely, feature collections are slimmed down to only features that are useful to an end-user.

respond(response)[source]

Perform the actual web response.

This is usually just a JSON encoded dump of the search results, but implementors may choose to implement this differently (e.g., with a cache).

Parameters:response (bottle.Response) – A web response object.
Return type:str
add_filter(name, filter)[source]

Add a filter to this search engine.

Parameters:filter (Filter) – A filter.
Return type:SearchEngine
create_filter_predicate()[source]

Creates a filter predicate.

The list of available filters is given by calls to add_filter, and the list of filters to use is given by parameters in params.

In this default implementation, multiple filters can be specified with the filter parameter. Each filter is initialized with the same set of query parameters given to the search engine.

The returned function accepts a (content_id, FC) and returns True if and only if every selected predicate returns True on the same input.

class dossier.web.interface.Filter[source]

Bases: dossier.web.interface.Queryable

A filter for results returned by search engines.

A filter is a yakonfig.Configurable object (or one that can be auto-configured) that returns a callable for creating a predicate that will filter results produced by a search engine.

A filter has one abstract method: Filter.create_predicate().

create_predicate()[source]

Creates a predicate for this filter.

The predicate should accept a tuple of (content_id, FC) and return True if and only if the given result should be included in the list of recommendations provided to the user.

class dossier.web.interface.Queryable[source]

Queryable supports parameterization from URLs and config.

Queryable is meant to be subclassed by things that have two fundamental things in common:

  1. Requires a single query identifier.
  2. Can be optionally configured from either user provided URL parameters or admin provided configuration.

Queryable provides a common interface for these two things, while also providing a way to declare a schema for the parameters. This schema is used to convert values from the URL/config into typed Python values.

Parameter schema

The param_schema class variable can define rudimentary type conversion from strings to typed Python values such as unicode or int.

param_schema is a dictionary that maps keys (parameter name) to a parameter type. A parameter type is itself a dictionary with the following keys:

type
Required. Must be one of 'bool', 'int', 'float', 'bytes' or 'unicode'.
min
Optional for 'int' and 'float' types. Specifies a minimum value.
max
Optional for 'int' and 'float' types. Specifies a maximum value.
encoding
Specifies an encoding for 'unicode' types.

If you want to inherit the schema of a parent class, then you can use:

param_schema = dict(ParentClass.param_schema, **{
    # your extra types here
})
Variables:
  • query_content_id – The query content id.
  • query_params – The raw query parameters, as a bottle.MultiDict.
  • config_params – The raw configuration parameters. This must be maintained explicitly, but will be incorporated in the values for params. If k is and config_params, then k‘s default value is config_params[k] (which is overridden by query_params[k] if it exists).
  • params – The combined and typed values of query_params and config_params.
__init__()[source]

Creates a new instance of Queryable.

This initializes a default empty state, where all parameter dictionaries are empty and query_content_id is None.

To take advantage of dependency injected configuration, you’ll want to write your own constructor that sets config parameters explicitly:

def __init__(self, param1=None, param2=5):
    self.config_params = {
        'param1': param1,
        'param2': param2,
    }
    super(MyClass, self).__init__()

It’s important to call the constructor after config_params has been set so that the schema is applied correctly.

set_query_id(query_content_id)[source]

Set the query id.

Parameters:query_content_id (str) – The query content identifier.
Return type:Queryable
set_query_params(query_params)[source]

Set the query parameters.

The query parameters should be a dictionary mapping keys to strings or lists of strings.

Parameters:query_params (name |--> (str | [str])) – query parameters
Return type:Queryable
add_query_params(query_params)[source]

Overwrite the given query parameters.

This is the same as Queryable.set_query_params(), except it overwrites existing parameters individually whereas set_query_params deletes all existing key in query_params.

5.1.2. Web service for active learning

dossier.web.routes is a REST stateful web service that can drive Dossier Stack’s an active ranking models and user interface, as well as other search technologies.

There are only a few API end points. They provide searching, storage and retrieval of feature collections along with storage of ground truth data as labels. Labels are typically used in the implementation of a search engine to filter or improve the recommendations returned.

The API end points are documented as functions in this module.

Search feature collections.

The route for this endpoint is: /dossier/v1/<content_id>/search/<search_engine_name>.

content_id can be any profile content identifier. (This restriction may be lifted at some point.) Namely, it must start with p|.

engine_name corresponds to the search strategy to use. The list of available search engines can be retrieved with the v1_search_engines() endpoint.

This endpoint returns a JSON payload which is an object with a single key, results. results is a list of objects, where the objects each have content_id and fc attributes. content_id is the unique identifier for the result returned, and fc is a JSON serialization of a feature collection.

There are also two query parameters:

  • limit limits the number of results to the number given.
  • filter sets the filtering function. The default filter function, already_labeled, will filter out any feature collections that have already been labeled with the query content_id.
dossier.web.routes.v1_search_engines(search_engines)[source]

List available search engines.

The route for this endpoint is: /dossier/v1/search_engines.

This endpoint returns a JSON payload which is an object with two keys: default and names. default corresponds to a chosen default search engine. This value will always correspond to a valid search engine. names is an array of all available search engines (including default).

dossier.web.routes.v1_fc_get(visid_to_dbid, store, cid)[source]

Retrieve a single feature collection.

The route for this endpoint is: /dossier/v1/feature-collections/<content_id>.

This endpoint returns a JSON serialization of the feature collection identified by content_id.

dossier.web.routes.v1_fc_put(request, response, visid_to_dbid, store, cid)[source]

Store a single feature collection.

The route for this endpoint is: PUT /dossier/v1/feature-collections/<content_id>.

content_id is the id to associate with the given feature collection. The feature collection should be in the request body serialized as JSON.

This endpoint returns status 201 upon successful storage. An existing feature collection with id content_id is overwritten.

dossier.web.routes.v1_random_fc_get(response, dbid_to_visid, store)[source]

Retrieves a random feature collection from the database.

The route for this endpoint is: GET /dossier/v1/random/feature-collection.

Assuming the database has at least one feature collection, this end point returns an array of two elements. The first element is the content id and the second element is a feature collection (in the same format returned by dossier.web.routes.v1_fc_get()).

If the database is empty, then a 404 error is returned.

Note that currently, this may not be a uniformly random sample.

dossier.web.routes.v1_label_put(request, response, visid_to_dbid, config, label_hooks, label_store, cid1, cid2, annotator_id)[source]

Store a single label.

The route for this endpoint is: PUT /dossier/v1/labels/<content_id1>/<content_id2>/<annotator_id>.

content_id are the ids of the feature collections to associate. annotator_id is a string that identifies the human that created the label. The value of the label should be in the request body as one of the following three values: -1 for not coreferent, 0 for “I don’t know if they are coreferent” and 1 for coreferent.

Optionally, the query parameters subtopic_id1 and subtopic_id2 may be specified. Neither, both or either may be given. subtopic_id1 corresponds to a subtopic in content_id1 and subtopic_id2 corresponds to a subtopic in content_id2.

This endpoint returns status 201 upon successful storage. Any existing labels with the given ids are overwritten.

dossier.web.routes.v1_label_direct(request, response, visid_to_dbid, dbid_to_visid, label_store, cid, subid=None)[source]

Return directly connected labels.

The routes for this endpoint are /dossier/v1/label/<cid>/direct and /dossier/v1/label/<cid>/subtopic/<subid>/direct.

This returns all directly connected labels for cid. Or, if a subtopic id is given, then only directly connected labels for (cid, subid) are returned.

The data returned is a JSON list of labels. Each label is a dictionary with the following keys: content_id1, content_id2, subtopic_id1, subtopic_id2, annotator_id, epoch_ticks and value.

dossier.web.routes.v1_label_connected(request, response, visid_to_dbid, dbid_to_visid, label_store, cid, subid=None)[source]

Return a connected component of positive labels.

The routes for this endpoint are /dossier/v1/label/<cid>/connected and /dossier/v1/label/<cid>/subtopic/<subid>/connected.

This returns the edges for the connected component of either cid or (cid, subid) if a subtopic identifier is given.

The data returned is a JSON list of labels. Each label is a dictionary with the following keys: content_id1, content_id2, subtopic_id1, subtopic_id2, annotator_id, epoch_ticks and value.

dossier.web.routes.v1_label_expanded(request, response, label_store, visid_to_dbid, dbid_to_visid, cid, subid=None)[source]

Return an expansion of the connected component of positive labels.

The routes for this endpoint are /dossier/v1/label/<cid>/expanded and /dossier/v1/label/<cid>/subtopic/<subid>/expanded.

This returns the edges for the expansion of the connected component of either cid or (cid, subid) if a subtopic identifier is given. Note that the expansion of a set of labels does not provide any new information content over a connected component. It is provided as a convenience for clients that want all possible labels in a connected component, regardless of whether one explicitly exists or not.

The data returned is a JSON list of labels. Each label is a dictionary with the following keys: content_id1, content_id2, subtopic_id1, subtopic_id2, annotator_id, epoch_ticks and value.

dossier.web.routes.v1_label_negative_inference(request, response, visid_to_dbid, dbid_to_visid, label_store, cid)[source]

Return inferred negative labels.

The route for this endpoint is: /dossier/v1/label/<cid>/negative-inference.

Negative labels are inferred by first getting all other content ids connected to cid through a negative label. For each directly adjacent cid', the connected components of cid and cid' are traversed to find negative labels.

The data returned is a JSON list of labels. Each label is a dictionary with the following keys: content_id1, content_id2, subtopic_id1, subtopic_id2, annotator_id, epoch_ticks and value.

5.1.3. Managing folders and sub-folders

In many places where active learning is used, it can be useful to provide the user with a means to group and categorize topics. In an active learning setting, it is essential that we try to capture a user’s grouping of topics so that it can be used for ground truth data. To that end, dossier.web exposes a set of web service endpoints for managing folders and subfolders for a particular user. Folders and subfolders are stored and managed by dossier.label, which means they are automatically available as ground truth data.

The actual definition of what a folder or subfolder is depends on the task the user is trying to perform. We tend to think of a folder as a general topic and a subfolder as a more specific topic or “subtopic.” For example, a topic might be “cars” and some subtopics might be “dealerships with cars I want to buy” or “electric cars.”

The following end points allow one to add or list folders and subfolders. There is also an endpoint for listing all of the items in a single subfolder, where each item is a pair of (content_id, subtopic_id).

In general, the identifier of a folder/subfolder is also used as its name, similar to how identifiers in Wikipedia work. For example, if a folder has a name “My Cars”, then its identifier is My_Cars. More specifically, given any folder name NAME, its corresponding identifier can be obtained with NAME.replace(' ', '_').

All web routes accept and return identifiers (so space characters are disallowed).

dossier.web.routes.v1_folder_list(request, kvlclient)[source]

Retrieves a list of folders for the current user.

The route for this endpoint is: GET /dossier/v1/folder.

(Temporarily, the “current user” can be set via the annotator_id query parameter.)

The payload returned is a list of folder identifiers.

dossier.web.routes.v1_folder_add(request, response, kvlclient, fid)[source]

Adds a folder belonging to the current user.

The route for this endpoint is: PUT /dossier/v1/folder/<fid>.

If the folder was added successfully, 201 status is returned.

(Temporarily, the “current user” can be set via the annotator_id query parameter.)

dossier.web.routes.v1_subfolder_list(request, response, kvlclient, fid)[source]

Retrieves a list of subfolders in a folder for the current user.

The route for this endpoint is: GET /dossier/v1/folder/<fid>/subfolder.

(Temporarily, the “current user” can be set via the annotator_id query parameter.)

The payload returned is a list of subfolder identifiers.

dossier.web.routes.v1_subfolder_add(request, response, kvlclient, fid, sfid, cid, subid=None)[source]

Adds a subtopic to a subfolder for the current user.

The route for this endpoint is: PUT /dossier/v1/folder/<fid>/subfolder/<sfid>/<cid>/<subid>.

fid is the folder identifier, e.g., My_Folder.

sfid is the subfolder identifier, e.g., My_Subtopic.

cid and subid are the content id and subtopic id of the subtopic being added to the subfolder.

If the subfolder does not already exist, it is created automatically. N.B. An empty subfolder cannot exist!

If the subtopic was added successfully, 201 status is returned.

(Temporarily, the “current user” can be set via the annotator_id query parameter.)

dossier.web.routes.v1_subtopic_list(request, response, kvlclient, fid, sfid)[source]

Retrieves a list of items in a subfolder.

The route for this endpoint is: GET /dossier/v1/folder/<fid>/subfolder/<sfid>.

(Temporarily, the “current user” can be set via the annotator_id query parameter.)

The payload returned is a list of two element arrays. The first element in the array is the item’s content id and the second element is the item’s subtopic id.

dossier.web.routes.v1_folder_delete(request, response, kvlclient, fid, sfid=None, cid=None, subid=None)[source]

Deletes a folder, subfolder or item.

The routes for this endpoint are:

  • DELETE /dossier/v1/folder/<fid>
  • DELETE /dossier/v1/folder/<fid>/subfolder/<sfid>
  • DELETE /dossier/v1/folder/<fid>/subfolder/<sfid>/<cid>
  • DELETE /dossier/v1/folder/<fid>/subfolder/<sfid>/<cid>/<subid>
dossier.web.routes.v1_folder_rename(request, response, kvlclient, fid_src, fid_dest, sfid_src=None, sfid_dest=None)[source]

Rename a folder or a subfolder.

The routes for this endpoint are:

  • POST /dossier/v1/<fid_src>/rename/<fid_dest>
  • POST /dossier/v1/<fid_src>/subfolder/<sfid_src>/rename/ <fid_dest>/subfolder/<sfid_dest>

Foldering for Dossier Stack.