peyutil: Python utilities for Open Tree

Simple utility functions used by Open Tree python code.

These function are used by packages that descend from peyotl, but do not depend on any part of peyotl.

The package is organized into subpackages for ease of maintenance. However, the public functions are imported into peyutil.__init__.py, so users can ignore the subpackage structure. The subpackage structure is retained in the documentation structure below as an organizational aid.

See https://opentreeoflife.github.io/ for general information about the Open Tree of Life project.

IO functions

Implemented in the input_output subpackage.

peyutil.download(url, encoding='utf-8')[source]

Returns the text fetched via http GET from URL, read as encoding.

peyutil.expand_path(p)[source]

Helper function to expand ~ and any environmental vars in a path string.

peyutil.expand_to_abspath(p)[source]

Calls expand_path and then converts to an absolute path.

peyutil.open_for_group_write(fp, mode, encoding='utf-8')[source]

Open with mode=mode and permissions ‘-rw-rw-r–’.

Group writable is the default on some systems/accounts, but it is important that it be present on our deployment machine.

peyutil.parse_study_tree_list(fp)[source]

Takes a filepath to a list of study+trees returns dicts with that info.

The fp filepath in this function can refer to a JSON file or text file. If fp is JSON, it should unpack as an iterable with each element either:

  • a dict with ‘study_id’ and ‘tree_id’ keys, or

  • a string

If fp does not parse, it is assumed to be a text file. Lines with the first graphical character being # will be skipped. Others will be treated as strings.

Every element in iterable input that is a dict, must have ‘study_id’ and ‘tree_id’ keys (otherwise and AssertionError will be raise). Qualifying elements will be returned.

Every string element in the input iterable should be in the pattern:

  • pg_(STUDYID)_(TREEID) or

  • ot_(STUDYID)_(TREEID)

If no AssertionError is raise, the return list will be a list of dicts each of which will have a ‘study_id’ and ‘tree_id’ key (though they may have extra info if the input was JSON. This function was a normalizer as we moved from string to JSON representation of tree reference lists.

peyutil.pretty_dict_str(d, indent=2)[source]

Shows JSON indented representation of d.

peyutil.read_as_json(in_filename, encoding='utf-8')[source]

Returnes the content of the JSON at the filepath in_filename.

peyutil.read_filepath(filepath, encoding='utf-8')[source]

Returns the text content of filepath.

peyutil.write_to_filepath(content, filepath, encoding='utf-8', mode='w', group_writeable=False)[source]

Writes content to the filepath; may create parent directory.

Uses the specified file mode and data encoding. If group_writeable is True, the output file will have permissions to be writable by the group (on POSIX systems).

peyutil.write_as_json(blob, dest, indent=0, sort_keys=True)[source]

Writes blob as JSON to the filepath or outstream dest.

If dest is a string, it is assumed to be object with .write(). Uses utf-8 encoding if the filepath is given (does not change the encoding if dest is already open).

peyutil.write_pretty_dict_str(out, obj, indent=2)[source]

Writes JSON indented representation of obj to out.

String manipulations

peyutil.increment_slug(s)[source]

Generate next slug for a series.

Some docstore types will use slugs (see above) as document ids. To support unique ids, we’ll serialize them as follows: TestUserA/my-test TestUserA/my-test-2 TestUserA/my-test-3 …

peyutil.slugify(s)[source]

Convert any string to a “slug” for filename and URL part.

EXAMPLE: “Trees about bees” => ‘trees-about-bees’ EXAMPLE: “My favorites!” => ‘my-favorites’ N.B. that its behavior should match this client-side slugify function, so we can accurately “preview” slugs in the browser: https://github.com/OpenTreeOfLife/opentree/blob/553546942388d78545cc8dcc4f84db78a2dd79ac/curator/static/js/curation-helpers.js#L391-L397 TODO: Should we also trim leading and trailing spaces (or dashes in the final slug)?

peyutil.underscored2camel_case(v)[source]

Converts ott_id to ottId.

Misc functions

peyutil.doi2url(v)[source]

Canonicalizes a DOI.

Takes a string form of a DOI (raw, http…, or doi:) and returns a string in the URL form.

peyutil.reverse_dict(d)[source]

Returns a dict v->k for the k->v mapping in d.

Newick Tree tokenization

See https://evolution.genetics.washington.edu/phylip/newicktree.html

peyutil provides some utilities for breaking string input into Newick tokens or (on a higher level) events that happen when parsing an Newick string.

class peyutil.NewickTokenType(value)[source]

Enum of Newick Token Types.

class peyutil.NewickTokenizer(stream=None, newick=None, filepath=None)[source]

Class providing an Newick token iteration interface.

Name tokens are stripped of whitespace and comments.

Newick input as stream, a newick string, of a filepath of __init__.

__iter__()[source]

Returns self as the internal state is used to achieve iteration.

__next__()[source]

Deletes comments from previous tokens, then to the next token.

file_pos()[source]

Returns a string describing the position within the input stream.

next()

Deletes comments from previous tokens, then to the next token.

tokens()[source]

Returns a list of remaining tokens (all if not yet iterated over).

class peyutil.NewickEvents(value)[source]

Newick Event types for event-based parsing.

class peyutil.NewickEventFactory(tokenizer=None, newick=None, filepath=None, event_handler=None)[source]

Class providing an Newick event iteration interface.

Higher level interface for reading newick strings into a series of events. Each event will be a dict with the keys:

  • 'type': a facet of the NewickEvents Enum, and

  • 'comments': a list of all comments contained

NewickEvents.TIP and NewickEvents.CLOSE_SUBTREE events can also have a label and/or an edge_info string.

NOTE for the sake of performance, the value of the comments field may be the same list in different events! So client code should copy the list if they need a stable copy.

You must make a copy of it if you want to process comments later.

Inputs via tokenizer, newick, or filepath of __init__.

If event_handler is not None in __init__, the initializer will iterate over all events, passing each one to the handler. So no iteration is needed or supported.

If event_handler is None, the object will be ready for iteration.

__iter__()[source]

Returns self as the internal state is used to achieve iteration.

__next__()[source]

Deletes comments and returns next event dict.

next()

Deletes comments and returns next event dict.

Python 2+3 compatability

Because Open Tree uses Python 2 and Python 3, peyutil provides some conditonal importing to make basic usage in client code work without conditioning on python version:

  • StringIO: imported from io in py3 and cStringIO in py2

  • UNICODE: str in py3 and unicode in py2

  • urlencode: from urllib.parse.urlencode or urllib.urlencode

  • primitive_string_types: a tuple of primitive string types (str,) or (str, unicode)

peyutil.is_int_type(x)[source]

Return True if x is from int.

peyutil.is_str_type(x)[source]

Return True if x is from str.

peyutil.get_utf_8_string_io_writer()[source]

Returns a (strio, wrapper) tuple. Backward compat. layer for 2.7.

  1. wrapper.write(…) operations support adding content

  2. When write’s are done: call flush_utf_8_writer(wrapper)

  3. the string can be recovered using strio.getvalue()
    • (you’ll need to call strio.getvalue().decode(‘utf-8’)

      if you are in Python 2.7)

peyutil.flush_utf_8_writer(wrapper)[source]

No-op in Python 3.

You must call this on wrapper instance from get_utf_8_string_io_writer when done writing. NO-Op in python 3.

For internal use only

Several functions implemented here have drawbacks (e.g. lack of thread safety) that make them unwise to use in general. They tend to be used in a few spots of other Open Tree of Life code that is executed in contexts in which they are known to be safe to use.

peyutil.any_early_exit(iterable, predicate)[source]

Tests each element in iterable by calling predicate(element).

Returns True on first True, or False.

peyutil.get_unique_filepath(stem)[source]

Returns a unique stem# string. NOT thread-safe!

Return stems or stem# where # is the smallest positive integer for which the path does not exist. useful for temp dirs where the client code wants an obvious ordering.

peyutil.pretty_timestamp(t=None, style=0)[source]

NOT Recommended. Simple time formatter. legacy artifact!

Used in peyotl test reporting. t defaults to current time. If style is 0, strftime uses Y-m-d format If style is not 0 and not a string YmdHMS is the format. Otherwise it is passed to strftime.

peyutil.propinquity_fn_to_study_tree(inp_fn, strip_extension=True)[source]

For internal use only. Parses a filename to study+tree.

This should only be called by propinquity - other code should be treating these filenames (and the keys that are based on them) as opaque strings.

Takes a filename (or key if strip_extension is False), returns (study_id, tree_id)

propinquity provides a map to look up the study ID and tree ID (and git SHA) from these strings.

Utilities for writing tests

peyutil.test.support.helper.testing_read_json(fp)[source]

Reads a UTF-8 JSON from filepath.

peyutil.test.support.helper.testing_write_json(o, fp)[source]

Writes a UTF-8 JSON to filepath.

peyutil.test.support.helper.testing_through_json(d)[source]

Returns a deserialized version of the JSON serialization of d.

peyutil.test.support.helper.testing_dict_eq(a, b)[source]

You should just call a == b. This is a legacy of a verbose dict compare.

peyutil.test.support.helper.testing_conv_key_unicode_literal(d)[source]

Not intended for widespread use. Used in some NexSON tests.

peyutil.test.support.pathmap.get_test_path_mapper()[source]

Factory for PathMapForTests object for the package.

class peyutil.test.support.pathmap.PathMapForTests(path_map_filepath)[source]

Class with attributes that make it easy to find different types of testing data.

Uses the parent directory of path_map_filepath to find other testing dirs.

all_files(prefix)[source]

Returns a set of filepaths to all test data.

amendment_file_obj(filename)[source]

Returns a readable file object from amendment_source_path call.

amendment_obj(filename)[source]

Returns a JSON load result from filename in the amendments test dir.

amendment_source_path(filename=None)[source]

Returns a absolute filepath to filename in TESTS/data/amendments dir.

collection_file_obj(filename)[source]

Returns a readable file object from collection_source_path call.

collection_obj(filename)[source]

Returns a JSON load result from filename in the collection test dir.

collection_source_path(filename=None)[source]

Returns a absolute filepath to filename in TESTS/data/colletions dir.

equal_blob_check(unit_test, diff_file_tag, first, second)[source]

Trips unit_test failure if first != second after writing diff files.

json_source_path(filename=None)[source]

Returns the fullpath for testing a JSON in filename.

named_output_path(filename=None, suffix_timestamp=True)[source]

Returns a filepath to filename in the testing output dir.

If suffix_timestamp is True, results of pretty_timestamp will be appended to the filename.

named_output_stream(filename=None, suffix_timestamp=True)[source]

Returns writable file stream of the file from a named_ouput_path call.

nexml_source_path(filename=None)[source]

Returns the fullpath for testing a NeXML in filename.

nexson_file_obj(filename)[source]

Returns readable file object for testing NexSON in filename.

nexson_obj(filename)[source]

Returns a JSON load result from filename in the TESTS/data/nexson.

nexson_source_path(filename=None)[source]

Returns the fullpath for testing a NexSON in filename.

next_unique_filepath(fp)[source]

Not thread safe! adds numeric suffix to fp until the path does not exist.

next_unique_scratch_filepath(fn)[source]

Returns the full path to a scratch file starting with fn in the scratch dir.

script_source_path(filename=None)[source]

Returns the full path to a filename in the package’s scripts dir.

shared_test_dir()[source]

Returns the fullpath to the shared-api-tests dir in the tests.

class peyutil.test.support.struct_diff.DictDiff[source]

Class for creating readable diffs between dicts.

Use the DictDiff.create factory function to avoid pitfalls associated with having to learn all of the methods of this class.

Use the DictDiff.create method.

add_addition(k, v)[source]

Records a new k->v pair. Don’t call, if you used the factory.

add_deletion(k, v)[source]

Records a missing k->v pair. Don’t call, if you used the factory.

add_modification(k, v)[source]

Records a new value for k->v pair. Don’t call, if you used the factory.

additions_expr(par='')[source]

Returns a list of strings describing additions. par can be a prefix.

static create(src, dest, **kwargs)[source]

Factory function for a DictDiff object comparing src to dest.

Inefficient comparison of src and dest dicts. Recurses through dict and lists. returns None if there is no difference and a DictDiffObject of there are differences **kwargs can contain:

wrap_dict_in_list default False. If True and one

value is present as a list and the other is a dict, then the dict will be converted to a list of one dict of the comparison. This is helpful given the BadgerFish convention of emitting single elements as a dict, but >1 elements as a list of dicts.

deletions_expr(par='')[source]

Returns a list of strings describing deletions. par can be a prefix.

edits_expr(par='')[source]

Should probably be private. Returns a list of strings describing edits.

finish()[source]

Sorts edits. Not needed if you used the DictDiff.create.

modification_expr(par='')[source]

Returns a list of strings describing modifications. par can be a prefix.

patch(src)[source]

Applies the DictDiff to src dict.

class peyutil.test.support.struct_diff.ListDiff[source]

Class for creating readable diffs between lists.

Use the ListDiff.create factory function to avoid pitfalls associated with having to learn all of the methods of this class.

Use the ListDiff.create method.

add_deletion(ind, obj)[source]

Records a missing obj at ind. Don’t call, if you used the factory.

add_insertion(ind, add_offset, obj)[source]

Records a new obj at ind. Don’t call, if you used the factory.

add_modificaton(ind, obj)[source]

Records altered obj at ind. Don’t call, if you used the factory.

additions_expr(par='')[source]

Returns a list of strings describing additions. par can be a prefix.

static create(src, dest, **kwargs)[source]

Factory function for a ListDiff object comparing src to dest.

Returns None if the src and dest are equal. Inefficient comparison of src and dest lists. Recurses through dict and lists. returns (is_identical, modifications, additions, deletions) where each

is_identical is a boolean True if the dicts have

contents that compare equal.

and the other three are dicts:

attributes both, but with different values attributes in dest but not in src attributes in src but not in dest

Returned dicts may alias objects in src, and dest.

deletions_expr(par='')[source]

Returns a list of strings describing deletions. par can be a prefix.

edits_expr(par='')[source]

Should probably be private. Returns a list of strings describing edits.

finish()[source]

Sorts edits. Not needed if you used the factory.

modification_expr(par='')[source]

Returns a list of strings describing modifications. par can be a prefix.

patch(src)[source]

Applies the ListDiff to src list.

class peyutil.test.support.struct_diff.ListEdit(src_ind, obj)[source]

Base class for encapsulating a single edit to a list.

Notes obj at src_ind in edited form.

class peyutil.test.support.struct_diff.ListAddition(src_ind, add_offset, obj)[source]

Subclass for encapsulating an addition to a list.

Same as base class init.

__repr__()[source]

Standard repr.

__str__()[source]

Str calls repr.

class peyutil.test.support.struct_diff.ListElModification(src_ind, obj)[source]

Subclass for encapsulating a modification to a list.

Same as base class init.

__repr__()[source]

Standard repr.

__str__()[source]

Str calls repr.