Storage
Pymerkle is unopinionated on how leaves are appended to the tree, i.e., how entries should be stored in concrete. “Leaves” is an abstraction for the contiguously indexed data which the tree operates upon, no matter what their concrete form in persistent or volatile memory is. Specifying how to store entries and how to encode them (so that they become amenable to hashing operations) belongs to the particular application logic and amounts to implementing the internal storage interface presented in this section.
Interface
A Merkle-tree implementation is a concrete subclass of the BaseMerkleTree
abstract base class. The latter encapsulates the cryptographic
functionality in a storage agnostic fashion, i.e., without making assumptions
about how entries are stored and accessed. It operates on top of an abstract
storage interface, which is to be implemented by any concrete subclass:
from pymerkle import BaseMerkleTree
class MerkleTree(BaseMerkleTree):
def __init__(self, algorithm='sha256'):
"""
Storage setup and superclass initialization
"""
def _encode_entry(self, data):
"""
Prepares data entry for hashing
"""
def _store_leaf(self, data, digest):
"""
Stores data hash in a new leaf and returns index
"""
def _get_leaf(self, index):
"""
Returns the hash stored by the leaf specified
"""
def _get_leaves(self, offset, width):
"""
Returns hashes corresponding to the specified leaf range
"""
def _get_size(self):
"""
Returns the current number of leaves
"""
_encode_entry
: converts data entry to binary, so that it becomes amenable to hashing._store_leaf
: stores the output of hashing alogn with the original entry and returns the leaf index._get_leaf
: leaf hash by index (counting from one)_get_leaves
: an iterable of the leaf hashes corresponding to the specified range_get_size
: current tree size (number of leaves).
Various strategies are here possible. For example, data entry could be
further processed by _store_leaf
in order to conform to a given database
schema and have the hash value stored in the appropriate table.
Or, if a predefined schema is given that does not make space for hashes,
the hash value could be forwarded to a dedicated datastore for future access;
_get_leaf
and _get_leaves
would then have to access that separate datastore
in order to make available the hash value.
Note
It is important to implement _get_leaves
as efficiently as
possible depending on your working framework.
See Optimizations for details.
Here the exact interface to be implemented:
# pymerkle/base.py
class BaseMerkleTree(MerkleHasher, metaclass=ABCMeta):
...
@abstractmethod
def _encode_entry(self, data):
"""
Should return the binary format of the provided data entry.
:param data: data to encode
:type data: whatever expected according to application logic
:rtype: bytes
"""
@abstractmethod
def _store_leaf(self, data, digest):
"""
Should create a new leaf storing the provided data entry along with
its hash value.
:param data: data entry
:type data: whatever expected according to application logic
:param digest: hashed data
:type digest: bytes
:returns: index of newly appended leaf counting from one
:rtype: int
"""
@abstractmethod
def _get_leaf(self, index):
"""
Should return the hash stored at the specified leaf.
:param index: leaf index counting from one
:type index: int
:rtype: bytes
"""
@abstractmethod
def _get_leaves(self, offset, width):
"""
Should return in respective order the hashes stored by the leaves in
the specified range.
:param offset: starting position counting from zero
:type offset: int
:param width: number of leaves to consider
:type width: int
:rtype: iterable of bytes
"""
@abstractmethod
def _get_size(self):
"""
Should return the current number of leaves
:rtype: int
"""
...
Implementations
Pymerkle provides out of the box the following concrete implementations
of BaseMerkleTree
.
In memory
Warning
This is a very memory inefficient implementation. Use it for debugging, testing and investigating the tree structure.
InmemoryTree
is a non-persistent implementation where nodes reside in
runtime.
from pymerkle import InmemoryTree
tree = InmemoryTree(algorithm='sha256')
Data is expected to be provided in binary:
index = tree.append_entry(b'foo')
It is hashed without further processing and can be accessed as follows:
data = tree.leaves[index - 1].entry
assert data == b'foo'
State coincides with the value of the current root-node:
assert tree.get_state() == tree.root.value
Nodes have a right
, left
and parent
attribute, pointing to their
right child, left child and parent node respectively. (Leaf nodes have no
children, whereas the current root-node has no parent). These linkages allow
for concrete path traversals. For example, the following loop detects the
root-node starting from the first leaf of a non-empty tree:
leaf = tree.leaves[0]
curr = leaf
while curr.parent:
curr = curr.parent
assert curr == tree.root
Concrete path traversals are used under the hood for visualizing the tree by means of printing:
>>> print(tree)
└─346ec544...
├──bbe0bdaf...
│ ├──39286a4a...
│ │ ├──1d2039fa...
│ │ └──48590412...
│ └──0bf15c4f...
│ ├──b06d6958...
│ └──5a43bc14...
└──4c715fb1...
├──7a4b8eff...
│ ├──2e219794...
│ └──1c0c3f26...
└──e9345fea...
├──2c3bb97e...
└──dcd08bea...
Sqlite
SqliteTree
uses a SQLite database to persistently store entries.
It is a wrapper of sqlite3, suitable for leightweight applications
that do not require separate server processes for the database.
from pymerkle import SqliteTree
tree = SqliteTree('merkle.db')
This opens a connection to the provided database, which will also be created if not already existent.
Note
The database schema consists of a single table called leaf with two columns: index, which is the primary key serving as leaf index, and entry, which is a blob field storing the appended data.
Data is expected to be provided in binary:
index = tree.append_entry(b'foo')
It is hashed without further processing and can be accessed as follows:
data = tree.get_entry(index)
assert data == b'foo'
In order to efficiently append multiple entries at once, you can do the following:
entries = [f'entry-{i + 1}'.encode() for i in range(100000)]
tree.append_entries(entries, chunksize=1024)
where chunksize
controls the number of insertions per database transaction
(defaults to 100,000).
It is suggested to close the connection to the database when ready:
tree.con.close()
Alternatively, initialize the tree as context-manager to ensure that this will be done without taking explicit care:
with SqliteTree('merkle.db') as tree:
...
Examples
Warning
The following exaples are only for the purpose of reference and understanding
Simple list
This is a simple non-persistent implementation utilizing a list as storage. It expects entries to be strings, which it encodes in utf-8 before hashing.
from pymerkle import BaseMerkleTree
class MerkleTree(BaseMerkleTree):
def __init__(self, algorithm='sha256'):
self.hashes = []
super().__init__(algorithm)
def _encode_entry(self, data):
return data.encode('utf-8')
def _store_leaf(self, data, digest):
self.hashes += [digest]
return len(self.hashes)
def _get_leaf(self, index):
value = self.hashes[index - 1]
return value
def _get_leaves(self, offset, width):
values = self.hashes[offset: offset + width]
return values
def _get_size(self):
return len(self.hashes)
Unix DBM
This is a hasty implementing using dbm to persistently store entries in
a "merkledb"
file. It expects strings as entries and encodes them in
utf-8 before hashing.
import dbm
from pymerkle import BaseMerkleTree
class MerkleTree(BaseMerkleTree):
def __init__(self, algorithm='sha256'):
self.dbfile = 'merkledb'
self.mode = 0o666
with dbm.open(self.dbfile, 'c', mode=self.mode) as db:
pass
super().__init__(algorithm)
def _encode_entry(self, data):
return data.encode('utf-8')
def _store_leaf(self, data, digest):
with dbm.open(self.dbfile, 'w', mode=self.mode) as db:
index = len(db) + 1
db[hex(index)] = b'|'.join(data, digest)
return index
def _get_leaf(self, index):
with dbm.open(self.dbfile, 'r', mode=self.mode) as db:
value = db[hex(index)].split(b'|')[1]
return value
def _get_leaves(self, offset, width):
values = []
with dbm.open(self.dbfile, 'r', mode=self.mode) as db:
for index in range(offset + 1, width + 1):
value = db[hex(index)].split(b'|')[index]
values += [value]
return value
def _get_size(self):
with dbm.open(self.dbfile, 'r', mode=self.mode) as db:
size = len(db)
return size