[bitcoin-dev] Merkle trees and mountain ranges

Discussion:

Bram Cohen via bitcoin-dev

2016-06-15 00:14:23 UTC

This is in response to Peter Todd's proposal for Merkle Mountain Range
commitments in blocks:

https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html

I'm in strong agreement that there's a compelling need to put UTXO
commitments in blocks, and that the big barrier to getting it done is
performance, particularly latency. But I have strong disagreements (or
perhaps the right word is skepticism) about the details.

Peter proposes that there should be both UTXO and STXO commitments, and
they should be based on Merkle Mountain Ranges based on Patricia Tries. My
first big disagreement is about the need for STXO commitments. I think
they're unnecessary and a performance problem. The STXO set is much larger
than the utxo set and requires much more memory and horespower to maintain.
Most if not all of its functionality can in practice be done using the utxo
set. Almost anything accepting proofs of inclusion and exclusion will have
a complete history of block headers, so to prove inclusion in the stxo set
you can use a utxo proof of inclusion in the past and a proof of exclusion
for the most recent block. In the case of a txo which has never been
included at all, it's generally possible to show that an ancestor of the
txo in question was at one point included but that an incompatible
descendant of it (or the ancestor itself) is part of the current utxo set.
Generating these sorts of proofs efficiently can for some applications
require a complete STXO set, but that can done with a non-merkle set,
getting the vastly better performance of an ordinary non-cryptographic
hashtable.

The fundamental approach to handling the latency problem is to have the
utxo commitments trail a bit. Computing utxo commitments takes a certain
amount of time, too much to hold up block propagation but still hopefully
vastly less than the average amount of time between blocks. Trailing by a
single block is probably a bad idea because you sometimes get blocks back
to back, but you never get blocks back to back to back to back. Having the
utxo set be trailing by a fixed amount - five blocks is probably excessive
- would do a perfectly good job of keeping latency from every becoming an
issue. Smaller commitments for the utxos added and removed in each block
alone could be added without any significant performance penalty. That way
all blocks would have sufficient commitments for a completely up to date
proofs of inclusion and exclusion. This is not a controversial approach.

Now I'm going to go out on a limb. My thesis is that usage of a mountain
range is unnecessary, and that a merkle tree in the raw can be made
serviceable by sprinkling magic pixie dust on the performance problem.

There are two causes of performance problems for merkle trees: hashing
operations and memory cache misses. For hashing functions, the difference
between a mountain range and a straight merkle tree is roughly that in a
mountain range there's one operation for each new update times the number
of times that thing will get merged into larger hills. If there are fewer
levels of hills the number of operations is less but the expense of proof
of inclusion will be larger. For raw merkle trees the number of operations
per thing added is the log base 2 of the number of levels in the tree,
minus the log base 2 of the number of things added at once since you can do
lazy evaluation. For practical Bitcoin there are (very roughly) a million
things stored, or 20 levels, and there are (even more roughly) about a
thousand things stored per block, so each thing forces about 20 - 10 = 10
operations. If you follow the fairly reasonable guideline of mountain range
hills go up by factors of four, you instead have 20/2 = 10 operations per
thing added amortized. Depending on details this comparison can go either
way but it's roughly a wash and the complexity of a mountain range is
clearly not worth it at least from the point of view of CPU costs.

But CPU costs aren't the main performance problem in merkle trees. The
biggest issues is cache misses, specifically l1 and l2 cache misses. These
tend to take a long time to do, resulting in the CPU spending most of its
time sitting around doing nothing. A naive tree implementation is pretty
much the worst thing you can possibly build from a cache miss standpoint,
and its performance will be completely unacceptable. Mountain ranges do a
fabulous job of fixing this problem, because all their updates are merges
so the metrics are more like cache misses per block instead of cache misses
per transaction.

The magic pixie dust I mentioned earlier involves a bunch of subtle
implementation details to keep cache coherence down which should get the
number of cache misses per transaction down under one, at which point it
probably isn't a bottleneck any more. There is an implementation in the
works here:

https://github.com/bramcohen/MerkleSet

This implementation isn't finished yet! I'm almost there, and I'm
definitely feeling time pressure now. I've spent quite a lot of time on
this, mostly because of a bunch of technical reworkings which proved
necessary. This is the last time I ever write a database for kicks. But
this implementation is good on all important dimensions, including:

Lazy root calculation
Few l1 and l2 cache misses
Small proofs of inclusion/exclusion
Reasonably simple implementation
Reasonably efficient in memory
Reasonable defense against malicious insertion attacks

There is a bit of a false dichotomy with the mountain range approach.
Mountain ranges need underlying merkle trees, and mine are semantically
nearly identically to Peter's. This is not a coincidence - I adopted
patricia tries at his suggestion. There are a bunch of small changes which
allow a more efficient implementation. I believe that my underlying merkle
tree is unambiguously superior in every way, but the question of whether a
mountain range is worth it is one which can only be answered empirically,
and that requires a bunch of implementation work to be done, starting with
me finishing my merkle tree implementation and then somebody porting it to
C and optimizing it. The Python version has details which are ridiculous
and only make sense once it gets ported, and even under the best of
conditions Python performance is not strongly indicative of C performance.

Peter Todd via bitcoin-dev

2016-06-16 00:10:40 UTC

Permalink

Post by Bram Cohen via bitcoin-dev
This is in response to Peter Todd's proposal for Merkle Mountain Range
https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html
I'm in strong agreement that there's a compelling need to put UTXO
commitments in blocks, and that the big barrier to getting it done is
performance, particularly latency. But I have strong disagreements (or
perhaps the right word is skepticism) about the details.
Peter proposes that there should be both UTXO and STXO commitments, and

No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor STXO
commitments.

Post by Bram Cohen via bitcoin-dev
they should be based on Merkle Mountain Ranges based on Patricia Tries. My
first big disagreement is about the need for STXO commitments. I think
they're unnecessary and a performance problem. The STXO set is much larger
than the utxo set and requires much more memory and horespower to maintain.

Again, I'm not proposing STXO commitments precisely because the set of _spent_
transactions grows without bound. TXO commitments with committed sums of
remaining unspent TXO's and with pruning of old history are special in this
regard, because once spent the data associated with spent transactions can be
discarded completely, and at the same time, data associated with old history
can be pruned with responsibility for keeping it resting on the shoulders of
those owning those coins.

Post by Bram Cohen via bitcoin-dev
Most if not all of its functionality can in practice be done using the utxo
set. Almost anything accepting proofs of inclusion and exclusion will have
a complete history of block headers, so to prove inclusion in the stxo set
you can use a utxo proof of inclusion in the past and a proof of exclusion
for the most recent block. In the case of a txo which has never been
included at all, it's generally possible to show that an ancestor of the
txo in question was at one point included but that an incompatible
descendant of it (or the ancestor itself) is part of the current utxo set.
Generating these sorts of proofs efficiently can for some applications
require a complete STXO set, but that can done with a non-merkle set,
getting the vastly better performance of an ordinary non-cryptographic
hashtable.

TXO commitments allows you to do all of this without requiring miners to have
unbounded storage to create new blocks.

Post by Bram Cohen via bitcoin-dev
The fundamental approach to handling the latency problem is to have the
utxo commitments trail a bit. Computing utxo commitments takes a certain
amount of time, too much to hold up block propagation but still hopefully
vastly less than the average amount of time between blocks. Trailing by a
single block is probably a bad idea because you sometimes get blocks back
to back, but you never get blocks back to back to back to back. Having the
utxo set be trailing by a fixed amount - five blocks is probably excessive
- would do a perfectly good job of keeping latency from every becoming an
issue. Smaller commitments for the utxos added and removed in each block
alone could be added without any significant performance penalty. That way
all blocks would have sufficient commitments for a completely up to date
proofs of inclusion and exclusion. This is not a controversial approach.

Agreed - regardless of approach adding latency to commitment calculations of
all kinds is something I think we all agree can work in principle, although
obviously it should be a last resort technique when optimization fails.

Post by Bram Cohen via bitcoin-dev
Now I'm going to go out on a limb. My thesis is that usage of a mountain
range is unnecessary, and that a merkle tree in the raw can be made
serviceable by sprinkling magic pixie dust on the performance problem.

It'd help if you specified exactly what type of merkle tree you're talking
about here; remember that the certificate transparency RFC appears to have
reinvented merkle mountain ranges, and they call them "merkle trees". Bitcoin
meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a
partially filled fixed-sized perfect tree.

Post by Bram Cohen via bitcoin-dev
There are two causes of performance problems for merkle trees: hashing
operations and memory cache misses. For hashing functions, the difference
between a mountain range and a straight merkle tree is roughly that in a
mountain range there's one operation for each new update times the number
of times that thing will get merged into larger hills. If there are fewer
levels of hills the number of operations is less but the expense of proof
of inclusion will be larger. For raw merkle trees the number of operations
per thing added is the log base 2 of the number of levels in the tree,
minus the log base 2 of the number of things added at once since you can do
lazy evaluation. For practical Bitcoin there are (very roughly) a million
things stored, or 20 levels, and there are (even more roughly) about a
thousand things stored per block, so each thing forces about 20 - 10 = 10
operations. If you follow the fairly reasonable guideline of mountain range
hills go up by factors of four, you instead have 20/2 = 10 operations per
thing added amortized. Depending on details this comparison can go either
way but it's roughly a wash and the complexity of a mountain range is
clearly not worth it at least from the point of view of CPU costs.

I'm having a hard time understanding this paragraph; could you explain what you
think is happening when things are "merged into larger hills"?

Post by Bram Cohen via bitcoin-dev
But CPU costs aren't the main performance problem in merkle trees. The
biggest issues is cache misses, specifically l1 and l2 cache misses. These
tend to take a long time to do, resulting in the CPU spending most of its
time sitting around doing nothing. A naive tree implementation is pretty
much the worst thing you can possibly build from a cache miss standpoint,
and its performance will be completely unacceptable. Mountain ranges do a
fabulous job of fixing this problem, because all their updates are merges
so the metrics are more like cache misses per block instead of cache misses
per transaction.
The magic pixie dust I mentioned earlier involves a bunch of subtle
implementation details to keep cache coherence down which should get the
number of cache misses per transaction down under one, at which point it
probably isn't a bottleneck any more. There is an implementation in the

As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's
impossible to get CPU cache misses below one for update operations. The closest
thing to an exception is MMR's, which due to their insertion-ordering could
have good cache locality for appends, in the sense that the mountain tips
required to recalculate the MMR tip digest will already be in cache from the
previous append. But that's not sufficient, as you also need to modify old
TXO's further back in the tree to mark them as spent - that data is going to be
far larger than L1/L2 cache.

Post by Bram Cohen via bitcoin-dev
https://github.com/bramcohen/MerkleSet
This implementation isn't finished yet! I'm almost there, and I'm
definitely feeling time pressure now. I've spent quite a lot of time on
this, mostly because of a bunch of technical reworkings which proved
necessary. This is the last time I ever write a database for kicks. But
Lazy root calculation
Few l1 and l2 cache misses
Small proofs of inclusion/exclusion

Have you looked at the pruning system that my proofchains work implements?

--
https://petertodd.org 'peter'[:-1]@petertodd.org

Bram Cohen via bitcoin-dev

2016-06-16 01:16:26 UTC

Permalink

Post by Peter Todd via bitcoin-dev

Post by Bram Cohen via bitcoin-dev
Peter proposes that there should be both UTXO and STXO commitments, and

No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor STXO
commitments.

What do you mean by TXO commitments? If you mean that it only records
insertions rather than deletions, then that can do many of the same proofs
but has no way of proving that something is currently in the UTXO set,
which is functionality I'd like to provide.

When I say 'merkle tree' what I mean is a patricia trie. What I assume is
meant by a merkle mountain range is a series of patricia tries of
decreasing size each of which is an addition to the previous one, and
they're periodically consolidated into larger tries so the old ones can go
away. This data structure has the nice property that it's both in sorted
order and has less than one cache miss per operation because the
consolidation operations can be batched and done linearly. There are a
number of different things you could be describing if I misunderstood.

Post by Peter Todd via bitcoin-dev
I'm not proposing STXO commitments precisely because the set of _spent_
transactions grows without bound.

I'm worried that once there's real transaction fees everyone might stop
consolidating dust and the set of unspent transactions might grow without
bound as well, but that's a topic for another day.

Post by Peter Todd via bitcoin-dev

It'd help if you specified exactly what type of merkle tree you're talking
about here; remember that the certificate transparency RFC appears to have
reinvented merkle mountain ranges, and they call them "merkle trees".
Bitcoin
meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a
partially filled fixed-sized perfect tree.

What I'm making is a patricia trie. Its byte level definition is very
similar to the one in your MMR codebase.

Each level of the tree has a single metadata byte and followed by two
hashes. The hashes are truncated by one byte and the hash function is a
non-padding variant of sha256 (right now it's just using regular sha256,
but that's a nice optimization which allows everything to fit in a single
block).

The possible metadata values are: TERM0, TERM1, TERMBOTH, ONLY0, ONLY1,
MIDDLE. They mean:

TERM0, TERM1: There is a single thing in the tree on the specified side.
The thing hashed on that side is that thing verbatim. The other side has
more than one thing and the hash of it is the root of everything below.

TERMBOTH: There are exactly two things below and they're included inline.
Note that two things is a special case, if there are more you sometimes
have ONLY0 or ONLY1.

ONLY0, ONLY1: There are more than two things below and they're all on the
same side. This makes proofs of inclusion and exclusion simpler, and makes
some implementation details easier, for example there's always something at
every level with perfect memory positioning. It doesn't cause much extra
memory usage because of the TERMBOTH exception for exactly two things.

MIDDLE: There two or more things on both sides.

There's also a SINGLETON special case for a tree which contains only one
thing, and an EMPTY special value for tree which doesn't contain anything.

The main differences to your patricia trie are the non-padding sha256 and
that each level doesn't hash in a record of its depth and the usage of
ONLY0 and ONLY1.

Post by Peter Todd via bitcoin-dev

operations

Post by Bram Cohen via bitcoin-dev
per thing added is the log base 2 of the number of levels in the tree,
minus the log base 2 of the number of things added at once since you can

Post by Bram Cohen via bitcoin-dev
lazy evaluation. For practical Bitcoin there are (very roughly) a million
things stored, or 20 levels, and there are (even more roughly) about a
thousand things stored per block, so each thing forces about 20 - 10 = 10
operations. If you follow the fairly reasonable guideline of mountain

range

Post by Bram Cohen via bitcoin-dev
hills go up by factors of four, you instead have 20/2 = 10 operations per
thing added amortized. Depending on details this comparison can go either
way but it's roughly a wash and the complexity of a mountain range is
clearly not worth it at least from the point of view of CPU costs.

I'm having a hard time understanding this paragraph; could you explain what you
think is happening when things are "merged into larger hills"?

I'm talking about the recalculation of mountain tips, assuming we're on the
same page about what 'MMR' means.

Post by Peter Todd via bitcoin-dev
As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's
impossible to get CPU cache misses below one for update operations. The closest
thing to an exception is MMR's, which due to their insertion-ordering could
have good cache locality for appends, in the sense that the mountain tips
required to recalculate the MMR tip digest will already be in cache from the
previous append. But that's not sufficient, as you also need to modify old
TXO's further back in the tree to mark them as spent - that data is going to be
far larger than L1/L2 cache.

This makes me think we're talking about subtly different things for MMRs.
The ones I described above have sub-1 cache miss per update due to the
amortized merging together of old mountains.

Technically even a patricia trie utxo commitment can have sub-1 cache
misses per update if some of the updates in a single block are close to
each other in memory. I think I can get practical Bitcoin updates down to a
little bit less than one l2 cache miss per update, but not a lot less.

Peter Todd via bitcoin-dev

2016-06-16 03:26:12 UTC