Discussion:
Merkle trees and mountain ranges
(too old to reply)
Bram Cohen via bitcoin-dev
2016-06-15 00:14:23 UTC
Permalink
This is in response to Peter Todd's proposal for Merkle Mountain Range
commitments in blocks:

https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html

I'm in strong agreement that there's a compelling need to put UTXO
commitments in blocks, and that the big barrier to getting it done is
performance, particularly latency. But I have strong disagreements (or
perhaps the right word is skepticism) about the details.

Peter proposes that there should be both UTXO and STXO commitments, and
they should be based on Merkle Mountain Ranges based on Patricia Tries. My
first big disagreement is about the need for STXO commitments. I think
they're unnecessary and a performance problem. The STXO set is much larger
than the utxo set and requires much more memory and horespower to maintain.
Most if not all of its functionality can in practice be done using the utxo
set. Almost anything accepting proofs of inclusion and exclusion will have
a complete history of block headers, so to prove inclusion in the stxo set
you can use a utxo proof of inclusion in the past and a proof of exclusion
for the most recent block. In the case of a txo which has never been
included at all, it's generally possible to show that an ancestor of the
txo in question was at one point included but that an incompatible
descendant of it (or the ancestor itself) is part of the current utxo set.
Generating these sorts of proofs efficiently can for some applications
require a complete STXO set, but that can done with a non-merkle set,
getting the vastly better performance of an ordinary non-cryptographic
hashtable.

The fundamental approach to handling the latency problem is to have the
utxo commitments trail a bit. Computing utxo commitments takes a certain
amount of time, too much to hold up block propagation but still hopefully
vastly less than the average amount of time between blocks. Trailing by a
single block is probably a bad idea because you sometimes get blocks back
to back, but you never get blocks back to back to back to back. Having the
utxo set be trailing by a fixed amount - five blocks is probably excessive
- would do a perfectly good job of keeping latency from every becoming an
issue. Smaller commitments for the utxos added and removed in each block
alone could be added without any significant performance penalty. That way
all blocks would have sufficient commitments for a completely up to date
proofs of inclusion and exclusion. This is not a controversial approach.

Now I'm going to go out on a limb. My thesis is that usage of a mountain
range is unnecessary, and that a merkle tree in the raw can be made
serviceable by sprinkling magic pixie dust on the performance problem.

There are two causes of performance problems for merkle trees: hashing
operations and memory cache misses. For hashing functions, the difference
between a mountain range and a straight merkle tree is roughly that in a
mountain range there's one operation for each new update times the number
of times that thing will get merged into larger hills. If there are fewer
levels of hills the number of operations is less but the expense of proof
of inclusion will be larger. For raw merkle trees the number of operations
per thing added is the log base 2 of the number of levels in the tree,
minus the log base 2 of the number of things added at once since you can do
lazy evaluation. For practical Bitcoin there are (very roughly) a million
things stored, or 20 levels, and there are (even more roughly) about a
thousand things stored per block, so each thing forces about 20 - 10 = 10
operations. If you follow the fairly reasonable guideline of mountain range
hills go up by factors of four, you instead have 20/2 = 10 operations per
thing added amortized. Depending on details this comparison can go either
way but it's roughly a wash and the complexity of a mountain range is
clearly not worth it at least from the point of view of CPU costs.

But CPU costs aren't the main performance problem in merkle trees. The
biggest issues is cache misses, specifically l1 and l2 cache misses. These
tend to take a long time to do, resulting in the CPU spending most of its
time sitting around doing nothing. A naive tree implementation is pretty
much the worst thing you can possibly build from a cache miss standpoint,
and its performance will be completely unacceptable. Mountain ranges do a
fabulous job of fixing this problem, because all their updates are merges
so the metrics are more like cache misses per block instead of cache misses
per transaction.

The magic pixie dust I mentioned earlier involves a bunch of subtle
implementation details to keep cache coherence down which should get the
number of cache misses per transaction down under one, at which point it
probably isn't a bottleneck any more. There is an implementation in the
works here:

https://github.com/bramcohen/MerkleSet

This implementation isn't finished yet! I'm almost there, and I'm
definitely feeling time pressure now. I've spent quite a lot of time on
this, mostly because of a bunch of technical reworkings which proved
necessary. This is the last time I ever write a database for kicks. But
this implementation is good on all important dimensions, including:

Lazy root calculation
Few l1 and l2 cache misses
Small proofs of inclusion/exclusion
Reasonably simple implementation
Reasonably efficient in memory
Reasonable defense against malicious insertion attacks

There is a bit of a false dichotomy with the mountain range approach.
Mountain ranges need underlying merkle trees, and mine are semantically
nearly identically to Peter's. This is not a coincidence - I adopted
patricia tries at his suggestion. There are a bunch of small changes which
allow a more efficient implementation. I believe that my underlying merkle
tree is unambiguously superior in every way, but the question of whether a
mountain range is worth it is one which can only be answered empirically,
and that requires a bunch of implementation work to be done, starting with
me finishing my merkle tree implementation and then somebody porting it to
C and optimizing it. The Python version has details which are ridiculous
and only make sense once it gets ported, and even under the best of
conditions Python performance is not strongly indicative of C performance.
Peter Todd via bitcoin-dev
2016-06-16 00:10:40 UTC
Permalink
Post by Bram Cohen via bitcoin-dev
This is in response to Peter Todd's proposal for Merkle Mountain Range
https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html
I'm in strong agreement that there's a compelling need to put UTXO
commitments in blocks, and that the big barrier to getting it done is
performance, particularly latency. But I have strong disagreements (or
perhaps the right word is skepticism) about the details.
Peter proposes that there should be both UTXO and STXO commitments, and
No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor STXO
commitments.
Post by Bram Cohen via bitcoin-dev
they should be based on Merkle Mountain Ranges based on Patricia Tries. My
first big disagreement is about the need for STXO commitments. I think
they're unnecessary and a performance problem. The STXO set is much larger
than the utxo set and requires much more memory and horespower to maintain.
Again, I'm not proposing STXO commitments precisely because the set of _spent_
transactions grows without bound. TXO commitments with committed sums of
remaining unspent TXO's and with pruning of old history are special in this
regard, because once spent the data associated with spent transactions can be
discarded completely, and at the same time, data associated with old history
can be pruned with responsibility for keeping it resting on the shoulders of
those owning those coins.
Post by Bram Cohen via bitcoin-dev
Most if not all of its functionality can in practice be done using the utxo
set. Almost anything accepting proofs of inclusion and exclusion will have
a complete history of block headers, so to prove inclusion in the stxo set
you can use a utxo proof of inclusion in the past and a proof of exclusion
for the most recent block. In the case of a txo which has never been
included at all, it's generally possible to show that an ancestor of the
txo in question was at one point included but that an incompatible
descendant of it (or the ancestor itself) is part of the current utxo set.
Generating these sorts of proofs efficiently can for some applications
require a complete STXO set, but that can done with a non-merkle set,
getting the vastly better performance of an ordinary non-cryptographic
hashtable.
TXO commitments allows you to do all of this without requiring miners to have
unbounded storage to create new blocks.
Post by Bram Cohen via bitcoin-dev
The fundamental approach to handling the latency problem is to have the
utxo commitments trail a bit. Computing utxo commitments takes a certain
amount of time, too much to hold up block propagation but still hopefully
vastly less than the average amount of time between blocks. Trailing by a
single block is probably a bad idea because you sometimes get blocks back
to back, but you never get blocks back to back to back to back. Having the
utxo set be trailing by a fixed amount - five blocks is probably excessive
- would do a perfectly good job of keeping latency from every becoming an
issue. Smaller commitments for the utxos added and removed in each block
alone could be added without any significant performance penalty. That way
all blocks would have sufficient commitments for a completely up to date
proofs of inclusion and exclusion. This is not a controversial approach.
Agreed - regardless of approach adding latency to commitment calculations of
all kinds is something I think we all agree can work in principle, although
obviously it should be a last resort technique when optimization fails.
Post by Bram Cohen via bitcoin-dev
Now I'm going to go out on a limb. My thesis is that usage of a mountain
range is unnecessary, and that a merkle tree in the raw can be made
serviceable by sprinkling magic pixie dust on the performance problem.
It'd help if you specified exactly what type of merkle tree you're talking
about here; remember that the certificate transparency RFC appears to have
reinvented merkle mountain ranges, and they call them "merkle trees". Bitcoin
meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a
partially filled fixed-sized perfect tree.
Post by Bram Cohen via bitcoin-dev
There are two causes of performance problems for merkle trees: hashing
operations and memory cache misses. For hashing functions, the difference
between a mountain range and a straight merkle tree is roughly that in a
mountain range there's one operation for each new update times the number
of times that thing will get merged into larger hills. If there are fewer
levels of hills the number of operations is less but the expense of proof
of inclusion will be larger. For raw merkle trees the number of operations
per thing added is the log base 2 of the number of levels in the tree,
minus the log base 2 of the number of things added at once since you can do
lazy evaluation. For practical Bitcoin there are (very roughly) a million
things stored, or 20 levels, and there are (even more roughly) about a
thousand things stored per block, so each thing forces about 20 - 10 = 10
operations. If you follow the fairly reasonable guideline of mountain range
hills go up by factors of four, you instead have 20/2 = 10 operations per
thing added amortized. Depending on details this comparison can go either
way but it's roughly a wash and the complexity of a mountain range is
clearly not worth it at least from the point of view of CPU costs.
I'm having a hard time understanding this paragraph; could you explain what you
think is happening when things are "merged into larger hills"?
Post by Bram Cohen via bitcoin-dev
But CPU costs aren't the main performance problem in merkle trees. The
biggest issues is cache misses, specifically l1 and l2 cache misses. These
tend to take a long time to do, resulting in the CPU spending most of its
time sitting around doing nothing. A naive tree implementation is pretty
much the worst thing you can possibly build from a cache miss standpoint,
and its performance will be completely unacceptable. Mountain ranges do a
fabulous job of fixing this problem, because all their updates are merges
so the metrics are more like cache misses per block instead of cache misses
per transaction.
The magic pixie dust I mentioned earlier involves a bunch of subtle
implementation details to keep cache coherence down which should get the
number of cache misses per transaction down under one, at which point it
probably isn't a bottleneck any more. There is an implementation in the
As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's
impossible to get CPU cache misses below one for update operations. The closest
thing to an exception is MMR's, which due to their insertion-ordering could
have good cache locality for appends, in the sense that the mountain tips
required to recalculate the MMR tip digest will already be in cache from the
previous append. But that's not sufficient, as you also need to modify old
TXO's further back in the tree to mark them as spent - that data is going to be
far larger than L1/L2 cache.
Post by Bram Cohen via bitcoin-dev
https://github.com/bramcohen/MerkleSet
This implementation isn't finished yet! I'm almost there, and I'm
definitely feeling time pressure now. I've spent quite a lot of time on
this, mostly because of a bunch of technical reworkings which proved
necessary. This is the last time I ever write a database for kicks. But
Lazy root calculation
Few l1 and l2 cache misses
Small proofs of inclusion/exclusion
Have you looked at the pruning system that my proofchains work implements?
--
https://petertodd.org 'peter'[:-1]@petertodd.org
Bram Cohen via bitcoin-dev
2016-06-16 01:16:26 UTC
Permalink
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Peter proposes that there should be both UTXO and STXO commitments, and
No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor STXO
commitments.
What do you mean by TXO commitments? If you mean that it only records
insertions rather than deletions, then that can do many of the same proofs
but has no way of proving that something is currently in the UTXO set,
which is functionality I'd like to provide.

When I say 'merkle tree' what I mean is a patricia trie. What I assume is
meant by a merkle mountain range is a series of patricia tries of
decreasing size each of which is an addition to the previous one, and
they're periodically consolidated into larger tries so the old ones can go
away. This data structure has the nice property that it's both in sorted
order and has less than one cache miss per operation because the
consolidation operations can be batched and done linearly. There are a
number of different things you could be describing if I misunderstood.
Post by Peter Todd via bitcoin-dev
I'm not proposing STXO commitments precisely because the set of _spent_
transactions grows without bound.
I'm worried that once there's real transaction fees everyone might stop
consolidating dust and the set of unspent transactions might grow without
bound as well, but that's a topic for another day.
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Now I'm going to go out on a limb. My thesis is that usage of a mountain
range is unnecessary, and that a merkle tree in the raw can be made
serviceable by sprinkling magic pixie dust on the performance problem.
It'd help if you specified exactly what type of merkle tree you're talking
about here; remember that the certificate transparency RFC appears to have
reinvented merkle mountain ranges, and they call them "merkle trees".
Bitcoin
meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a
partially filled fixed-sized perfect tree.
What I'm making is a patricia trie. Its byte level definition is very
similar to the one in your MMR codebase.

Each level of the tree has a single metadata byte and followed by two
hashes. The hashes are truncated by one byte and the hash function is a
non-padding variant of sha256 (right now it's just using regular sha256,
but that's a nice optimization which allows everything to fit in a single
block).

The possible metadata values are: TERM0, TERM1, TERMBOTH, ONLY0, ONLY1,
MIDDLE. They mean:

TERM0, TERM1: There is a single thing in the tree on the specified side.
The thing hashed on that side is that thing verbatim. The other side has
more than one thing and the hash of it is the root of everything below.

TERMBOTH: There are exactly two things below and they're included inline.
Note that two things is a special case, if there are more you sometimes
have ONLY0 or ONLY1.

ONLY0, ONLY1: There are more than two things below and they're all on the
same side. This makes proofs of inclusion and exclusion simpler, and makes
some implementation details easier, for example there's always something at
every level with perfect memory positioning. It doesn't cause much extra
memory usage because of the TERMBOTH exception for exactly two things.

MIDDLE: There two or more things on both sides.

There's also a SINGLETON special case for a tree which contains only one
thing, and an EMPTY special value for tree which doesn't contain anything.

The main differences to your patricia trie are the non-padding sha256 and
that each level doesn't hash in a record of its depth and the usage of
ONLY0 and ONLY1.
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
There are two causes of performance problems for merkle trees: hashing
operations and memory cache misses. For hashing functions, the difference
between a mountain range and a straight merkle tree is roughly that in a
mountain range there's one operation for each new update times the number
of times that thing will get merged into larger hills. If there are fewer
levels of hills the number of operations is less but the expense of proof
of inclusion will be larger. For raw merkle trees the number of
operations
Post by Bram Cohen via bitcoin-dev
per thing added is the log base 2 of the number of levels in the tree,
minus the log base 2 of the number of things added at once since you can
do
Post by Bram Cohen via bitcoin-dev
lazy evaluation. For practical Bitcoin there are (very roughly) a million
things stored, or 20 levels, and there are (even more roughly) about a
thousand things stored per block, so each thing forces about 20 - 10 = 10
operations. If you follow the fairly reasonable guideline of mountain
range
Post by Bram Cohen via bitcoin-dev
hills go up by factors of four, you instead have 20/2 = 10 operations per
thing added amortized. Depending on details this comparison can go either
way but it's roughly a wash and the complexity of a mountain range is
clearly not worth it at least from the point of view of CPU costs.
I'm having a hard time understanding this paragraph; could you explain what you
think is happening when things are "merged into larger hills"?
I'm talking about the recalculation of mountain tips, assuming we're on the
same page about what 'MMR' means.
Post by Peter Todd via bitcoin-dev
As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's
impossible to get CPU cache misses below one for update operations. The closest
thing to an exception is MMR's, which due to their insertion-ordering could
have good cache locality for appends, in the sense that the mountain tips
required to recalculate the MMR tip digest will already be in cache from the
previous append. But that's not sufficient, as you also need to modify old
TXO's further back in the tree to mark them as spent - that data is going to be
far larger than L1/L2 cache.
This makes me think we're talking about subtly different things for MMRs.
The ones I described above have sub-1 cache miss per update due to the
amortized merging together of old mountains.

Technically even a patricia trie utxo commitment can have sub-1 cache
misses per update if some of the updates in a single block are close to
each other in memory. I think I can get practical Bitcoin updates down to a
little bit less than one l2 cache miss per update, but not a lot less.
Peter Todd via bitcoin-dev
2016-06-16 03:26:12 UTC
Permalink
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Peter proposes that there should be both UTXO and STXO commitments, and
No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor STXO
commitments.
What do you mean by TXO commitments? If you mean that it only records
insertions rather than deletions, then that can do many of the same proofs
but has no way of proving that something is currently in the UTXO set,
which is functionality I'd like to provide.
I think you need to re-read my original post on TXO commitments, specifically
where I say:

# TXO Commitments

A merkle tree committing to the state of __all transaction outputs, both spent
and unspent__, we can provide a method of compactly proving the current state of
an output.

https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html
Post by Bram Cohen via bitcoin-dev
When I say 'merkle tree' what I mean is a patricia trie. What I assume is
meant by a merkle mountain range is a series of patricia tries of
decreasing size each of which is an addition to the previous one, and
they're periodically consolidated into larger tries so the old ones can go
away. This data structure has the nice property that it's both in sorted
order and has less than one cache miss per operation because the
consolidation operations can be batched and done linearly. There are a
number of different things you could be describing if I misunderstood.
Nope, MMR's are completely unlike what you just described.
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
I'm not proposing STXO commitments precisely because the set of _spent_
transactions grows without bound.
I'm worried that once there's real transaction fees everyone might stop
consolidating dust and the set of unspent transactions might grow without
bound as well, but that's a topic for another day.
Ok, but then if you're concerned about that risk, why introduce a data
structure - the STXO set - that's _guaranteed_ to grow without bound?
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Now I'm going to go out on a limb. My thesis is that usage of a mountain
range is unnecessary, and that a merkle tree in the raw can be made
serviceable by sprinkling magic pixie dust on the performance problem.
It'd help if you specified exactly what type of merkle tree you're talking
about here; remember that the certificate transparency RFC appears to have
reinvented merkle mountain ranges, and they call them "merkle trees".
Bitcoin
meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a
partially filled fixed-sized perfect tree.
What I'm making is a patricia trie. Its byte level definition is very
similar to the one in your MMR codebase.
Which codebase exactly? I have both a insertion-ordered list (MMR) and a
key:value mapping (referred to as a "merbinner tree" in the codebase) in the
proofchains codebase. They're very different data structures.
Post by Bram Cohen via bitcoin-dev
Each level of the tree has a single metadata byte and followed by two
hashes. The hashes are truncated by one byte and the hash function is a
non-padding variant of sha256 (right now it's just using regular sha256,
but that's a nice optimization which allows everything to fit in a single
block).
The possible metadata values are: TERM0, TERM1, TERMBOTH, ONLY0, ONLY1,
TERM0, TERM1: There is a single thing in the tree on the specified side.
The thing hashed on that side is that thing verbatim. The other side has
more than one thing and the hash of it is the root of everything below.
TERMBOTH: There are exactly two things below and they're included inline.
Note that two things is a special case, if there are more you sometimes
have ONLY0 or ONLY1.
ONLY0, ONLY1: There are more than two things below and they're all on the
same side. This makes proofs of inclusion and exclusion simpler, and makes
some implementation details easier, for example there's always something at
every level with perfect memory positioning. It doesn't cause much extra
memory usage because of the TERMBOTH exception for exactly two things.
MIDDLE: There two or more things on both sides.
There's also a SINGLETON special case for a tree which contains only one
thing, and an EMPTY special value for tree which doesn't contain anything.
The main differences to your patricia trie are the non-padding sha256 and
that each level doesn't hash in a record of its depth and the usage of
ONLY0 and ONLY1.
I'm rather confused, as the above sounds nothing like what I've implemented,
which only has leaf nodes, inner nodes, and the special empty node singleton,
for both the MMR and merbinner trees.
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
I'm having a hard time understanding this paragraph; could you explain what you
think is happening when things are "merged into larger hills"?
I'm talking about the recalculation of mountain tips, assuming we're on the
same page about what 'MMR' means.
Yeah, we're definitely not...

In MMR's append operations never need to modify mountain contents.
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's
impossible to get CPU cache misses below one for update operations. The closest
thing to an exception is MMR's, which due to their insertion-ordering could
have good cache locality for appends, in the sense that the mountain tips
required to recalculate the MMR tip digest will already be in cache from the
previous append. But that's not sufficient, as you also need to modify old
TXO's further back in the tree to mark them as spent - that data is going to be
far larger than L1/L2 cache.
This makes me think we're talking about subtly different things for MMRs.
The ones I described above have sub-1 cache miss per update due to the
amortized merging together of old mountains.
Again, see above.
Post by Bram Cohen via bitcoin-dev
Technically even a patricia trie utxo commitment can have sub-1 cache
misses per update if some of the updates in a single block are close to
each other in memory. I think I can get practical Bitcoin updates down to a
little bit less than one l2 cache miss per update, but not a lot less.
I'm very confused as to why you think that's possible. When you say "practical
Bitcoin updates", what exactly is the data structure you're proposing to
update? How is it indexed?
--
https://petertodd.org 'peter'[:-1]@petertodd.org
Bram Cohen via bitcoin-dev
2016-06-16 09:07:26 UTC
Permalink
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
What do you mean by TXO commitments? If you mean that it only records
insertions rather than deletions, then that can do many of the same
proofs
Post by Bram Cohen via bitcoin-dev
but has no way of proving that something is currently in the UTXO set,
which is functionality I'd like to provide.
I think you need to re-read my original post on TXO commitments, specifically
# TXO Commitments
A merkle tree committing to the state of __all transaction outputs, both spent
and unspent__, we can provide a method of compactly proving the current state of
an output.
https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html
Okay, clearly my assumptions about the parts of that post I didn't read
carefully were way off. I'll have to look through it carefully to be able
to make coherent apples to apples comparisons.
Post by Bram Cohen via bitcoin-dev
I'm worried that once there's real transaction fees everyone might stop
Post by Bram Cohen via bitcoin-dev
consolidating dust and the set of unspent transactions might grow without
bound as well, but that's a topic for another day.
Ok, but then if you're concerned about that risk, why introduce a data
structure - the STXO set - that's _guaranteed_ to grow without bound?
I'm not proposing STXO set commitments either. My point was that there
should be incentives for collecting dust. That has nothing to do with this
thread though and should be discussed separately (also I don't feel like
discussing it because I don't have a good proposal).
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
What I'm making is a patricia trie. Its byte level definition is very
similar to the one in your MMR codebase.
Which codebase exactly? I have both a insertion-ordered list (MMR) and a
key:value mapping (referred to as a "merbinner tree" in the codebase) in the
proofchains codebase. They're very different data structures.
I'm talking about your merbinner trees. I read through that part of your
codebase carefully and got the impression that the MMR tree section used it
as a building block.
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
The main differences to your patricia trie are the non-padding sha256 and
that each level doesn't hash in a record of its depth and the usage of
ONLY0 and ONLY1.
I'm rather confused, as the above sounds nothing like what I've implemented,
which only has leaf nodes, inner nodes, and the special empty node singleton,
for both the MMR and merbinner trees.
It's quite a bit like merbinner trees. I've basically taken the leaf nodes
and smushed them into the inner nodes above them, thus saving a hashing
operation and some memory. They're both binary radix trees.
Post by Bram Cohen via bitcoin-dev
Technically even a patricia trie utxo commitment can have sub-1 cache
Post by Bram Cohen via bitcoin-dev
misses per update if some of the updates in a single block are close to
each other in memory. I think I can get practical Bitcoin updates down
to a
Post by Bram Cohen via bitcoin-dev
little bit less than one l2 cache miss per update, but not a lot less.
I'm very confused as to why you think that's possible. When you say "practical
Bitcoin updates", what exactly is the data structure you're proposing to
update? How is it indexed?
My calculations are: a Bitcoin block contains about 2000 updates. The l2
cache is about 256 kilobytes, and if an update is about 32 bytes times two
for the parents, grandparents, etc. then an l2 cache can contain about 4000
values. If the current utxo size is about 2000 * 4000 = 8,000,000 in size
then about half the pages which contain a transaction will contain a second
one. I think the utxo set is currently about an order of magnitude greater
than that, so the number of such collisions will be fairly mall, hence my
'less than one but not a lot less' comment.

As for how it's indexed, at a crypto definition level it's just a binary
radix tree. In terms of how it's indexed in memory, that involves some
optimizations to avoid cache misses. Memory is allocated into blocks of
about the size of an 12 cache (or maybe an l1 cache, it will require some
testing and optimization). Blocks are either branch blocks, which keep
everything in fixed positions, or leaf blocks, which contain fixed size
entries for nodes plus indexes within the same leaf block of their
children. Branch blocks can have many children which can be either branch
blocks or leaf blocks, but typically are either all branch blocks or all
leaf blocks. Branch blocks always have exactly one parent. Leaf blocks
always have all their inputs come from a single branch block, but there can
be multiple ones of those. When a branch block overflows it first tries to
put stuff into the last leaf block it used, and if there's no more room it
allocates a new one. It's fairly common for branches to have just a few
leaf children, but they also could have a lot, depending on whether the
base 2 log of the number of things currently in the set modulo the number
levels in a branch is a small number.

Usually when an update is done it consists of first checking the
appropriate output of the root block (it's jumped to directly to avoid
unnecessary memory lookups. If there's nothing there the algorithm will
walk back until it finds something.) That leads directly to (usually)
another branch whose output is jumped to directly again. At Bitcoin utxo
set sizes that will usually lead to a leaf block, which is then walked down
manually to find the actual terminal node, which is then updated, and the
parent, grandparent, etc. is then marked invalid until something which was
already marked invalid is hit, and it exits. Calculation of hash values is
done lazily.
Peter Todd via bitcoin-dev
2016-06-17 04:34:35 UTC
Permalink
Post by Bram Cohen via bitcoin-dev
Okay, clearly my assumptions about the parts of that post I didn't read
carefully were way off. I'll have to look through it carefully to be able
to make coherent apples to apples comparisons.
Thanks!
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
I'm worried that once there's real transaction fees everyone might stop
Post by Bram Cohen via bitcoin-dev
consolidating dust and the set of unspent transactions might grow without
bound as well, but that's a topic for another day.
Ok, but then if you're concerned about that risk, why introduce a data
structure - the STXO set - that's _guaranteed_ to grow without bound?
I'm not proposing STXO set commitments either. My point was that there
should be incentives for collecting dust. That has nothing to do with this
thread though and should be discussed separately (also I don't feel like
discussing it because I don't have a good proposal).
Ah, yeah, I misunderstood you there; as expected absolutely no-one is proposing
STXO set commitments. :)
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
The main differences to your patricia trie are the non-padding sha256 and
that each level doesn't hash in a record of its depth and the usage of
ONLY0 and ONLY1.
I'm rather confused, as the above sounds nothing like what I've implemented,
which only has leaf nodes, inner nodes, and the special empty node singleton,
for both the MMR and merbinner trees.
It's quite a bit like merbinner trees. I've basically taken the leaf nodes
and smushed them into the inner nodes above them, thus saving a hashing
operation and some memory. They're both binary radix trees.
Ah, I see what you mean now.

So above you said that in merbinner trees each node "hash[es] in a record of
its depth" That's actually incorrect: each node commits to the prefix that all
keys below that level start with, not just the depth.

This means that in merbinner trees, cases where multiple keys share parts of
the same prefix are handled efficiently, without introducing extra levels
unnecessarily; there's no need for the ONLY0/1 nodes as the children of an
inner node will always be on different sides.

When keys are randomly distributed, this isn't a big deal; OTOH against
attackers who are choosing keys, e.g. by grinding hashes, merbinner trees
always have maximum depths in proportion to log2(n) of the actual number of
items in the tree. Grinding is particularly annoying to deal with due to the
birthday attack: creating a ground prefix 64 bits long only takes 32 bits worth
of work.


In my deterministic expressions work one of the ideas I've been tossing around
is rather than always using hash digests directly for when you need to commit
to some data, we could instead extend the idea of a digest to that of a
"commitment", where a commitment is simply some short, but variable-sized,
string that uniquely maps to a given set of data. Secondly, commitments do
*not* always guarantee that the original data can't be recovered from the
commitment itself.

By allowing commitments to be variable sized - say 0 to ~64 bytes - we get a
number of advantages:

1) Data shorter than the length of a digest (32 bytes) can be included in the
commitment itself, improving efficiency.

2) Data a little longer than a digest can have hashing delayed, to better fill
up blocks.

In particular, case #2 handles your leaf node optimizations generically,
without special cases and additional complexity. It'd also be a better way to
do the ONLY0/1 cases, as if the "nothing on this side" symbol is a single byte,
each additional colliding level would simply extend the commitment without
hashing. In short, you'd have nearly the same level of optimization even if at
the cryptography level your tree consists of only leaves, inner nodes, and nil.

Another advantage of variable sized commitments is that it can help make clear
to users when it's possible to brute force the message behind the commitment.
For instance, digest from a hashed four byte integer can be trivially reversed
by just trying all combinations. Equally, if that integer is concatenated with
a 32 byte digest that the attacker knows, the value of the integer can be brute
forced.
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Technically even a patricia trie utxo commitment can have sub-1 cache
Post by Bram Cohen via bitcoin-dev
misses per update if some of the updates in a single block are close to
each other in memory. I think I can get practical Bitcoin updates down
to a
Post by Bram Cohen via bitcoin-dev
little bit less than one l2 cache miss per update, but not a lot less.
I'm very confused as to why you think that's possible. When you say "practical
Bitcoin updates", what exactly is the data structure you're proposing to
update? How is it indexed?
My calculations are: a Bitcoin block contains about 2000 updates. The l2
cache is about 256 kilobytes, and if an update is about 32 bytes times two
for the parents, grandparents, etc. then an l2 cache can contain about 4000
values. If the current utxo size is about 2000 * 4000 = 8,000,000 in size
then about half the pages which contain a transaction will contain a second
one. I think the utxo set is currently about an order of magnitude greater
than that, so the number of such collisions will be fairly mall, hence my
'less than one but not a lot less' comment.
Your estimate of updates requiring 32 bytes of data is *way* off.

Each inner node updated on the path to a leaf node will itself require 32 bytes
of data to be fetched - the digest of the sibling. As of block 416,628, there
are 39,167,128 unspent txouts, giving us a tree about 25 levels deep.

So if I want to update a single leaf, I need to read:

25 nodes * 32 bytes/node = 800 bytes

of data. Naively, that'd mean our 2,000 updates needs to read 1.6MB from RAM,
which is 6.4x bigger than the L2 cache - it's just not going to fit.

Taking into account the fact that this is a batched update improves things a
little bit. For a node at level i with random access patterns and N accesses
total our amortised cost is 1/(1 + N/2^i) Summing that over 2,000 leaf updates
and 25 levels gives us ~29,000 total updates, 0.9MB, which is still a lot
larger than L2 cache.

While this might fit in L3 cache - usually on the order of megabytes - this is
a rather optimistic scenario anyway: we're assuming no other cache pressure and
100% hit rate.

Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/byte
(SHA256 is about 15 cycles/byte) so hashing that same data would take around
200 cycles, and probably quite a bit more in practice due to overheads from our
short message lengths; fetching a cache line from DRAM only takes about 1,000
cycles. I'd guess that once other overheads are taken into account, even if you
could eliminate L2/L3 cache-misses it wouldn't be much of an improvement.
Post by Bram Cohen via bitcoin-dev
As for how it's indexed, at a crypto definition level it's just a binary
radix tree. In terms of how it's indexed in memory, that involves some
optimizations to avoid cache misses. Memory is allocated into blocks of
about the size of an 12 cache (or maybe an l1 cache, it will require some
testing and optimization). Blocks are either branch blocks, which keep
everything in fixed positions, or leaf blocks, which contain fixed size
entries for nodes plus indexes within the same leaf block of their
children. Branch blocks can have many children which can be either branch
blocks or leaf blocks, but typically are either all branch blocks or all
leaf blocks. Branch blocks always have exactly one parent. Leaf blocks
always have all their inputs come from a single branch block, but there can
be multiple ones of those. When a branch block overflows it first tries to
put stuff into the last leaf block it used, and if there's no more room it
allocates a new one. It's fairly common for branches to have just a few
leaf children, but they also could have a lot, depending on whether the
base 2 log of the number of things currently in the set modulo the number
levels in a branch is a small number.
Usually when an update is done it consists of first checking the
appropriate output of the root block (it's jumped to directly to avoid
unnecessary memory lookups. If there's nothing there the algorithm will
walk back until it finds something.) That leads directly to (usually)
another branch whose output is jumped to directly again. At Bitcoin utxo
set sizes that will usually lead to a leaf block, which is then walked down
manually to find the actual terminal node, which is then updated, and the
parent, grandparent, etc. is then marked invalid until something which was
already marked invalid is hit, and it exits. Calculation of hash values is
done lazily.
I think it's safe to say that given our working set is significantly larger
than the L2/L3 cache available, none of the above optimizations are likely to
help much. Better to just keep the codebase simple and use standard techniques.
--
https://petertodd.org 'peter'[:-1]@petertodd.org
Bram Cohen via bitcoin-dev
2016-06-18 02:43:47 UTC
Permalink
Post by Peter Todd via bitcoin-dev
So above you said that in merbinner trees each node "hash[es] in a record of
its depth" That's actually incorrect: each node commits to the prefix that all
keys below that level start with, not just the depth.
I considered a similar trick at the implementation rather than the
definition level: A node doesn't have to store the prefix which is implicit
in its position. That would create a fair number of headaches though,
because I'm using fixed size stuff in important ways, and it could at most
save about 10% of memory, so it goes into the 'maybe later' bucket.
Post by Peter Todd via bitcoin-dev
This means that in merbinner trees, cases where multiple keys share parts of
the same prefix are handled efficiently, without introducing extra levels
unnecessarily; there's no need for the ONLY0/1 nodes as the children of an
inner node will always be on different sides.
When keys are randomly distributed, this isn't a big deal; OTOH against
attackers who are choosing keys, e.g. by grinding hashes, merbinner trees
always have maximum depths in proportion to log2(n) of the actual number of
items in the tree. Grinding is particularly annoying to deal with due to the
birthday attack: creating a ground prefix 64 bits long only takes 32 bits worth
of work.
Yes an attacker can force the tree to be deeper in places, but it's
mitigated in several ways: (1) The way I'm using memory it won't cause a
whole new block to be allocated, it will just force log(attack strength) -
log(n) nodes to be used (2) logarithmic growth being what it is that isn't
such a big amount (3) With the special casing of TERMBOTH an attacker needs
three things with the same prefix to pull off an attack rather than two,
which is quite a bit harder to pull off.

That said, it wouldn't be all that hard to change how the hashing function
works to do a single hash for a whole series of ONLY in a row instead of a
new one at every level, which would make the attacker only able to force
extra memory usage instead of extra CPU, but this is a slightly annoying
thing to write to stop a fairly lame attack, so I'm at least not doing it
for my initial implementation. I could likely be convinced that it's worth
doing before an actual release though. There's another implementation trick
to do the same thing for memory usage, which is much more in the 'do later'
category because it doesn't involve changing the format and hence it can be
put off.
Post by Peter Todd via bitcoin-dev
In particular, case #2 handles your leaf node optimizations generically,
without special cases and additional complexity. It'd also be a better way to
do the ONLY0/1 cases, as if the "nothing on this side" symbol is a single byte,
each additional colliding level would simply extend the commitment without
hashing. In short, you'd have nearly the same level of optimization even if at
the cryptography level your tree consists of only leaves, inner nodes, and nil.
I'm taking pains to make all the hashing be of fixed-size things, so that a
non-padding variant of a secure hashing algorithm can be used. The chains
of ONLY thing above would force a special exception to that, which can be
done but is annoying. Making things smaller than a single block (64 bytes)
won't speed up hashing time, and making things a single byte longer than
that doubles it.
Post by Peter Todd via bitcoin-dev
Another advantage of variable sized commitments is that it can help make clear
to users when it's possible to brute force the message behind the commitment.
For instance, digest from a hashed four byte integer can be trivially reversed
by just trying all combinations. Equally, if that integer is concatenated with
a 32 byte digest that the attacker knows, the value of the integer can be brute
forced.
I'm hashing all strings before inserting to get them to be a fixed size and
avoid a few different attacks. In Bitcoin most of the strings added are
longer than that so it's a form of compression. A custom hash function
could be used which 'hashes' very short strings by repeating them verbatim
could be used, but seems like not such a hot idea. I'm making extensive use
of things being fixed size everywhere, which improves performance in a lot
of ways.
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Technically even a patricia trie utxo commitment can have sub-1 cache
Post by Bram Cohen via bitcoin-dev
misses per update if some of the updates in a single block are close
to
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
each other in memory. I think I can get practical Bitcoin updates
down
Post by Bram Cohen via bitcoin-dev
to a
Post by Bram Cohen via bitcoin-dev
little bit less than one l2 cache miss per update, but not a lot
less.
Post by Bram Cohen via bitcoin-dev
I'm very confused as to why you think that's possible. When you say "practical
Bitcoin updates", what exactly is the data structure you're proposing
to
Post by Bram Cohen via bitcoin-dev
update? How is it indexed?
I'll re-answer this because I did a terrible job before. The entire data
structure consists of nodes which contain a metadata byte (TERM0, ONLY1,
etc.) followed by fixes size secure hashes, and (in some cases) pointers to
where the children are. The secure hashes in parent nodes are found by
hashing their children verbatim (or the stored string in the case of a
TERM). This is very conventional. All of the cleverness is in where in
memory these nodes are stored so that tracing down the tree causes very few
cache misses.

(The alternate approach is to have each node store its own hash rather than
that be stored by the parent. That approach means that when you're
recalculating you have to look up siblings which doubles the number of
cache misses. Not such a hot idea.)

At the root there's a branch block. It consists of all nodes up to some
fixed depth - let's say 12 - with that depth set so that it roughly fits
within a single memory page. Branch blocks are arranged with the nodes in
fixed position defined by the prefix they correspond to, and the terminals
have outpointers to other blocks. Because they're all clustered together, a
lookup or update will only require a single

Below the root block are other branch blocks. Each of them has a fixed 12
bit prefix it is responsible for. When doing a lookup a second cache miss
will be hit for levels 13-24, because those are all clustered in the same
branch block.

Below the second level of root block (at Bitcoin utxo set scale - this
varies based on how much is stored) there are leaf blocks. A leaf block
consists of nodes with outpointers to its own children which must be within
the same leaf block. All entry points into a leaf block are from the same
branch block, and the leaf block has no out pointers to other blocks. When
a leaf block overflows the entry point into it which overflowed is moved
into the active leaf for that branch, and if that's full a new one is
allocated. There's some subtlety to exactly how this is done, but I've
gotten everything down to simple expedient tricks with straightforward
implementations. The thing which matters for now is that there's only a
single cache miss for each leaf node, because they also fit in a page.

So at Bitcoin scale there will probably only be 3 cache misses for a
lookup, and that's a worst case scenario. The first one is probably always
warm, bringing it down to 2, and if you do a bunch in sorted order they'll
probably hit the same second level branches repeatedly bringing it down to
1, and might even average less than that if there are enough that the leaf
block has multiple things being accessed.

(These same tricks can be applied to merbinner tree implementation as well,
although that would be a bit less parsimonious with memory, by a small
constant factor.)
Post by Peter Todd via bitcoin-dev
Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/byte
(SHA256 is about 15 cycles/byte) so hashing that same data would take around
200 cycles, and probably quite a bit more in practice due to overheads from our
short message lengths; fetching a cache line from DRAM only takes about 1,000
cycles. I'd guess that once other overheads are taken into account, even if you
could eliminate L2/L3 cache-misses it wouldn't be much of an improvement.
Those numbers actually back up my claims about performance. If you're doing
a single update and recalculating the root afterwards, then the amount of
rehashing to be done is about 30 levels deep times 64 bytes per thing
hashed times 15 cycles per byte then it's about 28,800 cycles of hashing.
If you have a naive radix tree implementation which hits a cache miss at
every level then that's 30,000 cycles, which is about half the performance
time, certainly worth optimizing. If instead of sha256 you use blake2
(Which sounds like a very good idea!) then hashing for an update will be
about 5760 cycles and performance will be completely dominated by cache
misses. If a more cache coherent implementation is used, then the cost of
cache misses will be 3000 cycles, which will be a non-factor with sha256
and a significant but not dominating one with blake2.

It's reasonable to interpret those numbers as saying that blake2 and cache
coherent implementation are both clearly worth it (probably necessary for
real adoption) and that an amortized binary radix tree is tempting but not
worth it.
Peter Todd via bitcoin-dev
2016-06-18 23:01:43 UTC
Permalink
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
So above you said that in merbinner trees each node "hash[es] in a record of
its depth" That's actually incorrect: each node commits to the prefix that all
keys below that level start with, not just the depth.
I considered a similar trick at the implementation rather than the
definition level: A node doesn't have to store the prefix which is implicit
in its position. That would create a fair number of headaches though,
because I'm using fixed size stuff in important ways, and it could at most
save about 10% of memory, so it goes into the 'maybe later' bucket.
Wait, are you saying you think committing to the prefix is a "trick"? It's just
a very simple - and possibly not-optimal - way of committing to what data
should be accessible under a given node. An alternative would have been ensure
that in terms of _cryptographic_ tree position.

By "position", are you talking about position within RAM? That may or may not
be a viable optimization, but it's quite separate from the question of the
cryptographic structure of the data.
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
This means that in merbinner trees, cases where multiple keys share parts of
the same prefix are handled efficiently, without introducing extra levels
unnecessarily; there's no need for the ONLY0/1 nodes as the children of an
inner node will always be on different sides.
When keys are randomly distributed, this isn't a big deal; OTOH against
attackers who are choosing keys, e.g. by grinding hashes, merbinner trees
always have maximum depths in proportion to log2(n) of the actual number of
items in the tree. Grinding is particularly annoying to deal with due to the
birthday attack: creating a ground prefix 64 bits long only takes 32 bits worth
of work.
Yes an attacker can force the tree to be deeper in places, but it's
mitigated in several ways: (1) The way I'm using memory it won't cause a
whole new block to be allocated, it will just force log(attack strength) -
log(n) nodes to be used (2) logarithmic growth being what it is that isn't
such a big amount (3) With the special casing of TERMBOTH an attacker needs
three things with the same prefix to pull off an attack rather than two,
which is quite a bit harder to pull off.
That said, it wouldn't be all that hard to change how the hashing function
works to do a single hash for a whole series of ONLY in a row instead of a
new one at every level, which would make the attacker only able to force
extra memory usage instead of extra CPU, but this is a slightly annoying
thing to write to stop a fairly lame attack, so I'm at least not doing it
for my initial implementation. I could likely be convinced that it's worth
doing before an actual release though. There's another implementation trick
to do the same thing for memory usage, which is much more in the 'do later'
category because it doesn't involve changing the format and hence it can be
put off.
Post by Peter Todd via bitcoin-dev
In particular, case #2 handles your leaf node optimizations generically,
without special cases and additional complexity. It'd also be a better way to
do the ONLY0/1 cases, as if the "nothing on this side" symbol is a single byte,
each additional colliding level would simply extend the commitment without
hashing. In short, you'd have nearly the same level of optimization even if at
the cryptography level your tree consists of only leaves, inner nodes, and nil.
I'm taking pains to make all the hashing be of fixed-size things, so that a
non-padding variant of a secure hashing algorithm can be used. The chains
of ONLY thing above would force a special exception to that, which can be
done but is annoying. Making things smaller than a single block (64 bytes)
won't speed up hashing time, and making things a single byte longer than
that doubles it.
Have you seen how BLAKE2 omits padding when the data to be hashed happens to be
exactly one block in size? It's significantly faster than SHA256, and that's a
standard part of the algorithm already.
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
Technically even a patricia trie utxo commitment can have sub-1 cache
Post by Bram Cohen via bitcoin-dev
misses per update if some of the updates in a single block are close
to
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
each other in memory. I think I can get practical Bitcoin updates
down
Post by Bram Cohen via bitcoin-dev
to a
Post by Bram Cohen via bitcoin-dev
little bit less than one l2 cache miss per update, but not a lot
less.
Post by Bram Cohen via bitcoin-dev
I'm very confused as to why you think that's possible. When you say "practical
Bitcoin updates", what exactly is the data structure you're proposing
to
Post by Bram Cohen via bitcoin-dev
update? How is it indexed?
I'll re-answer this because I did a terrible job before. The entire data
structure consists of nodes which contain a metadata byte (TERM0, ONLY1,
etc.) followed by fixes size secure hashes, and (in some cases) pointers to
where the children are. The secure hashes in parent nodes are found by
hashing their children verbatim (or the stored string in the case of a
TERM). This is very conventional. All of the cleverness is in where in
memory these nodes are stored so that tracing down the tree causes very few
cache misses.
(The alternate approach is to have each node store its own hash rather than
that be stored by the parent. That approach means that when you're
recalculating you have to look up siblings which doubles the number of
cache misses. Not such a hot idea.)
Have you benchmarked the cost of a hash operation vs. the cost of a cache miss?
What are the actual numbers?
Post by Bram Cohen via bitcoin-dev
At the root there's a branch block. It consists of all nodes up to some
fixed depth - let's say 12 - with that depth set so that it roughly fits
within a single memory page. Branch blocks are arranged with the nodes in
fixed position defined by the prefix they correspond to, and the terminals
have outpointers to other blocks. Because they're all clustered together, a
lookup or update will only require a single
A single....?
Post by Bram Cohen via bitcoin-dev
Below the root block are other branch blocks. Each of them has a fixed 12
bit prefix it is responsible for. When doing a lookup a second cache miss
will be hit for levels 13-24, because those are all clustered in the same
branch block.
So, is this also how the data structure looks cryptographically, or is the way
it's hashed separate from the above description?
Post by Bram Cohen via bitcoin-dev
Below the second level of root block (at Bitcoin utxo set scale - this
varies based on how much is stored) there are leaf blocks. A leaf block
consists of nodes with outpointers to its own children which must be within
the same leaf block. All entry points into a leaf block are from the same
branch block, and the leaf block has no out pointers to other blocks. When
a leaf block overflows the entry point into it which overflowed is moved
into the active leaf for that branch, and if that's full a new one is
allocated. There's some subtlety to exactly how this is done, but I've
gotten everything down to simple expedient tricks with straightforward
implementations. The thing which matters for now is that there's only a
single cache miss for each leaf node, because they also fit in a page.
Page as in 4096 bytes? But L1/L2/L3 cache is arranged in terms of 64 byte cache
lines - where do pages come in here?

At Bitcoin UTXO set scale, how large do you think these data structures are?
Post by Bram Cohen via bitcoin-dev
So at Bitcoin scale there will probably only be 3 cache misses for a
lookup, and that's a worst case scenario. The first one is probably always
warm, bringing it down to 2, and if you do a bunch in sorted order they'll
probably hit the same second level branches repeatedly bringing it down to
1, and might even average less than that if there are enough that the leaf
block has multiple things being accessed.
"Sorted order" - what exact type of sorting do you mean here?
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/byte
(SHA256 is about 15 cycles/byte) so hashing that same data would take around
200 cycles, and probably quite a bit more in practice due to overheads from our
short message lengths; fetching a cache line from DRAM only takes about 1,000
cycles. I'd guess that once other overheads are taken into account, even if you
could eliminate L2/L3 cache-misses it wouldn't be much of an improvement.
Those numbers actually back up my claims about performance. If you're doing
a single update and recalculating the root afterwards, then the amount of
rehashing to be done is about 30 levels deep times 64 bytes per thing
hashed times 15 cycles per byte then it's about 28,800 cycles of hashing.
If you have a naive radix tree implementation which hits a cache miss at
every level then that's 30,000 cycles, which is about half the performance
time, certainly worth optimizing. If instead of sha256 you use blake2
(Which sounds like a very good idea!) then hashing for an update will be
about 5760 cycles and performance will be completely dominated by cache
misses. If a more cache coherent implementation is used, then the cost of
cache misses will be 3000 cycles, which will be a non-factor with sha256
and a significant but not dominating one with blake2.
But that's assuming the dataset in question fits in cache; I don't see how it
does. Since it doesn't, I'm argung the total % improvement by _any_ cache
optimization on the subset that does fit in cache will be relatively small.

Again, how large a dataset do you think you're working with here?
--
https://petertodd.org 'peter'[:-1]@petertodd.org
Bram Cohen via bitcoin-dev
2016-07-15 23:00:57 UTC
Permalink
On Sat, Jun 18, 2016 at 4:01 PM, Peter Todd <***@petertodd.org> wrote:
Have you seen how BLAKE2 omits padding when the data to be hashed happens
Post by Peter Todd via bitcoin-dev
to be
exactly one block in size? It's significantly faster than SHA256, and that's a
standard part of the algorithm already.
That's very convenient! I didn't know it, but had 'look up how blake2 does
padding' in my list of stuff to do. I'm leaning heavily towards using
blake2b at this point, at least for internal hashing.
Post by Peter Todd via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
At the root there's a branch block. It consists of all nodes up to some
fixed depth - let's say 12 - with that depth set so that it roughly fits
within a single memory page. Branch blocks are arranged with the nodes in
fixed position defined by the prefix they correspond to, and the
terminals
Post by Bram Cohen via bitcoin-dev
have outpointers to other blocks. Because they're all clustered
together, a
Post by Bram Cohen via bitcoin-dev
lookup or update will only require a single
A single....?
Okay, I've figured out the root cause of general confusion here. It's
mostly my fault.

There are a few different media on which data can be stored, with different
properties in terms of how long it takes to retrieve data from them, and
how much of a readahead they typically have. I was misreading the l2 cache
size as the main memory readahead amount, which is... probably wrong? The
readahead properties of memory aren't well documented and apparently vary a
lot. On SSDs it typically pulls down a kilobyte at once and they call them
pages, hence my use of that term above. But since my real point is that my
implementation should act as a silver bullet which can have acceptable
performance even on extremely bad devices, I'll give an analysis of how
well it works when everything is stored on a regular spinning hard drive.

Let's say you're storing 100 million items, which will fit within 10
gigabytes. If you set the block depths to about 10 bits they'll be about
32K, and if you set the size of leaf blocks to be about the same then
memory efficiency will be good because the leaf blocks will store twigs of
about 2^7 in size while having 2^10 space so they'll fit reasonably. Almost
everything will be three blocks from root, so updates will generally
require two disk seeks (plus one more for a write but those are generally
faster because they get batched.)

For latency numbers, I'm going off these:
https://gist.github.com/jboner/2841832

If the blockchain is very full of simple transactions and a disk seek takes
15 milliseconds, then going with the assumption that a block is about 600
seconds and the blockchain can handle 4 transactions per second and each of
them is 3 updates (one utxo spent plus two new ones created) that's 15 *
600 * 4 * 3 * 2 milliseconds per block, or about 200 seconds per block, so
if the uxto roots trail by a few blocks even a ludicrously underpowered
node could keep up.

On an SSD keeping up is completely trivial, the problem becomes one of how
quickly you can validate an entire blockchain history. There a read takes
about 0.15 milliseconds and you have to do 5 of them instead of 2 because
the natural memory block size is 4k, so it's about 1 millisecond per
update, or 600 * 4 * 3 total time for each block, which is about 7 seconds.
That's large enough that making the utxo root trail by two blocks is still
a good idea, but small enough that it isn't a big factor in the cost of
running a node. It's enough that validating a complete block history might
take a while though, and even validating just the last year would take a
few days. This is very conservative and it's assuming that *everything* is
kept on an SSD though. If the numbers work better and a few layers are kept
in regular memory validating a whole year of history might only take a few
hours.

Hopefully that all makes a fairly good case that raw merkle tree utxo root
trailing by a few blocks is a viable strategy. The data structures in the
MMR proposal are fairly complicated and the analysis of them talks in
somewhat vague terms about things being fast and slow. A similar analysis
of the MMR proposal specifying storage media and expectations of latency
numbers would clarify the reasoning a lot.

(By the way, sorry for the slow response - I got preempted by a bunch of
other work duties.)

Bram Cohen via bitcoin-dev
2016-06-18 03:22:04 UTC
Permalink
Post by Bram Cohen via bitcoin-dev
Post by Bram Cohen via bitcoin-dev
The fundamental approach to handling the latency problem is to have the
utxo commitments trail a bit. Computing utxo commitments takes a certain
amount of time, too much to hold up block propagation but still hopefully
vastly less than the average amount of time between blocks. Trailing by a
single block is probably a bad idea because you sometimes get blocks back
to back, but you never get blocks back to back to back to back. Having
the
Post by Bram Cohen via bitcoin-dev
utxo set be trailing by a fixed amount - five blocks is probably
excessive
Post by Bram Cohen via bitcoin-dev
- would do a perfectly good job of keeping latency from every becoming an
issue. Smaller commitments for the utxos added and removed in each block
alone could be added without any significant performance penalty. That
way
Post by Bram Cohen via bitcoin-dev
all blocks would have sufficient commitments for a completely up to date
proofs of inclusion and exclusion. This is not a controversial approach.
Agreed - regardless of approach adding latency to commitment calculations of
all kinds is something I think we all agree can work in principle, although
obviously it should be a last resort technique when optimization fails.
An important point: Adding latency to utxo commitments does not imply
latency to proofs of inclusion and exclusion! If roots of what's added and
deleted in each block are added as well, then a proof of inclusion can be
done by having a proof of inclusion of the trailing utxo set followed by a
proof of exclusion from all the following deletion sets, or a proof of
inclusion in one of the single block addition sets followed by proofs of
exclusion from all the more recent deletion sets. Likewise a proof of
exclusion can be a proof of exclusion from the utxo set followed by proofs
of exclusion from all the more recent addition sets or a single proof of
inclusion in a recent deletion set.

This does make proofs larger (except in the case of recent deletions and
maybe recent additions) and adds complexity, so it shouldn't be done unless
necessary. But validation before block propagation needs to be extremely
fast, so for utxo roots this trick is probably both necessary and
sufficient.
Peter Todd via bitcoin-dev
2016-06-18 22:09:29 UTC
Permalink
Post by Bram Cohen via bitcoin-dev
Post by Peter Todd via bitcoin-dev
Agreed - regardless of approach adding latency to commitment calculations of
all kinds is something I think we all agree can work in principle, although
obviously it should be a last resort technique when optimization fails.
An important point: Adding latency to utxo commitments does not imply
latency to proofs of inclusion and exclusion! If roots of what's added and
deleted in each block are added as well, then a proof of inclusion can be
done by having a proof of inclusion of the trailing utxo set followed by a
proof of exclusion from all the following deletion sets, or a proof of
inclusion in one of the single block addition sets followed by proofs of
exclusion from all the more recent deletion sets. Likewise a proof of
exclusion can be a proof of exclusion from the utxo set followed by proofs
of exclusion from all the more recent addition sets or a single proof of
inclusion in a recent deletion set.
This does make proofs larger (except in the case of recent deletions and
maybe recent additions) and adds complexity, so it shouldn't be done unless
necessary.
So, to be clear you're assuming that blocks commit to key:value maps of the
block contents, specifically a pre-block "UTXO deletion/things that this block
spent" set? First of all, it's interesting how the much smaller dataset of a
pre-block key:value map would make L2/L3 caching optimizations much more likely
to be relevant. :)


That type of solution would be very similar to the solutions treechains would
need to prove coins haven't been doublespent. Basically, in treechains the
system as a whole is a datastructure indexed by time and prefix. So, if you
want to prove a valid spend you need to convince me of three things:

1. The coin existed as of time t1 at prefix p

2. At t2, p, a valid spend was published.

3. Between t1 and t2 at prefix p no other valid spend was published.

Paths to any prefix p as of time t, will have about log2(len(p)) size (beyond
the top-level chain), similar to your above suggestion. Of course, unlike your
above suggestion, in treechains it's not clear if step #1 can be done without
another n*log(N)-ish sized proof in a truly trustless environment!
Post by Bram Cohen via bitcoin-dev
But validation before block propagation needs to be extremely
fast, so for utxo roots this trick is probably both necessary and
sufficient.
I'm _not_ of the optinion that validation before propagation needs to be done
at all - I think it's perfectly reasonable to propgate blocks that you have not
validated at all (beyond checking PoW as an anti-DoS measure). The time it
takes miners to start mining the next block - and collecting fees - is however
very important.

In practice, I think we're mostly in agreement here, but because I'm happy to
propagate prior to validating I'd be ok with protocol designs that required
miners to have relatively large amounts of RAM - say 32GB - dedicated to UTXO
lookup because that wouldn't require relay nodes to also have those kinds of
resources available to them once validationless propagation was implemented.
--
https://petertodd.org 'peter'[:-1]@petertodd.org
Loading...