I wonder how this new approach compares to serializable snapshot isolation (SSI) https://wiki.postgresql.org/wiki/SSI
I'm not familiar with these technologies, but when I was studying databases, SSI was touted as the "better" 2PL on the horizon. I wonder how SSI compares to 2PLSF, and why it wasn't mentioned here?
I’m working on a memory model for a new platform which is almost entirely based around copy on write, snapshots and SSI. But in distributed effects you still need locks and two phase transactions and so on. These are largely complementary features imo.
For Wait-Or-Die, do you really need to fetch_and_add to get a transaction ID? Do you need a transaction ID at all? It sounds like the goal is just to have an arbitrary-but-consistent ordering of active transactions so that different transactions can agree on who should wait and who should die in the event of a conflict. So why not just use the thread ID? Or even a random number might work (if ties are treated as “die”, so the worst that happens is that both transactions unnecessarily abort, and then retry with new random numbers).
While it’s not mentioned, I suppose you want to prioritize older transactions in order to prevent long-running transactions from being starved by shorter-running transactions. (If one long transaction conflicts with an average of, say, three short transactions, and on each conflict it’s effectively random who wins, then each long transaction has only a 1/8 chance of winning all three conflicts and being able to commit.)
But preventing starvation only requires older transactions to be prioritized most of the time, not every single time, especially not if the transaction is only slightly older. So some kind of timestamp / cycle counter should work fine, even if there’s skew between threads or other sources of inaccuracy. Ties could be broken by thread ID, or again by having both sides abort.
I would use a ULID rather than thread ID.
They (and others) are great for this kind of case - and many others.
ulid is great, I use it a lot.
There is also a draft to make a new uuid variant – uuid v7 – that will be very similar to how ulid works.
https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...
I use ULID a lot too and It’s frustrating how close the new spec UUIDs are to it without actually being the same… so I’ve got a bunch of code to modify once Postgres supports generation of the new UUIDs server side without extensions or stored procedures. Relatively painless work, but frustrating since it could have been avoided.
> frustrating since it could have been avoided.
How? UUID is a structured format so the only options I can see is ulid creating their own unregistered uuid variant (probably a terrible idea) or adding ulid support to postgres (nothing to do with uuid).
I'd love for Postgres to adopt ULID as a first class variant of the same basic 128bit wide binary optimized column type they use for UUIDs, but I don't expect they will, while its "popular" its not likely popular enough to have support for them to maintain it in the long run... Also the smart money ahead of time would have been for the ULID spec to sacrifice a few data bits to leave the version specifying sections of the bit field layout unused in the ULID binary spec (https://github.com/ulid/spec#binary-layout-and-byte-order) for the sake of future compatibility with "proper" UUIDs... Performing one quick bitfield modification on every row in PostgreSQL would have been less painful as in set bit, vs load parse, repack in order to re-computing the appropriate UUIDv7s (or UUIDv8s for some reason) since the primary key update transaction should be roughly the same speed either way.
Yes. V7 looks good but ULID already does it. Rather than change code, I'm just staying with ULID. All give us 128 bits. That should be enough for anyone ;/
I do not quite understand the last figure for the relaxed avl tree. For the 100 % lookup (rightmost) the TL2 algo should scale linearly with the number of threads. For read-only transactions, TL2 needs to sample the global version, then for all reads make sure the local version is less than or equal to the sampled version. given this, it is difficult to understand why the graph is sub linear and that TL2 is not as fast as the other STM implementations.
I don't see such a graph for TL2? I do see one for TLRW though which does use reader locks, hence the scalability cap.
the chart doesn't seem visible on ios safari but i can see it on firefox desktop however, the figure seems to be the same as from the linked paper: https://zenodo.org/record/7886718
Ah ok I found it there, I see what you mean now. My only guess would be cache effects? With that large AVL tree (1 M entries, so likely dozens of MiB), you are escaping L2 cache and hitting shared L3 or main memory for a large portion of lookups, and are bandwidth-constrained at the die level, thus adding that knee (which I think is visible with some of the other algorithms as well).
but TL2 shouldn't be using more memory than say TinySTM, i don't think. If the implementations are object based and nodes are cache aligned, adding an extra 8 bytes for the versioned-lock shouldn't bump the node size to greater than a cache line.
however, i do think the tl2 implementation as described in the paper is memory based, as is TinySTM so every read needs to do a hash to locate the corresponding lock / meta data. the read-only transactions for tl2 and TinySTM seem identical to me which is why i am so confused.
looking at other figures from the 2PLSF paper, the TL2 for 100% lookup on hash set and skip list it looks like such a dog compared to the other algos.
One thing I didn’t quite catch was: I thought most write transactions needs consistent reads on many (contended) “objects”, but the actual writes are often just one or two objects. Is 2PL addressing this or does a write transaction take write locks on all objects?
You take either read OR write locks for whatever you need, and keep all locks until you commit the transaction.
That (possibly brief) moment where you hold all the necessary locks is your point of linearizability: it's as if everything happened at that exact moment.
You can take locks on all objects before you write. Most systems don’t do that.
Usually there are multiple writes to complete a business process. In each step stale data is read to create a write request similar to how you place your order with the wait staff at the restaurant. The wait staff will orchestrate your request. First the chef prepares your meal according to the order ticket. Second the wait staff delivers your meal based on the chef’s internal work order. Lastly they take your payment based on the original order ticket. Notice there are at least three microtransactions based on stale data. No one is holding locks until the diner finishes eating.
IIUC, That’s a micro service way, missing cancellations and rollback and timeouts
What happens if the chef can’t fulfill the meal, the customer leaves cause the order is taking long … etc Most real life will not charge the customer is the chef can’t make the meal and cancel the cooking if the customer leaves
The analogy maybe going far ;)
Interesting re the transaction ID generation. Sled generates ~70-120 million transaction IDs per second. The author points out that it’s still a scaling bottleneck but I wonder if Sled’s approach might raise the TPS for the contended case.
huh? no mention of serializable snapshot isolation (SSI), used by most commercial dbs for years. what gives?
MVCC is what most databases do.
MVCC is optimistic concurrency control – you may end up having to retry your write transaction multiple times, which might suck if the transaction is expensive. Pessimistic concurrency control like locking may be cheaper in those cases, since the cost of locking may be drastically lower. Having one doesn't preclude the other!
Isn’t 2PL a limited applicability construct?
All my incoming and outgoing “transactions” are over loosely-coupled API calls.
Is there a point I’m missing?
IMO it’s about what’s happening underneath those API calls. You might be doing lots of POST requests that touch the same underlying sql table.
this is a great paper that provides a framework for comparing 2 phase and paxos.
https://lamport.azurewebsites.net/video/consensus-on-transac...
Two-phase locking is different from two-phase commit, in spite of an overlap in their naming. Two-phase commit is relevant to be compared against Paxos - both of which fall under the category of consensus protocols.
Two-phase locking is a concurrency control mechanism.
A little summary:
2PC: An algorithm used in the context of distributed transactions where each machine handles a different part of the transaction. This means that nothing is redundant - the success of each and every participant is required for the transaction to be committed.
Paxos/Raft/consensus: An algorithm usually used in the context of distributed replication. Since every participant is doing the same thing, it's tolerable if a few fail or give outputs that diverge from the majority.
2PL: A method of acquiring multiple locks such that first you acquire all the required locks (first phase), then you do what you need to do, and then you release all the locks (second phase). This is in contrast to a locking scheme where lock acquisitions and releases are interspersed. This isn't strictly limited to distributed systems, although it's common to see 2PC with 2PL.
If this piques your interest, read the Spanner paper! Spanner uses all three - 2PC with 2PL for distributed read-write transactions, and Paxos for replication.
PS: "Distributed" just means there's more than one machine involved, any of which may fail independently, and communication among these machines happens over unreliable wire.
In a multi-replica system, where, say, we cannot tolerate any failures or lags, is 2PC used in practice to achieve consensus? Or are there other methods for achieving such strict consensus?
2PC (or atomic commitment more generally) is needed for sharded/partitioned systems with different data on each node. In these systems, each node gets a vote on whether a transaction should be allowed to commit. Replication, making multiple copies of the same data, doesn't need 2PC. Instead, algorithms like Paxos, Raft, or chain replication are used.
Concurrency control > Concurrency control in databases > Why is concurrency control needed?: https://en.m.wikipedia.org/wiki/Concurrency_control#Why_is_c...?
Locks (computer science) > Disadvantages: https://en.wikipedia.org/wiki/Lock_(computer_science)#Disadv...
Two-phase locking (2PL) https://en.wikipedia.org/wiki/Two-phase_locking
Two-phase commit protocol (2PC) https://en.wikipedia.org/wiki/Two-phase_commit_protocol
Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
Raft: https://en.wikipedia.org/wiki/Raft_(algorithm)
Consensus (computer science) https://en.wikipedia.org/wiki/Consensus_(computer_science)
Spanner: https://en.wikipedia.org/wiki/Spanner_(database)
Non-blocking algorithm; "lock-free concurrency", "wait-free" https://en.wikipedia.org/wiki/Non-blocking_algorithm
"Ask HN: Why don't PCs have better entropy sources?" [for generating txids/uuids] https://news.ycombinator.com/item?id=30877296
"100-Gbit/s Integrated Quantum Random Number Generator Based on Vacuum Fluctuations" https://link.aps.org/doi/10.1103/PRXQuantum.4.010330
Re: tests of randomness: https://mail.python.org/archives/list/[email protected]...
TIL there's a regular heartbeat in the quantum foam; there's a regular monotonic heartbeat in the quantum Rydberg wave packet interference; and that should be useful for distributed applications with and without vector clocks and an initial time synchronization service (WhiteRabbit > PTP > NTP Network Time Protocol) https://journals.aps.org/prresearch/abstract/10.1103/PhysRev... :
> The [quantum time-keeping application of this research] relies on the unique fingerprint that is created by the time-dependent photoionization of these complex wave packets. These fingerprints determine how much time has passed since the wave packet was formed and provide an assurance that the measured time is correct. Unlike any other clock, this quantum watch does not utilize a counter and is fully quantum mechanical in its nature. The quantum watch has the potential to become an invaluable tool in pump-probe spectroscopy due to its simplicity, assurance of accuracy, and ability to provide an absolute timestamp, i.e., there is no need to find time zero.
IIUC a Rydberg antenna can read and/or write such noise?
"Patterns of Distributed Systems (2022)" https://news.ycombinator.com/item?id=36504073
I find it interesting that “distributed” and “concurrent” end up falling under the mathematical concept of nondeterminism with respect to correctness. Of course a practically efficient implementation has additional concerns.
2500 years later and the best hypotenuse algorithm is still Pythagoras’.
You jest but this is a computationally slow algorithm that only works for right angle triangles in a Cartesian space. There’s very reasonable and fast trigonometric approximations. The general form would be the cosine law which applies to any triangle. We can derive this from more abstract metrics over an inner product space.. a sort of vector space of any dimension.
Funny you should say that. Here[1] is an interesting improvement. More interestingly he shows his work.
[1] https://www.cs.utexas.edu/~EWD/transcriptions/EWD09xx/EWD975...
We are so arrogant sometimes. The passage of time has nothing to do with reality. People seem to think that given enough time and energy and ingenuity we can do anything we want. This is not how the real world works.
Sometimes, we find a good solution, and it's the best possible one and we found it early on. We think we can do better, but we can't. A classic example of this is Euclid's fifth axiom. It wasn't proven until the 19th century that this axiom was necessary but everyone from Euclid to Gauss tried to get rid of it. Foolish.
It seems to me that while it's foolish to assume we can always improve on something, it's not at all foolish to try to do better (even if the attempts don't succeed). Asking "can we do better than this?" and trying to find a better way is a prerequisite for progress, after all.
An excellent illustration of how a lack of diversity in thinking can hinder forward progress. Thank God everyone doesn't think the same as you.
I think it's a testament to the human condition there are those of us willing to entertain the folly of pushing the boundaries, or of being foolish as you put it.
I'm a beginner in this topic and I find this topic interesting. I really want there to be an easy-to-deploy consistency solution.
If I have a distributed microservice architecture and I want to keep multiple datastores in synchronization or "consistent" what's the industry best practice?
A few days ago I was trying to solve the inconsistency problem with "settled timestamps" which is a kind of multiversioning idea except that timestamps elapsed with the absence of reported error represent a valid save/commit. Kind of like two phase commit with the second phase being time. The idea is that we watch the clocks of other servers and if they don't update then we know we cannot trust their settled timestamps. (My intent was to allow scaling consistency across many servers, because we don't need to wait for response for every update, we only need to wait for the next timestamp interval)
Here's my Multithreaded multiprocessing Python code to test indeterminancy. 10 threads all send eachother random updates. They also broadcast their own timestamp and the timestamps of their own perspective of the timestamps every other server.
https://replit.com/@Chronological/InconsistencySimulation#ma... (click Run and watch the output, you'll have to wait 10 seconds)
A read in this simulation is the MIN of all timestamps of all servers reported timestamps.
10 seconds into the simulation, we ask every thread for its own perspective of what the counter value is. Sometimes they will all report the same value, a lot of the time they shall be split brained.
I am aware that wall clock timestamps are not suitable for ordering in a distributed system and that logical or vector clocks should be used for ordering.
If you can get the simulation to all report the same number at any point in time, then that would be great :-)
Ordering in distributed systems is significant, as the eventual consistency of the simulation means that some values can arrive late but affect the value, meaning it is not linearizable. Bloomlang tries to solve this.
I'm specifically interested in scaling WITH consistency but I think this is quite difficult.
> If I have a distributed microservice architecture and I want to keep multiple datastores in synchronization or "consistent" what's the industry best practice?
Not to use a distributed microservice architecture.
For distributed systems, the main idea is a centralized write ordering journal that is replayed by individual nodes.
Multiple systems write sequentially to the central journal. The journal is simply taking requests like a key value store. The journal is replicated to all the nodes. The nodes read from the journal and performs the complex logic requested.
If business software practices is not enough of a proof, look at any massive online games. They all use one central server as a source of truth about the game world, and broadcast that state to the clients. Anything a client reports that diverges from the central server view is either corrected, rejected, or becomes a reason to disconnect the client for cheating attempts.
If you need strict order, that order should happen in strictly one place. (The universe itself does not support strict order at a distance, as Special Relativity shows.)
Bitcoin: Am I joke to you?
Everyone: yes.
There's no performance gain from bitcoin's approach though. The distributed consensus is there for trustless operation, not performance. It's many orders of magnitude slower (and a few more orders of magnitude more power inefficient!) than just having one server handle it.
Isn't bitcoin actually eventual consensus determined by increasing unlikelihood of finding a conflicting sequence of hashes?
Bitcoin is an interesting exception to the above. Issue is latency.
that is server/many clients, not distributed
I’d suggest to re-evaluate if you really, really need (a) distributed data stores and (b) synchronous consistency. Things become much simpler if you can forego one of them.
I suspect TigerBeetle DB will be the industry benchmark for consistent, high throughput, fault tolerant, distributed databases in 5 years.
Is your inter-machine messaging asynchronous and is it possible for one of your machines to crash?
Look at the Raft protocol. In general you’d rather not integrate at the protocol layer; it’s standard to use a consistent store like etcd that implements Raft for the coordination/data that needs to be serializable.
(Kubernetes uses etcd so it scales pretty well for a strongly-consistent k/v store.)
You said “multiple datastores” so I’m assuming you have heterogenous data and something like CockroachDB isn’t an option.
> I'm a beginner in this topic and I find this topic interesting.
Not trying to gatekeepe but rolling your own is dangerous. See https://aphyr.com/ for the gold standard in testing (great educational material). You can use Jepsen to test your distributed systems. But better to just use datastores that Kyle has shown are solid.
Thanks for your reply.
I've experimented with a toy Raft implementation but I haven't Jepsen tested that and it's incomplete
I did write a Jepsen test for a different eventually consistent protocol which understandably fails the linearizability test because I'm still learning - eventually consistent is not linearizable.
https://GitHub.com/samsquire/eventually-consistent-mesh
I want to have my cake and eat it too. Scalability and consistency.