Quacker News

samsquire 9 hours ago | next

I'm a beginner in this topic and I find this topic interesting. I really want there to be an easy-to-deploy consistency solution.

If I have a distributed microservice architecture and I want to keep multiple datastores in synchronization or "consistent" what's the industry best practice?

A few days ago I was trying to solve the inconsistency problem with "settled timestamps" which is a kind of multiversioning idea except that timestamps elapsed with the absence of reported error represent a valid save/commit. Kind of like two phase commit with the second phase being time. The idea is that we watch the clocks of other servers and if they don't update then we know we cannot trust their settled timestamps. (My intent was to allow scaling consistency across many servers, because we don't need to wait for response for every update, we only need to wait for the next timestamp interval)

Here's my Multithreaded multiprocessing Python code to test indeterminancy. 10 threads all send eachother random updates. They also broadcast their own timestamp and the timestamps of their own perspective of the timestamps every other server.

https://replit.com/@Chronological/InconsistencySimulation#ma... (click Run and watch the output, you'll have to wait 10 seconds)

A read in this simulation is the MIN of all timestamps of all servers reported timestamps.

10 seconds into the simulation, we ask every thread for its own perspective of what the counter value is. Sometimes they will all report the same value, a lot of the time they shall be split brained.

I am aware that wall clock timestamps are not suitable for ordering in a distributed system and that logical or vector clocks should be used for ordering.

If you can get the simulation to all report the same number at any point in time, then that would be great :-)

Ordering in distributed systems is significant, as the eventual consistency of the simulation means that some values can arrive late but affect the value, meaning it is not linearizable. Bloomlang tries to solve this.

I'm specifically interested in scaling WITH consistency but I think this is quite difficult.

reply

ris 2 hours ago | next

> If I have a distributed microservice architecture and I want to keep multiple datastores in synchronization or "consistent" what's the industry best practice?

Not to use a distributed microservice architecture.

reply

rawgabbit 9 hours ago | next

For distributed systems, the main idea is a centralized write ordering journal that is replayed by individual nodes.

Multiple systems write sequentially to the central journal. The journal is simply taking requests like a key value store. The journal is replicated to all the nodes. The nodes read from the journal and performs the complex logic requested.

reply

nine_k 7 hours ago | next

If business software practices is not enough of a proof, look at any massive online games. They all use one central server as a source of truth about the game world, and broadcast that state to the clients. Anything a client reports that diverges from the central server view is either corrected, rejected, or becomes a reason to disconnect the client for cheating attempts.

If you need strict order, that order should happen in strictly one place. (The universe itself does not support strict order at a distance, as Special Relativity shows.)

reply

bawolff 5 hours ago | next

Bitcoin: Am I joke to you?

Everyone: yes.

reply

rcxdude 4 hours ago | next

There's no performance gain from bitcoin's approach though. The distributed consensus is there for trustless operation, not performance. It's many orders of magnitude slower (and a few more orders of magnitude more power inefficient!) than just having one server handle it.

reply

sfifs 4 hours ago | next

Isn't bitcoin actually eventual consensus determined by increasing unlikelihood of finding a conflicting sequence of hashes?

reply

rusk 4 hours ago | next

Bitcoin is an interesting exception to the above. Issue is latency.

reply

mjan22640 4 hours ago | next

that is server/many clients, not distributed

reply

layer8 6 hours ago | next

I’d suggest to re-evaluate if you really, really need (a) distributed data stores and (b) synchronous consistency. Things become much simpler if you can forego one of them.

reply

infogulch 6 hours ago | next

I suspect TigerBeetle DB will be the industry benchmark for consistent, high throughput, fault tolerant, distributed databases in 5 years.

reply

acuozzo 8 hours ago | next

Is your inter-machine messaging asynchronous and is it possible for one of your machines to crash?

reply

theptip 4 hours ago | next

Look at the Raft protocol. In general you’d rather not integrate at the protocol layer; it’s standard to use a consistent store like etcd that implements Raft for the coordination/data that needs to be serializable.

(Kubernetes uses etcd so it scales pretty well for a strongly-consistent k/v store.)

You said “multiple datastores” so I’m assuming you have heterogenous data and something like CockroachDB isn’t an option.

> I'm a beginner in this topic and I find this topic interesting.

Not trying to gatekeepe but rolling your own is dangerous. See https://aphyr.com/ for the gold standard in testing (great educational material). You can use Jepsen to test your distributed systems. But better to just use datastores that Kyle has shown are solid.

reply

samsquire 4 hours ago | next

Thanks for your reply.

I've experimented with a toy Raft implementation but I haven't Jepsen tested that and it's incomplete

I did write a Jepsen test for a different eventually consistent protocol which understandably fails the linearizability test because I'm still learning - eventually consistent is not linearizable.

https://GitHub.com/samsquire/eventually-consistent-mesh

I want to have my cake and eat it too. Scalability and consistency.

reply

eartheaterrr 11 hours ago | next

I wonder how this new approach compares to serializable snapshot isolation (SSI) https://wiki.postgresql.org/wiki/SSI

I'm not familiar with these technologies, but when I was studying databases, SSI was touted as the "better" 2PL on the horizon. I wonder how SSI compares to 2PLSF, and why it wasn't mentioned here?

reply

3cats-in-a-coat 10 hours ago | next

I’m working on a memory model for a new platform which is almost entirely based around copy on write, snapshots and SSI. But in distributed effects you still need locks and two phase transactions and so on. These are largely complementary features imo.

reply

comex 12 hours ago | next

For Wait-Or-Die, do you really need to fetch_and_add to get a transaction ID? Do you need a transaction ID at all? It sounds like the goal is just to have an arbitrary-but-consistent ordering of active transactions so that different transactions can agree on who should wait and who should die in the event of a conflict. So why not just use the thread ID? Or even a random number might work (if ties are treated as “die”, so the worst that happens is that both transactions unnecessarily abort, and then retry with new random numbers).

While it’s not mentioned, I suppose you want to prioritize older transactions in order to prevent long-running transactions from being starved by shorter-running transactions. (If one long transaction conflicts with an average of, say, three short transactions, and on each conflict it’s effectively random who wins, then each long transaction has only a 1/8 chance of winning all three conflicts and being able to commit.)

But preventing starvation only requires older transactions to be prioritized most of the time, not every single time, especially not if the transaction is only slightly older. So some kind of timestamp / cycle counter should work fine, even if there’s skew between threads or other sources of inaccuracy. Ties could be broken by thread ID, or again by having both sides abort.

reply

djbusby 11 hours ago | next

I would use a ULID rather than thread ID.

https://github.com/ulid/spec

They (and others) are great for this kind of case - and many others.

reply

codetrotter 11 hours ago | next

ulid is great, I use it a lot.

There is also a draft to make a new uuid variant – uuid v7 – that will be very similar to how ulid works.

https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...

reply

techdragon 11 hours ago | next

I use ULID a lot too and It’s frustrating how close the new spec UUIDs are to it without actually being the same… so I’ve got a bunch of code to modify once Postgres supports generation of the new UUIDs server side without extensions or stored procedures. Relatively painless work, but frustrating since it could have been avoided.

reply

masklinn 6 hours ago | next

> frustrating since it could have been avoided.

How? UUID is a structured format so the only options I can see is ulid creating their own unregistered uuid variant (probably a terrible idea) or adding ulid support to postgres (nothing to do with uuid).

reply

techdragon 6 hours ago | next

I'd love for Postgres to adopt ULID as a first class variant of the same basic 128bit wide binary optimized column type they use for UUIDs, but I don't expect they will, while its "popular" its not likely popular enough to have support for them to maintain it in the long run... Also the smart money ahead of time would have been for the ULID spec to sacrifice a few data bits to leave the version specifying sections of the bit field layout unused in the ULID binary spec (https://github.com/ulid/spec#binary-layout-and-byte-order) for the sake of future compatibility with "proper" UUIDs... Performing one quick bitfield modification on every row in PostgreSQL would have been less painful as in set bit, vs load parse, repack in order to re-computing the appropriate UUIDv7s (or UUIDv8s for some reason) since the primary key update transaction should be roughly the same speed either way.

reply

djbusby 11 hours ago | next

Yes. V7 looks good but ULID already does it. Rather than change code, I'm just staying with ULID. All give us 128 bits. That should be enough for anyone ;/

reply

telez 8 hours ago | next

I do not quite understand the last figure for the relaxed avl tree. For the 100 % lookup (rightmost) the TL2 algo should scale linearly with the number of threads. For read-only transactions, TL2 needs to sample the global version, then for all reads make sure the local version is less than or equal to the sampled version. given this, it is difficult to understand why the graph is sub linear and that TL2 is not as fast as the other STM implementations.

reply

colanderman 7 hours ago | next

I don't see such a graph for TL2? I do see one for TLRW though which does use reader locks, hence the scalability cap.

reply

telez 6 hours ago | next

the chart doesn't seem visible on ios safari but i can see it on firefox desktop however, the figure seems to be the same as from the linked paper: https://zenodo.org/record/7886718

reply

colanderman 5 hours ago | next

Ah ok I found it there, I see what you mean now. My only guess would be cache effects? With that large AVL tree (1 M entries, so likely dozens of MiB), you are escaping L2 cache and hitting shared L3 or main memory for a large portion of lookups, and are bandwidth-constrained at the die level, thus adding that knee (which I think is visible with some of the other algorithms as well).

reply

telez 3 hours ago | next

but TL2 shouldn't be using more memory than say TinySTM, i don't think. If the implementations are object based and nodes are cache aligned, adding an extra 8 bytes for the versioned-lock shouldn't bump the node size to greater than a cache line.

however, i do think the tl2 implementation as described in the paper is memory based, as is TinySTM so every read needs to do a hash to locate the corresponding lock / meta data. the read-only transactions for tl2 and TinySTM seem identical to me which is why i am so confused.

looking at other figures from the 2PLSF paper, the TL2 for 100% lookup on hash set and skip list it looks like such a dog compared to the other algos.

reply

gigatexal 6 hours ago | next

2PLSF paper the author claims is what 2PL should have been. https://zenodo.org/record/7886718

reply

klabb3 11 hours ago | next

One thing I didn’t quite catch was: I thought most write transactions needs consistent reads on many (contended) “objects”, but the actual writes are often just one or two objects. Is 2PL addressing this or does a write transaction take write locks on all objects?

reply

ncruces 51 minutes ago | next

You take either read OR write locks for whatever you need, and keep all locks until you commit the transaction.

That (possibly brief) moment where you hold all the necessary locks is your point of linearizability: it's as if everything happened at that exact moment.

reply

rawgabbit 11 hours ago | next

You can take locks on all objects before you write. Most systems don’t do that.

Usually there are multiple writes to complete a business process. In each step stale data is read to create a write request similar to how you place your order with the wait staff at the restaurant. The wait staff will orchestrate your request. First the chef prepares your meal according to the order ticket. Second the wait staff delivers your meal based on the chef’s internal work order. Lastly they take your payment based on the original order ticket. Notice there are at least three microtransactions based on stale data. No one is holding locks until the diner finishes eating.

reply

zaphirplane 8 hours ago | next

IIUC, That’s a micro service way, missing cancellations and rollback and timeouts

What happens if the chef can’t fulfill the meal, the customer leaves cause the order is taking long … etc Most real life will not charge the customer is the chef can’t make the meal and cancel the cooking if the customer leaves

The analogy maybe going far ;)

reply

vlovich123 10 hours ago | next

Interesting re the transaction ID generation. Sled generates ~70-120 million transaction IDs per second. The author points out that it’s still a scaling bottleneck but I wonder if Sled’s approach might raise the TPS for the contended case.

reply

johnthescott 10 hours ago | next

huh? no mention of serializable snapshot isolation (SSI), used by most commercial dbs for years. what gives?

reply

mgaunard 14 hours ago | next

MVCC is what most databases do.

reply

vore 12 hours ago | next

MVCC is optimistic concurrency control – you may end up having to retry your write transaction multiple times, which might suck if the transaction is expensive. Pessimistic concurrency control like locking may be cheaper in those cases, since the cost of locking may be drastically lower. Having one doesn't preclude the other!

reply

endisneigh 12 hours ago | next

has its own tradeoffs tho

reply

4 hours ago | next

reply

indus 10 hours ago | next

Isn’t 2PL a limited applicability construct?

All my incoming and outgoing “transactions” are over loosely-coupled API calls.

Is there a point I’m missing?

reply

grivant 10 hours ago | next

IMO it’s about what’s happening underneath those API calls. You might be doing lots of POST requests that touch the same underlying sql table.

reply

convolvatron 14 hours ago | next

this is a great paper that provides a framework for comparing 2 phase and paxos.

https://lamport.azurewebsites.net/video/consensus-on-transac...

reply

thoughtlede 14 hours ago | next

Two-phase locking is different from two-phase commit, in spite of an overlap in their naming. Two-phase commit is relevant to be compared against Paxos - both of which fall under the category of consensus protocols.

Two-phase locking is a concurrency control mechanism.

reply

roncesvalles 13 hours ago | next

A little summary:

2PC: An algorithm used in the context of distributed transactions where each machine handles a different part of the transaction. This means that nothing is redundant - the success of each and every participant is required for the transaction to be committed.

Paxos/Raft/consensus: An algorithm usually used in the context of distributed replication. Since every participant is doing the same thing, it's tolerable if a few fail or give outputs that diverge from the majority.

2PL: A method of acquiring multiple locks such that first you acquire all the required locks (first phase), then you do what you need to do, and then you release all the locks (second phase). This is in contrast to a locking scheme where lock acquisitions and releases are interspersed. This isn't strictly limited to distributed systems, although it's common to see 2PC with 2PL.

If this piques your interest, read the Spanner paper! Spanner uses all three - 2PC with 2PL for distributed read-write transactions, and Paxos for replication.

PS: "Distributed" just means there's more than one machine involved, any of which may fail independently, and communication among these machines happens over unreliable wire.

reply

thoughtlede 13 hours ago | next

In a multi-replica system, where, say, we cannot tolerate any failures or lags, is 2PC used in practice to achieve consensus? Or are there other methods for achieving such strict consensus?

reply

mjb 10 hours ago | next

2PC (or atomic commitment more generally) is needed for sharded/partitioned systems with different data on each node. In these systems, each node gets a vote on whether a transaction should be allowed to commit. Replication, making multiple copies of the same data, doesn't need 2PC. Instead, algorithms like Paxos, Raft, or chain replication are used.

reply

foota 12 hours ago | next

There are other means, the most common being Calvin like protocols where every replica deterministically processes some log, and so only one round of distributed calls is necessary.

reply

reubenbond 12 hours ago | next

Which systems use Calvin-like protocols?

reply

foota 6 hours ago | next

FaunaDB is the most popular, I believe.

reply

westurner 9 hours ago | next

Concurrency control > Concurrency control in databases > Why is concurrency control needed?: https://en.m.wikipedia.org/wiki/Concurrency_control#Why_is_c...?

Locks (computer science) > Disadvantages: https://en.wikipedia.org/wiki/Lock_(computer_science)#Disadv...

Two-phase locking (2PL) https://en.wikipedia.org/wiki/Two-phase_locking

Two-phase commit protocol (2PC) https://en.wikipedia.org/wiki/Two-phase_commit_protocol

Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)

Raft: https://en.wikipedia.org/wiki/Raft_(algorithm)

Consensus (computer science) https://en.wikipedia.org/wiki/Consensus_(computer_science)

Spanner: https://en.wikipedia.org/wiki/Spanner_(database)

Non-blocking algorithm; "lock-free concurrency", "wait-free" https://en.wikipedia.org/wiki/Non-blocking_algorithm

"Ask HN: Why don't PCs have better entropy sources?" [for generating txids/uuids] https://news.ycombinator.com/item?id=30877296

"100-Gbit/s Integrated Quantum Random Number Generator Based on Vacuum Fluctuations" https://link.aps.org/doi/10.1103/PRXQuantum.4.010330

Re: tests of randomness: https://mail.python.org/archives/list/[email protected]...

TIL there's a regular heartbeat in the quantum foam; there's a regular monotonic heartbeat in the quantum Rydberg wave packet interference; and that should be useful for distributed applications with and without vector clocks and an initial time synchronization service (WhiteRabbit > PTP > NTP Network Time Protocol) https://journals.aps.org/prresearch/abstract/10.1103/PhysRev... :

> The [quantum time-keeping application of this research] relies on the unique fingerprint that is created by the time-dependent photoionization of these complex wave packets. These fingerprints determine how much time has passed since the wave packet was formed and provide an assurance that the measured time is correct. Unlike any other clock, this quantum watch does not utilize a counter and is fully quantum mechanical in its nature. The quantum watch has the potential to become an invaluable tool in pump-probe spectroscopy due to its simplicity, assurance of accuracy, and ability to provide an absolute timestamp, i.e., there is no need to find time zero.

IIUC a Rydberg antenna can read and/or write such noise?

"Patterns of Distributed Systems (2022)" https://news.ycombinator.com/item?id=36504073

reply

User23 11 hours ago | next

I find it interesting that “distributed” and “concurrent” end up falling under the mathematical concept of nondeterminism with respect to correctness. Of course a practically efficient implementation has additional concerns.

reply

onedognight 14 hours ago | next

2500 years later and the best hypotenuse algorithm is still Pythagoras’.

reply

goalieca 13 hours ago | next

You jest but this is a computationally slow algorithm that only works for right angle triangles in a Cartesian space. There’s very reasonable and fast trigonometric approximations. The general form would be the cosine law which applies to any triangle. We can derive this from more abstract metrics over an inner product space.. a sort of vector space of any dimension.

reply

quickthrower2 12 hours ago | next

First thing I thought of is https://en.wikipedia.org/wiki/Fast_inverse_square_root

reply

User23 11 hours ago | next

Funny you should say that. Here[1] is an interesting improvement. More interestingly he shows his work.

[1] https://www.cs.utexas.edu/~EWD/transcriptions/EWD09xx/EWD975...

reply

d0mine 3 hours ago | next

Thank you for this gem.

It is a delight to read.

reply

12 hours ago | next

reply

Sharminkhant 17 hours ago | next

[flagged]

reply

djha-skin 10 hours ago | next

We are so arrogant sometimes. The passage of time has nothing to do with reality. People seem to think that given enough time and energy and ingenuity we can do anything we want. This is not how the real world works.

Sometimes, we find a good solution, and it's the best possible one and we found it early on. We think we can do better, but we can't. A classic example of this is Euclid's fifth axiom. It wasn't proven until the 19th century that this axiom was necessary but everyone from Euclid to Gauss tried to get rid of it. Foolish.

reply

bigstrat2003 10 hours ago | next

It seems to me that while it's foolish to assume we can always improve on something, it's not at all foolish to try to do better (even if the attempts don't succeed). Asking "can we do better than this?" and trying to find a better way is a prerequisite for progress, after all.

reply

patrulek 4 hours ago | next

And trying to improve something will often result in improving ourselves (by learning).

reply

stevenhuang 4 hours ago | next

An excellent illustration of how a lack of diversity in thinking can hinder forward progress. Thank God everyone doesn't think the same as you.

I think it's a testament to the human condition there are those of us willing to entertain the folly of pushing the boundaries, or of being foolish as you put it.

reply

50 years later, is two-phase locking the best we can do?