Scaling locks in high concurrency web apps [closed]

https://softwareengineering.stackexchange.com/questions/270603

06-10-2020
|

سؤال

Our web application has a certain resource type that is shared accross multiple users, and may thus be read and written by anyone at anytime. We resorted to the usual suspects, database transactions and software locks with expiration times, and with a low user base and low concurrency this worked well and prevented many race condition scenarios that would otherwise have happened.

We are now slowly starting to scale and are finding out that with a few hundred concurrent users, ocassionally (about once a day) a request fails due to either timing out waiting for a lock (which I assume happens when many attempt to get or are waiting for the same lock at the same time) or by database deadlocks. Today, this is only a minor annoyance, since it's still relatively infrequent and the affected user can reattempt the request and it will usually get through. But we are naturally concerned that this will not scale well.

How is this scenario typically addressed? I can only imagine it's a relatively common problem to have, but I'm not sure what specific techniques can be applied here without having to redesign the entire application. We're looking into lock-free designs currently, but at a first glance it would seem like rewritting our application to be lock-free would be a titanic task. Any sort of advice would be appreciated.

المحلول

Welcome to the deep, dark recesses of the Concurrency Jungle! This is where many app developers fear to tread--and fear it for good reason.

@kdgregory gets it exactly right: This is not one problem, but a very large, very sticky class of problems.

Most hard concurrency problems are addressable at scale, with sufficient effort and investment. None, however, are particularly simple, easy, or cheap. All of the world's high-concurrency solutions--e.g. the insides of operating systems and databases, high-scale message passing and trading systems, financial markets, travel booking systems, global web apps--have received enormous ongoing development efforts and investments, in most cases over years or decades. Most of them have required fundamental engine rewrites/roto-tilling to reach new scale plateaus.

It sounds as if you are at one of those plateaus. Your fear about "having to redesign the entire application" and "at a first glance it would seem [to] be a titanic task" are well-founded. If you aren't daunted at the effort involved in scaling up a highly-concurrent app, you just don't understand concurrency.

I would start by reconsidering the "certain resource type that is shared accross multiple users, and may thus be read and written by anyone at anytime." Is it truly possible for any and every user to share this resource / these resources? Or is it just that many users may share each? That's a slightly subtle distinction. If truly everyone may share (M:1), that's a genuine individual bottleneck. If it's more accurately M:N, and the users that can share each of the N can be grouped or partitioned, then it may be more readily addressed, for example by

Sharding Sharding would partition users onto different servers, along with their related critical resource(s). Organizing by shards/partitions isn't automagically easy, but it's easier than a lot of other concurrency strategies, given that it nicely strength reduces the problem into smaller sub-problems, each of which is more manageable and performant. If pure sharding isn't enough, it's sometimes possible to combine sharding with function or data shipping: that is, sharing for the common cases, then move shared computations to where the data is, or dynamically recognize users that could be better migrated to other shards, as what other resources they interact with are discovered. And/or if your central shared resource is read-mostly or read-more-often-than-write, it can possibly also can be replicated across shards.

Truly complete sharing problems are harder--especially if you have heavy updates to your critical shared resources.

Scale Up. For some scale ranges, this can be addressed with "scale up" servers--buying much bigger, shared-memory servers. The servers are a lot more expensive than traditional web gear, but the large memory complement and extremely high-speed system interconnect goes a long way to solving communication/coordination problems. Even if the gear is expensive, you don't have to worry so much about software rewrites, which can make them an economic win. ("Send in the mainframes!")
Parallel Scale Up. A fallback from there are logically scale-up data stores: parallel database and middleware engines (e.g. DB2, NonStop SQL, Teradata) that, while parallel internally, appear to your code as unified, less-parallel services. (Internally such servers often use InfiniBand or proprietary low-lantency system-to-system interconnects, making them a hybrid between full "scale up" and distributed servers.)

Many web apps, however, skip the various levels of "scale up" solutions and head directly for the final frontier:

Fully distributed. Rewrite to use middleware/services with exposed parallelism. Examples would be Hadoop, Amazon Web Services's SimpleDB database and SQS queueing services, the parallelism models of many NoSQL databases, and the API semantics of most high-scale web services (e.g. Twitter). To use these, developers must embrace much different and usually much weaker concurrency semantics than traditional databases offer (e.g. "eventual consistency") and must accept the responsibility for app-managed concurrency (also a weaker semantic and service level than app developers have been accustomed to). This preference for exposed parallelism in web apps is often cultural, since so many web apps are "green fields" and there is a strong preference for scale-out strategies and "parallel everything" (really, "distributed everything," since even scale-up systems are highly parallel).

This final "entirely distributed, parallelism exposed" infrastructure is popular among the highest scale web apps (e.g. Google, Facebook, Twitter), but it's also where your "large rewrite effort" fear comes in most directly, because adopting these would likely change the level at which your app manages interactions/sharing. I'd suggest reviewing the opportunities for the more modest sharding and/or scale-up strategies before diving into the full "eventual consistency" and "rewrite for different sharing semantics" thickets.

نصائح أخرى

Would it be possible to use an update log approach? That is, instead of directly updating the resource, code would simply write an update entry in a table/queue/whatever-makes-sense and a periodic task (or some other mechanism) applies the updates to the resource. Now you only have lock contention on the log, which should be significantly less than contention for your shared resource.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى softwareengineering.stackexchange