Picture this: You are a dev in a payment services company. The thing with payment transactions is that you are supposed to process each transaction (e.g. send money from John's account to Jane) exactly once. It's 3 AM, your PagerDuty is blowing up with alerts about transactions being processed multiple times by cron jobs, and you're staring at Redis locks that should prevent this exact problem. Sound familiar?
We didn't end up sending money to someone 20 times, but our problem was similar. This is the story of how we went from a complex Redis-based locking nightmare to a beautifully simple database-centric solution, and why you should probably ditch those distributed locks too.
Let me set the scene. We have this automated testing platform where developers would push their OpenAPI specs, and our job workers would generate tests in the background. Think Postman, but the test generation is automated. Simple enough, right? Wrong.
The system worked like this:
Our "brilliant" solution was Redis user-level locking:
def main(): redis = redis_client() # Get users with pending tasks users_with_pending_tasks = get_users_with_pending_tasks() # Try to acquire lock for a user for user in users_with_pending_tasks: user_id = user["user_id"] lock_acquired = redis.setnx(f"process_lock_{user_id}", user_id) if lock_acquired: print(f"Acquired lock for user {user_id}") process_user_tasks(user_id) redis.delete(f"process_lock_{user_id}") # Release lock break
This looked solid on paper. One job per user, clean separation, what could go wrong?
Turns out, a lot. Here's what we discovered after days of debugging.
Jobs would crash (because who writes perfect code?), leaving behind zombie locks in Redis. Developer #12345's tasks would be forever locked, waiting for a job that no longer exists. We'd have to manually clean these up, which is about as fun as debugging CSS alignment issues.
Even worse, we had this beautiful race condition where two jobs would:
It was like watching two developers simultaneously fix the same bug in different branches.
The most frustrating part? The logic was sound. User-level locking should prevent duplicate processing. But we were still getting duplicate test suites generated, and developers were opening GitHub issues faster than we could close them.
After staring at this code for the hundredth time, my teammate dropped this gem: "Why are we even using Redis for this? Isn't our database already designed to handle concurrency?"
They were right. Here we were, adding this complex external dependency when PostgreSQL has been solving concurrency problems since before Redis was even a twinkle in antirez's eye.
The solution was embarrassingly simple:
def claim_task(task_id): result = execute_query( "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'", (task_id,) ) return result.rowcount > 0 # True if we successfully claimed it
That's it. Twelve lines of Redis complexity replaced by three lines of SQL.
When two jobs try to claim the same task simultaneously:
status = 'pending', updates it to 'processing', returns rowcount = 1status = 'processing' (not 'pending'), updates nothing, returns rowcount = 0The database engine handles all the locking, isolation, and consistency for us. It's literally what ACID properties were designed for.
Here's what our main processing loop became:
def main(): # Get all pending tasks - dead simple pending_tasks = get_tasks_by_status("pending") for task in pending_tasks: # Try to atomically claim this task if claim_task(task["id"]): print(f"Got task {task['id']}, let's go!") process_task(task) else: print(f"Task {task['id']} stolen by another worker, moving on...") def claim_task(task_id): result = execute_query( "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'", (task_id,) ) return result.rowcount > 0
No Redis calls. No lock cleanup. No timeouts. No complex error handling. Just pure, boring database operations.
But wait, there's more! (There's always more, isn't there?)
Our celebration was short-lived. Within a week, we discovered a new problem: one startup uploaded their monolithic API spec with 1,000 endpoints (yes, we've all been there). Guess what happened? All our job instances started fighting over that user's tasks, completely ignoring everyone else.
Meanwhile, Sarah uploaded her simple microservice spec with 3 endpoints and watched it sit in the queue for hours while MegaCorp's monolith hogged all the workers. Classic tragedy of the commons.
This is where the simplicity of our solution became both a blessing and a curse. It was too fair - treating all tasks equally regardless of user impact.
We went with the simplest solution first: limit each user to 2 concurrent test generation tasks max.
def main(): pending_tasks = get_tasks_by_status("pending") for task in pending_tasks: # Check if this user is already hogging resources user_task_count = get_user_task_count(task["user_id"], "processing") if user_task_count >= 2: # Max 2 concurrent tasks per user print(f"User {task['user_id']} hit their limit, skipping...") continue # Try to claim the task if claim_task(task["id"]): print(f"Processing task for user {task['user_id']}") process_task(task) def get_user_task_count(user_id, status): result = execute_query( "SELECT COUNT(*) as count FROM tasks WHERE user_id = %s AND status = %s", (user_id, status) ) return result["count"]
Simple, effective, and Sarah is happy again.
If you want to be fancy (and reduce database calls), you can do the fairness check and task claiming in one atomic operation:
def claim_task_with_fairness(task_id, max_concurrent=2): result = execute_query(""" UPDATE tasks t1 SET status = 'processing' WHERE t1.id = %s AND t1.status = 'pending' AND ( SELECT COUNT(*) FROM tasks t2 WHERE t2.user_id = t1.user_id AND t2.status = 'processing' ) < %s """, (task_id, max_concurrent)) return result.rowcount > 0
This is beautiful from a database perspective - one query does it all. But it's harder to debug when things go wrong, and trust me, things will go wrong.
For maximum fairness, you can prioritize users who have fewer tasks running:
def get_fair_pending_tasks(): """Get tasks ordered by user fairness - users with fewer running tasks go first""" return execute_query(""" SELECT t1.* FROM tasks t1 LEFT JOIN ( SELECT user_id, COUNT(*) as running_count FROM tasks WHERE status = 'processing' GROUP BY user_id ) t2 ON t1.user_id = t2.user_id WHERE t1.status = 'pending' ORDER BY COALESCE(t2.running_count, 0) ASC, t1.created_at ASC """)
This query is doing some heavy lifting:
It's more complex but gives you true round-robin fairness. MegaCorp's monolith still gets processed, but not at Sarah's microservice's expense.
You could also add a last_processed_at timestamp to users and prioritize those who haven't been processed recently. But honestly, that's probably overkill unless you're running something like GitHub Actions at scale.
I spent weeks building a distributed locking system when PostgreSQL was sitting there like "I've literally been doing this since 1996, but okay…"
Databases are designed for concurrency. ACID properties exist for exactly these scenarios. Use them.
Every line of Redis locking code was a potential failure point:
The database solution eliminated all of this. Sometimes the best code is the code you don't write.
We learned this the hard way when MegaCorp's monolith starved all the microservice users. If you're building anything where multiple users compete for resources, think about fairness from day one, not when your users start filing angry GitHub issues.
We went with the two-query approach (check user count, then claim task) rather than the fancy single-query version. Why? Because when something breaks at 3 AM, you want to be able to debug it quickly.
Let's be honest about the trade-offs:
Database Approach:
Redis Approach:
For our use case, the database approach was clearly better. Your mileage may vary.
Go with database atomicity when:
Stick with Redis/distributed locks when:
We replaced 50 lines of complex Redis locking logic with 5 lines of SQL and immediately solved our duplicate processing problem. Sometimes the best engineering solution is the boring one.
Your database has been solving concurrency problems longer than most of us have been writing code. Maybe it's time to trust it.


