0

Situation: I have Model have a relation 1-1, sample:

class User(models.Model):
    user_namme = models.CharField(max_length=40)
    type = models.CharField(max_length=255)
    created_at = models.DatetimeField()
    ...

class Book(models.Model):
    user = models.OneToOneField(User, on_delete=models.CASCADE)

And I have a around 200,000 records.

  • Languague: Python
  • Framework: Django
  • Database: Postgres

Question:

  • How can I delete 200,000 records above with minimal cost?

Solution I have tried:

user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids. 
for i, _ in enumerate(user_ids[:: 1000]):
    with transaction.atomic():
        batch_start = i * self.batch_size
        batch_end = batch_start + self.batch_size
        _, deleted = Users.objects.filter(id__in=user_ids[batch_start,batch_end]
    

With this solution, my server use arround:

  • 600MB CPU
  • 300MB RAM
  • Take more 15 minutes to finish workload.

I wonder do anyone have a better solution?

3
  • Isn't the right answer here? stackoverflow.com/questions/4867852/…. Bulk delete all the books followed by all the users? Commented Nov 22, 2022 at 14:02
  • @RobertMoskal, Thanks for your answer, but in my case, I must filter on User Model due to legacy data (UserModel have field query + index). And in fact, my User Model has more than 1 relation one-one to another Model.
    – Thành Lý
    Commented Nov 23, 2022 at 3:48
  • Sounds like a job for a sql statement. Commented Nov 23, 2022 at 4:46

3 Answers 3

2

By first principles, nothing beats raw(Django query) SQL in terms of speed because it operates closest to the database!

cursor.execute(“DELETE FROM DB WHERE Column = %s”)

Or else you can do it by:

Variable = Model.objects.filter(variable=variable)

if Variable.exists():

    Variable.delete()
0
1

Thanks, everyone. I have tried the solution with RawQuery

user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids. 

for i in range(0, 3):
    user_ids_str = ""
    for user_id in user_ids.iterator(chunk_size=5000):
        user_ids_str += f"{user_id},"
    query = f"""
            DELETE FROM "user" WHERE "user"."id" IN ({user_ids_str});
            DELETE FROM "book" WHERE "user"."id" IN ({user_ids_str});
    """
    with transaction.atomic():
        with connection.cursor() as c:
            c.execute("SET statement_timeout = '10min';")
            c.execute(query)

This one can remove 600000 records and take around 10 minutes. And the server used around:

  • CPU: 50MB
  • RAM: 200MB
1
  • If you are using straight sql why not do a join on the user table to delete the books and the delete all the users using the created_date criteria? Let the database do all the work! Commented Nov 26, 2022 at 2:50
0

If you are using straight sql why not do a join on the user table with the date criteria to delete the books and then delete all the users using the created_date criteria? Let the database do all the work!

Even without writing the join

DELETE FROM "book" WHERE "user"."id" IN (select id from user where created_date >= '2022-11-15 08:00' and...)
DELETE FROM "user" WHERE created_date >= '2022-11-15 08:00' and...

would be better than what you have.

2
  • If you're using SQL why not let the DB cascade on its own (You'll have to ensure ON DELETE CASCADE, etc. are there already)? Let the database do all the work! Commented Nov 26, 2022 at 3:12
  • Some folks don't have the ability to change schema! Commented Nov 26, 2022 at 3:59

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.