19

I need to ignore duplicate inserts when using insert_many with pymongo, where the duplicates are based on the index. I've seen this question asked on stackoverflow, but I haven't seen a useful answer.

Here's my code snippet:

try:
    results = mongo_connection[db][collection].insert_many(documents, ordered=False, bypass_document_validation=True)
except pymongo.errors.BulkWriteError as e:
    logger.error(e)

I would like the insert_many to ignore duplicates and not throw an exception (which fills up my error logs). Alternatively, is there a separate exception handler I could use, so that I can just ignore the errors. I miss "w=0"...

Thanks

6
  • Even with ordered=False Bulk "inserts" still throw errors, even though the whole batch actually commits. The option is up to you whether to try .. except and essentially "ignore" the duplicate key error, or if you really don't want to like with that, then use "upserts" instead. That does require what is effectively a "find" on each document, but by nature is "cannot" create a duplicate key. It's just how it works. Commented Jun 30, 2017 at 4:00
  • How do I ignore the specific "duplicate key" error? I don't want to inadvertently ignore other errors. Commented Jun 30, 2017 at 4:03
  • Well the BuklWriteError or whatever the particular class is in python ( need to look that up ) with list each error in an array. Those entries have an error code which E11000 off the top of my head. Simply process and ignore those, and of course really "thow/complain/log/whatever" on any other code present. Commented Jun 30, 2017 at 4:05
  • This is the error string: "batch op errors occurred" which is not very specific. Commented Jun 30, 2017 at 4:07
  • 1
    Dear S.M.Styvane, Yes this question has been asked before, unfortunately none of the answers were satisfactory. Hence the reason for re-posting. But in this case, the answer is correct, and useful. Commented Jul 3, 2017 at 5:35

3 Answers 3

26

You can deal with this by inspecting the errors produced with BulkWriteError. This is actually an "object" which has several properties. The interesting parts are in details:

import pymongo
from bson.json_util import dumps
from pymongo import MongoClient
client = MongoClient()
db = client.test

collection = db.duptest

docs = [{ '_id': 1 }, { '_id': 1 },{ '_id': 2 }]


try:
  result = collection.insert_many(docs,ordered=False)

except pymongo.errors.BulkWriteError as e:
  print e.details['writeErrors']

On a first run, this will give the list of errors under e.details['writeErrors']:

[
  { 
    'index': 1,
    'code': 11000, 
    'errmsg': u'E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }', 
    'op': {'_id': 1}
  }
]

On a second run, you see three errors because all items existed:

[
  {
    "index": 0,
    "code": 11000,
    "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }", 
    "op": {"_id": 1}
   }, 
   {
     "index": 1,
     "code": 11000,
     "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }",
     "op": {"_id": 1}
   },
   {
     "index": 2,
     "code": 11000,
     "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 2 }",
     "op": {"_id": 2}
   }
]

So all you need do is filter the array for entries with "code": 11000 and then only "panic" when something else is in there

panic = filter(lambda x: x['code'] != 11000, e.details['writeErrors'])

if len(panic) > 0:
  print "really panic"

That gives you a mechanism for ignoring the duplicate key errors but of course paying attention to something that is actually a problem.

Sign up to request clarification or add additional context in comments.

2 Comments

I did not know about the details field in the exception object, thanks!
@vgoklani It's kind of hidden, and not even really documented :( So even I had to go digging for it, even though I "knew it was there somewhere". Hence the delay since my last comments.
4

The correct solution is to use a WriteConcern with w=0 and ordered=False:

import pymongo
from pymongo.write_concern import WriteConcern


mongodb_connection[db][collection].with_options(write_concern=WriteConcern(w=0)).insert_many(messages, ordered=False)

1 Comment

it does not seem to work
3

Adding more to Neil's solution.

Having 'ordered=False, bypass_document_validation=True' params allows new pending insertion to occur even on duplicate exception.

from pymongo import MongoClient, errors

DB_CLIENT = MongoClient()
MY_DB = DB_CLIENT['my_db']
TEST_COLL = MY_DB.dup_test_coll

doc_list = [
    {
        "_id": "82aced0eeab2467c93d04a9f72bf91e1",
        "name": "shakeel"
    },
    {
        "_id": "82aced0eeab2467c93d04a9f72bf91e1",  # duplicate error: 11000
        "name": "shakeel"
    },
    {
        "_id": "fab9816677774ca6ab6d86fc7b40dc62",  # this new doc gets inserted
        "name": "abc"
    }
]

try:
    # inserts new documents even on error
    TEST_COLL.insert_many(doc_list, ordered=False, bypass_document_validation=True)
except errors.BulkWriteError as e:
    print(f"Articles bulk insertion error {e}")

    panic_list = list(filter(lambda x: x['code'] != 11000, e.details['writeErrors']))
    if len(panic_list) > 0:
        print(f"these are not duplicate errors {panic_list}")

And since we are talking about duplicates its worth checking this solution as well.

2 Comments

Does this write duplicate data into collection or just ignore the duplicates?
No, it raises exception with error code 11000

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.