Intro

In this post, we will look at the defaultdict datatype from the collections module.

The collections.defaultdict is a subclass of the built-in dict that accepts a callable (default_factory) during its initialization. Then, when you try to access a key in the initialized object, it does a regular dict lookup to fetch the key’s value. If the key is missing, it calls the default_factory that generates a value for the key requested. This key-value pair is then stored in the dict and the value is returned.

Use cases

The primary use case is when you need to build up a dict based on an iterable and don’t want to check if keys exist before operating on them. This is extremely common when grouping items, counting occurrences, or building nested data structures.

For example, when grouping items by some attribute, you’d normally need to check if the key exists before appending to a list. With defaultdict(list), you can just append directly - if the key doesn’t exist, it automatically creates an empty list first.

Another common scenario is counting: instead of checking if a key exists before incrementing a counter, you can use defaultdict(int) and increment directly, since missing keys default to 0.

The beauty of defaultdict is that it eliminates boilerplate conditional logic, making your code cleaner and often more readable!

Usage

Here’s a basic example showing the difference between regular dict and defaultdict:

from collections import defaultdict

# regular dict - requires existence checks
d = {}
if "fruits" not in d:
    d["fruits"] = []
d["fruits"].append("apple")
d["fruits"].append("banana")

# another way to do it using traditional dict
# but this isn't very pythonic and less readable as well
d.setdefault("fruits", []).append("apple")
d.setdefault("fruits", []).append("banana")

# defaultdict - no checks needed
dd = defaultdict(list)
dd["fruits"].append("apple")  # just works!
dd["fruits"].append("banana")
print(dd)
# [Out]: defaultdict(<class 'list'>, {'fruits': ['apple', 'banana']})

💡 A bit about dict internals and default_factory

In the standard dict implementation, when we do a lookup like d["key"], it calls the __getitem__ method. If this method cannot find the key, it calls the __missing__ method.

In a regular dict, this then raises a KeyError. In a defaultdict however, it calls the default_factory callable (which is the first argument passed to defaultdict as stated above). This callable is then executed, and its return value is assigned to key and is also returned.

The default_factory can be any callable, including builtins, functions, lambdas, None or even left blank. If None or nothing is passed, then upon lookup for a non-existent key, a KeyError is raised, like a regular dict.

As this callable decides what is returned when a non-existent key is looked-ip for, it must be a callable that can be called without any arguments. Then, the return value of this callable is assigned to the key.

Consider some examples:

from collections import defaultdict

# callable as `list`
dd = defaultdict(list)
print(dd["foo"])
# [Out]: []

# callable as `int`
dd = defaultdict(list)
print(dd["foo"])
# [Out]: 0

# callable as a lambda
dd = defaultdict(lambda: {"bar": []})
print(dd["foo"])
# [Out]: {'bar': []}

# callable as `None`
dd = defaultdict(None)
print(dd["foo"])
# [Out]: KeyError 'foo'

# callable as a function
def cust():
    return "DEFAULT"
dd = defaultdict(cust)
print(dd["foo"])
# [Out]: DEFAULT

The default_factory agument can be any callable that returns a value. This can be used to demostrate various use cases, viz.:

  • To directly append into a list:

    dd = defaultdict(list)
    dd["items"].append(1)
    print(dd["items"])
    # [Out]: [1]
    
  • To count grouped objects:

    dd = defaultdict(int)
    for i in range(11):
        if i % 2 == 0:
            dd["even"] += 1
        else:
            dd["odd"] += 1
    print(dd)
    # [Out]: defaultdict(<class 'int'>, {'even': 6, 'odd': 5})
    
  • Set custom default values:

    dd = defaultdict(lambda: "N/A")
    print(dd["missing_key"])
    # [Out]: N/A
    
  • Nested (2D) data structures:

    dd = defaultdict(lambda: defaultdict(int))
    dd["user1"]["score"] += 10
    dd["user1"]["score"] += 5
    print(dd["user1"]["score"])
    # [Out]: 15
    

Real-life scenarios

Here are some practical examples where defaultdict shines.

Grouping items by attribute

One of the most common uses - grouping a list of items by some property:

from collections import defaultdict

# group users by their department
users = [
    {"name": "Alice", "dept": "Engineering"},
    {"name": "Bob", "dept": "Sales"},
    {"name": "Charlie", "dept": "Engineering"},
    {"name": "Diana", "dept": "HR"},
    {"name": "Eve", "dept": "Sales"},
]

by_dept = defaultdict(list)
for user in users:
    by_dept[user["dept"]].append(user["name"])

print(dict(by_dept))
# [Out]: {
#   'Engineering': ['Alice', 'Charlie'],
#   'Sales': ['Bob', 'Eve'],
#   'HR': ['Diana']
# }

Building an inverted index

When working with search functionality, you often need to build an inverted index - mapping words to document IDs:

from collections import defaultdict

documents = {
    1: "Learn Python",
    2: "Webdev with Python",
    3: "Webdev with NodeJS",
}

# build inverted index: word -> set of doc IDs
iindex = defaultdict(set)
for doc_id, content in documents.items():
    for word in content.split():
        iindex[word.lower()].add(doc_id)

print(dict(iindex))
# [Out]: {
#    'learn': {1},
#    'python': {1, 2},
#    'webdev': {2, 3},
#    'with': {2, 3},
#    'nodejs': {3}
# }

# now you can quickly find documents containing a word
print(iindex["python"])
# [Out]: {1, 2}

Counting occurrences with grouping

A scenario I encounter frequently - counting events or items with multiple dimensions:

import json
from collections import defaultdict

# log entries with user and action
logs = [
    ("alice", "login"),
    ("bob", "login"),
    ("alice", "view_page"),
    ("alice", "view_page"),
    ("bob", "logout"),
    ("alice", "login"),
]

# count actions per user
user_actions = defaultdict(lambda: defaultdict(int))
for user, action in logs:
    user_actions[user][action] += 1

print(json.dumps(user_actions, indent=4))
# [Out]: {
#     "alice": {
#         "login": 2,
#         "view_page": 2
#     },
#     "bob": {
#         "login": 1,
#         "logout": 1
#     }
# }

Accumulating values for API response aggregation

When aggregating data from multiple API calls or database queries:

from collections import defaultdict

# simulating results from multiple API calls
api_responses = [
    {'user_id': 1, 'purchase': 100},
    {'user_id': 2, 'purchase': 50},
    {'user_id': 1, 'purchase': 200},
    {'user_id': 3, 'purchase': 75},
    {'user_id': 2, 'purchase': 30},
]

# aggregate total purchases per user
user_totals = defaultdict(int)
for response in api_responses:
    user_totals[response["user_id"]] += response["purchase"]

print(dict(user_totals))
# [Out]: {1: 300, 2: 80, 3: 75}

Caveats

The .get() method of dict does NOT use the default factory

This is probably the most surprising behavior. When you use .get() on a defaultdict, it behaves exactly like a regular dict - it does NOT call the default factory.

From the official docs:

Note that __missing__() is not called for any operations besides __getitem__(). This means that get() will, like normal dictionaries, return None as a default rather than using default_factory.

In case you aren’t aware, the __getitem__ is the method of dict that is called when we do a lookup using square brackets - any_dict["key"]. So what this means is dd["missing_key"] (where dd is a defaultdict) will call the default_factory, but dd.get("missing_key") will not call the default_factory:

from collections import defaultdict

dd = defaultdict(list)

# using bracket notation - triggers default factory
print(dd[1])
# [Out]: []


# using .get() - does NOT trigger default factory!
print(dd.get(2))
# [Out]: None

# similar to `dict`, `.get` also does not add an entry
print(dd)
# [Out]: defaultdict(<class 'list'>, {1: []})

# and it works just like in the ususal dict
print(dd.get(3, []))
# [Out]: []

Passing None or nothing as default_factory

If you create a defaultdict without a factory function (or with None), it behaves like a regular dict when accessing missing keys - you get a KeyError:

from collections import defaultdict

dd1 = defaultdict()
print(dd1[1])
# [Out]: KeyError 1

dd2 = defaultdict(None)
print(dd2[1])
# [Out]: KeyError 1

# But this works
dd3 = defaultdict(lambda: None)  # [1]
print(dd3[1])
# [Out]: None
print(dd3)  # key WAS added this time
# [Out]: defaultdict(<function <lambda> at 0x...>, {1: None})

Note [1]: This works because here we are setting a default_factory to a callable lambda that just returns None when a new key is requested for.

Comparing defaultdict with regular dict

Two defaultdicts are equal each other and to a regular dict if they all have the same items, even if they have different default_factory:

from collections import defaultdict

dd1 = defaultdict(list, {"a": 1})
dd2 = defaultdict(int, {"a": 1})   # different factory
regular = {"a": 1}

print(dd1 == dd2)
# [Out]: True

print(dd1 == regular)
# [Out]: True

# but watch out for empty defaultdicts with different factories
dd3 = defaultdict(list)
dd4 = defaultdict(int)

print(dd3 == dd4)  # both empty, so equal
# [Out]: True

This means, default_factory is not considered for equality check.

Calling in on the dict does not trigger default_factory

The in operator checks for key existence but does NOT trigger the default factory:

from collections import defaultdict

dd = defaultdict(list)

# "in" operator doesn't trigger factory
print("a" in dd)
# [Out]: False

print(dd)  # still empty
# [Out]: defaultdict(<class 'list'>, {})

# but bracket access does
_ = dd["a"]
print("a" in dd)
# [Out]: True

This actually makes sense - you don’t want membership testing to have side effects!

Unexpected mutations during iteration

While iterating over a defaultdict, accessing non-existent keys during iteration triggers the default_factory. This is obvious, as we know that a lookup triggers __missing__() method, which in turn, calls the default_factory that creates the entry.

from collections import defaultdict

dd = defaultdict(int)
dd["x"] = [1]

for key in ["x", "y", "z"]:
    print(key, dd2[key])  # this would add `y` and `z` to `dd`!

# instead, check first or use .get()
for key in ["x", "y", "z"]:
    if key in dd2:
        print(key, dd2[key])