Intro
In this post, we will look at the defaultdict datatype from the collections module.
The collections.defaultdict is a subclass of the built-in dict that accepts a callable (default_factory) during its initialization. Then, when you try to access a key in the initialized object, it does a regular dict lookup to fetch the key’s value. If the key is missing, it calls the default_factory that generates a value for the key requested. This key-value pair is then stored in the dict and the value is returned.
Use cases
The primary use case is when you need to build up a dict based on an iterable and don’t want to check if keys exist before operating on them. This is extremely common when grouping items, counting occurrences, or building nested data structures.
For example, when grouping items by some attribute, you’d normally need to check if the key exists before appending to a list. With defaultdict(list), you can just append directly - if the key doesn’t exist, it automatically creates an empty list first.
Another common scenario is counting: instead of checking if a key exists before incrementing a counter, you can use defaultdict(int) and increment directly, since missing keys default to 0.
The beauty of defaultdict is that it eliminates boilerplate conditional logic, making your code cleaner and often more readable!
Usage
Here’s a basic example showing the difference between regular dict and defaultdict:
from collections import defaultdict
# regular dict - requires existence checks
d = {}
if "fruits" not in d:
d["fruits"] = []
d["fruits"].append("apple")
d["fruits"].append("banana")
# another way to do it using traditional dict
# but this isn't very pythonic and less readable as well
d.setdefault("fruits", []).append("apple")
d.setdefault("fruits", []).append("banana")
# defaultdict - no checks needed
dd = defaultdict(list)
dd["fruits"].append("apple") # just works!
dd["fruits"].append("banana")
print(dd)
# [Out]: defaultdict(<class 'list'>, {'fruits': ['apple', 'banana']})
💡 A bit about
dictinternals anddefault_factoryIn the standard
dictimplementation, when we do a lookup liked["key"], it calls the__getitem__method. If this method cannot find thekey, it calls the__missing__method.In a regular
dict, this then raises aKeyError. In adefaultdicthowever, it calls thedefault_factorycallable (which is the first argument passed todefaultdictas stated above). This callable is then executed, and its return value is assigned tokeyand is also returned.The
default_factorycan be any callable, including builtins, functions, lambdas,Noneor even left blank. IfNoneor nothing is passed, then upon lookup for a non-existent key, aKeyErroris raised, like a regular dict.As this callable decides what is returned when a non-existent key is looked-ip for, it must be a callable that can be called without any arguments. Then, the return value of this callable is assigned to the key.
Consider some examples:
from collections import defaultdict # callable as `list` dd = defaultdict(list) print(dd["foo"]) # [Out]: [] # callable as `int` dd = defaultdict(list) print(dd["foo"]) # [Out]: 0 # callable as a lambda dd = defaultdict(lambda: {"bar": []}) print(dd["foo"]) # [Out]: {'bar': []} # callable as `None` dd = defaultdict(None) print(dd["foo"]) # [Out]: KeyError 'foo' # callable as a function def cust(): return "DEFAULT" dd = defaultdict(cust) print(dd["foo"]) # [Out]: DEFAULT
The default_factory agument can be any callable that returns a value. This can be used to demostrate various use cases, viz.:
To directly append into a list:
dd = defaultdict(list) dd["items"].append(1) print(dd["items"]) # [Out]: [1]To count grouped objects:
dd = defaultdict(int) for i in range(11): if i % 2 == 0: dd["even"] += 1 else: dd["odd"] += 1 print(dd) # [Out]: defaultdict(<class 'int'>, {'even': 6, 'odd': 5})Set custom default values:
dd = defaultdict(lambda: "N/A") print(dd["missing_key"]) # [Out]: N/ANested (2D) data structures:
dd = defaultdict(lambda: defaultdict(int)) dd["user1"]["score"] += 10 dd["user1"]["score"] += 5 print(dd["user1"]["score"]) # [Out]: 15
Real-life scenarios
Here are some practical examples where defaultdict shines.
Grouping items by attribute
One of the most common uses - grouping a list of items by some property:
from collections import defaultdict
# group users by their department
users = [
{"name": "Alice", "dept": "Engineering"},
{"name": "Bob", "dept": "Sales"},
{"name": "Charlie", "dept": "Engineering"},
{"name": "Diana", "dept": "HR"},
{"name": "Eve", "dept": "Sales"},
]
by_dept = defaultdict(list)
for user in users:
by_dept[user["dept"]].append(user["name"])
print(dict(by_dept))
# [Out]: {
# 'Engineering': ['Alice', 'Charlie'],
# 'Sales': ['Bob', 'Eve'],
# 'HR': ['Diana']
# }
Building an inverted index
When working with search functionality, you often need to build an inverted index - mapping words to document IDs:
from collections import defaultdict
documents = {
1: "Learn Python",
2: "Webdev with Python",
3: "Webdev with NodeJS",
}
# build inverted index: word -> set of doc IDs
iindex = defaultdict(set)
for doc_id, content in documents.items():
for word in content.split():
iindex[word.lower()].add(doc_id)
print(dict(iindex))
# [Out]: {
# 'learn': {1},
# 'python': {1, 2},
# 'webdev': {2, 3},
# 'with': {2, 3},
# 'nodejs': {3}
# }
# now you can quickly find documents containing a word
print(iindex["python"])
# [Out]: {1, 2}
Counting occurrences with grouping
A scenario I encounter frequently - counting events or items with multiple dimensions:
import json
from collections import defaultdict
# log entries with user and action
logs = [
("alice", "login"),
("bob", "login"),
("alice", "view_page"),
("alice", "view_page"),
("bob", "logout"),
("alice", "login"),
]
# count actions per user
user_actions = defaultdict(lambda: defaultdict(int))
for user, action in logs:
user_actions[user][action] += 1
print(json.dumps(user_actions, indent=4))
# [Out]: {
# "alice": {
# "login": 2,
# "view_page": 2
# },
# "bob": {
# "login": 1,
# "logout": 1
# }
# }
Accumulating values for API response aggregation
When aggregating data from multiple API calls or database queries:
from collections import defaultdict
# simulating results from multiple API calls
api_responses = [
{'user_id': 1, 'purchase': 100},
{'user_id': 2, 'purchase': 50},
{'user_id': 1, 'purchase': 200},
{'user_id': 3, 'purchase': 75},
{'user_id': 2, 'purchase': 30},
]
# aggregate total purchases per user
user_totals = defaultdict(int)
for response in api_responses:
user_totals[response["user_id"]] += response["purchase"]
print(dict(user_totals))
# [Out]: {1: 300, 2: 80, 3: 75}
Caveats
The .get() method of dict does NOT use the default factory
This is probably the most surprising behavior. When you use .get() on a defaultdict, it behaves exactly like a regular dict - it does NOT call the default factory.
From the official docs:
Note that
__missing__()is not called for any operations besides__getitem__(). This means thatget()will, like normal dictionaries, returnNoneas a default rather than usingdefault_factory.
In case you aren’t aware, the __getitem__ is the method of dict that is called when we do a lookup using square brackets - any_dict["key"]. So what this means is dd["missing_key"] (where dd is a defaultdict) will call the default_factory, but dd.get("missing_key") will not call the default_factory:
from collections import defaultdict
dd = defaultdict(list)
# using bracket notation - triggers default factory
print(dd[1])
# [Out]: []
# using .get() - does NOT trigger default factory!
print(dd.get(2))
# [Out]: None
# similar to `dict`, `.get` also does not add an entry
print(dd)
# [Out]: defaultdict(<class 'list'>, {1: []})
# and it works just like in the ususal dict
print(dd.get(3, []))
# [Out]: []
Passing None or nothing as default_factory
If you create a defaultdict without a factory function (or with None), it behaves like a regular dict when accessing missing keys - you get a KeyError:
from collections import defaultdict
dd1 = defaultdict()
print(dd1[1])
# [Out]: KeyError 1
dd2 = defaultdict(None)
print(dd2[1])
# [Out]: KeyError 1
# But this works
dd3 = defaultdict(lambda: None) # [1]
print(dd3[1])
# [Out]: None
print(dd3) # key WAS added this time
# [Out]: defaultdict(<function <lambda> at 0x...>, {1: None})
Note [1]: This works because here we are setting a default_factory to a callable lambda that just returns None when a new key is requested for.
Comparing defaultdict with regular dict
Two defaultdicts are equal each other and to a regular dict if they all have the same items, even if they have different default_factory:
from collections import defaultdict
dd1 = defaultdict(list, {"a": 1})
dd2 = defaultdict(int, {"a": 1}) # different factory
regular = {"a": 1}
print(dd1 == dd2)
# [Out]: True
print(dd1 == regular)
# [Out]: True
# but watch out for empty defaultdicts with different factories
dd3 = defaultdict(list)
dd4 = defaultdict(int)
print(dd3 == dd4) # both empty, so equal
# [Out]: True
This means, default_factory is not considered for equality check.
Calling in on the dict does not trigger default_factory
The in operator checks for key existence but does NOT trigger the default factory:
from collections import defaultdict
dd = defaultdict(list)
# "in" operator doesn't trigger factory
print("a" in dd)
# [Out]: False
print(dd) # still empty
# [Out]: defaultdict(<class 'list'>, {})
# but bracket access does
_ = dd["a"]
print("a" in dd)
# [Out]: True
This actually makes sense - you don’t want membership testing to have side effects!
Unexpected mutations during iteration
While iterating over a defaultdict, accessing non-existent keys during iteration triggers the default_factory. This is obvious, as we know that a lookup triggers __missing__() method, which in turn, calls the default_factory that creates the entry.
from collections import defaultdict
dd = defaultdict(int)
dd["x"] = [1]
for key in ["x", "y", "z"]:
print(key, dd2[key]) # this would add `y` and `z` to `dd`!
# instead, check first or use .get()
for key in ["x", "y", "z"]:
if key in dd2:
print(key, dd2[key])