Intro
In this post, we will look at the groupby function from the itertools module.
The itertools.groupby function groups consecutive elements from an iterable. It accepts an iterable and an optional callable key that computes a key value for each element. It returns an iterator that yields tuples of (key, group), where group itself is an iterator over the consecutive elements that share that key value. If key is not specified or passed as None, then the value of each element is used as key value.
⚠️ A few critical points to remember: groupby only groups consecutive elements with the same key. If your data has non-consecutive elements with the same key, you’ll need to sort the data first. Also, the grouped iterators are only accessible once. So, if you want to access the groups later, you should store them as list. See the Cavets section for more details.
Use cases
The primary use case of groupby is when you need to group consecutive items in an iterable by some property or computed value. This is extremely useful when processing sorted data, log files, or any sequential data where similar items appear together.
Common scenarios include:
- Processing sorted data where you need to aggregate items by category
- Analyzing log files where consecutive entries often share attributes (timestamps, user IDs, etc.)
- Run-length encoding for data compression
- Finding consecutive sequences in data
- Breaking down data into chunks based on some property
The beauty of groupby is its memory efficiency - it processes data lazily without loading everything into memory, making it perfect for large datasets.
Usage
Here’s a basic example showing how groupby works.
Note that we have already sorted the list in a way that we want the elements to be grouped:
from itertools import groupby
# group names by their first alphabet
names = ["Alice", "Bob", "Charlie", "Chuck", "Dan"]
for key, group in groupby(names, key=lambda x: x[0]):
print(f"{key}: {list(group)}")
# [Out]:
# A: ['Alice']
# B: ['Bob']
# C: ['Charlie', 'Chuck']
# D: ['Dan']
Let’s see what happens when the elements are not sorted in the way that we want them to be grouped:
from itertools import groupby
# group by even or odd
numbers = [2, 4, 6, 11, 3, 5, 8, 10, 9]
for key, group in groupby(numbers, key=lambda x: x % 2 == 0):
print(f"Even: {key}, Values: {list(group)}")
# [Out]:
# Even: True, Values: [2, 4, 6]
# Even: False, Values: [11, 3, 5]
# Even: True, Values: [8, 10]
# Even: False, Values: [9]
Notice how the even numbers [8, 10] formed a separate group from [2, 4, 6]. This happened because they were not consecutive in the original list, so when the element 11 was encountered, the value of key chaned (earlier it was evaluated to 0 for even number, then it was evaluated to 1 for the number 11) and hence a new group was formed. Same thing happened for the element 9.
I mentioned above that elements need to be sorted in the same way that we want them grouped. Here’s what I mean by that:
Say I have a list of dicts to be grouped by a key named city in each dict. Then, I also need to ensure that the list itself is sorted based on the same key. Generally, in case of complex datastructures, it is a good idea to abstract the sorting function / lambda into a variable and then access that in both places - to sort the list and to pass key to groupby:
from itertools import groupby
details = [
{"name": "Alice", "city": "Berlin"},
{"name": "Bob", "city": "Amsterdam"},
{"name": "Charlie", "city": "Mumbai"},
{"name": "Chuck", "city": "Berlin"},
{"name": "Dan", "city": "Warsaw"},
{"name": "Eve", "city": "New York"},
{"name": "Frank", "city": "Mumbai"},
{"name": "Grace", "city": "Berlin"},
]
sort_key = lambda x: x["city"]
# alternatively:
# sort_key = operator.itemgetter("city")
details.sort(key=sort_key)
for key, group in groupby(details, sort_key):
names = [each["name"] for each in group]
print(f"{key}: {names}")
# [Out]:
# Amsterdam: ['Bob']
# Berlin: ['Alice', 'Chuck', 'Grace']
# Mumbai: ['Charlie', 'Frank']
# New York: ['Eve']
# Warsaw: ['Dan']
By now you must have figured that the key parameter can be any callable, including functions, lambdas, even None. I’d encourage you to try out different options!
Real-life scenarios
Here are some practical examples where groupby shines.
Processing sorted log entries
When analyzing logs that are stored chronologically, groupby can be used group entries by timestamp, user, status code, or any other attribute. For e.g., grouping them by each hour:
from itertools import groupby
def get_hour_from_timestamp(entry):
return entry["timestamp"].split(":")[0]
# log entries sorted by timestamp
logs = [
{"timestamp": "10:00:36", "status": 200, "url": "/home"},
{"timestamp": "10:10:41", "status": 200, "url": "/about"},
{"timestamp": "10:20:14", "status": 404, "url": "/missing"},
{"timestamp": "11:10:26", "status": 404, "url": "/gone"},
{"timestamp": "11:40:45", "status": 500, "url": "/api"},
{"timestamp": "12:30:16", "status": 200, "url": "/home"},
]
for hour, entries in groupby(logs, key=get_hour_from_timestamp):
print(hour, entries) # do further processing with `entries`
# [Out]:
# 10 <itertools._grouper object at 0x..>
# 11 <itertools._grouper object at 0x..>
# 12 <itertools._grouper object at 0x..>
Run-length encoding
One of the classic uses of groupby - compressing consecutive repeated values:
from itertools import groupby
def run_length_encode(data):
"""Compress consecutive repeated values."""
return [(key, len(list(group))) for key, group in groupby(data)]
def run_length_decode(encoded):
"""Decompress run-length encoded data."""
return [key for key, count in encoded for _ in range(count)]
# compress
original = "AAABBCCCCAAD"
compressed = run_length_encode(original)
print(compressed)
# [Out]: [('A', 3), ('B', 2), ('C', 4), ('A', 2), ('D', 1)]
# decompress
decompressed = "".join(run_length_decode(compressed))
assert decompressed == original
print(decompressed)
# [Out]: AAABBCCCCAAD
Grouping database query results
When you fetch sorted data from a database and need to aggregate it.
ℹ️ Although it is usually better to delegate this functionality to the database engine itself (using
GROUPBYoperator in SQL), there might be instances when you need ungrouped data in multiple places. Then,groupbycan be used.
from collections import defaultdict
from itertools import groupby
from operator import itemgetter
# simulating sorted database results
orders = [
{"customer_id": 1, "product": "apple", "quantity": 5},
{"customer_id": 1, "product": "banana", "quantity": 3},
{"customer_id": 2, "product": "orange", "quantity": 2},
{"customer_id": 2, "product": "apple", "quantity": 1},
{"customer_id": 3, "product": "banana", "quantity": 4},
]
# aggregate by customer (data is already sorted by customer_id)
summary = {}
for customer_id, orders_group in groupby(orders, key=itemgetter("customer_id")):
summary[customer_id] = {"total_items": 0, "products": []}
for order in orders_group:
summary[customer_id]["total_items"] += order["quantity"]
summary[customer_id]["products"].append(order["product"])
print(summary)
# [Out]:
# {
# 1: {'total_items': 8, 'products': ['apple', 'banana']},
# 2: {'total_items': 3, 'products': ['orange', 'apple']},
# 3: {'total_items': 4, 'products': ['banana']}
# }
Caveats
Only groups consecutive elements
This is the most important caveat. groupby does NOT group all elements with the same key together - only consecutive ones. This differs from SQL’s GROUPBY that groups elements irrespective of their order.
from itertools import groupby
data = [1, 2, 1, 2, 1]
# without sorting
for key, group in groupby(data):
print(f"{key}: {list(group)}")
# [Out]:
# 1: [1]
# 2: [2]
# 1: [1]
# 2: [2]
# 1: [1]
# with sorting - now it groups properly
for key, group in groupby(sorted(data)):
print(f"{key}: {list(group)}")
# [Out]:
# 1: [1, 1, 1]
# 2: [2, 2]
If you need to group all elements with the same key regardless of position, use collections.defaultdict or collections.Counter instead.
Not consuming the group before moving on
The group iterator is tied to the main groupby iterator. Once you move to the next group, the previous group’s iterator is exhausted:
from itertools import groupby
data = [1, 1, 2, 2, 3]
result = []
for key, group in groupby(data):
result.append((key, group))
# try accessing the groups again
for key, group in result:
print(f"{key}: {list(group)}")
# [Out]:
# 1: []
# 2: []
# 3: []
Solution: Convert groups to lists immediately. In above snippet, this change must be done:
result.append((key, list(group)))
Advancing the groupby iterator invalidates previous groups
Even if you save a group iterator, advancing the main groupby iterator will invalidate it:
from itertools import groupby
data = [1, 1, 2, 2]
it = groupby(data)
key1, group1 = next(it)
key2, group2 = next(it) # this invalidates group1
print(f"{key1}: {list(group1)}") # empty!
# [Out]: 1: []
print(f"{key2}: {list(group2)}")
# [Out]: 2: [2, 2]
Memory considerations with large groups
While groupby itself is memory-efficient, if you convert large groups to lists, you’ll use memory. Instead, process each group there itself, for the purpose you have for them, without converting into list (i.e., using like generator objects itself).
The beauty of groupby is that it’s lazy - use it that way when dealing with large data!