Efficiently Processing Large JSON Files in Python Without Loading Everything Into Memory

Introduction

Processing large JSON files can quickly exhaust your system’s memory if you try to load the entire file at once. This is a common challenge in data engineering, ETL, and analytics workflows. Fortunately, Python offers tools to process such files efficiently by streaming the data and only keeping what’s necessary in memory.

This post demonstrates how to use ijson for streaming JSON parsing and memory-profiler to monitor memory usage. We’ll also show how to set up your environment with uv for reproducible installs.

Why Not Just Use json.load()?

The standard json module’s json.load() reads the entire file into memory. For files larger than your available RAM, this leads to crashes or severe slowdowns. Instead, streaming parsers like ijson process the file incrementally.

Setting Up Your Environment

First, initialize your Python project with uv and add the required dependencies:

uv init
uv pip install ijson memory-profiler

Streaming JSON Processing Example

Suppose you have a large JSON array of objects and want to filter items matching a specific field (e.g., "vpn": "ABC"), writing only those to an output file. Here’s how you can do it efficiently:

from memory_profiler import profile
import json
import ijson
import time

backend = ijson  # You can also use ijson.get_backend("yajl2_c") for speed

objects_num = 0

# References:
# https://pythonspeed.com/articles/json-memory-streaming/
# https://www.dataquest.io/blog/python-json-tutorial/
# https://pytutorial.com/python-json-streaming-handle-large-datasets-efficiently/
# https://github.com/kashifrazzaqui/json-streamer

@profile
def filter_large_json(input_file, output_file, target_vpn):
    global objects_num
    with open(input_file, "rb") as infile, open(output_file, "w") as outfile:
        outfile.write("[")
        first = True
        for obj in backend.items(infile, "item"):
            if obj.get("vpn") == target_vpn:
                if not first:
                    outfile.write(",")
                json.dump(obj, outfile)
                first = False
                objects_num += 1
        outfile.write("]")
    print(f"Filtered {objects_num} objects with vpn '{target_vpn}'.")

start_time = time.time()
filter_large_json("test.json", "output.json", "ABC")
print("--- %s seconds ---" % (time.time() - start_time))

Profiling Memory Usage

The @profile decorator from memory-profiler will show you the memory usage line by line when you run your script with:

mprof run your_script.py
mprof plot

Conclusion

By streaming your JSON processing, you can handle files of virtually any size, limited only by disk space, not RAM. This approach is essential for scalable data pipelines and analytics.

References