Import massive JSON arrays in Rails without melting your server
June 30, 2025
I recently found myself in the situation of having to process a multi-gigabyte JSON file in a Rails application. The file structure was an array of homogeneous objects, like the following example:
[
{
"sku": "ABC123",
"name": "Product Name",
"description": "Product Description",
"price": 19.99
},
{
"sku": "DEF456",
"name": "Another Product",
"description": "Another Product Description",
"price": 29.99
}
// ... 2GB of such data
]The naive solution
products = JSON.parse($stdin.read)
products.each do |product|
Product.upsert(product, unique_by: :sku, on_duplicate: :update)
endThere are two main problems to this approach:
- Building a String of 2GB then parsing it is not acceptable. It consumes a ton of memory.
- Processing the rows one by one is way too slow.
Creating all products at once
The second problem is easy to solve:
products = JSON.parse($stdin.read)
Product.upsert_all(products, unique_by: :sku, on_duplicate: :update)But this creates a new problem: All existing rows will be locked while the statement is being executed.
Batch inserts
The solution to fix this problem is to insert only a subset of the data at a time. For example in batches of 1000 products:
products = JSON.parse($stdin.read)
products.each_slice(1_000) do |batch|
Product.upsert_all(batch, unique_by: :sku, on_duplicate: :update)
endThis solution should be faster than the naive one, performing 1k times fewer database roundtrips, but also locking a reasonable amount of rows every iteration.
But the memory problem remains, the entire dataset is still loaded into Ruby memory.
Parsing large JSON arrays in Ruby
There is no built-in tool in Ruby to efficiently process massive JSON arrays.
I was able to find two solutions:
- Using
yajl-ruby(binding toyajl) - Using
json-stream, an evented parser (à la SAX)
I had a preference for using json-stream, but it required writing a lot of Ruby code.
Pre-processing with jq
If I could convert the input to a JSON Sequence (RFC7464) where every object is on a new line, then parsing the input would be much simpler. Instead of parsing a multi-gigabyte JSON array all at once, we could read the file line by line (which is memory-efficient), parse each individual JSON object (which is fast), and process those objects in batches.
It turns out that this is exactly what the jq -c '.[]' command does, it would turn the example input above into the following JSON Sequence:
{ "sku": "ABC123", "name": "Product Name", "description": "Product Description", "price": 19.99 }
{ "sku": "DEF456", "name": "Another Product", "description": "Another Product Description", "price": 29.99 }But the jq command above has the same problem as our Ruby programs so far, it would load the entire file into memory before processing it.
Entering jq streaming
jq has a built-in streaming mode that allows us to process the JSON file without loading it into memory.
The syntax is a bit different from the regular mode, it uses the --stream option to read the input as a stream of tokens.
The following command will transform the JSON input stream into a JSON Sequence stream.
jq -cn --stream 'fromstream(1|truncate_stream(inputs))'task process_products: :environment do
$stdin.each_line.each_slice(1_000) do |product_lines|
products_data = product_lines.map { JSON.parse(_1, symbolize_names: true) }
Product.upsert_all(products_data, unique_by: :sku, on_duplicate: :update)
end
endcat products_2gb.json | \
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
bin/rake process_productsFinally we reached a solution that is both memory efficient and fast.