Import massive JSON arrays in Rails without melting your server

June 30, 2025

I recently found myself in the situation of having to process a multi-gigabyte JSON file in a Rails application. The file structure was an array of homogeneous objects, like the following example:

[
  {
    "sku": "ABC123",
    "name": "Product Name",
    "description": "Product Description",
    "price": 19.99
  },
  {
    "sku": "DEF456",
    "name": "Another Product",
    "description": "Another Product Description",
    "price": 29.99
  }
  // ... 2GB of such data
]

The naive solution

products = JSON.parse($stdin.read)
products.each do |product|
  Product.upsert(product, unique_by: :sku, on_duplicate: :update)
end

There are two main problems to this approach:

Building a String of 2GB then parsing it is not acceptable. It consumes a ton of memory.
Processing the rows one by one is way too slow.

Creating all products at once

The second problem is easy to solve:

products = JSON.parse($stdin.read)
Product.upsert_all(products, unique_by: :sku, on_duplicate: :update)

But this creates a new problem: All existing rows will be locked while the statement is being executed.

Batch inserts

The solution to fix this problem is to insert only a subset of the data at a time. For example in batches of 1000 products:

products = JSON.parse($stdin.read)
products.each_slice(1_000) do |batch|
  Product.upsert_all(batch, unique_by: :sku, on_duplicate: :update)
end

This solution should be faster than the naive one, performing 1k times fewer database roundtrips, but also locking a reasonable amount of rows every iteration.

But the memory problem remains, the entire dataset is still loaded into Ruby memory.

Parsing large JSON arrays in Ruby

There is no built-in tool in Ruby to efficiently process massive JSON arrays.

I was able to find two solutions:

Using yajl-ruby (binding to yajl)
Using json-stream, an evented parser (à la SAX)

I had a preference for using json-stream, but it required writing a lot of Ruby code.

Pre-processing with `jq`

If I could convert the input to a JSON Sequence (RFC7464) where every object is on a new line, then parsing the input would be much simpler. Instead of parsing a multi-gigabyte JSON array all at once, we could read the file line by line (which is memory-efficient), parse each individual JSON object (which is fast), and process those objects in batches.

It turns out that this is exactly what the jq -c '.[]' command does, it would turn the example input above into the following JSON Sequence:

{ "sku": "ABC123", "name": "Product Name", "description": "Product Description", "price": 19.99 }
{ "sku": "DEF456", "name": "Another Product", "description": "Another Product Description", "price": 29.99 }

But the jq command above has the same problem as our Ruby programs so far, it would load the entire file into memory before processing it.

Entering `jq` streaming

jq has a built-in streaming mode that allows us to process the JSON file without loading it into memory. The syntax is a bit different from the regular mode, it uses the --stream option to read the input as a stream of tokens.

The following command will transform the JSON input stream into a JSON Sequence stream.

jq -cn --stream 'fromstream(1|truncate_stream(inputs))'

task process_products: :environment do
  $stdin.each_line.each_slice(1_000) do |product_lines|
    products_data = product_lines.map { JSON.parse(_1, symbolize_names: true) }
    Product.upsert_all(products_data, unique_by: :sku, on_duplicate: :update)
  end
end

cat products_2gb.json | \
  jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
  bin/rake process_products

Finally we reached a solution that is both memory efficient and fast.