I'm working on an application aimed at power users for whom advanced search tools on real-time data is a hard requirement. My team built a solid SQL-based search engine that works pretty well but we are increasingly asked for new ways to be able to filter results and more and more data to take into account.

This is the story of a software architect standing at a fork in the road:

  • Should I keep investing in PostgreSQL, which has the benefit of being the source of truth and always up-to-date,
  • Or should I switch to Elasticsearch, which offers a powerful search and filtering engine out-of-the-box, but comes with the complexity of syncing data?

Context

Let me give a bit of context around the app and its search needs:

We're talking ~20 related tables, with cardinalities ranging from a few dozen records to millions.

At the moment, I have a home-made search engine that accepts both fuzzy full-text queries and structured filters. Filters can be grouped with AND / OR logic, and support predicates such as any/all/none of, equals, <, >, etc. Everything gets validated and ends up compiled into a big, tailored SQL query with CTEs, joins, aggregates, etc. The result then gets decorated, ordered, paginated, and sent back to the frontend.

It works fine, and to be honest, I find the Ruby code orchestrating this rather elegant. It's packaged in query objects, each filter has its own CTE and is injected only when needed, ids can be interpolated early for enhanced efficiency, etc. It's a fine piece of software, yet it's starting to feel like I'm pushing SQL and my centralized PostgreSQL database beyond their comfort zone. I'm worried about its ability to scale given the new requirements.

Aside from the scaling uncertainty, I have to say that finding (and keeping) developers who are comfortable working on hundreds lines-long SQL queries is tough (no ORM bottle-feeding here). Finding developers able to stare at EXPLAIN ANALYZE outputs trying to squeeze out performance is even tougher. And I know it's not a technical argument, but ensuring the company's ability to maintain and develop its system in the coming months/years is a valid argument nonetheless and it's one of the things I have to take into account given my position.

The two candidates

At this point, I'm standing at a fork in the road, and no crystal ball to foresee how each path will perform or age.

Option 1: Stay with PostgreSQL

This means continuing to invest in the current engine, improving query generation, index strategies, adding foreign keys to reduce the number of intermediate joins, and adding more tsvector-based full-text searches (currently, only a couple tables use it). I'd probably need to implement some caching here and there too.

PostgreSQL is my source of truth, it's real-time by design, and I already have a working implementation of the search engine. It just looks like the safest choice, but scaling deep multi-table search with joins, conditions, and fuzzy text filtering is painful. And I'm not sure how far it'll take me.

Option 2: Switch to Elasticsearch

I've used a similar tool before (Algolia), and I know it's not suited for all use cases.

In an e-commerce app, for instance, it's perfect: you index your product catalog, update it every now and then, and no one's going to rage-quit your app just because they changed the price of a shoe and it didn't show up in the next search within 300ms.

This app is definitely not e-commerce.

It's used by power users who bulk-edit hundreds of records, drag-and-drop things across Kanban views, receive inputs from public-facing forms, and expect their searchable back-office dashboard to reflect all of those changes instantly.

This is not the kind of thing you'd expect of Elasticsearch for it's meant to be asynchronous, so I evaluate this option with a bit of skepticism. But ironically, it might be the one that fits best.

Elasticsearch already does most of what I need:

  • Full-text search with fuzziness, boosts, tokenization
  • Filtering with booleans, ranges, term queries, etc.
  • Nested documents
  • Pagination, ranking and sorting

But:

  • It's not the source of truth, which means I'd have to deal with data synchronization. If done asynchronously as it's supposed to be, and if the sync lags even for a few seconds, I'll be dealing with UX issues I don't even want to start thinking about. Users won't know it's "eventual consistency." They'll just say "it's broken."
  • Users must be able to perform bulk actions on filtered results. While easy to do when PostgreSQL is used both for filtering and updating, selecting records using Elasticseach (with its imposed pagination) and applying updates on PostgreSQL would be an additional gymnastics to integrate into my app.

Where I'm leaning

Against my initial instincts, I'm leaning toward Elasticsearch despite the added complexity (dependency to a new database, syncing data synchronously when it's rather designed for asynchronicity).

Why? Because I believe it will give users a better experience, with less custom code and fewer bottlenecks long term. It's designed to solve this exact problem: searching across large, structured, and nested data sets, fast.

With the right architecture around it, I think it can work.

Planned implementation

Making Elasticsearch real-time

Making Elasticsearch real-time

Elasticsearch supports a refresh=wait_for option, which allows to wait for a document to become searchable (that is, to wait for the next refresh cycle) before responding to a write.

index.refresh_interval
How often to perform a refresh operation, which makes recent changes to the index visible to search. Defaults to 1s. Can be set to -1 to disable refresh. If this setting is not explicitly set, shards that haven’t seen search traffic for at least index.search.idle.after seconds will not receive background refreshes until they receive a search request. Searches that hit an idle shard where a refresh is pending will trigger a refresh as part of the search operation for that shard only. This behavior aims to automatically optimize bulk indexing in the default case when no searches are performed. In order to opt out of this behavior an explicit value of 1s should set as the refresh interval.

Source: https://www.elastic.co/docs/reference/elasticsearch/index-settings/index-modules#index-refresh-interval-setting

TL;DR:

  • I must calibrate index.refresh_interval to my needs. 500ms? 1s?
  • Indexing documents on ES synchronously may cost (up to) an extra ~1 second (or whatever index.refresh_interval is set to).

If I wrap all the database writes and the ES indexation call in a transaction-like process, I can make sure the user only gets a response when their change is fully live in both PG and ES. This would solve most of the issues regarding synchronization, but I still have to plug every model that affects the search document (remember, there are ~20) to "something" that triggers an ES reindex. I also need to make sure I don't reindex the same document 15 times in the same request (yes, accepts_nested_attributes_for, I'm talking about you).

My goal is to come up with a simple architecture that ensures:

  • Only one reindex per record per transaction, even when multiple models change
  • Only reindex if the database commit succeeds
  • Decoupled from controllers and services, so that no matter where the update comes from (even from the console), the ES cache gets updated.
  • Simple to reason about and extend

Let's assume the main model I want to index is Project. Here's how it'd works.

1. A registry to track projects to reindex

Every time a model that affects search is saved, I register the affected project_ids into a thread-local registry:

NOTE: My server is run by Puma in a multi-threaded context.

# lib/project_reindex_registry.rb
class ProjectReindexRegistry
  def self.add(project_ids)
    registry.merge(project_ids)
  end

  def self.flush
    project_ids = registry.to_a.uniq
    clear
    ProjectIndexer.sync(project_ids)
  end

  def self.clear
    Thread.current[:project_reindex_registry] = Set.new
  end

  def self.registry
    Thread.current[:project_reindex_registry] ||= Set.new
  end
end

This prevents redundant indexing during a single request or background job. It is thread-safe, but it must be flushed/cleared before handling a new request.

2. Models are responsible of marking the cache as dirty

Instead of stuffing controller logic or callbacks everywhere, I created a small concern:

# app/models/concerns/triggers_project_reindex.rb
module TriggersProjectReindex
  extend ActiveSupport::Concern

  included do
    after_commit :queue_project_reindex, if: :project_related_attributes_changed?
  end

  def queue_project_reindex
    project_id = resolve_project_id
    ProjectReindexRegistry.add(project_id) if project_id
  end

  # You can override this method per model
  def resolve_project_id
    try(:project_id) || (respond_to?(:project) ? project&.id : nil)
  end

  def project_related_attributes_changed?
    %i[...].any? { attribute_changed?(_1) } # TODO fix this part
end

Now, any model that affects the search index can just include TriggersProjectReindex, and that's it.

Example:

class ProjectMembership < ApplicationRecord
  include TriggersProjectReindex
  belongs_to :project
  belongs_to :user
end

class RequiredSkill < ApplicationRecord
  include TriggersProjectReindex
  belongs_to :project
end

3. Flushing the registry after all transactions successfully commit

I'm glad I worked on updating my app from Rails 5 to the latest Rails 7 version. It gives me access to a tool that'll come in handy: after_all_transactions_commit.

To make sure reindexing only happens when the DB transaction succeeds, I hooked into Active Record's transaction system.

# config/initializers/project_reindex_hook.rb
ActiveRecord::Base.after_all_transactions_commit do
  ProjectReindexRegistry.flush
end

Now, all project IDs are flushed and reindexed after the transaction is fully committed.

I'm still questioning the separation of concerns here. On one side, it's really comfortable to set, application-wide, a logic that ensures projects that get updated are reindexed. On the other side, how come a config initializer (which is supposed to initialize the Rails app itself or its dependencies) is aware of a data integrity requirement and knows about its implementation details? I know ActiveStorage's mirror strategy is too setup in the application's configuration, but I'm not a huge fan of this approach. It's an acceptable tradeoff but if I can think of a better way, I'll gladly reconsider this solution.

4. The actual reindexation

Here's how reindexing would be handled behind the scenes:

class ProjectIndexer
  def self.sync(project_ids)
    projects = Project.includes(:memberships, :users, :answers, etc).where(id: project_ids)
    projects.find_each do |project|
      doc = ProjectDocumentBuilder.build(project)

      ES_CLIENT.index(
        index: 'projects',
        id: project.id,
        body: doc,
        refresh: 'wait_for' # REMINDER: ensures it's immediately searchable
      )
    end
  end
end

And the document builder is just a plain Ruby class that turns a Project into a JSON blob:

class ProjectDocumentBuilder
  def self.build(project)
    {
      id: project.id,
      name: project.name,
      # ...
    }
  end
end

This is easy to test, extend, and evolve over time.

Filtering using Elasticsearch, updating on PostgreSQL

When a user wants to bulk edit records, here is how it's going to work:

  • Put the user on hold with some "Please wait" modal
  • Send the filtering query to the backend with the update to apply
  • The backend passes the request to an asynchronous job (and returns a 202 ACCEPTED HTTP status code)
  • The async job:
    • Fetches the filtered results and loops on all available pages to collect ids
    • Applies the update on PostgreSQL
    • Notifies the frontend via Websocket that the task is done
  • The frontend gets a feedback that the action has been completed successfully (or not)

Handling intensive updates

One of the features of the application is public-facing and may ingest high loads of inputs. We're talking hundreds/thousands of submissions per minute. These inputs impact how records are sorted and filtered in the back-office dashboard.

Reindexing the full document in Elasticsearch every time would be overkill since:

  • Most of the document is unchanged.
  • These updates only affect a specific field or a small subset (you can think of it as incrementing a score, this kind of things).
  • Full reindexing would create unnecessary load on both PG and the ES cluster.

The solution I'm considering is, for these updates specifically, to perform partial reindexing. I would only update the subpart of the document that changed, trusting the rest of the document is already up-to-date.

I might either do an atomic increment if the cost of interpreting/executing a painless script reveals to be negligible:

client.update(
  index: 'projects',
  id: '123',
  body: {
    script: {
      source: 'ctx._source.total_views += params.count',
      lang: "painless",
      params: {
        count: 1
      }
    }
  }
)

Or an arbitrary update if it's less costly:

client.update(
  index: 'projects',
  id: '123',
  body: {
    doc: {
      total_views: project.total_views
    }
  }
)

Notice I am not using the refresh: 'wait_for' option here. The reason is simple: I don't want public submissions to feel like a DDoS attack on the ES cluster.

I'm using Elasticsearch's default async refresh behavior. Combined with some optimistic UI, this will allow me to avoid blocking the request while still offering a snappy user experience.

If that's not enough, I can always take it further. I may batch updates and push them to ES every minute or so, depending on how real-time it needs to feel.

A safeguard against unforeseen errors

Thousands of things could go wrong, as always. One thing for sure, though, is that I must ensure that one request doesn't mess with the next one. Since we're using the current thread to store a registry, we must make sure it's cleaned up before the next request comes in.

The safest way I know to do that is to use a Rack middleware. It's not ideal as the logic gets even more scattered (some happens in the application's initializers, some in the model concerns, and now some in a Rack middleware too), but I can't think of a better way for the moment. It's okay, though, I'm still in the brainstorming phase and I still have time to improve this design. If nothing better bubbles up, there are several techniques I can use to make the general architecture easy to understand and to link the code where needed.

# app/middleware/clear_project_reindex_registry_middleware.rb

class ClearProjectReindexRegistryMiddleware
  def initialize(app)
    @app = app
  end

  def call(env)
    # TODO: Decide whether or not to also clear the registry before calling the next app,
    # or at least log an error in case the registry isn't in a clean state.
    @app.call(env)
  ensure
    ProjectReindexRegistry.clear
  end
end

It needs to be registered with the middleware stack. That's usually done in config/application.rb. Depending on the testing strategy I choose later, I might decide to exclude it from test environments to make test isolation easier.

config.middleware.use ClearProjectReindexRegistryMiddleware

Clean, scalable, predictable

This architecture is fast to implement and easy to reason about:

  • Controllers don't need to care about indexing.
  • Models just say "this save affects the project" and the system takes care of the rest.
  • Indexing only happens when the transaction commits.
  • We get real-time search without weird caching hacks.

Conclusion

There's no perfect answer. PostgreSQL is solid, reliable, and familiar, but it wasn't built for Google-style search. Elasticsearch is fast, flexible, and powerful, but the real-time requirement makes caching a real challenge that needs a lot of care.

Whichever we pick as a team, the key is understanding the tradeoffs, and making sure that everyone understands that it's a bet. It's the kind of topics we can't have any certainty on until we try.

Thank you for reading!
Younes SERRAJ