Searching on steroids

There are plenty of indexing/search servers available out there like Solr, Sphinx, ElasticsearchGoogle Search Appliance and many more.

But out of all the above, Elasticsearch is gaining more attention to it because of it’s popularity. This post is going to tell you only about it and it’s integration in Rails.

So, how would someone describe what an Elasticsearch is ?

From God’s perspective, it’s an open source, distributed, RESTful, search engine i.e. it has a very advanced distributed model, speaks JSON natively, and exposes many advanced search features, all seamlessly expressed through JSON DSL and REST API’s.

♦ Inception

Standard way of using ES in any application is to use it as a secondary data store i.e. data will be stored in some kind of SQL/NoSQL database and then continuously upserting required documents from it into ES, pretty neat.

Some of us might think why not use database itself as a searching engine over ES, as it involves less work and it has all the features needed ? Answer is – NO, you shouldn’t, because when the data starts hitting roof, the database gives up and it won’t show results in real time.

Then some of us might think why not use Elasticsearch as a primary store ? Answer is again – NO, you shouldn’t (for now at least). Reasons are – it doesn’t support transactions, associations and most of all, there are no ORMs/ODMs yet to support ActiveRecord like features (callbacks, validations, eager loading etc).

There are plenty of gems out there to start with ES, but the one I would prefer are elasticsearch-rails and elasticsearch-model as they provide more customization for querying.

♦ The Architect

Consider an Article model which has many comments and an author.

class Article
  include Mongoid::Document

  field :title
  field :body
  field :status
  field :publishing_date, type: Date
  has_many :comments
  belongs_to :author
end

To map Article in ES, you have to specify its JSON structure as,

class Article
  include Mongoid::Document
  include Elasticsearch::Model    # note this inclusion 

  # ....

  # JSON structure to be indexed
  INDEXED_FIELDS = {
    only: [:title, :publishing_date, :status],

    include: {
      author: {
        only: [:_id, :name]
      },

      comments: {
        only: [:body],
        include: {
          author: {
            only: [:_id, :name]
          }
        }
      }
    }
  }

  # It will get called while indexing an article
  def as_indexed_json(options = {})
    as_json(INDEXED_FIELDS)
  end
end

ES needs to know what datatype a field has and how it should get handled while searching. You can specify both using mappings as,

class Article
  include Mongoid::Document
  include Elasticsearch::Model

  # ...

  mappings do
    indexes :status, index: :not_analyzed
    indexes :publishing_date, type: :date

    indexes :author do
      indexes :_id, index: :not_analyzed
    end

    indexes :comments do
      indexes :author do
        indexes :_id, index: :not_analyzed
      end
    end
  end

  # ...
end

PS : You don’t have to specify mappings for all fields, but for only those who need customization. If you don’t specify mapping for a field, ES will assume its datatype is string and it needs to be analyzed (it’ll analyze the text and break it into tokens).

Sometimes it’s worth to store additional fields in ES which you won’t require while searching, but they are required while building a response. For example, while getting list of the articles, response should also contain author’s avatar URL. If its also stored at ES side, no need to make a database call to get it which would  increase response time.

♦ The Transporter

Elasticsearch::Model also adds support for importing data from database to ES. You may need to import all data at the very beginning or while there are some amendments. If you want to automatically update an index if there are document changes, you need to include Elasticsearch::Model::Callbacks module.

There are few ways by which you can do the importing.

• Standard way

Standard way of importing all data in a collection.

Article.import

You can also add scopes while importing, so that only specific documents will get imported.

Article.published.import

import accepts some other options like, force: true to recreate indexes, refresh: true to refresh indexes, batch_size: 100 to fetch 100 documents at a time from collection.

• Using rake

elasticsearch-rails provides a rake task for importing data. It also accepts same option as import through environment variables.

rake elasticsearch:import:model CLASS='Article' SCOPE='published'

You can use this task if you want to setup a CRON job or if you want to import all collections in one go as,

rake elasticsearch:import:all DIR=app/models

• Custom way

If you want more customization for indexing, you can implement it by yourself. For example, making asynchronous updates on document changes or making multiple operations in single request.

If indexing data is taking more time, then below are some cases which you can consider,

  • Adding proper indexes at database side.
  • Fetching less documents in single database request as it would not acquire much of your RAM.
  • If possible, avoid eager loading of associations.
  • Disabling logs or lowering log level so that unwanted lines like database queries won’t take much of time.

♦ In pursuit of happyness 

Everything is setup, all data is imported, upsert actions are in place. Now, all you want to do is make search happen and smirk at the performance 😏. I won’t strain your read by adding some ES tutorials as there are plenty, rather I’ll be sharing some common scenarios I came across and their analogy with database queries.

• Article.published

Article.search(query: { constant_score: { filter: { term: { status: 'published' } } } }).results

I’ve used constant_score query as it boost up the performance by not caring about the document score. One more thing to note is the use of term query instead of match query, as the query is about exact match and not partial match.

Above query can also be written using filter query as,

Article.search(filter: { term: { status: 'published' } }).results

filter queries are used for simple checks of inclusion or exclusion. They are non-scoring queries which makes them faster than scoring queries. In addition, ES caches such non-scoring queries in memory for faster access if they are getting called frequently. So, you have to be careful about choosing in between query and filter depending upon the use case.

• Article.in(status: [‘published’, ‘draft’])

Article.search(query: { constant_score: { filter: { terms: { status: ['published', 'draft'] } } } }).results

Only difference here with above example is using terms instead of term, as query is about ORing the values.

• Article.where(status: ‘published’, :date.gte => 1.month.ago)

Article.search(query: { constant_score: { filter: { bool: { must: [ { term: { status: 'published' } }, { range: { date: { gte: 'now-1M' } } } ] } } } }).results

• Article.any_of(title: /mcdonald/, body: /mcdonald/) + Author.any_of(first_name: /mcdonald/, last_name: /mcdonald/)

Article.search(query: { multi_match: { query: 'mcdonald', fields: [:title, :body, 'author.*_name'] } }).results

Most of the time, multi_match is used for autocomplete feature. It accepts optional parameters to define your search criteria. For example,

  •  type: :best_fields which searches Captain America in single field rather than captain in one field and america in another.
  • type: :phrase for matching exact sentence order i.e. Iron Man will be matched with a field containing iron man and not man iron.
  • fuzziness: 2 which will allow 2 edits in a query i.e. DeaddPoool will be matched with deadpool.
  • and there are many others

• Article.collection.aggregate([ { ‘$match’: { status: ‘published’ } }, { ‘$group’: { _id: ‘$publishing_date’, count: { ‘$sum’: 1 } } } ])

This query will return published articles count grouped by publishing_date. Its analogy in ES query would be,

Article.search(size: 0, query: { constant_score: { filter: { term: { status: 'published' } } } }, aggs: { grouped_articles: { terms: { field: :publishing_date, size: 0 } } }).aggregations.grouped_articles.buckets

Note the use of size: 0 in query. If you only care about aggregated results and not about query results, then make sure to use this parameter as it’ll remove query results in response.

• Article.count

Article.search(aggs: { articles: { terms: { field: :_type, size: 0 } } }).aggregations.articles.buckets.first.doc_count

You might ask why aggregation for getting count ? That’s because, its fast and ES queries search up to its max window size. Your document count might be greater than it, because of which, you need to use Scroll API till last page to get all documents.

There are lot of things to yet to be learn from ES, and new features are still getting added into it. They are also working on 5.0.0 release which will take less disk space and will be twice as fast, with 25% increase in search performance.

Thanks for reading, Happy searching !!

Advertisements