There are plenty of indexing/search servers available out there like Solr, Sphinx, Elasticsearch, Google Search Appliance and many more.
But out of all the above, Elasticsearch is gaining more attention to it because of it’s popularity. This post is going to tell you only about it and it’s integration in Rails.
So, how would someone describe what an Elasticsearch is ?
From God’s perspective, it’s an open source, distributed, RESTful, search engine i.e. it has a very advanced distributed model, speaks JSON natively, and exposes many advanced search features, all seamlessly expressed through JSON DSL and REST API’s.
♦ Inception
Standard way of using ES in any application is to use it as a secondary data store i.e. data will be stored in some kind of SQL/NoSQL database and then continuously upserting required documents from it into ES, pretty neat.
Some of us might think why not use database itself as a searching engine over ES, as it involves less work and it has all the features needed ? Answer is – NO, you shouldn’t, because when the data starts hitting roof, the database gives up and it won’t show results in real time.
Then some of us might think why not use Elasticsearch as a primary store ? Answer is again – NO, you shouldn’t (for now at least). Reasons are – it doesn’t support transactions, associations and most of all, there are no ORMs/ODMs yet to support ActiveRecord like features (callbacks, validations, eager loading etc).
There are plenty of gems out there to start with ES, but the one I would prefer are elasticsearch-rails and elasticsearch-model as they provide more customization for querying.
♦ The Architect
Consider an Article
model which has many comments and an author.
class Article include Mongoid::Document field :title field :body field :status field :publishing_date, type: Date has_many :comments belongs_to :author end
To map Article
in ES, you have to specify its JSON structure as,
class Article include Mongoid::Document include Elasticsearch::Model # note this inclusion # .... # JSON structure to be indexed INDEXED_FIELDS = { only: [:title, :publishing_date, :status], include: { author: { only: [:_id, :name] }, comments: { only: [:body], include: { author: { only: [:_id, :name] } } } } } # It will get called while indexing an article def as_indexed_json(options = {}) as_json(INDEXED_FIELDS) end end
ES needs to know what datatype a field has and how it should get handled while searching. You can specify both using mappings
as,
class Article include Mongoid::Document include Elasticsearch::Model # ... mappings do indexes :status, index: :not_analyzed indexes :publishing_date, type: :date indexes :author do indexes :_id, index: :not_analyzed end indexes :comments do indexes :author do indexes :_id, index: :not_analyzed end end end # ... end
PS : You don’t have to specify mappings for all fields, but for only those who need customization. If you don’t specify mapping for a field, ES will assume its datatype is string and it needs to be analyzed (it’ll analyze the text and break it into tokens).
Sometimes it’s worth to store additional fields in ES which you won’t require while searching, but they are required while building a response. For example, while getting list of the articles, response should also contain author’s avatar URL. If its also stored at ES side, no need to make a database call to get it which would increase response time.
♦ The Transporter
Elasticsearch::Model
also adds support for importing data from database to ES. You may need to import all data at the very beginning or while there are some amendments. If you want to automatically update an index if there are document changes, you need to include Elasticsearch::Model::Callbacks
module.
There are few ways by which you can do the importing.
• Standard way
Standard way of importing all data in a collection.
Article.import
You can also add scopes while importing, so that only specific documents will get imported.
Article.published.import
import
accepts some other options like, force: true
to recreate indexes, refresh: true
to refresh indexes, batch_size: 100
to fetch 100 documents at a time from collection.
• Using rake
elasticsearch-rails provides a rake task for importing data. It also accepts same option as import
through environment variables.
rake elasticsearch:import:model CLASS='Article' SCOPE='published'
You can use this task if you want to setup a CRON job or if you want to import all collections in one go as,
rake elasticsearch:import:all DIR=app/models
• Custom way
If you want more customization for indexing, you can implement it by yourself. For example, making asynchronous updates on document changes or making multiple operations in single request.
If indexing data is taking more time, then below are some cases which you can consider,
- Adding proper indexes at database side.
- Fetching less documents in single database request as it would not acquire much of your RAM.
- If possible, avoid eager loading of associations.
- Disabling logs or lowering log level so that unwanted lines like database queries won’t take much of time.
♦ In pursuit of happyness
Everything is setup, all data is imported, upsert actions are in place. Now, all you want to do is make search happen and smirk at the performance 😏. I won’t strain your read by adding some ES tutorials as there are plenty, rather I’ll be sharing some common scenarios I came across and their analogy with database queries.
• Article.published
Article.search(query: { constant_score: { filter: { term: { status: 'published' } } } }).results
I’ve used constant_score query as it boost up the performance by not caring about the document score. One more thing to note is the use of term
query instead of match
query, as the query is about exact match and not partial match.
Above query can also be written using filter
query as,
Article.search(filter: { term: { status: 'published' } }).results
filter
queries are used for simple checks of inclusion or exclusion. They are non-scoring queries which makes them faster than scoring queries. In addition, ES caches such non-scoring queries in memory for faster access if they are getting called frequently. So, you have to be careful about choosing in between query
and filter
depending upon the use case.
• Article.in(status: [‘published’, ‘draft’])
Article.search(query: { constant_score: { filter: { terms: { status: ['published', 'draft'] } } } }).results
Only difference here with above example is using terms
instead of term
, as query is about ORing the values.
• Article.where(status: ‘published’, :date.gte => 1.month.ago)
Article.search(query: { constant_score: { filter: { bool: { must: [ { term: { status: 'published' } }, { range: { date: { gte: 'now-1M' } } } ] } } } }).results
• Article.any_of(title: /mcdonald/, body: /mcdonald/) + Author.any_of(first_name: /mcdonald/, last_name: /mcdonald/)
Article.search(query: { multi_match: { query: 'mcdonald', fields: [:title, :body, 'author.*_name'] } }).results
Most of the time, multi_match is used for autocomplete feature. It accepts optional parameters to define your search criteria. For example,
-
type: :best_fields
which searches Captain America in single field rather than captain in one field and america in another. type: :phrase
for matching exact sentence order i.e. Iron Man will be matched with a field containing iron man and not man iron.fuzziness: 2
which will allow 2 edits in a query i.e. DeaddPoool will be matched with deadpool.- and there are many others
• Article.collection.aggregate([ { ‘$match’: { status: ‘published’ } }, { ‘$group’: { _id: ‘$publishing_date’, count: { ‘$sum’: 1 } } } ])
This query will return published articles count grouped by publishing_date. Its analogy in ES query would be,
Article.search(size: 0, query: { constant_score: { filter: { term: { status: 'published' } } } }, aggs: { grouped_articles: { terms: { field: :publishing_date, size: 0 } } }).aggregations.grouped_articles.buckets
Note the use of size: 0
in query. If you only care about aggregated results and not about query results, then make sure to use this parameter as it’ll remove query results in response.
• Article.count
Article.search(aggs: { articles: { terms: { field: :_type, size: 0 } } }).aggregations.articles.buckets.first.doc_count
You might ask why aggregation for getting count ? That’s because, its fast and ES queries search up to its max window size. Your document count might be greater than it, because of which, you need to use Scroll API till last page to get all documents.
There are lot of things to yet to be learn from ES, and new features are still getting added into it. They are also working on 5.0.0 release which will take less disk space and will be twice as fast, with 25% increase in search performance.
Thanks for reading, Happy searching !!
Reblogged this on Josh Software – Where Programming is an Art! and commented:
ElasticSearch is one of the popular Search Indexing engines today. Yogesh takes us on an interesting tutorial for integrating ElasticSearch into a Rails app – right from the initial thought-process till execution and performance.
LikeLike