Wednesday, August 27, 2014

Solr vs ElasticSearch

Background

The first thing you should know about Solr and ElasticSearch is that they are competing search servers. Both ElasticSearch and Solr are built on top of Lucene, so many of their core features are identical. If you are unfamiliar, Lucene is a search engine packaged together in a set of jar files. Many custom applications embed the Lucene jar files directly into their application and manually create and search their Lucene index through the Lucene APIs.
Solr and ES take those Lucene APIs, add features on top of them, and make the APIs accessible through an easy to deploy web server (like tomcat or jetty). Instead of coding through the Lucene Java API, developers can now easily shoot http commands to the search server and index/search that way.

Distributed Search

Foundations

Solr was released in 2008. The Solr commiters focused on building new search features. Later, it became obvious that distributed search was a highly desired feature. In October of 2012 Solr released the SolrCloud feature set which was supposed to make distributed search easy. People like to say that Solr brought distributed search on as an afterthought. On the other hand, ElasticSearch was released in 2010 specifically designed to make up for the lacking distributed features of Solr. For this reason, you may find it easier and more intuitive to start up an ElasticSearch cluster rather than a SolrCloud cluster
Winner: ElasticSearch

Coordination

ElasticSearch uses its own internal coordination mechanism to handle cluster state while Solr uses ZooKeeper. This means in order to have a SolrCloud, you have to have a ZooKeeper quorum setup. For a lot of folks using different components in the Hadoop ecosystem, this isn’t a problem since they will most likely already have a ZooKeeper quorum started up. In addition, by using ZooKeeper Solr can avoid a split brain scenario that ElasticSearch is vulnerable to. I’ll mark this section as a toss up.
Winner: Toss Up

Shard Splitting

Shards are the partitioning unit for the Lucene index, both Solr and ElasticSearch have them. You can distribute your index by placing shards on different machines in a cluster. Until April 2013, both Solr and ElasticSearch would not allow you to change the number of shards in your index. So if you decided you wanted to split your index into 10 shards on day one, and two years later you want to add another 5 shards, you were not able to do that without completely starting over (reindexing everything). As of April 2013 Solr supports shard splitting, which allows you to create more shards by splitting existing shards. ElasticSearch still does not support this.
Winner: Solr

Automatic Shard Rebalancing

Let’s say you’re in charge of capacity planning for your ElasticSearch index. Today, you have 5 machines, but you know in the future you will have budget for 20 machines by the end of this year. To make best use of those 20 machines next year, you decide that it would make most sense to split your index into 10 shards, and have 1 replica of each shard (10 shards and 10 replica shards = 20 total shards). Then you would have either 1 shard or 1 replica shard on each machine in your cluster. Since you only have 5 machines today, multiple shards will have to shard the same machine. As you add new machines, ElasticSearch will automatically load balance and move shards to new nodes in the cluster. This automatic shard rebalancing behavior does not exist in Solr.
Winner: ElasticSearch

Schema

Schema-less?

To be 100% clear, both Solr and ElasticSearch provide dynamic typing so that you can index new fields on the fly (after you have already defined your schema).
Winner: Users

Schema Creation

ElasticSearch will automagically create your schema based on the data you are indexing. Solr on the other hand requires you to define a schema before you index anything. In production for either Solr or ElasticSearch, you’ll want to define your schema before you index anything. This is because there are many advanced analyzers/filters you will want to apply on the data before you index it.
Winner: Both

Nested Typing

ElasticSearch supports complex nested types. For example, you could have an address field that contains a home field and a work field. Each of those fields would have street, city, state, and zip fields. These nested types only work for 1 (parent) to many (child) relationships. There are also a lot of “gotchyas” here. For example, with parent-fields, all members of a relationship must fit onto one shard in your index. Or for nested fields, updating may be extremely slow if you make any updates to any field in the nest. Solr does not support nested typing, the document structure must be flat. The fact that these options exist in ElasticSearch is very cool, but you have to be very careful with how you use them.
Winner: ElasticSearch

Queries

Query Syntax

Solr’s query syntax is key/value pair based using / and () to delineate and nest queries. For example
q=((name:ryan* AND haircolor:brown) OR interest:zombies) OR (job: engineer*).
ElasticSearch’s uses JSON.  For example here is an ElasticSearch query:
“bool” : {
       “must” : {
           “term” : { “user” : “kimchy” }
       },
       “must_not” : {
           “range” : {
               “age” : { “from” : 10, “to” : 20 }
           }
       },
       “should” : [
           {
               "term" : { "tag" : "wow" }
           },
           {
               "term" : { "tag" : "elasticsearch" }
           }
       ],
       “minimum_should_match” : 1,
   }
}
Winner: Users

Distributed Group By

Solr supports distributed group by (including grouped sorting, filtering, faceting, etc), ElasticSearch does not. This feature seems to be like a no brainer in most any search applications which is why I call it out specifically here.
Winner: Solr

Percolation Queries

ElasticSearch allows you to register certain queries that can generate notifications when indexed documents match that query. This is really great for things like alerts. This may cause performance issues if you have too many percolated queries as each document that is indexed will be queried by each percolated query. If the newly indexed document is returned by one of the percolated queries then an alert is sent out.
Winner: ElasticSearch

Community

Users

ElasticSearch is still fairly new but its community is growing very quickly. Solr has been around for much longer and therefore has a larger user base.
Winner: Solr

Vendor Support

MapR, Cloudera, and DataStax have all chosen Solr for their search technology. InfoChimps is using ElasticSearch. I haven’t heard any word on if HortonWorks is even looking into search at this point. LucidWorks has many of the Solr committers and provides an enterprise Solr product with more features, while ElasticSearch provides most of the support for their product. Think Big also supports Solr and ElasticSearch, especially when it comes to integrating these technologies with big data. I see DataStax and Cloudera as thought leaders in this area, which is why I give the win to Solr.
Winner: Solr

Conclusion

So ElasticSearch received four winner categories and Solr received four. Regardless of how to counts were going to end up, I never wanted to say that ElasticSearch is better than Solr or Solr is better than ElasticSearch. At the end of the day Solr and ElasticSearch are very close to each other in feature sets, and it would be really difficult to make a decision on one or the other without really knowing the exact requirements your organization has.


Origin post: https://thinkbiganalytics.com/solr-vs-elastic-search/

No comments: