Background
The first thing you should know about Solr and
ElasticSearch is that they are competing search servers. Both
ElasticSearch and Solr are built on top of Lucene, so many of their core
features are identical. If you are unfamiliar, Lucene is a search
engine packaged together in a set of jar files. Many custom applications
embed the Lucene jar files directly into their application and manually
create and search their Lucene index through the Lucene APIs.
Solr and ES take those Lucene APIs, add features on top of
them, and make the APIs accessible through an easy to deploy web server
(like tomcat or jetty). Instead of coding through the Lucene Java API,
developers can now easily shoot http commands to the search server and
index/search that way.
Distributed Search
Foundations
Solr was released in 2008. The Solr commiters focused on
building new search features. Later, it became obvious that distributed
search was a highly desired feature. In October of 2012 Solr released
the SolrCloud feature set which was supposed to make distributed search
easy. People like to say that Solr brought distributed search on as an
afterthought. On the other hand, ElasticSearch was released in 2010
specifically designed to make up for the lacking distributed features of
Solr. For this reason, you may find it easier and more intuitive to
start up an ElasticSearch cluster rather than a SolrCloud cluster
Winner: ElasticSearch
Coordination
ElasticSearch uses its own internal coordination mechanism
to handle cluster state while Solr uses ZooKeeper. This means in order
to have a SolrCloud, you have to have a ZooKeeper quorum setup. For a
lot of folks using different components in the Hadoop ecosystem, this
isn’t a problem since they will most likely already have a ZooKeeper
quorum started up. In addition, by using ZooKeeper Solr can avoid a
split brain scenario that ElasticSearch is vulnerable to. I’ll mark this
section as a toss up.
Winner: Toss Up
Shard Splitting
Shards are the partitioning unit for the Lucene index, both
Solr and ElasticSearch have them. You can distribute your index by
placing shards on different machines in a cluster. Until April 2013,
both Solr and ElasticSearch would not allow you to change the number of
shards in your index. So if you decided you wanted to split your index
into 10 shards on day one, and two years later you want to add another 5
shards, you were not able to do that without completely starting over
(reindexing everything). As of April 2013 Solr supports shard splitting, which allows you to create more shards by splitting existing shards. ElasticSearch still does not support this.
Winner: Solr
Automatic Shard Rebalancing
Let’s say you’re in charge of capacity planning for your
ElasticSearch index. Today, you have 5 machines, but you know in the
future you will have budget for 20 machines by the end of this year. To
make best use of those 20 machines next year, you decide that it would
make most sense to split your index into 10 shards, and have 1 replica
of each shard (10 shards and 10 replica shards = 20 total shards). Then
you would have either 1 shard or 1 replica shard on each machine in your
cluster. Since you only have 5 machines today, multiple shards will
have to shard the same machine. As you add new machines, ElasticSearch
will automatically load balance and move shards to new nodes in the
cluster. This automatic shard rebalancing behavior does not exist in
Solr.
Winner: ElasticSearch
Schema
Schema-less?
To be 100% clear, both Solr and ElasticSearch provide
dynamic typing so that you can index new fields on the fly (after you
have already defined your schema).
Winner: Users
Schema Creation
ElasticSearch will automagically create your schema based
on the data you are indexing. Solr on the other hand requires you to
define a schema before you index anything. In production for either Solr
or ElasticSearch, you’ll want to define your schema before you index
anything. This is because there are many advanced analyzers/filters you
will want to apply on the data before you index it.
Winner: Both
Nested Typing
ElasticSearch supports complex nested types. For example,
you could have an address field that contains a home field and a work
field. Each of those fields would have street, city, state, and zip
fields. These nested types only work for 1 (parent) to many (child)
relationships. There are also a lot of “gotchyas” here. For example,
with parent-fields, all members of a relationship must fit onto one
shard in your index. Or for nested fields, updating may be extremely
slow if you make any updates to any field in the nest. Solr does not
support nested typing, the document structure must be flat. The fact
that these options exist in ElasticSearch is very cool, but you have to
be very careful with how you use them.
http://www.elasticsearch.org/guide/reference/mapping/nested-type/http://www.elasticsearch.org/guide/reference/mapping/object-type/http://www.elasticsearch.org/guide/reference/mapping/parent-field/
Winner: ElasticSearch
Queries
Query Syntax
Solr’s query syntax is key/value pair based using / and () to delineate and nest queries. For example
q=((name:ryan* AND haircolor:brown) OR interest:zombies) OR (job: engineer*).
ElasticSearch’s uses JSON. For example here is an ElasticSearch query:“bool” : {
“must” : {
“term” : { “user” : “kimchy” }
},
“must_not” : {
“range” : {
“age” : { “from” : 10, “to” : 20 }
}
},
“should” : [
{
"term" : { "tag" : "wow" }
},
{
"term" : { "tag" : "elasticsearch" }
}
],
“minimum_should_match” : 1,
}
}
Winner: Users
Distributed Group By
Solr supports distributed group by (including grouped
sorting, filtering, faceting, etc), ElasticSearch does not. This feature
seems to be like a no brainer in most any search applications which is
why I call it out specifically here.
Winner: Solr
Percolation Queries
ElasticSearch allows you to register certain queries that
can generate notifications when indexed documents match that query. This
is really great for things like alerts. This may cause performance
issues if you have too many percolated queries as each document that is
indexed will be queried by each percolated query. If the newly indexed
document is returned by one of the percolated queries then an alert is
sent out.
Winner: ElasticSearch
Community
Users
ElasticSearch is still fairly new but its community is
growing very quickly. Solr has been around for much longer and therefore
has a larger user base.
Winner: Solr
Vendor Support
MapR, Cloudera, and DataStax have all chosen Solr for their
search technology. InfoChimps is using ElasticSearch. I haven’t heard
any word on if HortonWorks is even looking into search at this point.
LucidWorks has many of the Solr committers and provides an enterprise
Solr product with more features, while ElasticSearch provides most of
the support for their product. Think Big also supports Solr and
ElasticSearch, especially when it comes to integrating these
technologies with big data. I see DataStax and Cloudera as thought
leaders in this area, which is why I give the win to Solr.
Winner: Solr
Conclusion
So ElasticSearch received four winner categories and Solr
received four. Regardless of how to counts were going to end up, I never
wanted to say that ElasticSearch is better than Solr or Solr is better
than ElasticSearch. At the end of the day Solr and ElasticSearch are
very close to each other in feature sets, and it would be really
difficult to make a decision on one or the other without really knowing
the exact requirements your organization has.
Origin post: https://thinkbiganalytics.com/solr-vs-elastic-search/
No comments:
Post a Comment