Domanda

I'm trying to bootstrap a product but I'm constrained for money. So I would like to keep the server costs as low as possible.

My requirement is that I need to index millions of records in elasticsearch that keep coming in at the rate of 20 records per second. i also need to run search queries and percolate queries often. I currently have a basic digitalocean droplet serving the website which also hosts the elasticsearch node. It has a mere 512 mb of RAM. So I often run into out-of-heap-memory errors with elasticsearch becoming non-responsive.

I have a few computers at home to spare. What I would like to do is, setup a master elasticsearch server in my home network, which will index all the data and also handle the percolate queries. It will push periodic updates to a slave elasticsearch node on the web server. The slave node will handle the search queries. Is this setup possible?

If it is not possible, what is the minimum RAM I would need in the current scenario to keep elasticsearch happy?

Will indexing in bulk (like 100 documents at a time) instead of one document at a time make a difference?

Will switching to sphinx make a difference for my usecase?

(The reason I chose elasticsearch over sphinx was 1. Elasticsearch has flexible document schema which was an advantage as the product is still in defining phase. 2. The percolate feature in Elasticsearch, which I use heavily.)

Thank you very much.

È stato utile?

Soluzione

You can manually setup something similar to master/slave using the Elasticsearch Snapshot and Restore mechanism:

Snapshot And Restore

The snapshot and restore module allows to create snapshots of individual indices or an entire cluster into a remote repository. At the time of the initial release only shared file system repository was supported, but now a range of backends are available via officially supported repository plugins.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html

Snapshot and Restore let's you backup indices or entire indexes to a shared file system (Amazon EC2 and Microsoft Azure are supported) and then restore them. You could take periodic snapshots of your index from you home Elasticsearch cluster which can then be restored to your search cluster in the cloud. You can control this via the normal Rest API so you could make it happen automatically on a schedule.

That addresses the indexing portion of your performance problem, provided you have sufficient resources on your home network (servers with enough memory and a network with sufficient uploading capacity to get you index pushed to the cloud).

Regarding your performance on queries, you need as much memory as you can get. Personally I'd look at some of Amazon's EC2 memory optimized instances that provide more memory at the expense of disk or cpu, as many ES installations (like yours) are primarily memory bound.

I'd also suggest something I've done when dealing with heap issues - a short script that searches the log file for heap issues and when they occur restarts jetty or tomcat or whatever servlet container you are using. Not a solution but certainly helps when ES dies in the middle of the night.

Altri suggerimenti

ElasticSearch is fantastic at indexing millions of records, but it needs lots of memory to be effecient. Our production servers have 30gigs of memory pinned just for ES. I don't see any way you can index millions of records and expect positive response times with 512mb.

Perhaps look into using Azure or EC2 to keep your costs down.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top