Upgrade to Pro — share decks privately, control downloads, hide ads and more …

500 Billion Documents & Counting: Scaling Elast...

Elastic Co
March 18, 2015
31k

500 Billion Documents & Counting: Scaling Elasticsearch for Production

First deployed as a proof of concept in mid-2012, Verizon Business moved fully into production with Elasticsearch in mid-2013 and has continued to push forward ever since. Bhaskar will take you through this entire history - including a peek inside the architecture handling over 500 Billion documents - with a look forward at the next year for Verizon and Elasticsearch.

Presented by Bhaskar Karambelkar, Verizon Enterprise Solutions

Elastic Co

March 18, 2015
Tweet

Transcript

  1. { } CC-BY-ND 4.0 Introduction • Security Data Scientist /

    Tech Lead @ Verizon. • Avid Elasticsearch user and advocate since 2012. • Interested in information security and data analytics. 2
  2. { } CC-BY-ND 4.0 Why Elasticsearch ? • Logs stayed

    on disks. • No easy way to fetch/search logs. • Hard to scale. • Untapped potential in log data DRUM ROLLS…… Log Management. 3 Before Elasticsearch Disks RDBMS Logs Events Incidents Tickets
  3. { } CC-BY-ND 4.0 Expectations • Must be able to

    store massive amounts of data. • Must be able to get data in at a very high rate and get data out at an acceptable rate. • Should be schema agnostic. • Must be able to search/filter/analyze data. • Must support MULTI-TENANCY. • Distributed/fault-tolerant/load-balanced/etc. 4
  4. { } CC-BY-ND 4.0 Progress 5 Jul  ‘13 Sept  ‘13

    Nov’  13 Dec  ‘14 Boxes/Nodes 14/14 28/28 28/56 128/128 Cores/RAM/DISK 12/128GB/ 3TBx12 8/64GB/1TBx6 AVG  DAILY  VOLUME 500  M 1  B. 2.5/3  B 10+  B TOTAL  VOLUME ~  10  B. ~  100  B. ~  200  B >  500  B  &  counting  
  5. { } CC-BY-ND 4.0 Know your data and platform •

    Volume / Velocity / Variety / Veracity will affect your choices and their interactions even more so. • Select a proper base config for your nodes. Get the CPU- Cores x RAM x Disk ratio right. • Decide on self hosted vs. cloud hosted. • Prefer JBODs for data disks over RAID, SAN/NAS. • Virtualization vs. bare metal: know the tradeoffs. • SSDs vs. Spinning Disks, (Speed vs. Capacity). 6
  6. { } CC-BY-ND 4.0 The Must DO’s • Change Cluster

    name. • Dedicated Master, Data and Client nodes. • Use Aliases from get go. • Keep all nodes in same subnet. Use unicast discovery. • Have enough memory for JVM heap + FS Cache. • Tune kernel parameters, user/process/file/network limits. • Always CHECK JVM version compatibility. Also, stick to Oracle JVM. • Learn Query DSL and Elasticsearch APIs. 7
  7. { } CC-BY-ND 4.0 The Should DO’s • Tune JVM

    params, but avoid going overboard. • Tune network/connectivity parameters. • Tune recovery parameters. • Configure gateway parameters. • Tune thread pools: Bulk/Index/Search. • Prefer bulk indexing. • Tune caching parameters, especially field data. • In general don’t be afraid to tweak-n-tune till you hit performance sweet spot. • Having a knowledge of text analytics in the team will go a long way. 8
  8. { } CC-BY-ND 4.0 The Do NOTs • Avoid running

    Elasticsearch along with another service on the same box. • Avoid vertical scaling i.e. avoid 2+ nodes per box. • Don’t grow cluster beyond ~150 nodes. Deploy multiple clusters and use Tribe node. • Don’t allow unrestricted/unsupervised querying unless you know the user base. • Never send data for indexing or search queries to Master nodes. 9
  9. { } CC-BY-ND 4.0 Compared to Hadoop • In terms

    of scalability & design Elasticsearch ≠ Hadoop • Not inferior, just different. • Bulkier nodes for Hadoop, leaner for Elasticsearch. • Scale Elasticsearch horizontally, never vertically. • Load characteristics, (CPU/Mem/IO), differ. 10
  10. { } CC-BY-ND 4.0 Indexing • Bulk indexing with refresh

    interval = -1. Send data directly to data nodes. • Set aside more resources for bulk thread pool. • We saw slight performance gain when using raw TCP over HTTP. • Build new index per month/week/day/hour(?) and use aliases. • Number of Indexes x Shards x Replicas not only affects storage but also Memory. 11
  11. { } CC-BY-ND 4.0 Indexing cont. • Don’t use a

    lot of “types” in a single index. • Use mappings/templates for pre-defining field types. • Disable ‘_all’ field unless really needed. Avoid ‘storing’ fields outside of ‘_source’. • Know which fields to NOT index. Decide which analyzer/ token-filter/char-filter works best. 12
  12. { } CC-BY-ND 4.0 Searching • Use filters. • Avoid

    ‘query string’ query. • Know difference between “bool” and ‘and/or/not’ filters. • Know the impact of faceting/aggregations/sorting on field data cache for high cardinality fields like “timestamp”. • For bulk searching use ‘scroll’. • Rely on explain/validate queries for performance tuning. • Search Templates for simplified queries. • Send searches to client nodes. Not to data nodes and never ever to Master nodes. 13
  13. { } CC-BY-ND 4.0 Monitoring / Management • For monitoring

    we prefer Nagios, but Marvel works really well too. • Use automated deployment / configuration management via Chef/Puppet/Salt/Ansible. • Prefer to share a single config file for all node types. • We retain raw data in HDFS for a year in case of data loss / re- indexing required. Data in Elasticsearch retained for max 90 days. • Know when to use rolling upgrades vs full upgrades. 14
  14. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0