Tuesday, 16 October 2018

ElasticSearch

ElasticSearch   is a full test Search and Analytical engine. It is Robust, High Available & Distributed in nature. It supports Aggregations, Log Analysis, Geo-location data, Machine Learning.
Data is stored as Documents. It is similar to row in rdbms

Elastic Stack
  • Kibana->Analytics and Visualization platform
  • Logstash -> Data processing pipeline
  • X-Pack -> Security, Monitoring, ML, Graph, SQL for documents query
  • Beats-

Sharding

  • Divide indices into smaller pieces. Each piece is called shard
  • Sharding is done at index level
  • This is to horizontally scale the data volume
It is a Full-text search
  • Searches to terms of all documents
  • Content is parsed and stored before hand
  • Equivalent to google search
Analytics
  • Search is zooming in -> finding needle in a haystack.
  • Analytics is:
    • Opposite of search
    • Zooming out and looking at a bigger picture

Components of elasticsearch
  • Logstash
    • Helps centralize event data like logs, metrics of any format
    • It can perform transformations before sending to stash
    • It is a serverside component. Its role is to centralize data from various input sources, transform and forward the data to an output.
  • Beats
    • Open source lightweight data shippers
    • It is a client side component and its role is complementary to logstash. 
    • It consists of core library and libbeat which provides api for 
      • ship data from source
      • configure input options
      • implement logging
    • Elastic team build various beats like Packetbeat, Filebeat, metricbeat, Winlogbeat,Audiobeat,heartbeat
  • Kibana
    • Visualization tool of elastic stack
    • Helps to gain powerful insights about data. It is called window into elastic stack
    • It offers many visualizations like Histograms, Maps, Linecharts, timeseries  and more
    • Offers management tools
      • Manage settings& configure x-pack security settings
    • Offers development tools
      • build and test REST api
  • X-Pack
    • It adds Security, Monitoring, Alerting, reporting and graph capabilities
    • Security
      • Authentication and authorization
      • Secure access to ElsaticSearch and Kibana
      • Extension helps to configure Fields and Document level security
    • Monitoring
      • Monitor Clusters, nodes and index level metrics
      • Plugin to maintain performance history.
    • Graph
  • Elastic Cloud 


How does it work: When a document is added to ElasticSearch index, an Inverted index is created by stripping down the document into most optimized form to search. Once the inverted index is created, the document is ready for search.
  • It indexes all fields of the document
  • Other optimizers make it lightning fast

Use Cases: 
  • Uber - marketplace dynamics
  • Salesforce- log analysis for usage trends
  • ebay - search thru 800 million + listings
  • New York Times - Search thru 164 years of publications
Command to Start ElasticSearch Cluster 
  • bin\elasticsearch
  • localhost:9200
Kibana
  • bin\kibana
  • localhost:5601
DataTypes
  • String 
    • text
      • useful for supporting full-text search for field containing a description
      • fields are analyzed before  indexing
    • keyword
      • enables analytics on String fields
      • These fields support sorting, filtering and aggregations
Dynamic Mapping 
  • Elastic search infers datatypes of all fields when first document is indexed with in a non existing type
  • GET /catalog/_mapping/product
CURD Operations 
  • Adding or creating a document into a type within an index of ElasticSearch is called an indexing operation.
  • PUT /my_movies/movie/1/_create
    {
      "name":"Movie one",
      "actor_count":10,
      "date":"2015-02-10"
    }
 Elasticsearch APIs
  • Document
    • Query that matches all documents from all indices of  the cluster(default is 10)
      • GET /_search 
    • All documents in one index
      • GET /catalog/_search
      •  GET /catalog/product/_search
      •  GET /catalog,my_index/product/_search
      •  GET /_all/product/_search

  • Search
  • Aggregations
  • Indices
  • Cluster
  • Cat
What is an Index:
  • Its a Logical namespace that points to 1 or more Shards in an ElasticSearch cluster.
  • It is place where data is stored in the form of documents
  • Index is broken into shards and Shards are containers of data
  • Default number of shards in index is 5 along with one replica shard
Relational DB    ------------------ ElasticSearch
  Database         ------------------   Index
  table                ------------------   type
  row                  ------------------   document
  column            ------------------   field

Create & Drop Index
  • curl -XPUT/-XGET/-XDELETE http://localhost:9200/my_test_index 
 Type
  • It is representation of a Class of Similar Documents
  • Type is Optional
  • eg: index/type/document
Index creation with type
________________
PUT my_movies
{
  "mappings":{
    "movie":{
      "properties":{
         "name":{ "type":"text"},
         "actor_count":{"type":"integer"},
    "date":{"type":"date"}
       }
      }
    }
}

Adding document
_________________
PUT /my_movies/movie/1/_create
{
  "name":"Movie one",
  "actor_count":10,
  "date":"2015-02-10"
}

DELETE /my_movies/movie/1

Queries: match/term

GET /my_movies/movie/_count
{
  "query": {
    "match":{
                "name": "Movie two"
    }
  }
}

DSL an ultra powerful JSON based language that lets you to execute Queries in Elasticsearch. It supports 2 clauses.
  • Leaf query
    • Match, term or range which searches for a given value in a given field 
      • Match
      • Term 
        • Exists  - Documents with Not Null Field
        • Type - Match documents based on mapping type
        • Range - Objects/Documents that exists between range of values.
  • Compound query
    • Combines leaf query and  
Query Context - Matches documents and calculates a  _score
 GET /my_movies/movie/_search?explain
{
  "query": {
    "match":{
                "name": "Movie two"
    }
  }
}

Filter Context - seeks yes or no answer to whether a document matches
 GET /my_movies/movie/_search
{
  "query": {

    "bool":{
    "must":[{"match":{"name": "Movie two"}}],
    "filter":[{"term":{"actor_count": 10}}]

    }
  }
}

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.
  • Supervised Learning - Label data(tables)
    • For organized/structured data
    • Used for predictions
  • Unsupervised Learning
    • Analyses unstructured data 
    • weblogs to determine anomalies and more.
    • Elastic search uses unsupervised learning
  • Semi-supervised Learning
    • Uses both labeled and unlabeled data to create models.
Machine Learning usage
  • Anomaly detection
  • predictive analytics - prices of houses and so
  • Grouping(Clustering)
Elasticsearch and machine learning
  • Makes model building easier.
  • intuitive UI
  • Easy to feed data and update model
  • works in tandem with elasticsearch
Machine Learning steps:
  • Data - Get the data
  • Train Model
  •  Feed data


Shards
  • Partitions of index.
 Term, Range & Boosting
  • Term searches for exact term in specified query. This is most effective when querying values in keyword fields for exact term matches.(Keyword types are not analyzed, text type are analyzed)
  • Range to retrieve docs with values that fall in a range for select fields.
  • Boosting used to add more weight to one query relative to another.
Aggregations :
  • Metrics Aggregation  calculates the average numeric value over a given numeric field in a set of documents
    • Cardinality: Single value metric that aggregates distinct values.
    • Extended stats
    • GEO aggregations uses longitude and latitude data from Set of documents to calculate a box that encloses all lon/lat Locations
  • Bucket Aggregation places results of search into a numerical distribution grouped into buckets.
ElasticStack - It is full search and analytical stack. ElasticSearch is at the heart of Elastic stack  providing storage, search and analytical capabilities. This is built on radically different technology - Apache Lucene.

Components of Elastic Stack:
  • Logstash - Centralize data from input sources. Transform and forward the data to an output.
  • Beats - Its role is to complementary to logstash. This is client side component. Provides api to ship data from source, configure input options & implement logging. 
  • Elastic Search -
  • Kibana- Visualization tool of elastic stack. Window into elastic stack. It also has management tools to manage settings and x-pack security features. It also offers development tools to build and test REST API requests.
  • X-Pack - It adds security, monitoring, alerting, reporting and graph capabilities to Elastic Stack.
    • Security - Authentication and authorization capabilities to elastic search and kibana.
    • Monitoring - 
    • Reporting -
    • Alerting -
    • Graph -
  • Elastic Cloud - 

ElasticSearch(Distributed, RESTful search and analytics) is an analytical engine designed to be scalable, horizontal in nature.
  • Key Characteristics
    • Search, Index and analyze data
    • Language agnostic
    • Built-in machine learning 
    • Scalable, Highly available, Distributed
  • Goals
    • Lightning fast search
    • Analytics Engine
    • Near Real-time
      • When document is added to elastic search index, an inverted index is created. Once inverted index is created, it is available for search. 
    • Powerful Rest API
  • Features
    • Aggregations
    • Log Analysis
    • Geo-location data analysis
    • Machine learning
  • Installing ElasticSearch
    • Download elasticsearch from Elastic.co
    • Extract downloaded file into a directory
    • Map to ElasticSearch directory
    • Run command to start cluster
  • Installing Kibana
    • Download Kibana from Elastic.co
    • Extract downloaded file into a directory
    • Edit Kibana configuration file
    • Map to Kibana directory and start
What is Index
    • It is a logical namespace that points to 1 or more Shards(partition or containers of data) in an Elastic Search cluster
    • It is the place where data is stored in the form of documents
    • Index is broken  into shards and Shards are containers for data. Default number of shards in index is 5.


Docs in Elastic Search(it is like ROW in a rdbms table)
  • It is individual entry that is the primary method for adding data.
  • Type is a representation of a Class of similar documents (table in rdbms). 
  • Type is optional in elastic world
What is a cluster
  • Cluster is One or More instances of Running on a given network
  • Node is an ElasticSearch instance. They can handle HTTP and transport protocols.

Shards and replicas
  •  
  1. Bulk API

ElasticSearch APIs
  • Document 
    • Pretty=true
  • Search
  • Aggregation
  • Indices
  • Cluster
  • Cat

 

No comments:

Post a Comment