ElasticSearch Cài đặt và sử dụng với codeigniter

1 ElasticSearch

Khái niệm cơ bản

Chúng ta hãy nhìn vào những khái niệm chính của ElasticSearch:

  • Cluster: Một tập hợp Nodes (servers) chứa tất cả các dữ liệu.
  • Node: Một server duy nhất chứa một số dữ liệu và tham gia vào cluster’s indexing and querying.
  • Index: Hãy quên SQL Indexes đi. Mỗi ES Index là 1 tập hợp các documents.
  • Shards: Tập con các documents của 1 Index. Một Index có thể được chia thành nhiều shard.
  • Type: Một định nghĩa về schema of a Document bên trong một Index (Index có thể có nhiều type).
  • Document: Một JSON object với một số dữ liệu. Đây là basic information unit trong ES.

1 Khái niệm và cách sử dụng

Elasticsearch là một công cụ tìm kiếm toàn văn nguồn mở. Nó cho phép bạn lưu trữ và tìm kiếm dữ liệu trong thời gian thực. Thời gian tìm kiếm trong Elasticsearch nhanh hơn đáng kể so với SQL. Bạn cũng có thể tìm kiếm một cụm từ, và động cơ sẽ cung cấp cho bạn các kết quả trong vài giây tùy thuộc vào cơ sở dữ liệu Elasticsearch lớn như thế nào.

  • Elasticsearch là một search engine.
  • Elasticsearch được xây dựng để hoạt động như một server cloud theo cơ chế của RESTful.
  • Kế thừa và phát triển từ Lucene Apache.
  • Phát triển bằng ngôn ngữ java.
  • Là phần mềm open-source được phát hành theo giấy phép của Apache License.
  • Tương tự như Solr (Apache)
  • ELASTIC-SEARCH có thể tích hợp được với tất cả các ứng dụng sử dụng các loại ngôn ngữ sau.
    • java, javascipt
    • Groovy, .NET
    • PHP, Perl
    • Python, Ruby
  • Những ai đã dùng ElasticSearch
    • Mozilla, Quora
    • SoundCloud, GitHub
    • Stack Exchange
    • Center for Open Science
    • Reverb, Netflix

Khi nào phải sử dụng elasticSearch : Khi mà người dùng truy cập website của bạn sử dụng công cụ tìm kiếm chủ yếu  ví dụ như 1 website thương mại điện tử lớn,1 website ca nhạc như mp3.zings.vn hay đơn giản như site phim : phim.nhatvl.com chẳng hạn

2: Mô hình sử dụng

Mình xin phép sử dụng hình của mastercode để cho các bạn 1 hình dung cụ thể

3 : Cài đặt elastic search vào centos 7

Đầu tiên bạn cần cài java cho centos 7

sudo yum install java-1.8.0-openjdk.x86_64

Sau khi chạy xong bạn kiểm tra với lệnh

java -version

Bước tiếp theo bạn download elastic search về server bằng 2 cách

Cách 1 : https://www.elastic.co/downloads/elasticsearch download file đuôi rpm rồi up lên vps

Cách 2 : Download trực tiếp về bằng lệnh :

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.7.0.rpm

Sau đó chạy lệnh

sudo rpm -ivh elasticsearch-6.7.0.rpm

Kết quả Elasticsearch sẽ được cài đặt ở /usr/share/elasticsearch/  và file config sẽ nằm ở mục /etc/elasticsearch và tập lệnh nằm ở /etc/init.d/elasticsearch

Để  Elasticsearch khởi động và tắt cùng với vps thì bạn thêm vào lệnh sau :

sudo systemctl enable elasticsearch.service

Cấu hình Elastic ở file /etc/elasticsearch/elasticsearch.yml

Bỏ comment : ở đoạn :

node.name: "My First Node"
cluster.name: mycluster1
network.bind_host: localhost

Lưu ý chỉnh tên name cho dễ nhớ và network.bind_host mình ở đây chỉ làm localhost .Nếu bạn nào làm kiểu lớn thì nên tách Elastic Search ra 1 server chuyên biệt thì đánh địa chỉ IP ở đó nhé

Ok vậy là xong nếu bạn chỉ cài 1 server Elastic Search.

Cấu hình tiếp theo chỉ dành cho các website có lượng tìm kiếm cực khủng với việc cấu hình đa server để search bằng Masters và Slaves

Tìm kiếm trong file trên bỏ dấu # ở đầu dòng node.master thay giá trị bằng false để làm Slaves và true nếu muốn là Masters

Nếu 1 node master hoặc node dùng làm search load balancer thì bạn để dòng

node.data: false

Sau đó bạn khởi động lại service Elastic Search

sudo service elasticsearch start

Check status elastic search

service elasticsearch status

4 : bảo mật elastic search

Elastic Search ko có cơ chế bảo mật nào nên chúng ta cần bảo mật qua iptables để cho những ip nào truy cập vào và dòng lệnh trong  elasticsearch.yml

script.disable_dynamic: true

sau đó khởi động lại

sudo service elasticsearch restart

5:Thử nghiệm

chạy lệnh sau trên command

curl -X GET 'http://localhost:9200'

nếu bạn thấy đoạn lệnh

{
  "status" : 200,
  "name" : "Franz Kafka",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.3",
    "build_hash" : "05d4530971ef0ea46d0f4fa6ee64dbc8df659682",
    "build_timestamp" : "2015-10-15T09:14:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

thì bạn đã cài đặt thành công

6 : Cách dùng

Elasticsearch sử dụng API RESTful nên ta dễ dàng sử dụng đáp ứng các lệnh CRUD thông thường: Tạo, Đọc, Cập nhật và Xoá

Lệnh input :

curl -X POST 'http://localhost:9200/hoathinhnet/video/1' -d '{ "name": "Cướp biển vùng caribe","description":"Cướp biển vùng caribe" }'

Kiểm tra trạng thái của elastic search

curl 'localhost:9200/_cat/indices?v'

Uninstall Elastic Search

yum remove elasticsearch

Dưới đây là file /etc/elasticsearch/elasticsearch.yml mình cấu hình để mở server elasticsearch cho server khác truy cập .Nhớ có khoảng trống network.host: 0.0.0.0  không là không chạy nhé

##################### Elasticsearch Configuration Example #####################

# This file contains an overview of various configuration settings,
# targeted at operations staff. Application developers should
# consult the guide at <http://elasticsearch.org/guide>.
#
# The installation procedure is covered at
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup.html>.
#
# Elasticsearch comes with reasonable defaults for most settings,
# so you can try it out without bothering with configuration.
#
# Most of the time, these defaults are just fine for running a production
# cluster. If you're fine-tuning your cluster, or wondering about the
# effect of certain configuration option, please _do ask_ on the
# mailing list or IRC channel [http://elasticsearch.org/community].

# Any element in the configuration can be replaced with environment variables
# by placing them in ${...} notation. For example:
#
#node.rack: ${RACK_ENV_VAR}

# For information on supported formats and syntax for the config file, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html>


################################### Cluster ###################################

# Cluster name identifies your cluster for auto-discovery. If you're running
# multiple clusters on the same network, make sure you're using unique names.
#
cluster.name: elasticsearch


#################################### Node #####################################

# Node names are generated dynamically on startup, so you're relieved
# from configuring them manually. You can tie this node to a specific name:
#
node.name: "Franz Kafka"

# Every node can be configured to allow or deny being eligible as the master,
# and to allow or deny to store the data.
#
# Allow this node to be eligible as a master node (enabled by default):
#
#node.master: true
#
# Allow this node to store data (enabled by default):
#
#node.data: true

# You can exploit these settings to design advanced cluster topologies.
#
# 1. You want this node to never become a master node, only to hold data.
#    This will be the "workhorse" of your cluster.
#
#node.master: false
#node.data: true
#
# 2. You want this node to only serve as a master: to not store any data and
#    to have free resources. This will be the "coordinator" of your cluster.
#
#node.master: true
#node.data: false
#
# 3. You want this node to be neither master nor data node, but
#    to act as a "search load balancer" (fetching data from nodes,
#    aggregating results, etc.)
#
#node.master: false
#node.data: false

# Use the Cluster Health API [http://localhost:9200/_cluster/health], the
# Node Info API [http://localhost:9200/_nodes] or GUI tools
# such as <http://www.elasticsearch.org/overview/marvel/>,
# <http://github.com/karmi/elasticsearch-paramedic>,
# <http://github.com/lukas-vlcek/bigdesk> and
# <http://mobz.github.com/elasticsearch-head> to inspect the cluster state.

# A node can have generic attributes associated with it, which can later be used
# for customized shard allocation filtering, or allocation awareness. An attribute
# is a simple key value pair, similar to node.key: value, here is an example:
#
#node.rack: rack314

# By default, multiple nodes are allowed to start from the same installation location
# to disable it, set the following:
#node.max_local_storage_nodes: 1


#################################### Index ####################################

# You can set a number of options (such as shard/replica options, mapping
# or analyzer definitions, translog settings, ...) for indices globally,
# in this file.
#
# Note, that it makes more sense to configure index settings specifically for
# a certain index, either when creating it or by using the index templates API.
#
# See <http://elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html> and
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html>
# for more information.

# Set the number of shards (splits) of an index (5 by default):
#
#index.number_of_shards: 5

# Set the number of replicas (additional copies) of an index (1 by default):
#
#index.number_of_replicas: 1

# Note, that for development on a local machine, with small indices, it usually
# makes sense to "disable" the distributed features:
#
#index.number_of_shards: 1
#index.number_of_replicas: 0

# These settings directly affect the performance of index and search operations
# in your cluster. Assuming you have enough machines to hold shards and
# replicas, the rule of thumb is:
#
# 1. Having more *shards* enhances the _indexing_ performance and allows to
#    _distribute_ a big index across machines.
# 2. Having more *replicas* enhances the _search_ performance and improves the
#    cluster _availability_.
#
# The "number_of_shards" is a one-time setting for an index.
#
# The "number_of_replicas" can be increased or decreased anytime,
# by using the Index Update Settings API.
#
# Elasticsearch takes care about load balancing, relocating, gathering the
# results from nodes, etc. Experiment with different settings to fine-tune
# your setup.

# Use the Index Status API (<http://localhost:9200/A/_status>) to inspect
# the index status.


#################################### Paths ####################################

# Path to directory containing configuration (this file and logging.yml):
#
#path.conf: /path/to/conf

# Path to directory where to store index data allocated for this node.
#
#path.data: /path/to/data
#
# Can optionally include more than one location, causing data to be striped across
# the locations (a la RAID 0) on a file level, favouring locations with most free
# space on creation. For example:
#
#path.data: /path/to/data1,/path/to/data2

# Path to temporary files:
#
#path.work: /path/to/work

# Path to log files:
#
#path.logs: /path/to/logs

# Path to where plugins are installed:
#
#path.plugins: /path/to/plugins


#################################### Plugin ###################################

# If a plugin listed here is not installed for current node, the node will not start.
#
#plugin.mandatory: mapper-attachments,lang-groovy


################################### Memory ####################################

# Elasticsearch performs poorly when JVM starts swapping: you should ensure that
# it _never_ swaps.
#
# Set this property to true to lock the memory:
#
#bootstrap.mlockall: true

# Make sure that the ES_MIN_MEM and ES_MAX_MEM environment variables are set
# to the same value, and that the machine has enough memory to allocate
# for Elasticsearch, leaving enough memory for the operating system itself.
#
# You should also make sure that the Elasticsearch process is allowed to lock
# the memory, eg. by using `ulimit -l unlimited`.


############################## Network And HTTP ###############################

# Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens
# on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node
# communication. (the range means that if the port is busy, it will automatically
# try the next port).

# Set the bind address specifically (IPv4 or IPv6):
#
#network.bind_host: 168.235.86.129

# Set the address other nodes will use to communicate with this node. If not
# set, it is automatically derived. It must point to an actual IP address.
#
network.host: 0.0.0.0 

# Set both 'bind_host' and 'publish_host':
#
#network.host: 192.168.0.1

# Set a custom port for the node to node communication (9300 by default):
#
#transport.tcp.port: 9300

# Enable compression for all communication between nodes (disabled by default):
#
#transport.tcp.compress: true

# Set a custom port to listen for HTTP traffic:
#
#http.port: 9200

# Set a custom allowed content length:
#
#http.max_content_length: 100mb

# Disable HTTP completely:
#
#http.enabled: false


################################### Gateway ###################################

# The gateway allows for persisting the cluster state between full cluster
# restarts. Every change to the state (such as adding an index) will be stored
# in the gateway, and when the cluster starts up for the first time,
# it will read its state from the gateway.

# There are several types of gateway implementations. For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html>.

# The default gateway type is the "local" gateway (recommended):
#
#gateway.type: local

# Settings below control how and when to start the initial recovery process on
# a full cluster restart (to reuse as much local data as possible when using shared
# gateway).

# Allow recovery process after N nodes in a cluster are up:
#
#gateway.recover_after_nodes: 1

# Set the timeout to initiate the recovery process, once the N nodes
# from previous setting are up (accepts time value):
#
#gateway.recover_after_time: 5m

# Set how many nodes are expected in this cluster. Once these N nodes
# are up (and recover_after_nodes is met), begin recovery process immediately
# (without waiting for recover_after_time to expire):
#
#gateway.expected_nodes: 2


############################# Recovery Throttling #############################

# These settings allow to control the process of shards allocation between
# nodes during initial recovery, replica allocation, rebalancing,
# or when adding and removing nodes.

# Set the number of concurrent recoveries happening on a node:
#
# 1. During the initial recovery
#
#cluster.routing.allocation.node_initial_primaries_recoveries: 4
#
# 2. During adding/removing nodes, rebalancing, etc
#
#cluster.routing.allocation.node_concurrent_recoveries: 2

# Set to throttle throughput when recovering (eg. 100mb, by default 20mb):
#
#indices.recovery.max_bytes_per_sec: 20mb

# Set to limit the number of open concurrent streams when
# recovering a shard from a peer:
#
#indices.recovery.concurrent_streams: 5


################################## Discovery ##################################

# Discovery infrastructure ensures nodes can be found within a cluster
# and master node is elected. Multicast discovery is the default.

# Set to ensure a node sees N other master eligible nodes to be considered
# operational within the cluster. This should be set to a quorum/majority of 
# the master-eligible nodes in the cluster.
#
#discovery.zen.minimum_master_nodes: 1

# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
#discovery.zen.ping.timeout: 3s

# For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html>

# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
#
# 1. Disable multicast discovery (enabled by default):
#
#discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
#    to perform discovery when new nodes (master or data) are started:
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

# EC2 discovery allows to use AWS EC2 API in order to perform discovery.
#
# You have to install the cloud-aws plugin for enabling the EC2 discovery.
#
# For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-ec2.html>
#
# See <http://elasticsearch.org/tutorials/elasticsearch-on-ec2/>
# for a step-by-step tutorial.

# GCE discovery allows to use Google Compute Engine API in order to perform discovery.
#
# You have to install the cloud-gce plugin for enabling the GCE discovery.
#
# For more information, see <https://github.com/elasticsearch/elasticsearch-cloud-gce>.

# Azure discovery allows to use Azure API in order to perform discovery.
#
# You have to install the cloud-azure plugin for enabling the Azure discovery.
#
# For more information, see <https://github.com/elasticsearch/elasticsearch-cloud-azure>.

################################## Slow Log ##################################

# Shard level query and fetch threshold logging.

#index.search.slowlog.threshold.query.warn: 10s
#index.search.slowlog.threshold.query.info: 5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

#index.search.slowlog.threshold.fetch.warn: 1s
#index.search.slowlog.threshold.fetch.info: 800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

#index.indexing.slowlog.threshold.index.warn: 10s
#index.indexing.slowlog.threshold.index.info: 5s
#index.indexing.slowlog.threshold.index.debug: 2s
#index.indexing.slowlog.threshold.index.trace: 500ms

################################## GC Logging ################################

#monitor.jvm.gc.young.warn: 1000ms
#monitor.jvm.gc.young.info: 700ms
#monitor.jvm.gc.young.debug: 400ms

#monitor.jvm.gc.old.warn: 10s
#monitor.jvm.gc.old.info: 5s
#monitor.jvm.gc.old.debug: 2s

################################## Security ################################

# Uncomment if you want to enable JSONP as a valid return transport on the
# http server. With this enabled, it may pose a security risk, so disabling
# it unless you need it is recommended (it is disabled by default).
#
#http.jsonp.enable: true

Tài liệu tham khảo : https://viblo.asia/p/elasticsearch-phan-1-gDVK2k60ZLj

https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-elasticsearch-on-centos-7

Be the first to comment

Leave a Reply

Your email address will not be published.


*