Centos7下配置Logstash7.2与mysql8.0的数据同步

字数 1513阅读 655

环境说明

Centos7.5.1804、Elasticsearch7.2、Mysql8.0、Kibana7.2均已安装。
传送门
Centos7安装Elasticsearch7.2版本
Centos7安装mysql8.0
Kibana7.2的安装和配置

开始安装

logstash要与mysql进行数据同步,除了安装logstash本身以外,还需要安装 logstash-input-jdbc 插件和mysql对应版本的数据库链接驱动。
下载安装包

[root@localhost ~]# cd /usr/local/src/tar.gz/
[root@localhost tar.gz]# wget https://artifacts.elastic.co/downloads/logstash/logstash-7.2.0.tar.gz
# 省略
[root@localhost tar.gz]# ll
total 887456
-rw-r--r--. 1 root root 336647987 Aug  2 11:39 elasticsearch-7.2.0-linux-x86_64.tar.gz
-rw-r--r--. 1 root root 213198521 Aug  2 16:54 kibana-7.2.0-linux-x86_64.tar.gz
-rw-r--r--. 1 root root 171299740 Aug  3 12:56 logstash-7.2.0.tar.gz
-rw-r--r--. 1 root root 187599951 Aug  2 15:31 openjdk-11.0.1_linux-x64_bin.tar.gz

百度云
链接:https://pan.baidu.com/s/1KgHX_2_lbCGAb33O73g18g
提取码:yneo

解压到指定目录

#解压
[root@localhost tar.gz]# tar -xzvf logstash-7.2.0.tar.gz -C /usr/local/src/
#切换目录,更改文件夹名
[root@localhost src]# mv logstash-7.2.0/ ./logstash

安装logstash-input-jdbc

[root@localhost bin]# pwd
/usr/local/src/logstash/bin
[root@localhost bin]# ./logstash-plugin install logstash-input-jdbc
Validating logstash-input-jdbc
Installing logstash-input-jdbc
Installation successful

插件安装成功

下载mysql8.0版本的数据库链接驱动

[root@localhost src]# pwd
/usr/local/src
[root@localhost src]# wget http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar
--2019-08-03 13:34:40--  http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar
Resolving central.maven.org (central.maven.org)... 151.101.24.209
Connecting to central.maven.org (central.maven.org)|151.101.24.209|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2036609 (1.9M) [application/java-archive]
Saving to: ‘mysql-connector-java-8.0.11.jar’

100%[=============================================================================================================================================================================>] 2,036,609   11.1KB/s   in 2m 46s 

2019-08-03 13:37:30 (12.0 KB/s) - ‘mysql-connector-java-8.0.11.jar’ saved [2036609/2036609]

百度云
链接:https://pan.baidu.com/s/1NKYXjxqTox3M_TAClYXVpw
提取码:7ojx

数据同步

一下内容部分参考 ES 译文之如何使用 Logstash 实现关系型数据库与 ElasticSearch 之间的数据同步
数据同步概述
Logstash 的 JDBC input 插件进行 ElasticSearch 和 MySQL 之间的数据同步。从概念上讲,JDBC 插件将通过周期性的轮询以发现上次迭代后的新增和更新的数据。为了正常工作,几个条件需要满足:

ElasticSearch 中 _id 设置必须来自 MySQL 中 id 字段。它提供了 MySQL 和 ElasticSearch 之间文档数据的映射关系。如果一条记录在 MySQL 更新,那么,ElasticSearch 所有关联文档都应该被重写。要说明的是,重写 ElasticSearch 中的文档和更新操作的效率相同。在内部实现上,一个更新操作由删除一个旧文档和创建一个新文档两部分组成。

当 MySQL 中插入或更新一条记录时,必须包含一个字段用于保存字段的插入或更新时间。如此一来, Logstash 就可以实现每次请求只获取上次轮询后更新或插入的记录。Logstash 每次轮询都会保存从 MySQL 中读取到的最新的插入或更新时间,该时间大于上次轮询最新时间。

如果满足了上述条件,我们就可以配置 Logstash 周期性的从 MySQL 中读取所有最新更新或插入的记录,然后写入到 Elasticsearch 中。

MySQL 设置
这里建议在客户端工具(navicat)操作,避免犯一些低级错误。
MySQL 库和表的配置如下:

CREATE DATABASE es_db;

USE es_db;

DROP TABLE IF EXISTS es_table;

CREATE TABLE es_table (
  id BIGINT(20) UNSIGNED NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY unique_id (id),
  client_name VARCHAR(32) NOT NULL,
  modification_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  insertion_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

配置中有几点需要说明,如下:

es_table,MySQL 的数据表,我们将把它的数据同步到 ElasticSearch 中;
id,记录的唯一标识。注意,id 定义为主键的同时,也定义为唯一建,可以保证每个 id 在表中只出现一次。同步 ElasticSearch 时,将会转化为文档的 _id;
client_name,表示用户定义用来保存数据的字段,为使博文保持简洁,我们只定义了一个字段,更多字段也很容易加入。接下来的演示,我们会更新该字段,用以说明不仅仅新插入记录会同步到 MySQL,更新记录同样会同步到 MySQL;
modification_time,用于保存记录的更新或插入时间,它使得 Logstash 可以在每次轮询时只请求上次轮询后新增更新的记录;
insertion_time,该字段用于一条记录插入时间,主要是为演示方便,对同步而言,并非必须;

MySQL 操作
插入记录:

INSERT INTO es_table (id, client_name) VALUES (<id>, <client name>);

更新记录:

UPDATE es_table SET client_name = <new client name> WHERE id=<id>;

数据库同步配置
创建配置文件logstash.conf

[root@localhost config]# pwd
/usr/local/src/logstash/config
[root@localhost bin]# touch logstash.conf

配置文件完成内容如下

input {
    jdbc {
        jdbc_driver_library => "/usr/local/src/mysql-connector-java-8.0.11.jar"
        # 数据库连接驱动,新版的有cj
        jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
        jdbc_connection_string => "jdbc:mysql://localhost:3306/es_db?useUnicode=true&characterEncoding=utf-8&useSSL=false"
        jdbc_user => "root"
        jdbc_password => "123456"
        schedule => "*/5 * * * * *"
        # 时区设置为上海
        jdbc_default_timezone => "Asia/Shanghai"
        jdbc_page_size => "50000"
        record_last_run => "true"
        jdbc_paging_enabled => true
        # 增量更新索引的标识字段
        tracking_column => "unix_ts_in_secs"
        use_column_value => "true"
        last_run_metadata_path => "/usr/local/src/logstash/config/last_id"
        lowercase_column_names => "false"
        tracking_column_type => "numeric"
        # 为true表示重启logstash重新读取数据库所有内容,false会从上次读取的内容开始往后读取
        clean_run => "true"
        # 数据库文档的查询sql
        statement => "SELECT id AS docId, client_name, modification_time, insertion_time, UNIX_TIMESTAMP(modification_time) AS unix_ts_in_secs FROM es_table WHERE UNIX_TIMESTAMP(modification_time) > :sql_last_value AND modification_time < NOW() ORDER BY modification_time desc"
    }
}

filter {
  mutate {
    copy => { "id" => "[@metadata][id]"}
    remove_field => ["id", "@version", "unix_ts_in_secs"]
  }
  date {
    match => [ "modification_time", "yyyyMMddHHmm" ]
    timezone => "Asia/Shanghai"
  }
}

output {
    elasticsearch {
        # 索引名称
        index => "index_test_idx"
        # es文档的id为数据库表的id
        document_id => "%{docId}"
        hosts => ["http://localhost:9200"]
    }
}

启动logstash
启动前先启动mysql、Elasticsearch,如果你要查看运行结果,就启动kibana。

[root@localhost config]# cd ../bin/
[root@localhost bin]# ./logstash -f ../config/logstash.conf

运行log日志

[root@localhost bin]# ./logstash -f ../config/logstash.conf 
Thread.exclusive is deprecated, use Thread::Mutex
Sending Logstash logs to /usr/local/src/logstash/logs which is now configured via log4j2.properties
[2019-08-03T14:16:31,237][INFO ][logstash.setting.writabledirectory] Creating directory {:setting=>"path.queue", :path=>"/usr/local/src/logstash/data/queue"}
[2019-08-03T14:16:31,287][INFO ][logstash.setting.writabledirectory] Creating directory {:setting=>"path.dead_letter_queue", :path=>"/usr/local/src/logstash/data/dead_letter_queue"}
[2019-08-03T14:16:32,002][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2019-08-03T14:16:32,060][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"7.2.0"}
[2019-08-03T14:16:32,127][INFO ][logstash.agent           ] No persistent UUID file found. Generating new UUID {:uuid=>"793d3e86-56b6-42d6-b9da-709be9a18fd6", :path=>"/usr/local/src/logstash/data/uuid"}
[2019-08-03T14:16:54,091][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[2019-08-03T14:16:55,986][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://localhost:9200/"}
[2019-08-03T14:16:56,421][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>7}
[2019-08-03T14:16:56,437][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>7}
[2019-08-03T14:16:56,527][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["http://localhost:9200"]}
[2019-08-03T14:16:56,815][INFO ][logstash.outputs.elasticsearch] Using default mapping template
[2019-08-03T14:16:56,964][WARN ][org.logstash.instrument.metrics.gauge.LazyDelegatingGauge] A gauge metric of an unknown type (org.jruby.specialized.RubyArrayOneObject) has been create for key: cluster_uuids. This may result in invalid serialization.  It is recommended to log an issue to the responsible developer/development team.
[2019-08-03T14:16:56,967][INFO ][logstash.javapipeline    ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>1, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>125, :thread=>"#<Thread:0x2ecfb767 run>"}
[2019-08-03T14:16:57,426][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"index_patterns"=>"logstash-*", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s", "number_of_shards"=>1}, "mappings"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}
[2019-08-03T14:16:57,587][INFO ][logstash.outputs.elasticsearch] Installing elasticsearch template to _template/logstash
[2019-08-03T14:16:57,971][INFO ][logstash.javapipeline    ] Pipeline started {"pipeline.id"=>"main"}
[2019-08-03T14:16:58,323][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
/usr/local/src/logstash/vendor/bundle/jruby/2.5.0/gems/rufus-scheduler-3.0.9/lib/rufus/scheduler/cronline.rb:77: warning: constant ::Fixnum is deprecated
[2019-08-03T14:17:00,864][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
[2019-08-03T14:17:14,905][INFO ][logstash.inputs.jdbc     ] (0.439507s) SELECT version()
[2019-08-03T14:17:15,302][INFO ][logstash.inputs.jdbc     ] (0.126791s) SELECT version()
[2019-08-03T14:17:19,797][INFO ][logstash.inputs.jdbc     ] (0.123508s) SELECT count(*) AS `count` FROM (SELECT id AS docId, client_name, modification_time, insertion_time, UNIX_TIMESTAMP(modification_time) AS unix_ts_in_secs FROM es_table WHERE UNIX_TIMESTAMP(modification_time) > 0 AND modification_time < NOW() ORDER BY modification_time desc) AS `t1` LIMIT 1
[2019-08-03T14:19:40,269][INFO ][logstash.inputs.jdbc     ] (0.001810s) SELECT count(*) AS `count` FROM (SELECT id AS docId, client_name, modification_time, insertion_time, UNIX_TIMESTAMP(modification_time) AS unix_ts_in_secs FROM es_table WHERE UNIX_TIMESTAMP(modification_time) > 0 AND modification_time < NOW() ORDER BY modification_time desc) AS `t1` LIMIT 1
[2019-08-03T14:19:40,364][INFO ][logstash.inputs.jdbc     ] (0.002150s) SELECT * FROM (SELECT id AS docId, client_name, modification_time, insertion_time, UNIX_TIMESTAMP(modification_time) AS unix_ts_in_secs FROM es_table WHERE UNIX_TIMESTAMP(modification_time) > 0 AND modification_time < NOW() ORDER BY modification_time desc) AS `t1` LIMIT 50000 OFFSET 0

查看同步结果

image.png

文档删除

不知道你是否已经发现,如果一个文档从 MySQL 中删除,并不会同步到 ElasticSearch 。关于这个问题,列举一些可供我们考虑的方案,如下:

MySQL 中的记录可通过包含 is_deleted 字段用以表明该条记录是否有效。一旦发生更新,is_deleted 也会同步更新到 ElasticSearch 中。如果通过这种方式,在执行 MySQL 或 ElasticSearch 查询时,我们需要重写查询语句来过滤掉 is_deleted 为 true 的记录。同时,需要一些后台进程将 MySQL 和 ElasticSearch 中的这些文档删除。

另一个可选方案,应用系统负责 MySQL 和 ElasticSearch 中数据的删除,即应用系统在删除 MySQL 中数据的同时,也要负责将 ElasticSearch 中相应的文档删除。

几个注意点

这里可能会遇到几个问题:
1、id映射
为了方便之后的数据操作,建议以数据库id作为ES同步的_id,但是我在操作的时候发现数据库id没有办法映射过去。
我看网上别人是这么写的:


image.png

还有这么写的


image.png

但是在我这里都不行,然后我在写查询sql的时候为id指定了别名(随便起的)docId,然后在映射的时候,使用的别名。
image.png

解决了这个问题。

2、时区问题
我们同步是增量更新的,以更新时间这个字段作为增量同步的分割标识。我在做同步的时候发现除了第一次启动数据全量更新到mysql,之后添加数据、修改数据都没有同步到ES中。查看同步日志发现在查询sql中sql_last_value的值大约比首次同步时间快约14个小时,这个是时区不同导致的问题。但是把时区改为上海重启后还是不行。
后更改查询sql把时间字段转为时间戳数字,时间戳数字是全球统一的,把它映射到sql_last_value,问题解决。

3、SELECT 语句的正确性分析
这里建议看看这边文章,我的mysql部分内容是参考这里的,解释的很好;
https://www.elastic.co/cn/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash

最后
建议自己去创建索引,可以指定分区和备份等索引参数,然后再开启同步

推荐阅读更多精彩内容