分类目录归档:全文搜索

mediawiki的Dockerfile配置

1.sudo docker pull mediawiki
2.sudo docker run –name mediawiki -p 7089:80 -d mediawiki
3.save the LocalSettings.php’s file and copy to mymediawiki/html’s directory.
4.mkdir mymediawiki’s dir
5.write Dockerfile just like bellow.

FROM mediawiki
COPY html/LocalSettings.php /var/www/html/

5.build and run.

sudo docker build -t mymediawiki .
sudo docker run --name mymediawiki -p 7089:80 -d mymediawiki

6.check it.

logstash抓取nginx日志

以下是基于elk+lnmp开源进行测试验证。
也可以参考官网的实现方法:https://kibana.logstash.es/content/logstash/plugins/codec/json.html
https://kibana.logstash.es/content/logstash/plugins/codec/multiline.html
在官网文档中,有较多应用场景:
https://kibana.logstash.es/content/
https://kibana.logstash.es/content/logstash/examples/

1.抓取nginx日志

input {
    file {
        # path => ["/home/wwwlogs/h5.vim.vim.com.log", "/home/wwwlogs/h5.vim.vim.com2.log"]
	path => "/home/wwwlogs/h5.vim.vim.com.log"
        exclude => "*.zip"
        type => "java"
        add_field => [ "domain", "h5.vim.vim.com" ]
        codec => multiline {
                      pattern => "^\s+"
                      what => previous
              }
    }
    file {
        # path => ["/home/wwwlogs/h5.api.vim.vim.com.log", "/home/wwwlogs/h5.api.vim.vim.com2.log"]
	path => "/home/wwwlogs/h5.api.vim.vim.com.log"
        exclude => ["*.zip", "*.gz"]
        type => "java"
        add_field => [ "domain", "h5.api.vim.vim.com" ]
        codec => multiline {
                        pattern => "^\s+"
                        what => previous
                 }
    }
}
filter {
 
}
output {
    stdout { 
		codec => rubydebug 
	}
    elasticsearch {
        hosts => ["0.0.0.0:9200"]
        index => "logstash-%{domain}-%{+YYYY.MM.dd}"
    }
}

2.定期清理索引

#!/bin/bash
 
# --------------------------------------------------------------
# This script is to delete ES indices older than specified days.
# Version: 1.0
# --------------------------------------------------------------
 
function usage() {
        echo "Usage: `basename $0` -s ES_SERVER -d KEEP_DAYS [-w INTERVAL]"
}
 
 
PREFIX='logstash-'
WAITTIME=2
NOW=`date  +%s.%3N`
LOGPATH=/apps/logs/elasticsearch
 
 
while getopts d:s:w: opt
do
        case $opt in
        s) SERVER="$OPTARG";;
        d) KEEPDAYS="$OPTARG";;
        w) WAITTIME="$OPTARG";;
        *) usage;;
        esac
done
 
if [ -z "$SERVER" -o -z "$KEEPDAYS" ]; then
        usage
fi
 
if [ ! -d $LOGPATH ]; then
        mkdir -p $LOGPATH
fi
 
 
INDICES=`curl -s $SERVER/_cat/indices?h=index | grep -P '^logstash-.*\d{4}.\d{2}.\d{2}' | sort`
for index in $INDICES
do
        date=`echo $index | awk -F '-' '{print $NF}' | sed 's/\./-/g' | xargs -I{} date -d {} +%s.%3N`
        delta=`echo "($NOW-$date)/86400" | bc`
        if [ $delta -gt $KEEPDAYS ]; then
                echo "deleting $index" | tee -a $LOGPATH/es_delete_indices.log
                curl -s -XDELETE $SERVER/$index | tee -a $LOGPATH/es_delete_indices.log
                echo | tee -a $LOGPATH/es_delete_indices.log
                sleep $WAITTIME
        fi
done

机器学习的一些库

Gensim是一个相当专业的计算相似度的Python工具包。
在文本处理中,比如商品评论挖掘,有时需要了解每个评论分别和商品的描述之间的相似度,以此衡量评论的客观性。
评论和商品描述的相似度越高,说明评论的用语比较官方,不带太多感情色彩,比较注重描述商品的属性和特性,角度更客观。
http://radimrehurek.com/gensim/

————————————-
图像识别类库
https://github.com/tesseract-ocr/tesseract

原本由惠普开发的图像识别类库tesseract-ocr已经更新到2.04, 就是最近Google支持的那个OCR。原先是惠普写的,现在Open source了。

安装Elasticsearch&Kibana&X-Pack

1.下载文件:
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.2.0.tar.gz
https://artifacts.elastic.co/downloads/kibana/kibana-5.2.0-linux-x86_64.tar.gz
https://artifacts.elastic.co/downloads/logstash/logstash-5.2.0.tar.gz
https://artifacts.elastic.co/downloads/packs/x-pack/x-pack-5.2.0.zip
https://artifacts.elastic.co/downloads/beats/heartbeat/heartbeat-5.2.0-linux-x86_64.tar.gz
https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-5.2.0-linux-x86_64.tar.gz
https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-5.2.0-linux-x86_64.tar.gz
2.分别解压elasticsearch\kibana\logstash后,各自执行以下命令安装xpack。

bin/elasticsearch-plugin install file:///path/to/file/x-pack-5.2.0.zip
bin/kibana-plugin install file:///path/to/file/x-pack-5.2.0.zip
bin/logstash-plugin install file:///path/to/file/x-pack-5.2.0.zip
卸载命令
bin/elasticsearch-plugin remove x-pack
bin/kibana-plugin remove x-pack
bin/logstash-plugin remove x-pack

3.启动相应应用

bin/elasticsearch
bin/kibana
bin/logstash

4.登录相关后台

----------------------------
kibana的后台:
http://localhost:5601
帐号与密码
Username: elastic Password: changeme
------------------------------
elasticsearch的restfullAPI
http://localhost:9200
-----------------------
logstash的后台

参考文档:
https://www.elastic.co/start
https://www.elastic.co/guide/en/x-pack/current/xpack-introduction.html
https://www.elastic.co/guide/en/x-pack/current/installing-xpack.html

solr快速入门

1.下载合适的solr版本。当前官网最新版本是6.4.1,但经验证,6.4.1版本在其管理后台中操作dataimport时,会显示空白页。故本人不建议使用最新版本进行学习和应用开发。经验证,其5.5.3版本的各项功能是可以正常工作的。
下载5.5.3版本,快捷路径是:https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr
http://lucene.apache.org/solr/mirrors-solr-latest-redir.html
选择任意一个镜像,在进入镜像后,选择parent目录。


选择5.5.3的版本:https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/5.5.3/
https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/5.5.3/solr-5.5.3.zip
2.如下是其目录结构,初学者,应该习惯阅读Readme.txt文件,该文件记录了一些基本使用操作,很方便学习。

3.内置了几个例子,需要使用特殊命令开启,初学者应该每个例子都体验一下。

bin/solr -e <EXAMPLE> where <EXAMPLE> is one of:  
    cloud        : SolrCloud example
    dih          : Data Import Handler (rdbms, mail, rss, tika)
    schemaless   : Schema-less example (schema is inferred from data during indexing)
    techproducts : Kitchen sink example providing comprehensive examples of Solr features

4.体验dih例子。

bin/solr -e dih

5.打开管理后台页面:

在实际测试过程中,发现在window中,dataimport等一些相关操作,会失败。只有linux的才会成功,具体原因没有去分析。
http://mysql.mvware.com:8983/solr/


6.logging界面,当执行dataimport或其它操作,如果有错误或执行失败,可以检查该日志信息。

7.在CoreSelector中选择solr项,并选择dataimport项.

8.在dataimport项中,调试你的配置文件,经过该步骤,已经可以在query项和schemabrowser项中查询到相关记录了。

9.在dataimport项中,执行全量更新和增量更新,dataimport项是需要在solrconfig.xml中配置的。

solrconfig.xml中的requestHandler配置

<requestHandler name="/dataimport" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">solr-data-config.xml</str>
    </lst>
  </requestHandler>

solr-data-config.xml中的配置

<dataConfig>
  <document>
    <entity name="sep" processor="SolrEntityProcessor"
            url="http://127.0.0.1:8983/solr/db "
            query="*:*"
            fl="*,orig_version_l:_version_,ignored_price_c:price_c"/>
  </document>
</dataConfig>

10.浏览schemabrowser中的各个schema项,在solr6.x版本中,增加了schema的增删项,更方便从零搭建core项。

11.通过documents项,增加数据记录,通过schemabrowser或manage-schema.xml配置文件中可知道当前的schema有如下:。

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />
   <field name="store" type="location" indexed="true" stored="true"/>
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="resourcename" type="text_general" indexed="true" stored="true"/>
   <field name="url" type="text_general" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>


在DocumentType选择JSON项,然后输入内容如下,并点击submit按钮提交:

{"id":12345, "author":"author_121","text":"text_121", "title":"title_1121"}
{"id":22345, "url":"url_121"}

如果执行成功,则提示如下:

Status: success
Response:
{
  "responseHeader": {
    "status": 0,
    "QTime": 2
  }
}

如果出错呢?也会有相应的错误提示,可依据提示进行修改输入项内容。

12.通过query项,进行查找刚才的输入项。

13.也可以通schemabrowser中的记录,快速跳转搜索的内容。