代码之家  ›  专栏  ›  技术社区  ›  user1189332

Elasticsearch索引清理

  •  0
  • user1189332  · 技术社区  · 6 年前

    v Elasticsearch 5.6.*。

    我正在寻找一种方法来实现一种机制,通过这种机制,我的一个索引(每天大约100万个文档)可以自动管理存储限制。

    例如:我将文档的最大数量或最大索引大小定义为变量“n”。 我会编写一个调度器来检查“n”是否为真。如果是真的,那么我想删除最旧的“x”文档(基于时间)。

    我这里有几个问题:

    显然,我不想删除太多或太少。我怎么知道“x”是什么?我可以简单地对elasticsearch说“嘿,删除价值5GB的最旧文档”——我的目的只是释放固定数量的存储空间。这可能吗?

    其次,我想知道这里的最佳实践是什么?很明显,我不想在这里发明一个方轮,如果有什么东西(例如:策展人,我最近才听说)可以做到这一点,那么我很乐意使用它。

    3 回复  |  直到 6 年前
        1
  •  3
  •   Val    6 年前

    在您的情况下,最佳做法是使用基于时间的索引,可以是每日、每周或每月索引,根据您拥有的数据量和您想要的保留时间来选择合适的索引。你也可以使用 Rollover API 以决定何时需要创建新索引(基于时间、文档数量或索引大小)

    删除整个索引要比删除索引中符合特定条件的文档容易得多。如果执行后一种操作,文档将被删除,但在基础段合并之前,空间不会被释放。然而,如果你删除了整个基于时间的索引,那么你肯定会释放出空间。

        2
  •  2
  •   Kelby    5 年前

    我想出了一个相当简单的bash脚本解决方案,来清理Elasticsearch中基于时间的索引,我想我会与大家分享,以防有人感兴趣。策展人似乎是这样做的标准答案,但我真的不想安装和管理一个Python应用程序,它需要所有依赖项。没有比通过cron执行bash脚本更简单的了,而且它在核心Linux之外没有任何依赖项。

    #!/bin/bash
    
    # Make sure expected arguments were provided
    if [ $# -lt 3 ]; then
        echo "Invalid number of arguments!"
        echo "This script is used to clean time based indices from Elasticsearch. The indices must have a"
        echo "trailing date in a format that can be represented by the UNIX date command such as '%Y-%m-%d'."
        echo ""
        echo "Usage: `basename $0` host_url index_prefix num_days_to_keep [date_format]"
        echo "The date_format argument is optional and defaults to '%Y-%m-%d'"
        echo "Example: `basename $0` http://localhost:9200 cflogs- 7"
        echo "Example: `basename $0` http://localhost:9200 elasticsearch_metrics- 31 %Y.%m.%d"
        exit
    fi
    
    elasticsearchUrl=$1
    indexNamePrefix=$2
    numDaysDataToKeep=$3
    dateFormat=%Y-%m-%d
    if [ $# -ge 4 ]; then
        dateFormat=$4
    fi
    
    # Get the curent date in a 'seconds since epoch' format
    curDateInSecondsSinceEpoch=$(date +%s)
    #echo "curDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch"
    
    # Subtract numDaysDataToKeep from current epoch value to get the last day to keep
    let "targetDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch - ($numDaysDataToKeep * 86400)"
    #echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"
    
    while : ; do
        # Subtract one day from the target date epoch
       let "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch - 86400"
       #echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"
    
       # Convert targetDateInSecondsSinceEpoch into a YYYY-MM-DD format
       targetDateString=$(date --date="@$targetDateInSecondsSinceEpoch" +$dateFormat)
       #echo "targetDateString=$targetDateString"
    
       # Format the index name using the prefix and the calculated date string
       indexName="$indexNamePrefix$targetDateString"
       #echo "indexName=$indexName"
    
       # First check if an index with this date pattern exists
        # Curl options:
        #  -s   silent mode. Don't show progress meter or error messages
        #  -w "%{http_code}\n" Causes curl to display the HTTP status code only after a completed transfer.
        #  -I Fetch the HTTP-header only in the response. For HEAD commands there is no body so this keeps curl from waiting on it.
        #  -o /dev/null Prevents the output in the response from being displayed. This does not apply to the -w output though.
       httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
       #echo "httpCode=$httpCode"
       if [ $httpCode -ne 200 ]
       then
          echo "Index $indexName does not exist. Stopping processing."
          break;
       fi
    
       # Send the command to Elasticsearch to delete the index. Save the HTTP return code in a variable
       httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -X DELETE $elasticsearchUrl/$indexName)
       #echo "httpCode=$httpCode"
    
       if [ $httpCode -eq 200 ]
       then
          echo "Successfully deleted index $indexName."
        else
          echo "FAILURE! Delete command failed with return code $httpCode. Continuing processing with next day."
          continue;
       fi
    
       # Verify the index no longer exists. Should return 404 when the index isn't found.
       httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
       #echo "httpCode=$httpCode"
       if [ $httpCode -eq 200 ]
       then
          echo "FAILURE! Delete command responded successfully, but index still exists. Continuing processing with next day."
          continue;
       fi
    
    done
    
    
        3
  •  1
  •   untergeek    6 年前

    我在会议上回答了同样的问题 https://discuss.elastic.co/t/elasticsearch-efficiently-cleaning-up-the-indices-to-save-space/137019

    如果索引一直在增长,那么删除文档不是最佳做法。听起来你有时间序列数据。如果是真的,那么你想要的是时间序列指数,或者更好的是,滚动指数。

    5GB也是一个相当小的清除量,因为单个Elasticsearch碎片可以健康地增长到20GB-50GB大小。你的存储空间有限吗?你有多少个节点?