mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.

AGPL-3.0 License

Stars
3.7K
Committers
450

Bot releases are hidden (Show)

mimir - 2.8.0

Published by lamida over 1 year ago

This release contains 223 PRs from 53 authors, including new contributors Abdurrahman J. Allawala, Ashray Jain, Cyrill N, Daniel Barnes, Dave, David van der Spek, day4me, Devin Trejo, Dmitriy Okladin, Gabriel Santos, inbarpatashnik, Johannes Tandler, Julien Girard, KingJ, Miller, Rafał Boniecki, Raphael Ferreira, Raúl Marín, Ruslan Kovalov, Shagit Ziganshin, shanmugara, Wilfried ROSET. Thank you!

Grafana Mimir version 2.8.0 release notes

Grafana Labs is excited to announce version 2.8 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Features and enhancements

  • Experimental support for using Redis as cache Mimir now can use Redis for caching results, chunks, index and metadata.
  • Experimental support for fetching secret from Vault for TLS configuration.
  • Experimental support for querying native histograms. This support is not finalized as the related Prometheus API is also experimental, thus the exact behavior might change in future releases.
  • Query-frontend and ruler now use protobuf internal query result payload format by default This reduces the CPU and memory utilisation of the querier, query-frontend and ruler, as well as reducing network bandwidth consumed between these components.
  • Query-frontend cached results now contain timestamp This allows Mimir to check if cached results are still valid based on current TTL configured for tenant. Results cached by previous Mimir version are used until they expire from cache, which can take up to 7 days. If you need to use per-tenant TTL sooner, please flush results cache manually.
  • Optimized regular expression label matchers This reduces CPU utilisation in ingesters and store-gateways when running queries containing regular expression label matchers.
  • Store-gateway now use streaming for LabelNames RPC This improves memory utilization in store-gateway when calling LabelNames RPC.

Helm chart improvements

The Grafana Mimir and Grafana Enterprise Metrics Helm chart is now released independently. See the Grafana Mimir Helm chart documentation.

Important changes

In Grafana Mimir 2.8 we have removed the following previously deprecated or experimental metrics:

  • cortex_bucket_store_series_get_all_duration_seconds
  • cortex_bucket_store_series_merge_duration_seconds
  • cortex_ingester_tsdb_wal_replay_duration_seconds

The following configuration options are deprecated and will be removed in Grafana Mimir 2.10:

  • The CLI flag -blocks-storage.tsdb.max-tsdb-opening-concurrency-on-startup and its respective YAML configuration option tsdb.max_tsdb_opening_concurrency_on_startup.

The following configuration options that were deprecated in 2.6 are removed:

  • The CLI flag -store.max-query-length and its respective YAML configuration option limits.max_query_length.

The following configuration options that were deprecated in 2.5 are removed:

  • The CLI flag -azure.msi-resource.

The following experimental options and features are now stable:

  • The protobuf internal query result payload format, which is now enabled by default

We changed default value of block storage retention period. The default value for -blocks-storage.tsdb.retention-period was 24h and now is 13h

Bug fixes

  • Querier: Streaming remote read will now continue to return multiple chunks per frame after the first frame. PR 4423
  • Query-frontend: don't retry queries which error inside PromQL. PR 4643
  • Store-gateway & query-frontend: report more consistent statistics for fetched index bytes. PR 4671

Changelog

2.8.0

Grafana Mimir

  • [CHANGE] Ingester: changed experimental CLI flag from -out-of-order-blocks-external-label-enabled to -ingester.out-of-order-blocks-external-label-enabled #4440
  • [CHANGE] Store-gateway: The following metrics have been removed: #4332
    • cortex_bucket_store_series_get_all_duration_seconds
    • cortex_bucket_store_series_merge_duration_seconds
  • [CHANGE] Ingester: changed default value of -blocks-storage.tsdb.retention-period from 24h to 13h. If you're running Mimir with a custom configuration and you're overriding -querier.query-store-after to a value greater than the default 12h then you should increase -blocks-storage.tsdb.retention-period accordingly. #4382
  • [CHANGE] Ingester: the configuration parameter -blocks-storage.tsdb.max-tsdb-opening-concurrency-on-startup has been deprecated and will be removed in Mimir 2.10. #4445
  • [CHANGE] Query-frontend: Cached results now contain timestamp which allows Mimir to check if cached results are still valid based on current TTL configured for tenant. Results cached by previous Mimir version are used until they expire from cache, which can take up to 7 days. If you need to use per-tenant TTL sooner, please flush results cache manually. #4439
  • [CHANGE] Ingester: the cortex_ingester_tsdb_wal_replay_duration_seconds metrics has been removed. #4465
  • [CHANGE] Query-frontend and ruler: use protobuf internal query result payload format by default. This feature is no longer considered experimental. #4557 #4709
  • [CHANGE] Ruler: reject creating federated rule groups while tenant federation is disabled. Previously the rule groups would be silently dropped during bucket sync. #4555
  • [CHANGE] Compactor: the /api/v1/upload/block/{block}/finish endpoint now returns a 429 status code when the compactor has reached the limit specified by -compactor.max-block-upload-validation-concurrency. #4598
  • [CHANGE] Compactor: when starting a block upload the maximum byte size of the block metadata provided in the request body is now limited to 1 MiB. If this limit is exceeded a 413 status code is returned. #4683
  • [CHANGE] Store-gateway: cache key format for expanded postings has changed. This will invalidate the expanded postings in the index cache when deployed. #4667
  • [FEATURE] Cache: Introduce experimental support for using Redis for results, chunks, index, and metadata caches. #4371
  • [FEATURE] Vault: Introduce experimental integration with Vault to fetch secrets used to configure TLS for clients. Server TLS secrets will still be read from a file. tls-ca-path, tls-cert-path and tls-key-path will denote the path in Vault for the following CLI flags when -vault.enabled is true: #4446.
    • -distributor.ha-tracker.etcd.*
    • -distributor.ring.etcd.*
    • -distributor.forwarding.grpc-client.*
    • -querier.store-gateway-client.*
    • -ingester.client.*
    • -ingester.ring.etcd.*
    • -querier.frontend-client.*
    • -query-frontend.grpc-client-config.*
    • -query-frontend.results-cache.redis.*
    • -blocks-storage.bucket-store.index-cache.redis.*
    • -blocks-storage.bucket-store.chunks-cache.redis.*
    • -blocks-storage.bucket-store.metadata-cache.redis.*
    • -compactor.ring.etcd.*
    • -store-gateway.sharding-ring.etcd.*
    • -ruler.client.*
    • -ruler.alertmanager-client.*
    • -ruler.ring.etcd.*
    • -ruler.query-frontend.grpc-client-config.*
    • -alertmanager.sharding-ring.etcd.*
    • -alertmanager.alertmanager-client.*
    • -memberlist.*
    • -query-scheduler.grpc-client-config.*
    • -query-scheduler.ring.etcd.*
    • -overrides-exporter.ring.etcd.*
  • [FEATURE] Distributor, ingester, querier, query-frontend, store-gateway: add experimental support for native histograms. Requires that the experimental protobuf query result response format is enabled by -query-frontend.query-result-response-format=protobuf on the query frontend. #4286 #4352 #4354 #4376 #4377 #4387 #4396 #4425 #4442 #4494 #4512 #4513 #4526
  • [FEATURE] Added -<prefix>.s3.storage-class flag to configure the S3 storage class for objects written to S3 buckets. #4300
  • [FEATURE] Add freebsd to the target OS when generating binaries for a Mimir release. #4654
  • [FEATURE] Ingester: Add prepare-shutdown endpoint which can be used as part of Kubernetes scale down automations. #4718
  • [ENHANCEMENT] Add timezone information to Alpine Docker images. #4583
  • [ENHANCEMENT] Ruler: Sync rules when ruler JOINING the ring instead of ACTIVE, In order to reducing missed rule iterations during ruler restarts. #4451
  • [ENHANCEMENT] Allow to define service name used for tracing via JAEGER_SERVICE_NAME environment variable. #4394
  • [ENHANCEMENT] Querier and query-frontend: add experimental, more performant protobuf query result response format enabled with -query-frontend.query-result-response-format=protobuf. #4304 #4318 #4375
  • [ENHANCEMENT] Compactor: added experimental configuration parameter -compactor.first-level-compaction-wait-period, to configure how long the compactor should wait before compacting 1st level blocks (uploaded by ingesters). This configuration option allows to reduce the chances compactor begins compacting blocks before all ingesters have uploaded their blocks to the storage. #4401
  • [ENHANCEMENT] Store-gateway: use more efficient chunks fetching and caching. #4255
  • [ENHANCEMENT] Query-frontend and ruler: add experimental, more performant protobuf internal query result response format enabled with -ruler.query-frontend.query-result-response-format=protobuf. #4331
  • [ENHANCEMENT] Ruler: increased tolerance for missed iterations on alerts, reducing the chances of flapping firing alerts during ruler restarts. #4432
  • [ENHANCEMENT] Optimized .* and .+ regular expression label matchers. #4432
  • [ENHANCEMENT] Optimized regular expression label matchers with alternates (e.g. a|b|c). #4647
  • [ENHANCEMENT] Added an in-memory cache for regular expression matchers, to avoid parsing and compiling the same expression multiple times when used in recurring queries. #4633
  • [ENHANCEMENT] Query-frontend: results cache TTL is now configurable by using -query-frontend.results-cache-ttl and -query-frontend.results-cache-ttl-for-out-of-order-time-window options. These values can also be specified per tenant. Default values are unchanged (7 days and 10 minutes respectively). #4385
  • [ENHANCEMENT] Ingester: added advanced configuration parameter -blocks-storage.tsdb.wal-replay-concurrency representing the maximum number of CPUs used during WAL replay. #4445
  • [ENHANCEMENT] Ingester: added metrics cortex_ingester_tsdb_open_duration_seconds_total to measure the total time it takes to open all existing TSDBs. The time tracked by this metric also includes the TSDBs WAL replay duration. #4465
  • [ENHANCEMENT] Store-gateway: use streaming implementation for LabelNames RPC. The batch size for streaming is controlled by -blocks-storage.bucket-store.batch-series-size. #4464
  • [ENHANCEMENT] Memcached: Add support for TLS or mTLS connections to cache servers. #4535
  • [ENHANCEMENT] Compactor: blocks index files are now validated for correctness for blocks uploaded via the TSDB block upload feature. #4503
  • [ENHANCEMENT] Compactor: block chunks and segment files are now validated for correctness for blocks uploaded via the TSDB block upload feature. #4549
  • [ENHANCEMENT] Ingester: added configuration options to configure the "postings for matchers" cache of each compacted block queried from ingesters: #4561
    • -blocks-storage.tsdb.block-postings-for-matchers-cache-ttl
    • -blocks-storage.tsdb.block-postings-for-matchers-cache-size
    • -blocks-storage.tsdb.block-postings-for-matchers-cache-force
  • [ENHANCEMENT] Compactor: validation of blocks uploaded via the TSDB block upload feature is now configurable on a per tenant basis: #4585
    • -compactor.block-upload-validation-enabled has been added, compactor_block_upload_validation_enabled can be used to override per tenant
    • -compactor.block-upload.block-validation-enabled was the previous global flag and has been removed
  • [ENHANCEMENT] TSDB Block Upload: block upload validation concurrency can now be limited with -compactor.max-block-upload-validation-concurrency. #4598
  • [ENHANCEMENT] OTLP: Add support for converting OTel exponential histograms to Prometheus native histograms. The ingestion of native histograms must be enabled, please set -ingester.native-histograms-ingestion-enabled to true. #4063 #4639
  • [ENHANCEMENT] Query-frontend: add metric cortex_query_fetched_index_bytes_total to measure TSDB index bytes fetched to execute a query. #4597
  • [ENHANCEMENT] Query-frontend: add experimental limit to enforce a max query expression size in bytes via -query-frontend.max-query-expression-size-bytes or max_query_expression_size_bytes. #4604
  • [ENHANCEMENT] Query-tee: improve message logged when comparing responses and one response contains a non-JSON payload. #4588
  • [ENHANCEMENT] Distributor: add ability to set per-distributor limits via distributor_limits block in runtime configuration in addition to the existing configuration. #4619
  • [ENHANCEMENT] Querier: reduce peak memory consumption for queries that touch a large number of chunks. #4625
  • [ENHANCEMENT] Query-frontend: added experimental -query-frontend.query-sharding-max-regexp-size-bytes limit to query-frontend. When set to a value greater than 0, query-frontend disabled query sharding for any query with a regexp matcher longer than the configured limit. #4632
  • [ENHANCEMENT] Store-gateway: include statistics from LabelValues and LabelNames calls in cortex_bucket_store_series* metrics. #4673
  • [ENHANCEMENT] Query-frontend: improve readability of distributed tracing spans. #4656
  • [ENHANCEMENT] Update Docker base images from alpine:3.17.2 to alpine:3.17.3. #4685
  • [ENHANCEMENT] Querier: improve performance when shuffle sharding is enabled and the shard size is large. #4711
  • [ENHANCEMENT] Ingester: improve performance when Active Series Tracker is in use. #4717
  • [ENHANCEMENT] Store-gateway: optionally select -blocks-storage.bucket-store.series-selection-strategy, which can limit the impact of large posting lists (when many series share the same label name and value). #4667 #4695 #4698
  • [ENHANCEMENT] Querier: Cache the converted float histogram from chunk iterator, hence there is no need to lookup chunk every time to get the converted float histogram. #4684
  • [ENHANCEMENT] Ruler: Improve rule upload performance when not enforcing per-tenant rule group limits. #4828
  • [ENHANCEMENT] Improved memory limit on the in-memory cache used for regular expression matchers. #4751
  • [BUGFIX] Querier: Streaming remote read will now continue to return multiple chunks per frame after the first frame. #4423
  • [BUGFIX] Store-gateway: the values for stage="processed" for the metrics cortex_bucket_store_series_data_touched and cortex_bucket_store_series_data_size_touched_bytes when using fine-grained chunks caching is now reporting the correct values of chunks held in memory. #4449
  • [BUGFIX] Compactor: fixed reporting a compaction error when compactor is correctly shut down while populating blocks. #4580
  • [BUGFIX] OTLP: Do not drop exemplars of the OTLP Monotonic Sum metric. #4063
  • [BUGFIX] Packaging: flag /etc/default/mimir and /etc/sysconfig/mimir as config to prevent overwrite. #4587
  • [BUGFIX] Query-frontend: don't retry queries which error inside PromQL. #4643
  • [BUGFIX] Store-gateway & query-frontend: report more consistent statistics for fetched index bytes. #4671
  • [BUGFIX] Native histograms: fix how IsFloatHistogram determines if mimirpb.Histogram is a float histogram. #4706
  • [BUGFIX] Query-frontend: fix query sharding for native histograms. #4666
  • [BUGFIX] Ring status page: fixed the owned tokens percentage value displayed. #4730
  • [BUGFIX] Querier: fixed chunk iterator that can return sample with wrong timestamp. #4450
  • [BUGFIX] Packaging: fix preremove script preventing upgrades. #4801
  • [BUGFIX] Security: updates Go to version 1.20.4 to fix CVE-2023-24539, CVE-2023-24540, CVE-2023-29400. #4903

Mixin

  • [ENHANCEMENT] Queries: Display data touched per sec in bytes instead of number of items. #4492
  • [ENHANCEMENT] _config.job_names.<job> values can now be arrays of regular expressions in addition to a single string. Strings are still supported and behave as before. #4543
  • [ENHANCEMENT] Queries dashboard: remove mention to store-gateway "streaming enabled" in panels because store-gateway only support streaming series since Mimir 2.7. #4569
  • [ENHANCEMENT] Ruler: Add panel description for Read QPS panel in Ruler dashboard to explain values when in remote ruler mode. #4675
  • [BUGFIX] Ruler dashboard: show data for reads from ingesters. #4543
  • [BUGFIX] Pod selector regex for deployments: change (.*-mimir-) to (.*mimir-). #4603

Jsonnet

  • [CHANGE] Ruler: changed ruler deployment max surge from 0 to 50%, and max unavailable from 1 to 0. #4381
  • [CHANGE] Memcached connections parameters -blocks-storage.bucket-store.index-cache.memcached.max-idle-connections, -blocks-storage.bucket-store.chunks-cache.memcached.max-idle-connections and -blocks-storage.bucket-store.metadata-cache.memcached.max-idle-connections settings are now configured based on max-get-multi-concurrency and max-async-concurrency. #4591
  • [CHANGE] Add support to use external Redis as cache. Following are some changes in the jsonnet config: #4386 #4640
    • Renamed memcached_*_enabled config options to cache_*_enabled
    • Renamed memcached_*_max_item_size_mb config options to cache_*_max_item_size_mb
    • Added cache_*_backend config options
  • [CHANGE] Store-gateway StatefulSets with disabled multi-zone deployment are also unregistered from the ring on shutdown. This eliminated resharding during rollouts, at the cost of extra effort during scaling down store-gateways. For more information see Scaling down store-gateways. #4713
  • [ENHANCEMENT] Alertmanager: add alertmanager_data_disk_size and alertmanager_data_disk_class configuration options, by default no storage class is set. #4389
  • [ENHANCEMENT] Update rollout-operator to v0.4.0. #4524
  • [ENHANCEMENT] Update memcached to memcached:1.6.19-alpine. #4581
  • [ENHANCEMENT] Add support for mTLS connections to Memcached servers. #4553
  • [ENHANCEMENT] Update the memcached-exporter to v0.11.2. #4570
  • [ENHANCEMENT] Autoscaling: Add autoscaling_query_frontend_memory_target_utilization, autoscaling_ruler_query_frontend_memory_target_utilization, and autoscaling_ruler_memory_target_utilization configuration options, for controlling the corresponding autoscaler memory thresholds. Each has a default of 1, i.e. 100%. #4612
  • [ENHANCEMENT] Distributor: add ability to set per-distributor limits via distributor_instance_limits using runtime configuration. #4627
  • [BUGFIX] Add missing query sharding settings for user_24M and user_32M plans. #4374

Mimirtool

  • [ENHANCEMENT] Backfill: mimirtool will now sleep and retry if it receives a 429 response while trying to finish an upload due to validation concurrency limits. #4598
  • [ENHANCEMENT] gauge panel type is supported now in mimirtool analyze dashboard. #4679
  • [ENHANCEMENT] Set a User-Agent header on requests to Mimir or Prometheus servers. #4700

Mimir Continuous Test

  • [FEATURE] Allow continuous testing of native histograms as well by enabling the flag -tests.write-read-series-test.histogram-samples-enabled. The metrics exposed by the tool will now have a new label called type with possible values of float, histogram_float_counter, histogram_float_gauge, histogram_int_counter, histogram_int_gauge, the list of metrics impacted: #4457
    • mimir_continuous_test_writes_total
    • mimir_continuous_test_writes_failed_total
    • mimir_continuous_test_queries_total
    • mimir_continuous_test_queries_failed_total
    • mimir_continuous_test_query_result_checks_total
    • mimir_continuous_test_query_result_checks_failed_total
  • [ENHANCEMENT] Added a new metric mimir_continuous_test_build_info that reports version information, similar to the existing cortex_build_info metric exposed by other Mimir components. #4712
  • [ENHANCEMENT] Add coherency for the selected ranges and instants of test queries. #4704

Query-tee

Documentation

  • [CHANGE] Clarify what deprecation means in the lifecycle of configuration parameters. #4499
  • [CHANGE] Update compactor split-groups and split-and-merge-shards recommendation on component page. #4623
  • [FEATURE] Add instructions about how to configure native histograms. #4527
  • [ENHANCEMENT] Runbook for MimirCompactorHasNotSuccessfullyRunCompaction extended to include scenario where compaction has fallen behind. #4609
  • [ENHANCEMENT] Add explanation for QPS values for reads in remote ruler mode and writes generally, to the Ruler dashboard page. #4629
  • [ENHANCEMENT] Expand zone-aware replication page to cover single physical availability zone deployments. #4631
  • [FEATURE] Add instructions to use puppet module. #4610

Tools

  • [ENHANCEMENT] tsdb-index: iteration over index is now faster when any equal matcher is supplied. #4515

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.7.3...mimir-2.8.0

mimir - 2.7.3

Published by pstibrany over 1 year ago

2.7.3

Grafana Mimir

  • [BUGFIX] Security: updates Go to version 1.20.4 to fix CVE-2023-24539, CVE-2023-24540, CVE-2023-29400. #4905

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.7.2...mimir-2.7.3

mimir - 2.6.2

Published by pstibrany over 1 year ago

Changelog

2.6.2

Grafana Mimir

  • [BUGFIX] Security: updates Go to version 1.20.4 to fix CVE-2023-24539, CVE-2023-24540, CVE-2023-29400. #4903

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.6.1...mimir-2.6.2

mimir - 2.8.0-rc.2

Published by lamida over 1 year ago

This release contains 2 PRs from 2 authors. Thank you!

Changelog

2.8.0-rc.2

Grafana Mimir

  • [ENHANCEMENT] Ruler: Improve rule upload performance when not enforcing per-tenant rule group limits. #4828

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.8.0-rc.1...mimir-2.8.0-rc.2

mimir - 2.8.0-rc.1

Published by lamida over 1 year ago

This release contains 8 PRs from 2 authors. Thank you!

Changelog

2.8.0-rc.1

Grafana Mimir

  • [ENHANCEMENT] Improved memory limit on the in-memory cache used for regular expression matchers. #4751
  • [ENHANCEMENT] Go: update to 1.20.3. #4773
  • [BUGFIX] Packaging: fix preremove script preventing upgrades. #4801

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.8.0-rc.0...mimir-2.8.0-rc.1

mimir - 2.6.1

Published by aldernero over 1 year ago

This release contains 3 PRs from 2 authors. Thank you!

Changelog

2.6.1

Grafana Mimir

  • [BUGFIX] Security: updates Go to version 1.20.3 to fix CVE-2023-24538 #4798

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.6.0...mimir-2.6.1

mimir - 2.7.2

Published by aldernero over 1 year ago

This release contains 2 PRs from 2 authors. Thank you!

Changelog

2.7.2

Grafana Mimir

  • [BUGFIX] Security: updated Go version to 1.20.3 to fix CVE-2023-24538 #4795

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.7.1...mimir-2.7.2

mimir - 2.8.0-rc.0

Published by lamida over 1 year ago

This release contains 210 PRs from 53 authors, including new contributors Abdurrahman J. Allawala, Ashray Jain, Cyrill N, Daniel Barnes, Dave, David van der Spek, day4me, Devin Trejo, Dmitriy Okladin, Gabriel Santos, inbarpatashnik, Johannes Tandler, Julien Girard, KingJ, Miller, Rafał Boniecki, Raphael Ferreira, Raúl Marín, Ruslan Kovalov, Shagit Ziganshin, shanmugara, Wilfried ROSET. Thank you!

Grafana Mimir version 2.8.0-rc.0 release notes

Grafana Labs is excited to announce version 2.8 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Features and enhancements

  • Changed default value of block storage retention period The default value for -blocks-storage.tsdb.retention-period was 24h and now is 13h.
  • Query-frontend cached results now contain timestamp This allows Mimir to check if cached results are still valid based on current TTL configured for tenant. Results cached by previous Mimir version are used until they expire from cache, which can take up to 7 days. If you need to use per-tenant TTL sooner, please flush results cache manually.
  • Experimental support for using Redis as cache Mimir now can use Redis for caching results, chunks, index and metadata.
  • Experimental support for fetching secret from Vault for TLS configuration.

Helm chart improvements

The Grafana Mimir and Grafana Enterprise Metrics Helm chart is now released independently. See the Grafana Mimir Helm chart documentation.

Important changes

In Grafana Mimir 2.8 we have removed the following previously deprecated or experimental configuration options or metrics.

The following metrics have been removed cortex_bucket_store_series_get_all_duration_seconds, cortex_bucket_store_series_merge_duration_seconds,
cortex_ingester_tsdb_wal_replay_duration_seconds.

The following configuration options are deprecated and will be removed in Grafana Mimir 2.10:

  • The CLI flag -blocks-storage.tsdb.max-tsdb-opening-concurrency-on-startup and its respective YAML configuration option tsdb.max_tsdb_opening_concurrency_on_startup.

The following experimental options and features are now stable:

  • Use protobuf internal query result payload format by default.

Bug fixes

  • Querier: Streaming remote read will now continue to return multiple chunks per frame after the first frame. PR 4423
  • Query-frontend: don't retry queries which error inside PromQL. PR 4643
  • Store-gateway & query-frontend: report more consistent statistics for fetched index bytes. PR 4671
  • Native histograms: fix how IsFloatHistogram determines if mimirpb.Histogram is a float histogram. PR 4706
  • Query-frontend: fix query sharding for native histograms. PR 4666

Changelog

2.8.0-rc.0

Grafana Mimir

  • [CHANGE] Ingester: changed experimental CLI flag from -out-of-order-blocks-external-label-enabled to -ingester.out-of-order-blocks-external-label-enabled #4440
  • [CHANGE] Store-gateway: The following metrics have been removed: #4332
    • cortex_bucket_store_series_get_all_duration_seconds
    • cortex_bucket_store_series_merge_duration_seconds
  • [CHANGE] Ingester: changed default value of -blocks-storage.tsdb.retention-period from 24h to 13h. If you're running Mimir with a custom configuration and you're overriding -querier.query-store-after to a value greater than the default 12h then you should increase -blocks-storage.tsdb.retention-period accordingly. #4382
  • [CHANGE] Ingester: the configuration parameter -blocks-storage.tsdb.max-tsdb-opening-concurrency-on-startup has been deprecated and will be removed in Mimir 2.10. #4445
  • [CHANGE] Query-frontend: Cached results now contain timestamp which allows Mimir to check if cached results are still valid based on current TTL configured for tenant. Results cached by previous Mimir version are used until they expire from cache, which can take up to 7 days. If you need to use per-tenant TTL sooner, please flush results cache manually. #4439
  • [CHANGE] Ingester: the cortex_ingester_tsdb_wal_replay_duration_seconds metrics has been removed. #4465
  • [CHANGE] Query-frontend and ruler: use protobuf internal query result payload format by default. This feature is no longer considered experimental. #4557 #4709
  • [CHANGE] Ruler: reject creating federated rule groups while tenant federation is disabled. Previously the rule groups would be silently dropped during bucket sync. #4555
  • [CHANGE] Compactor: the /api/v1/upload/block/{block}/finish endpoint now returns a 429 status code when the compactor has reached the limit specified by -compactor.max-block-upload-validation-concurrency. #4598
  • [CHANGE] Compactor: when starting a block upload the maximum byte size of the block metadata provided in the request body is now limited to 1 MiB. If this limit is exceeded a 413 status code is returned. #4683
  • [CHANGE] Store-gateway: cache key format for expanded postings has changed. This will invalidate the expanded postings in the index cache when deployed. #4667
  • [FEATURE] Cache: Introduce experimental support for using Redis for results, chunks, index, and metadata caches. #4371
  • [FEATURE] Vault: Introduce experimental integration with Vault to fetch secrets used to configure TLS for clients. Server TLS secrets will still be read from a file. tls-ca-path, tls-cert-path and tls-key-path will denote the path in Vault for the following CLI flags when -vault.enabled is true: #4446.
    • -distributor.ha-tracker.etcd.*
    • -distributor.ring.etcd.*
    • -distributor.forwarding.grpc-client.*
    • -querier.store-gateway-client.*
    • -ingester.client.*
    • -ingester.ring.etcd.*
    • -querier.frontend-client.*
    • -query-frontend.grpc-client-config.*
    • -query-frontend.results-cache.redis.*
    • -blocks-storage.bucket-store.index-cache.redis.*
    • -blocks-storage.bucket-store.chunks-cache.redis.*
    • -blocks-storage.bucket-store.metadata-cache.redis.*
    • -compactor.ring.etcd.*
    • -store-gateway.sharding-ring.etcd.*
    • -ruler.client.*
    • -ruler.alertmanager-client.*
    • -ruler.ring.etcd.*
    • -ruler.query-frontend.grpc-client-config.*
    • -alertmanager.sharding-ring.etcd.*
    • -alertmanager.alertmanager-client.*
    • -memberlist.*
    • -query-scheduler.grpc-client-config.*
    • -query-scheduler.ring.etcd.*
    • -overrides-exporter.ring.etcd.*
  • [FEATURE] Distributor, ingester, querier, query-frontend, store-gateway: add experimental support for native histograms. Requires that the experimental protobuf query result response format is enabled by -query-frontend.query-result-response-format=protobuf on the query frontend. #4286 #4352 #4354 #4376 #4377 #4387 #4396 #4425 #4442 #4494 #4512 #4513 #4526
  • [FEATURE] Added -<prefix>.s3.storage-class flag to configure the S3 storage class for objects written to S3 buckets. #4300
  • [FEATURE] Add freebsd to the target OS when generating binaries for a Mimir release. #4654
  • [FEATURE] Ingester: Add prepare-shutdown endpoint which can be used as part of Kubernetes scale down automations. #4718
  • [ENHANCEMENT] Add timezone information to Alpine Docker images. #4583
  • [ENHANCEMENT] Ruler: Sync rules when ruler JOINING the ring instead of ACTIVE, In order to reducing missed rule iterations during ruler restarts. #4451
  • [ENHANCEMENT] Allow to define service name used for tracing via JAEGER_SERVICE_NAME environment variable. #4394
  • [ENHANCEMENT] Querier and query-frontend: add experimental, more performant protobuf query result response format enabled with -query-frontend.query-result-response-format=protobuf. #4304 #4318 #4375
  • [ENHANCEMENT] Compactor: added experimental configuration parameter -compactor.first-level-compaction-wait-period, to configure how long the compactor should wait before compacting 1st level blocks (uploaded by ingesters). This configuration option allows to reduce the chances compactor begins compacting blocks before all ingesters have uploaded their blocks to the storage. #4401
  • [ENHANCEMENT] Store-gateway: use more efficient chunks fetching and caching. #4255
  • [ENHANCEMENT] Query-frontend and ruler: add experimental, more performant protobuf internal query result response format enabled with -ruler.query-frontend.query-result-response-format=protobuf. #4331
  • [ENHANCEMENT] Ruler: increased tolerance for missed iterations on alerts, reducing the chances of flapping firing alerts during ruler restarts. #4432
  • [ENHANCEMENT] Optimized .* and .+ regular expression label matchers. #4432
  • [ENHANCEMENT] Optimized regular expression label matchers with alternates (e.g. a|b|c). #4647
  • [ENHANCEMENT] Added an in-memory cache for regular expression matchers, to avoid parsing and compiling the same expression multiple times when used in recurring queries. #4633
  • [ENHANCEMENT] Query-frontend: results cache TTL is now configurable by using -query-frontend.results-cache-ttl and -query-frontend.results-cache-ttl-for-out-of-order-time-window options. These values can also be specified per tenant. Default values are unchanged (7 days and 10 minutes respectively). #4385
  • [ENHANCEMENT] Ingester: added advanced configuration parameter -blocks-storage.tsdb.wal-replay-concurrency representing the maximum number of CPUs used during WAL replay. #4445
  • [ENHANCEMENT] Ingester: added metrics cortex_ingester_tsdb_open_duration_seconds_total to measure the total time it takes to open all existing TSDBs. The time tracked by this metric also includes the TSDBs WAL replay duration. #4465
  • [ENHANCEMENT] Store-gateway: use streaming implementation for LabelNames RPC. The batch size for streaming is controlled by -blocks-storage.bucket-store.batch-series-size. #4464
  • [ENHANCEMENT] Memcached: Add support for TLS or mTLS connections to cache servers. #4535
  • [ENHANCEMENT] Compactor: blocks index files are now validated for correctness for blocks uploaded via the TSDB block upload feature. #4503
  • [ENHANCEMENT] Compactor: block chunks and segment files are now validated for correctness for blocks uploaded via the TSDB block upload feature. #4549
  • [ENHANCEMENT] Ingester: added configuration options to configure the "postings for matchers" cache of each compacted block queried from ingesters: #4561
    • -blocks-storage.tsdb.block-postings-for-matchers-cache-ttl
    • -blocks-storage.tsdb.block-postings-for-matchers-cache-size
    • -blocks-storage.tsdb.block-postings-for-matchers-cache-force
  • [ENHANCEMENT] Compactor: validation of blocks uploaded via the TSDB block upload feature is now configurable on a per tenant basis: #4585
    • -compactor.block-upload-validation-enabled has been added, compactor_block_upload_validation_enabled can be used to override per tenant
    • -compactor.block-upload.block-validation-enabled was the previous global flag and has been removed
  • [ENHANCEMENT] TSDB Block Upload: block upload validation concurrency can now be limited with -compactor.max-block-upload-validation-concurrency. #4598
  • [ENHANCEMENT] OTLP: Add support for converting OTel exponential histograms to Prometheus native histograms. The ingestion of native histograms must be enabled, please set -ingester.native-histograms-ingestion-enabled to true. #4063 #4639
  • [ENHANCEMENT] Query-frontend: add metric cortex_query_fetched_index_bytes_total to measure TSDB index bytes fetched to execute a query. #4597
  • [ENHANCEMENT] Query-frontend: add experimental limit to enforce a max query expression size in bytes via -query-frontend.max-query-expression-size-bytes or max_query_expression_size_bytes. #4604
  • [ENHANCEMENT] Query-tee: improve message logged when comparing responses and one response contains a non-JSON payload. #4588
  • [ENHANCEMENT] Distributor: add ability to set per-distributor limits via distributor_limits block in runtime configuration in addition to the existing configuration. #4619
  • [ENHANCEMENT] Querier: reduce peak memory consumption for queries that touch a large number of chunks. #4625
  • [ENHANCEMENT] Query-frontend: added experimental -query-frontend.query-sharding-max-regexp-size-bytes limit to query-frontend. When set to a value greater than 0, query-frontend disabled query sharding for any query with a regexp matcher longer than the configured limit. #4632
  • [ENHANCEMENT] Store-gateway: include statistics from LabelValues and LabelNames calls in cortex_bucket_store_series* metrics. #4673
  • [ENHANCEMENT] Query-frontend: improve readability of distributed tracing spans. #4656
  • [ENHANCEMENT] Update Docker base images from alpine:3.17.2 to alpine:3.17.3. #4685
  • [ENHANCEMENT] Querier: improve performance when shuffle sharding is enabled and the shard size is large. #4711
  • [ENHANCEMENT] Ingester: improve performance when Active Series Tracker is in use. #4717
  • [ENHANCEMENT] Store-gateway: optionally select -blocks-storage.bucket-store.series-selection-strategy, which can limit the impact of large posting lists (when many series share the same label name and value). #4667 #4695 #4698
  • [ENHANCEMENT] Querier: Cache the converted float histogram from chunk iterator, hence there is no need to lookup chunk every time to get the converted float histogram. #4684
  • [BUGFIX] Querier: Streaming remote read will now continue to return multiple chunks per frame after the first frame. #4423
  • [BUGFIX] Store-gateway: the values for stage="processed" for the metrics cortex_bucket_store_series_data_touched and cortex_bucket_store_series_data_size_touched_bytes when using fine-grained chunks caching is now reporting the correct values of chunks held in memory. #4449
  • [BUGFIX] Compactor: fixed reporting a compaction error when compactor is correctly shut down while populating blocks. #4580
  • [BUGFIX] OTLP: Do not drop exemplars of the OTLP Monotonic Sum metric. #4063
  • [BUGFIX] Packaging: flag /etc/default/mimir and /etc/sysconfig/mimir as config to prevent overwrite. #4587
  • [BUGFIX] Query-frontend: don't retry queries which error inside PromQL. #4643
  • [BUGFIX] Store-gateway & query-frontend: report more consistent statistics for fetched index bytes. #4671
  • [BUGFIX] Native histograms: fix how IsFloatHistogram determines if mimirpb.Histogram is a float histogram. #4706
  • [BUGFIX] Query-frontend: fix query sharding for native histograms. #4666
  • [BUGFIX] Ring status page: fixed the owned tokens percentage value displayed. #4730
  • [BUGFIX] Querier: fixed chunk iterator that can return sample with wrong timestamp. #4450

Mixin

  • [ENHANCEMENT] Queries: Display data touched per sec in bytes instead of number of items. #4492
  • [ENHANCEMENT] _config.job_names.<job> values can now be arrays of regular expressions in addition to a single string. Strings are still supported and behave as before. #4543
  • [ENHANCEMENT] Queries dashboard: remove mention to store-gateway "streaming enabled" in panels because store-gateway only support streaming series since Mimir 2.7. #4569
  • [ENHANCEMENT] Ruler: Add panel description for Read QPS panel in Ruler dashboard to explain values when in remote ruler mode. #4675
  • [BUGFIX] Ruler dashboard: show data for reads from ingesters. #4543
  • [BUGFIX] Pod selector regex for deployments: change (.*-mimir-) to (.*mimir-). #4603

Jsonnet

  • [CHANGE] Ruler: changed ruler deployment max surge from 0 to 50%, and max unavailable from 1 to 0. #4381
  • [CHANGE] Memcached connections parameters -blocks-storage.bucket-store.index-cache.memcached.max-idle-connections, -blocks-storage.bucket-store.chunks-cache.memcached.max-idle-connections and -blocks-storage.bucket-store.metadata-cache.memcached.max-idle-connections settings are now configured based on max-get-multi-concurrency and max-async-concurrency. #4591
  • [CHANGE] Add support to use external Redis as cache. Following are some changes in the jsonnet config: #4386 #4640
    • Renamed memcached_*_enabled config options to cache_*_enabled
    • Renamed memcached_*_max_item_size_mb config options to cache_*_max_item_size_mb
    • Added cache_*_backend config options
  • [CHANGE] Store-gateway StatefulSets with disabled multi-zone deployment are also unregistered from the ring on shutdown. This eliminated resharding during rollouts, at the cost of extra effort during scaling down store-gateways. For more information see Scaling down store-gateways. #4713
  • [ENHANCEMENT] Alertmanager: add alertmanager_data_disk_size and alertmanager_data_disk_class configuration options, by default no storage class is set. #4389
  • [ENHANCEMENT] Update rollout-operator to v0.4.0. #4524
  • [ENHANCEMENT] Update memcached to memcached:1.6.19-alpine. #4581
  • [ENHANCEMENT] Add support for mTLS connections to Memcached servers. #4553
  • [ENHANCEMENT] Update the memcached-exporter to v0.11.2. #4570
  • [ENHANCEMENT] Autoscaling: Add autoscaling_query_frontend_memory_target_utilization, autoscaling_ruler_query_frontend_memory_target_utilization, and autoscaling_ruler_memory_target_utilization configuration options, for controlling the corresponding autoscaler memory thresholds. Each has a default of 1, i.e. 100%. #4612
  • [ENHANCEMENT] Distributor: add ability to set per-distributor limits via distributor_instance_limits using runtime configuration. #4627
  • [BUGFIX] Add missing query sharding settings for user_24M and user_32M plans. #4374

Mimirtool

  • [ENHANCEMENT] Backfill: mimirtool will now sleep and retry if it receives a 429 response while trying to finish an upload due to validation concurrency limits. #4598
  • [ENHANCEMENT] gauge panel type is supported now in mimirtool analyze dashboard. #4679
  • [ENHANCEMENT] Set a User-Agent header on requests to Mimir or Prometheus servers. #4700

Mimir Continuous Test

  • [FEATURE] Allow continuous testing of native histograms as well by enabling the flag -tests.write-read-series-test.histogram-samples-enabled. The metrics exposed by the tool will now have a new label called type with possible values of float, histogram_float_counter, histogram_float_gauge, histogram_int_counter, histogram_int_gauge, the list of metrics impacted: #4457
    • mimir_continuous_test_writes_total
    • mimir_continuous_test_writes_failed_total
    • mimir_continuous_test_queries_total
    • mimir_continuous_test_queries_failed_total
    • mimir_continuous_test_query_result_checks_total
    • mimir_continuous_test_query_result_checks_failed_total
  • [ENHANCEMENT] Added a new metric mimir_continuous_test_build_info that reports version information, similar to the existing cortex_build_info metric exposed by other Mimir components. #4712
  • [ENHANCEMENT] Add coherency for the selected ranges and instants of test queries. #4704

Documentation

  • [CHANGE] Clarify what deprecation means in the lifecycle of configuration parameters. #4499
  • [CHANGE] Update compactor split-groups and split-and-merge-shards recommendation on component page. #4623
  • [FEATURE] Add instructions about how to configure native histograms. #4527
  • [ENHANCEMENT] Runbook for MimirCompactorHasNotSuccessfullyRunCompaction extended to include scenario where compaction has fallen behind. #4609
  • [ENHANCEMENT] Add explanation for QPS values for reads in remote ruler mode and writes generally, to the Ruler dashboard page. #4629
  • [ENHANCEMENT] Expand zone-aware replication page to cover single physical availability zone deployments. #4631
  • [FEATURE] Add instructions to use puppet module. #4610

Tools

  • [ENHANCEMENT] tsdb-index: iteration over index is now faster when any equal matcher is supplied. #4515

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.7.1...mimir-2.8.0-rc.0

mimir - 2.7.1

Published by aldernero over 1 year ago

This release contains 177 PRs from 43 authors, including new contributors Bartosz Cisek, dggmsa, gmintoco, Ihor Urazov, James Ross, Jean-Philippe Quéméner, Jon Gutschon, l3ioo, lpugoy, Nicolás Pazos, Oscar, Reto Kupferschmid, ying-jeanne. Thank you!

Grafana Mimir version 2.7.1 release notes

Grafana Labs is excited to announce version 2.7.1 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Note: During the release process, version 2.7.0 was tagged too early, before completing the release checklist and production testing. Release 2.7.1 doesn't include any code changes since 2.7.0, but now has proper release notes, published documentation, and has been fully tested in our production environment.

Features and enhancements

  • Store-gateway streaming enabled by default The new default value of 5000 for -blocks-storage.bucket-store.batch-series-size enables store-gateway streaming in the default configuration. This means that series are loaded from object storage in batches rather than buffering them all in memory before returning to the querier. Enabling streaming can reduce memory utilization peaks in the store-gateway.
  • Store-gateway index header reader no longer uses mmap by default Along with streaming enabled in the store-gateway, this change contributes to more efficient memory usage. See the Important changes section for more details.
  • Support for keep_firing_for option to ruler configuration This new option determines the amount of time an alert should keep firing while the ruler expression doesn't return results.
  • More efficient chunks fetching and caching Enable with the new experimental feature flag -blocks-storage.bucket-store.chunks-cache.fine-grained-chunks-caching-enabled=true. This should reduce CPU, memory utilization, and receive bandwidth of a store-gateway.
  • Experimental query sharding improvements:
    A new configuration parameter, -query-frontend.query-sharding-target-series-per-shard, allows query sharding to take into account cardinality of similar requests executed previously when computing the maximum number of shards to use. If you want to try it out, we recommend starting with a value of 2500.
  • Experimental support for native histogram ingestion:
    Native histograms can now be ingested. The new per-tenant limit -ingester.native-histograms-ingestion-enabled controls whether native histograms are stored or ignored. The support for querying native histograms is not complete yet and it's expected to be available in the next release.

Alertmanager improvements

  • New metrics The following upstream metrics are now exposed:
    • cortex_alertmanager_dispatcher_aggregation_groups
    • cortex_alertmanager_dispatcher_alert_processing_duration_seconds

Helm chart improvements

The Grafana Mimir and Grafana Enterprise Metrics Helm chart is now released independently. See the Grafana Mimir Helm chart documentation.

Important changes

In Grafana Mimir 2.7, the default vaules of the following configuration options have changed:

  • -blocks-storage.bucket-store.batch-series-size is now enabled by default with a value of 5000.
  • -ruler.evaluation-delay-duration has changed from 0 to 1m.

In Grafana Mimir 2.7, the following configuration options are now deprecated:

  • -blocks-storage.bucket-store.chunks-cache.subrange-size since there's no benefit to changing the default of 16000
  • -blocks-storage.bucket-store.consistency-delay has been deprecated and will be removed in Mimir 2.9.
  • -compactor.consistency-delay has been deprecated and will be removed in Mimir 2.9.
  • -ingester.ring.readiness-check-ring-health has been deprecated and will be removed in Mimir 2.9.

In Grafana Mimir 2.7, the following options, metrics, and labels have been removed:

  • Experimental support for ephemeral storage introduced in Mimir 2.6.0 has been removed.
    • Following options are no longer available:
      • -blocks-storage.ephemeral-tsdb.*
      • -distributor.ephemeral-series-enabled
      • -distributor.ephemeral-series-matchers
      • -ingester.max-ephemeral-series-per-user
      • -ingester.instance-limits.max-ephemeral-series
    • The following metrics have been removed:
      • cortex_ingester_ephemeral_series
      • cortex_ingester_ephemeral_series_created_total
      • cortex_ingester_ephemeral_series_removed_total
      • cortex_ingester_ingested_ephemeral_samples_total
      • cortex_ingester_ingested_ephemeral_samples_failures_total
      • cortex_ingester_memory_ephemeral_users
      • cortex_ingester_queries_ephemeral_total
      • cortex_ingester_queried_ephemeral_samples
      • cortex_ingester_queried_ephemeral_series
    • Additionally, querying using the {__mimir_storage__="ephemeral"} selector no longer works. All label values with the ephemeral- prefix within the reason label of the cortex_discarded_samples_total metric are no longer available.
  • The store-gateway default index header reader no longer uses mmap and the mmap-based index header reader has been removed. The following flags have been changed:
    • -blocks-storage.bucket-store.index-header.map-populate-enabled has been removed
    • -blocks-storage.bucket-store.index-header.stream-reader-enabled has been removed
    • -blocks-storage.bucket-store.index-header.stream-reader-max-idle-file-handles has been renamed to -blocks-storage.bucket-store.index-header.max-idle-file-handles, and the corresponding configuration file option has been renamed from stream_reader_max_idle_file_handles to max_idle_file_handles

Bug fixes

  • Store-gateway: return Canceled rather than Aborted or Internal error when the calling querier cancels a label names or values request, and return Internal if processing the request fails for another reason. PR 4061
  • Querier: track canceled requests with status code 499 in the metrics instead of 503 or 422. PR 4099
  • Ingester: compact out-of-order data during /ingester/flush or when TSDB is idle. PR 4180
  • Ingester: conversion of global limits max-series-per-user, max-series-per-metric, max-metadata-per-user and max-metadata-per-metric into corresponding local limits now takes into account the number of ingesters in each zone. PR 4238
  • Ingester: track cortex_ingester_memory_series metric consistently with cortex_ingester_memory_series_created_total and cortex_ingester_memory_series_removed_total. PR 4312
  • Querier: fixed a bug which was incorrectly matching series with regular expression label matchers with begin/end anchors in the middle of the regular expression. PR 4340

Changelog

2.7.1

Grafana Mimir

  • [CHANGE] Ingester: the configuration parameter -ingester.ring.readiness-check-ring-health has been deprecated and will be removed in Mimir 2.9. #4422
  • [CHANGE] Ruler: changed default value of -ruler.evaluation-delay-duration option from 0 to 1m. #4250
  • [CHANGE] Querier: Errors with status code 422 coming from the store-gateway are propagated and not converted to the consistency check error anymore. #4100
  • [CHANGE] Store-gateway: When a query hits max_fetched_chunks_per_query and max_fetched_series_per_query limits, an error with the status code 422 is created and returned. #4056
  • [CHANGE] Packaging: Migrate FPM packaging solution to NFPM. Rationalize packages dependencies and add package for all binaries. #3911
  • [CHANGE] Store-gateway: Deprecate flag -blocks-storage.bucket-store.chunks-cache.subrange-size since there's no benefit to changing the default of 16000. #4135
  • [CHANGE] Experimental support for ephemeral storage introduced in Mimir 2.6.0 has been removed. Following options are no longer available: #4252
    • -blocks-storage.ephemeral-tsdb.*
    • -distributor.ephemeral-series-enabled
    • -distributor.ephemeral-series-matchers
    • -ingester.max-ephemeral-series-per-user
    • -ingester.instance-limits.max-ephemeral-series
      Querying with using {__mimir_storage__="ephemeral"} selector no longer works. All label values with ephemeral- prefix in reason label of cortex_discarded_samples_total metric are no longer available. Following metrics have been removed:
    • cortex_ingester_ephemeral_series
    • cortex_ingester_ephemeral_series_created_total
    • cortex_ingester_ephemeral_series_removed_total
    • cortex_ingester_ingested_ephemeral_samples_total
    • cortex_ingester_ingested_ephemeral_samples_failures_total
    • cortex_ingester_memory_ephemeral_users
    • cortex_ingester_queries_ephemeral_total
    • cortex_ingester_queried_ephemeral_samples
    • cortex_ingester_queried_ephemeral_series
  • [CHANGE] Store-gateway: use mmap-less index-header reader by default and remove mmap-based index header reader. The following flags have changed: #4280
    • -blocks-storage.bucket-store.index-header.map-populate-enabled has been removed
    • -blocks-storage.bucket-store.index-header.stream-reader-enabled has been removed
    • -blocks-storage.bucket-store.index-header.stream-reader-max-idle-file-handles has been renamed to -blocks-storage.bucket-store.index-header.max-idle-file-handles, and the corresponding configuration file option has been renamed from stream_reader_max_idle_file_handles to max_idle_file_handles
  • [CHANGE] Store-gateway: the streaming store-gateway is now enabled by default. The new default setting for -blocks-storage.bucket-store.batch-series-size is 5000. #4330
  • [CHANGE] Compactor: the configuration parameter -compactor.consistency-delay has been deprecated and will be removed in Mimir 2.9. #4409
  • [CHANGE] Store-gateway: the configuration parameter -blocks-storage.bucket-store.consistency-delay has been deprecated and will be removed in Mimir 2.9. #4409
  • [FEATURE] Ruler: added keep_firing_for support to alerting rules. #4099
  • [FEATURE] Distributor, ingester: ingestion of native histograms. The new per-tenant limit -ingester.native-histograms-ingestion-enabled controls whether native histograms are stored or ignored. #4159
  • [FEATURE] Query-frontend: Introduce experimental -query-frontend.query-sharding-target-series-per-shard to allow query sharding to take into account cardinality of similar requests executed previously. This feature uses the same cache that's used for results caching. #4121 #4177 #4188 #4254
  • [ENHANCEMENT] Go: update go to 1.20.1. #4266
  • [ENHANCEMENT] Ingester: added out_of_order_blocks_external_label_enabled shipper option to label out-of-order blocks before shipping them to cloud storage. #4182 #4297
  • [ENHANCEMENT] Ruler: introduced concurrency when loading per-tenant rules configuration. This improvement is expected to speed up the ruler start up time in a Mimir cluster with a large number of tenants. #4258
  • [ENHANCEMENT] Compactor: Add reason label to cortex_compactor_runs_failed_total. The value can be shutdown or error. #4012
  • [ENHANCEMENT] Store-gateway: enforce max_fetched_series_per_query. #4056
  • [ENHANCEMENT] Query-frontend: Disambiguate logs for failed queries. #4067
  • [ENHANCEMENT] Query-frontend: log caller user agent in query stats logs. #4093
  • [ENHANCEMENT] Store-gateway: add data_type label with values on cortex_bucket_store_partitioner_extended_ranges_total, cortex_bucket_store_partitioner_expanded_ranges_total, cortex_bucket_store_partitioner_requested_ranges_total, cortex_bucket_store_partitioner_expanded_bytes_total, cortex_bucket_store_partitioner_requested_bytes_total for postings, series, and chunks. #4095
  • [ENHANCEMENT] Store-gateway: Reduce memory allocation rate when loading TSDB chunks from Memcached. #4074
  • [ENHANCEMENT] Query-frontend: track cortex_frontend_query_response_codec_duration_seconds and cortex_frontend_query_response_codec_payload_bytes metrics to measure the time taken and bytes read / written while encoding and decoding query result payloads. #4110
  • [ENHANCEMENT] Alertmanager: expose additional upstream metrics cortex_alertmanager_dispatcher_aggregation_groups, cortex_alertmanager_dispatcher_alert_processing_duration_seconds. #4151
  • [ENHANCEMENT] Querier and query-frontend: add experimental, more performant protobuf internal query result response format enabled with -query-frontend.query-result-response-format=protobuf. #4153
  • [ENHANCEMENT] Store-gateway: use more efficient chunks fetching and caching. This should reduce CPU, memory utilization, and receive bandwidth of a store-gateway. Enable with -blocks-storage.bucket-store.chunks-cache.fine-grained-chunks-caching-enabled=true. #4163 #4174 #4227
  • [ENHANCEMENT] Query-frontend: Wait for in-flight queries to finish before shutting down. #4073 #4170
  • [ENHANCEMENT] Store-gateway: added encode and other stage to cortex_bucket_store_series_request_stage_duration_seconds metric. #4179
  • [ENHANCEMENT] Ingester: log state of TSDB when shipping or forced compaction can't be done due to unexpected state of TSDB. #4211
  • [ENHANCEMENT] Update Docker base images from alpine:3.17.1 to alpine:3.17.2. #4240
  • [ENHANCEMENT] Store-gateway: add a stage label to the metrics cortex_bucket_store_series_data_fetched, cortex_bucket_store_series_data_size_fetched_bytes, cortex_bucket_store_series_data_touched, cortex_bucket_store_series_data_size_touched_bytes. This label only applies to data_type="chunks". For fetched metrics with data_type="chunks" the stage label has 2 values: fetched - the chunks or bytes that were fetched from the cache or the object store, refetched - the chunks or bytes that had to be refetched from the cache or the object store because their size was underestimated during the first fetch. For touched metrics with data_type="chunks" the stage label has 2 values: processed - the chunks or bytes that were read from the fetched chunks or bytes and were processed in memory, returned - the chunks or bytes that were selected from the processed bytes to satisfy the query. #4227 #4316
  • [ENHANCEMENT] Compactor: improve the partial block check related to compactor.partial-block-deletion-delay to potentially issue less requests to object storage. #4246
  • [ENHANCEMENT] Memcached: added -*.memcached.min-idle-connections-headroom-percentage support to configure the minimum number of idle connections to keep open as a percentage (0-100) of the number of recently used idle connections. This feature is disabled when set to a negative value (default), which means idle connections are kept open indefinitely. #4249
  • [ENHANCEMENT] Querier and store-gateway: optimized regular expression label matchers with case insensitive alternate operator. #4340 #4357
  • [ENHANCEMENT] Compactor: added the experimental flag -compactor.block-upload.block-validation-enabled with the default true to configure whether block validation occurs on backfilled blocks. #3411
  • [ENHANCEMENT] Ingester: apply a jitter to the first TSDB head compaction interval configured via -blocks-storage.tsdb.head-compaction-interval. Subsequent checks will happen at the configured interval. This should help to spread the TSDB head compaction among different ingesters over the configured interval. #4364
  • [ENHANCEMENT] Ingester: the maximum accepted value for -blocks-storage.tsdb.head-compaction-interval has been increased from 5m to 15m. #4364
  • [BUGFIX] Store-gateway: return Canceled rather than Aborted or Internal error when the calling querier cancels a label names or values request, and return Internal if processing the request fails for another reason. #4061
  • [BUGFIX] Querier: track canceled requests with status code 499 in the metrics instead of 503 or 422. #4099
  • [BUGFIX] Ingester: compact out-of-order data during /ingester/flush or when TSDB is idle. #4180
  • [BUGFIX] Ingester: conversion of global limits max-series-per-user, max-series-per-metric, max-metadata-per-user and max-metadata-per-metric into corresponding local limits now takes into account the number of ingesters in each zone. #4238
  • [BUGFIX] Ingester: track cortex_ingester_memory_series metric consistently with cortex_ingester_memory_series_created_total and cortex_ingester_memory_series_removed_total. #4312
  • [BUGFIX] Querier: fixed a bug which was incorrectly matching series with regular expression label matchers with begin/end anchors in the middle of the regular expression. #4340

Mixin

  • [CHANGE] Move auto-scaling panel rows down beneath logical network path in Reads and Writes dashboards. #4049
  • [CHANGE] Make distributor auto-scaling metric panels show desired number of replicas. #4218
  • [CHANGE] Alerts: The alert MimirMemcachedRequestErrors has been renamed to MimirCacheRequestErrors. #4242
  • [ENHANCEMENT] Alerts: Added MimirAutoscalerKedaFailing alert firing when a KEDA scaler is failing. #4045
  • [ENHANCEMENT] Add auto-scaling panels to ruler dashboard. #4046
  • [ENHANCEMENT] Add gateway auto-scaling panels to Reads and Writes dashboards. #4049 #4216
  • [ENHANCEMENT] Dashboards: distinguish between label names and label values queries. #4065
  • [ENHANCEMENT] Add query-frontend and ruler-query-frontend auto-scaling panels to Reads and Ruler dashboards. #4199
  • [BUGFIX] Alerts: Fixed MimirAutoscalerNotActive to not fire if scaling metric does not exist, to avoid false positives on scaled objects with 0 min replicas. #4045
  • [BUGFIX] Alerts: MimirCompactorHasNotSuccessfullyRunCompaction is no longer triggered by frequent compactor restarts. #4012
  • [BUGFIX] Tenants dashboard: Correctly show the ruler-query-scheduler queue size. #4152

Jsonnet

  • [CHANGE] Create the query-frontend-discovery service only when Mimir is deployed in microservice mode without query-scheduler. #4353
  • [CHANGE] Add results cache backend config to ruler-query-frontend configuration to allow cache reuse for cardinality-estimation based sharding. #4257
  • [ENHANCEMENT] Add support for ruler auto-scaling. #4046
  • [ENHANCEMENT] Add optional weight param to newQuerierScaledObject and newRulerQuerierScaledObject to allow running multiple querier deployments on different node types. #4141
  • [ENHANCEMENT] Add support for query-frontend and ruler-query-frontend auto-scaling. #4199
  • [BUGFIX] Shuffle sharding: when applying user class limits, honor the minimum shard size configured in $._config.shuffle_sharding.*. #4363

Mimirtool

  • [FEATURE] Added keep_firing_for support to rules configuration. #4099
  • [ENHANCEMENT] Add -tls-insecure-skip-verify to rules, alertmanager and backfill commands. #4162

Query-tee

  • [CHANGE] Increase default value of -backend.read-timeout to 150s, to accommodate default querier and query frontend timeout of 120s. #4262
  • [ENHANCEMENT] Log errors that occur while performing requests to compare two endpoints. #4262
  • [ENHANCEMENT] When comparing two responses that both contain an error, only consider the comparison failed if the errors differ. Previously, if either response contained an error, the comparison always failed, even if both responses contained the same error. #4262
  • [ENHANCEMENT] Include the value of the X-Scope-OrgID header when logging a comparison failure. #4262
  • [BUGFIX] Parameters (expression, time range etc.) for a query request where the parameters are in the HTTP request body rather than in the URL are now logged correctly when responses differ. #4265

Documentation

  • [ENHANCEMENT] Add guide on alternative migration method for Thanos to Mimir #3554
  • [ENHANCEMENT] Restore "Migrate from Cortex" for Jsonnet. #3929
  • [ENHANCEMENT] Document migration from microservices to read-write deployment mode. #3951
  • [ENHANCEMENT] Do not error when there is nothing to commit as part of a publish #4058
  • [ENHANCEMENT] Explain how to run Mimir locally using docker-compose #4079
  • [ENHANCEMENT] Docs: use long flag names in runbook commands. #4088
  • [ENHANCEMENT] Clarify how ingester replication happens. #4101
  • [ENHANCEMENT] Improvements to the Get Started guide. #4315
  • [BUGFIX] Added indentation to Azure and SWIFT backend definition. #4263

Tools

  • [ENHANCEMENT] Adapt tsdb-print-chunk for native histograms. #4186
  • [ENHANCEMENT] Adapt tsdb-index-health for blocks containing native histograms. #4186
  • [ENHANCEMENT] Adapt tsdb-chunks tool to handle native histograms. #4186

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.6.0...mimir-2.7.1

mimir - 2.6.0

Published by 56quarters over 1 year ago

This release contains 259 PRs from 40 authors, including new contributors breadly7, bubu11e, Đurica Yuri Nikolić, Felix Beuke, Jack, klagroix, Martin Chodur, Ørjan Ommundsen, Sascha Sternheim, Wu Zhiyuan. Thank you!

Grafana Mimir version 2.6.0 release notes

Grafana Labs is excited to announce version 2.6 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Features and enhancements

  • Lower memory usage in store-gateway by streaming series results
    The store-gateway can now stream results back to the querier instead of buffering them. This is expected to greatly reduce peak memory consumption while keeping latency the same. This is still an experimental feature but Grafana Labs is already running it in production and there's no known issue. This feature can be enabled setting the -blocks-storage.bucket-store.batch-series-size configuration option (if you want to try it out, we recommend you setting to 5000).

  • Improved stability in store-gateway by removing mmap usage
    The store-gateway can now use an alternate code path to read index-headers that does not use memory mapped files. This is expected to improve stability of the store-gateway. This is still an experimental feature but Grafana Labs is already running it in production and there's no known issue. This feature can be enabled setting -blocks-storage.bucket-store.index-header.stream-reader-enabled=true.

Alertmanager improvements

  • Webex support Alertmanager can now use Webex to send alerts.

  • tenantID template function A new template function tenantID, returning the ID of the tenant owning the alert, has been added.

  • grafanaExploreURL template function A new template function grafanaExploreURL, returning the URL to the Grafana explore page with range query, has been added.

Helm chart improvements

The Grafana Mimir and Grafana Enterprise Metrics Helm chart is now released independently. See the corresponding documentation for more information.

Important changes

In Grafana Mimir 2.6 we have removed the following previously deprecated or experimental configuration options:

  • The CLI flag -blocks-storage.bucket-store.max-concurrent-reject-over-limit and its respective YAML configuration option blocks_storage.bucket_store.max_concurrent_reject_over_limit.
  • The CLI flag -query-frontend.align-querier-with-step and its respective YAML configuration option frontend.align_querier_with_step.

The following configuration options are deprecated and will be removed in Grafana Mimir 2.8:

  • The CLI flag -store.max-query-length and its respective YAML configuration option limits.max_query_length have been replaced with -querier.max-partial-query-length and limits.max_partial_query_length.

The following experimental options and features are now stable:

  • The CLI flag -query-frontend.max-total-query-length and its respective YAML configuration option limits.max_total_query_length.
  • The CLI flags -distributor.request-rate-limit and -distributor.request-burst-limit and their respective YAML configuration options limits.request_rate_limit and limits.request_rate_burst.
  • The CLI flag -ingester.max-global-exemplars-per-user and its respective YAML configuration option limits.max_global_exemplars_per_user.
  • The CLI flag -ingester.tsdb-config-update-period its respective YAML configuration option ingester.tsdb_config_update_period.
  • The API endpoint /api/v1/query_exemplars.

Bug fixes

  • Alertmanager: Fix template spurious deletion with relative data dir. PR 3604
  • Security: Update prometheus/exporter-toolkit for CVE-2022-46146. PR 3675
  • Security: Update golang.org/x/net for CVE-2022-41717. PR 3755
  • Debian package: Fix post-install, environment file path and user creation. PR 3720
  • Memberlist: Fix panic during Mimir startup when Mimir receives gossip message before it's ready. PR 3746
  • Update github.com/thanos-io/objstore to address issue with Multipart PUT on s3-compatible Object Storage. PR 3802 PR 3821
  • Querier: Canceled requests are no longer reported as "consistency check" failures. PR 3837 PR 3927
  • Distributor: Don't panic when metric_relabel_configs in overrides contains null element. PR 3868
  • Ingester, Compactor: Fix panic that can occur when compaction fails. PR 3955

Changelog

2.6.0

Grafana Mimir

  • [CHANGE] Querier: Introduce -querier.max-partial-query-length to limit the time range for partial queries at the querier level and deprecate -store.max-query-length. #3825 #4017
  • [CHANGE] Store-gateway: Remove experimental -blocks-storage.bucket-store.max-concurrent-reject-over-limit flag. #3706
  • [CHANGE] Ingester: If shipping is enabled block retention will now be relative to the upload time to cloud storage. If shipping is disabled block retention will be relative to the creation time of the block instead of the mintime of the last block created. #3816
  • [CHANGE] Query-frontend: Deprecated CLI flag -query-frontend.align-querier-with-step has been removed. #3982
  • [FEATURE] Store-gateway: streaming of series. The store-gateway can now stream results back to the querier instead of buffering them. This is expected to greatly reduce peak memory consumption while keeping latency the same. You can enable this feature by setting -blocks-storage.bucket-store.batch-series-size to a value in the high thousands (5000-10000). This is still an experimental feature and is subject to a changing API and instability. #3540 #3546 #3587 #3606 #3611 #3620 #3645 #3355 #3697 #3666 #3687 #3728 #3739 #3751 #3779 #3839
  • [FEATURE] Alertmanager: Added support for the Webex receiver. #3758
  • [FEATURE] Limits: Added the -validation.separate-metrics-group-label flag. This allows further separation of the cortex_discarded_samples_total metric by an additional group label - which is configured by this flag to be the value of a specific label on an incoming timeseries. Active groups are tracked and inactive groups are cleaned up on a defined interval. The maximum number of groups tracked is controlled by the -max-separate-metrics-groups-per-user flag. #3439
  • [FEATURE] Overrides-exporter: Added experimental ring support to overrides-exporter via -overrides-exporter.ring.enabled. When enabled, the ring is used to establish a leader replica for the export of limit override metrics. #3908 #3953
  • [FEATURE] Ephemeral storage (experimental): Mimir can now accept samples into "ephemeral storage". Such samples are available for querying for a short amount of time (-blocks-storage.ephemeral-tsdb.retention-period, defaults to 10 minutes), and then removed from memory. To use ephemeral storage, distributor must be configured with -distributor.ephemeral-series-enabled option. Series matching -distributor.ephemeral-series-matchers will be marked for storing into ephemeral storage in ingesters. Each tenant needs to have ephemeral storage enabled by using -ingester.max-ephemeral-series-per-user limit, which defaults to 0 (no ephemeral storage). Ingesters have new -ingester.instance-limits.max-ephemeral-series limit for total number of series in ephemeral storage across all tenants. If ingestion of samples into ephemeral storage fails, cortex_discarded_samples_total metric will use values prefixed with ephemeral- for reason label. Querying of ephemeral storage is possible by using {__mimir_storage__="ephemeral"} as metric selector. Following new metrics related to ephemeral storage are introduced: #3897 #3922 #3961 #3997 #4004
    • cortex_ingester_ephemeral_series
    • cortex_ingester_ephemeral_series_created_total
    • cortex_ingester_ephemeral_series_removed_total
    • cortex_ingester_ingested_ephemeral_samples_total
    • cortex_ingester_ingested_ephemeral_samples_failures_total
    • cortex_ingester_memory_ephemeral_users
    • cortex_ingester_queries_ephemeral_total
    • cortex_ingester_queried_ephemeral_samples
    • cortex_ingester_queried_ephemeral_series
  • [ENHANCEMENT] Added new metric thanos_shipper_last_successful_upload_time: Unix timestamp (in seconds) of the last successful TSDB block uploaded to the bucket. #3627
  • [ENHANCEMENT] Ruler: Added -ruler.alertmanager-client.tls-enabled configuration for alertmanager client. #3432 #3597
  • [ENHANCEMENT] Activity tracker logs now have component=activity-tracker label. #3556
  • [ENHANCEMENT] Distributor: remove labels with empty values #2439
  • [ENHANCEMENT] Query-frontend: track query HTTP requests in the Activity Tracker. #3561
  • [ENHANCEMENT] Store-gateway: Add experimental alternate implementation of index-header reader that does not use memory mapped files. The index-header reader is expected to improve stability of the store-gateway. You can enable this implementation with the flag -blocks-storage.bucket-store.index-header.stream-reader-enabled. #3639 #3691 #3703 #3742 #3785 #3787 #3797
  • [ENHANCEMENT] Query-scheduler: add cortex_query_scheduler_cancelled_requests_total metric to track the number of requests that are already cancelled when dequeued. #3696
  • [ENHANCEMENT] Store-gateway: add cortex_bucket_store_partitioner_extended_ranges_total metric to keep track of the ranges that the partitioner decided to overextend and merge in order to save API call to the object storage. #3769
  • [ENHANCEMENT] Compactor: Auto-forget unhealthy compactors after ten failed ring heartbeats. #3771
  • [ENHANCEMENT] Ruler: change default value of -ruler.for-grace-period from 10m to 2m and update help text. The new default value reflects how we operate Mimir at Grafana Labs. #3817
  • [ENHANCEMENT] Ingester: Added experimental flags to force usage of postings for matchers cache. These flags will be removed in the future and it's not recommended to change them. #3823
    • -blocks-storage.tsdb.head-postings-for-matchers-cache-ttl
    • -blocks-storage.tsdb.head-postings-for-matchers-cache-size
    • -blocks-storage.tsdb.head-postings-for-matchers-cache-force
  • [ENHANCEMENT] Ingester: Improved series selection performance when some of the matchers do not match any series. #3827
  • [ENHANCEMENT] Alertmanager: Add new additional template function tenantID returning id of the tenant owning the alert. #3758
  • [ENHANCEMENT] Alertmanager: Add additional template function grafanaExploreURL returning URL to grafana explore with range query. #3849
  • [ENHANCEMENT] Reduce overhead of debug logging when filtered out. #3875
  • [ENHANCEMENT] Update Docker base images from alpine:3.16.2 to alpine:3.17.1. #3898
  • [ENHANCEMENT] Ingester: Add new /ingester/tsdb_metrics endpoint to return tenant-specific TSDB metrics. #3923
  • [ENHANCEMENT] Query-frontend: CLI flag -query-frontend.max-total-query-length and its associated YAML configuration is now stable. #3882
  • [ENHANCEMENT] Ruler: rule groups now support optional and experimental align_evaluation_time_on_interval field, which causes all evaluations to happen on interval-aligned timestamp. #4013
  • [ENHANCEMENT] Query-scheduler: ring-based service discovery is now stable. #4028
  • [BUGFIX] Log the names of services that are not yet running rather than unsupported value type when calling /ready and some services are not running. #3625
  • [BUGFIX] Alertmanager: Fix template spurious deletion with relative data dir. #3604
  • [BUGFIX] Security: update prometheus/exporter-toolkit for CVE-2022-46146. #3675
  • [BUGFIX] Security: update golang.org/x/net for CVE-2022-41717. #3755
  • [BUGFIX] Debian package: Fix post-install, environment file path and user creation. #3720
  • [BUGFIX] memberlist: Fix panic during Mimir startup when Mimir receives gossip message before it's ready. #3746
  • [BUGFIX] Store-gateway: fix cortex_bucket_store_partitioner_requested_bytes_total metric to not double count overlapping ranges. #3769
  • [BUGFIX] Update github.com/thanos-io/objstore to address issue with Multipart PUT on s3-compatible Object Storage. #3802 #3821
  • [BUGFIX] Distributor, Query-scheduler: Make sure ring metrics include a cortex_ prefix as expected by dashboards. #3809
  • [BUGFIX] Querier: canceled requests are no longer reported as "consistency check" failures. #3837 #3927
  • [BUGFIX] Distributor: don't panic when metric_relabel_configs in overrides contains null element. #3868
  • [BUGFIX] Distributor: don't panic when OTLP histograms don't have any buckets. #3853
  • [BUGFIX] Ingester, Compactor: fix panic that can occur when compaction fails. #3955
  • [BUGFIX] Store-gateway: return Canceled rather than Aborted error when the calling querier cancels the request. #4007

Mixin

  • [ENHANCEMENT] Alerts: Added MimirIngesterInstanceHasNoTenants alert that fires when an ingester replica is not receiving write requests for any tenant. #3681
  • [ENHANCEMENT] Alerts: Extended MimirAllocatingTooMuchMemory to check read-write deployment containers. #3710
  • [ENHANCEMENT] Alerts: Added MimirAlertmanagerInstanceHasNoTenants alert that fires when an alertmanager instance ows no tenants. #3826
  • [ENHANCEMENT] Alerts: Added MimirRulerInstanceHasNoRuleGroups alert that fires when a ruler replica is not assigned any rule group to evaluate. #3723
  • [ENHANCEMENT] Support for baremetal deployment for alerts and scaling recording rules. #3719
  • [ENHANCEMENT] Dashboards: querier autoscaling now supports multiple scaled objects (configurable via $._config.autoscale.querier.hpa_name). #3962
  • [BUGFIX] Alerts: Fixed MimirIngesterRestarts alert when Mimir is deployed in read-write mode. #3716
  • [BUGFIX] Alerts: Fixed MimirIngesterHasNotShippedBlocks and MimirIngesterHasNotShippedBlocksSinceStart alerts for when Mimir is deployed in read-write or monolithic modes and updated them to use new thanos_shipper_last_successful_upload_time metric. #3627
  • [BUGFIX] Alerts: Fixed MimirMemoryMapAreasTooHigh alert when Mimir is deployed in read-write mode. #3626
  • [BUGFIX] Alerts: Fixed MimirCompactorSkippedBlocksWithOutOfOrderChunks matching on non-existent label. #3628
  • [BUGFIX] Dashboards: Fix Rollout Progress dashboard incorrectly using Gateway metrics when Gateway was not enabled. #3709
  • [BUGFIX] Tenants dashboard: Make it compatible with all deployment types. #3754
  • [BUGFIX] Alerts: Fixed MimirCompactorHasNotUploadedBlocks to not fire if compactor has nothing to do. #3793
  • [BUGFIX] Alerts: Fixed MimirAutoscalerNotActive to not fire if scaling metric is 0, to avoid false positives on scaled objects with 0 min replicas. #3999

Jsonnet

  • [CHANGE] Replaced the deprecated policy/v1beta1 with policy/v1 when configuring a PodDisruptionBudget for read-write deployment mode. #3811
  • [CHANGE] Removed -server.http-write-timeout default option value from querier and query-frontend, as it defaults to a higher value in the code now, and cannot be lower than -querier.timeout. #3836
  • [CHANGE] Replaced -store.max-query-length with -query-frontend.max-total-query-length in the query-frontend config. #3879
  • [CHANGE] Changed default mimir_backend_data_disk_size from 100Gi to 250Gi. #3894
  • [ENHANCEMENT] Update rollout-operator to v0.2.0. #3624
  • [ENHANCEMENT] Add user_24M and user_32M classes to operations config. #3367
  • [ENHANCEMENT] Update memcached image from memcached:1.6.16-alpine to memcached:1.6.17-alpine. #3914
  • [ENHANCEMENT] Allow configuring the ring for overrides-exporter. #3995
  • [BUGFIX] Apply ingesters and store-gateways per-zone CLI flags overrides to read-write deployment mode too. #3766
  • [BUGFIX] Apply overrides-exporter CLI flags to mimir-backend when running Mimir in read-write deployment mode. #3790
  • [BUGFIX] Fixed mimir-write and mimir-read Kubernetes service to correctly balance requests among pods. #3855 #3864 #3906
  • [BUGFIX] Fixed ruler-query-frontend and mimir-read gRPC server configuration to force clients to periodically re-resolve the backend addresses. #3862
  • [BUGFIX] Fixed mimir-read CLI flags to ensure query-frontend configuration takes precedence over querier configuration. #3877

Mimirtool

  • [ENHANCEMENT] Update mimirtool config convert to work with Mimir 2.4, 2.5, 2.6 changes. #3952
  • [ENHANCEMENT] Mimirtool is now available to install through Homebrew with brew install mimirtool. #3776
  • [ENHANCEMENT] Added --concurrency to mimirtool rules sync command. #3996
  • [BUGFIX] Fix summary output from mimirtool rules sync to display correct number of groups created and updated. #3918

Documentation

  • [BUGFIX] Querier: Remove assertion that the -querier.max-concurrent flag must also be set for the query-frontend. #3678
  • [ENHANCEMENT] Update migration from cortex documentation. #3662
  • [ENHANCEMENT] Query-scheduler: documented how to migrate from DNS-based to ring-based service discovery. #4028

Tools

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.5.0...mimir-2.6.0

mimir - 2.6.0-rc.0

Published by 56quarters over 1 year ago

This release contains 255 PRs from 40 authors, including new contributors breadly7, bubu11e, Đurica Yuri Nikolić, Felix Beuke, Jack, klagroix, Martin Chodur, Ørjan Ommundsen, Sascha Sternheim, Wu Zhiyuan. Thank you!

Grafana Mimir version 2.6.0-rc.0 release notes

Grafana Labs is excited to announce version 2.6.0-rc.0 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Features and enhancements

  • Lower memory usage in store-gateway by streaming series results
    The store-gateway can now stream results back to the querier instead of buffering them. This is expected to greatly reduce peak memory consumption while keeping latency the same. This is still an experimental feature but Grafana Labs is already running it in production and there's no known issue. This feature can be enabled setting the -blocks-storage.bucket-store.batch-series-size configuration option (if you want to try it out, we recommend you setting to 5000).

  • Improved stability in store-gateway by removing mmap usage
    The store-gateway can now use an alternate code path to read index-headers that does not use memory mapped files. This is expected to improve stability of the store-gateway. This is still an experimental feature but Grafana Labs is already running it in production and there's no known issue. This feature can be enabled setting -blocks-storage.bucket-store.index-header.stream-reader-enabled=true.

Alertmanager improvements

  • Webex support Alertmanager can now use Webex to send alerts.

  • tenantID template function A new template function tenantID, returning the ID of the tenant owning the alert, has been added.

  • grafanaExploreURL template function A new template function grafanaExploreURL, returning the URL to the Grafana explore page with range query, has been added.

Helm chart improvements

The Grafana Mimir and Grafana Enterprise Metrics Helm chart is now released independently. See the corresponding documentation for more information.

Important changes

In Grafana Mimir 2.6 we have removed the following previously deprecated or experimental configuration options:

  • The CLI flag -blocks-storage.bucket-store.max-concurrent-reject-over-limit and its respective YAML configuration option blocks_storage.bucket_store.max_concurrent_reject_over_limit.
  • The CLI flag -query-frontend.align-querier-with-step and its respective YAML configuration option frontend.align_querier_with_step.

The following configuration options are deprecated and will be removed in Grafana Mimir 2.8:

  • The CLI flag -store.max-query-length and its respective YAML configuration option limits.max_query_length have been replaced with -querier.max-partial-query-length and limits.max_partial_query_length.

The following experimental options and features are now stable:

  • The CLI flag -query-frontend.max-total-query-length and its respective YAML configuration option limits.max_total_query_length.
  • The CLI flags -distributor.request-rate-limit and -distributor.request-burst-limit and their respective YAML configuration options limits.request_rate_limit and limits.request_rate_burst.
  • The CLI flag -ingester.max-global-exemplars-per-user and its respective YAML configuration option limits.max_global_exemplars_per_user.
  • The CLI flag -ingester.tsdb-config-update-period its respective YAML configuration option ingester.tsdb_config_update_period.
  • The API endpoint /api/v1/query_exemplars.

Bug fixes

  • Alertmanager: Fix template spurious deletion with relative data dir. PR 3604
  • Security: Update prometheus/exporter-toolkit for CVE-2022-46146. PR 3675
  • Security: Update golang.org/x/net for CVE-2022-41717. PR 3755
  • Debian package: Fix post-install, environment file path and user creation. PR 3720
  • Memberlist: Fix panic during Mimir startup when Mimir receives gossip message before it's ready. PR 3746
  • Update github.com/thanos-io/objstore to address issue with Multipart PUT on s3-compatible Object Storage. PR 3802 PR 3821
  • Querier: Canceled requests are no longer reported as "consistency check" failures. PR 3837 PR 3927
  • Distributor: Don't panic when metric_relabel_configs in overrides contains null element. PR 3868
  • Ingester, Compactor: Fix panic that can occur when compaction fails. PR 3955

Changelog

2.6.0-rc.0

Grafana Mimir

  • [CHANGE] Querier: Introduce -querier.max-partial-query-length to limit the time range for partial queries at the querier level and deprecate -store.max-query-length. #3825 #4017
  • [CHANGE] Store-gateway: Remove experimental -blocks-storage.bucket-store.max-concurrent-reject-over-limit flag. #3706
  • [CHANGE] Ingester: If shipping is enabled block retention will now be relative to the upload time to cloud storage. If shipping is disabled block retention will be relative to the creation time of the block instead of the mintime of the last block created. #3816
  • [CHANGE] Query-frontend: Deprecated CLI flag -query-frontend.align-querier-with-step has been removed. #3982
  • [FEATURE] Store-gateway: streaming of series. The store-gateway can now stream results back to the querier instead of buffering them. This is expected to greatly reduce peak memory consumption while keeping latency the same. You can enable this feature by setting -blocks-storage.bucket-store.batch-series-size to a value in the high thousands (5000-10000). This is still an experimental feature and is subject to a changing API and instability. #3540 #3546 #3587 #3606 #3611 #3620 #3645 #3355 #3697 #3666 #3687 #3728 #3739 #3751 #3779 #3839
  • [FEATURE] Alertmanager: Added support for the Webex receiver. #3758
  • [FEATURE] Limits: Added the -validation.separate-metrics-group-label flag. This allows further separation of the cortex_discarded_samples_total metric by an additional group label - which is configured by this flag to be the value of a specific label on an incoming timeseries. Active groups are tracked and inactive groups are cleaned up on a defined interval. The maximum number of groups tracked is controlled by the -max-separate-metrics-groups-per-user flag. #3439
  • [FEATURE] Overrides-exporter: Added experimental ring support to overrides-exporter via -overrides-exporter.ring.enabled. When enabled, the ring is used to establish a leader replica for the export of limit override metrics. #3908 #3953
  • [FEATURE] Ephemeral storage (experimental): Mimir can now accept samples into "ephemeral storage". Such samples are available for querying for a short amount of time (-blocks-storage.ephemeral-tsdb.retention-period, defaults to 10 minutes), and then removed from memory. To use ephemeral storage, distributor must be configured with -distributor.ephemeral-series-enabled option. Series matching -distributor.ephemeral-series-matchers will be marked for storing into ephemeral storage in ingesters. Each tenant needs to have ephemeral storage enabled by using -ingester.max-ephemeral-series-per-user limit, which defaults to 0 (no ephemeral storage). Ingesters have new -ingester.instance-limits.max-ephemeral-series limit for total number of series in ephemeral storage across all tenants. If ingestion of samples into ephemeral storage fails, cortex_discarded_samples_total metric will use values prefixed with ephemeral- for reason label. Querying of ephemeral storage is possible by using {__mimir_storage__="ephemeral"} as metric selector. Following new metrics related to ephemeral storage are introduced: #3897 #3922 #3961 #3997 #4004
    • cortex_ingester_ephemeral_series
    • cortex_ingester_ephemeral_series_created_total
    • cortex_ingester_ephemeral_series_removed_total
    • cortex_ingester_ingested_ephemeral_samples_total
    • cortex_ingester_ingested_ephemeral_samples_failures_total
    • cortex_ingester_memory_ephemeral_users
    • cortex_ingester_queries_ephemeral_total
    • cortex_ingester_queried_ephemeral_samples
    • cortex_ingester_queried_ephemeral_series
  • [ENHANCEMENT] Added new metric thanos_shipper_last_successful_upload_time: Unix timestamp (in seconds) of the last successful TSDB block uploaded to the bucket. #3627
  • [ENHANCEMENT] Ruler: Added -ruler.alertmanager-client.tls-enabled configuration for alertmanager client. #3432 #3597
  • [ENHANCEMENT] Activity tracker logs now have component=activity-tracker label. #3556
  • [ENHANCEMENT] Distributor: remove labels with empty values #2439
  • [ENHANCEMENT] Query-frontend: track query HTTP requests in the Activity Tracker. #3561
  • [ENHANCEMENT] Store-gateway: Add experimental alternate implementation of index-header reader that does not use memory mapped files. The index-header reader is expected to improve stability of the store-gateway. You can enable this implementation with the flag -blocks-storage.bucket-store.index-header.stream-reader-enabled. #3639 #3691 #3703 #3742 #3785 #3787 #3797
  • [ENHANCEMENT] Query-scheduler: add cortex_query_scheduler_cancelled_requests_total metric to track the number of requests that are already cancelled when dequeued. #3696
  • [ENHANCEMENT] Store-gateway: add cortex_bucket_store_partitioner_extended_ranges_total metric to keep track of the ranges that the partitioner decided to overextend and merge in order to save API call to the object storage. #3769
  • [ENHANCEMENT] Compactor: Auto-forget unhealthy compactors after ten failed ring heartbeats. #3771
  • [ENHANCEMENT] Ruler: change default value of -ruler.for-grace-period from 10m to 2m and update help text. The new default value reflects how we operate Mimir at Grafana Labs. #3817
  • [ENHANCEMENT] Ingester: Added experimental flags to force usage of postings for matchers cache. These flags will be removed in the future and it's not recommended to change them. #3823
    • -blocks-storage.tsdb.head-postings-for-matchers-cache-ttl
    • -blocks-storage.tsdb.head-postings-for-matchers-cache-size
    • -blocks-storage.tsdb.head-postings-for-matchers-cache-force
  • [ENHANCEMENT] Ingester: Improved series selection performance when some of the matchers do not match any series. #3827
  • [ENHANCEMENT] Alertmanager: Add new additional template function tenantID returning id of the tenant owning the alert. #3758
  • [ENHANCEMENT] Alertmanager: Add additional template function grafanaExploreURL returning URL to grafana explore with range query. #3849
  • [ENHANCEMENT] Reduce overhead of debug logging when filtered out. #3875
  • [ENHANCEMENT] Update Docker base images from alpine:3.16.2 to alpine:3.17.1. #3898
  • [ENHANCEMENT] Ingester: Add new /ingester/tsdb_metrics endpoint to return tenant-specific TSDB metrics. #3923
  • [ENHANCEMENT] Query-frontend: CLI flag -query-frontend.max-total-query-length and its associated YAML configuration is now stable. #3882
  • [ENHANCEMENT] Ruler: rule groups now support optional and experimental align_evaluation_time_on_interval field, which causes all evaluations to happen on interval-aligned timestamp. #4013
  • [ENHANCEMENT] Query-scheduler: ring-based service discovery is now stable. #4028
  • [BUGFIX] Log the names of services that are not yet running rather than unsupported value type when calling /ready and some services are not running. #3625
  • [BUGFIX] Alertmanager: Fix template spurious deletion with relative data dir. #3604
  • [BUGFIX] Security: update prometheus/exporter-toolkit for CVE-2022-46146. #3675
  • [BUGFIX] Security: update golang.org/x/net for CVE-2022-41717. #3755
  • [BUGFIX] Debian package: Fix post-install, environment file path and user creation. #3720
  • [BUGFIX] memberlist: Fix panic during Mimir startup when Mimir receives gossip message before it's ready. #3746
  • [BUGFIX] Store-gateway: fix cortex_bucket_store_partitioner_requested_bytes_total metric to not double count overlapping ranges. #3769
  • [BUGFIX] Update github.com/thanos-io/objstore to address issue with Multipart PUT on s3-compatible Object Storage. #3802 #3821
  • [BUGFIX] Distributor, Query-scheduler: Make sure ring metrics include a cortex_ prefix as expected by dashboards. #3809
  • [BUGFIX] Querier: canceled requests are no longer reported as "consistency check" failures. #3837 #3927
  • [BUGFIX] Distributor: don't panic when metric_relabel_configs in overrides contains null element. #3868
  • [BUGFIX] Distributor: don't panic when OTLP histograms don't have any buckets. #3853
  • [BUGFIX] Ingester, Compactor: fix panic that can occur when compaction fails. #3955
  • [BUGFIX] Store-gateway: return Canceled rather than Aborted error when the calling querier cancels the request. #4007

Mixin

  • [ENHANCEMENT] Alerts: Added MimirIngesterInstanceHasNoTenants alert that fires when an ingester replica is not receiving write requests for any tenant. #3681
  • [ENHANCEMENT] Alerts: Extended MimirAllocatingTooMuchMemory to check read-write deployment containers. #3710
  • [ENHANCEMENT] Alerts: Added MimirAlertmanagerInstanceHasNoTenants alert that fires when an alertmanager instance ows no tenants. #3826
  • [ENHANCEMENT] Alerts: Added MimirRulerInstanceHasNoRuleGroups alert that fires when a ruler replica is not assigned any rule group to evaluate. #3723
  • [ENHANCEMENT] Support for baremetal deployment for alerts and scaling recording rules. #3719
  • [ENHANCEMENT] Dashboards: querier autoscaling now supports multiple scaled objects (configurable via $._config.autoscale.querier.hpa_name). #3962
  • [BUGFIX] Alerts: Fixed MimirIngesterRestarts alert when Mimir is deployed in read-write mode. #3716
  • [BUGFIX] Alerts: Fixed MimirIngesterHasNotShippedBlocks and MimirIngesterHasNotShippedBlocksSinceStart alerts for when Mimir is deployed in read-write or monolithic modes and updated them to use new thanos_shipper_last_successful_upload_time metric. #3627
  • [BUGFIX] Alerts: Fixed MimirMemoryMapAreasTooHigh alert when Mimir is deployed in read-write mode. #3626
  • [BUGFIX] Alerts: Fixed MimirCompactorSkippedBlocksWithOutOfOrderChunks matching on non-existent label. #3628
  • [BUGFIX] Dashboards: Fix Rollout Progress dashboard incorrectly using Gateway metrics when Gateway was not enabled. #3709
  • [BUGFIX] Tenants dashboard: Make it compatible with all deployment types. #3754
  • [BUGFIX] Alerts: Fixed MimirCompactorHasNotUploadedBlocks to not fire if compactor has nothing to do. #3793
  • [BUGFIX] Alerts: Fixed MimirAutoscalerNotActive to not fire if scaling metric is 0, to avoid false positives on scaled objects with 0 min replicas. #3999

Jsonnet

  • [CHANGE] Replaced the deprecated policy/v1beta1 with policy/v1 when configuring a PodDisruptionBudget for read-write deployment mode. #3811
  • [CHANGE] Removed -server.http-write-timeout default option value from querier and query-frontend, as it defaults to a higher value in the code now, and cannot be lower than -querier.timeout. #3836
  • [CHANGE] Replaced -store.max-query-length with -query-frontend.max-total-query-length in the query-frontend config. #3879
  • [CHANGE] Changed default mimir_backend_data_disk_size from 100Gi to 250Gi. #3894
  • [ENHANCEMENT] Update rollout-operator to v0.2.0. #3624
  • [ENHANCEMENT] Add user_24M and user_32M classes to operations config. #3367
  • [ENHANCEMENT] Update memcached image from memcached:1.6.16-alpine to memcached:1.6.17-alpine. #3914
  • [ENHANCEMENT] Allow configuring the ring for overrides-exporter. #3995
  • [BUGFIX] Apply ingesters and store-gateways per-zone CLI flags overrides to read-write deployment mode too. #3766
  • [BUGFIX] Apply overrides-exporter CLI flags to mimir-backend when running Mimir in read-write deployment mode. #3790
  • [BUGFIX] Fixed mimir-write and mimir-read Kubernetes service to correctly balance requests among pods. #3855 #3864 #3906
  • [BUGFIX] Fixed ruler-query-frontend and mimir-read gRPC server configuration to force clients to periodically re-resolve the backend addresses. #3862
  • [BUGFIX] Fixed mimir-read CLI flags to ensure query-frontend configuration takes precedence over querier configuration. #3877

Mimirtool

  • [ENHANCEMENT] Update mimirtool config convert to work with Mimir 2.4, 2.5, 2.6 changes. #3952
  • [ENHANCEMENT] Mimirtool is now available to install through Homebrew with brew install mimirtool. #3776
  • [ENHANCEMENT] Added --concurrency to mimirtool rules sync command. #3996
  • [BUGFIX] Fix summary output from mimirtool rules sync to display correct number of groups created and updated. #3918

Documentation

  • [BUGFIX] Querier: Remove assertion that the -querier.max-concurrent flag must also be set for the query-frontend. #3678
  • [ENHANCEMENT] Update migration from cortex documentation. #3662
  • [ENHANCEMENT] Query-scheduler: documented how to migrate from DNS-based to ring-based service discovery. #4028

Tools

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.5.0...mimir-2.6.0-rc.0

mimir - 2.5.0

Published by pstibrany almost 2 years ago

This release contains 230 PRs from 43 authors, including new contributors Aldo D'Aquino, Anıl Mısırlıoğlu, Charles Korn, Danny Staple, Dylan Crees, Eduardo Silvi, FG, Jesse Weaver, KarlisAG, Leegin-darknight, Rohan Kumar, Wille Faler, Y.Horie, manohar-koukuntla, paulroche, songjiayang, Éamon Ryan. Thank you!

Grafana Mimir version 2.5 release notes

Grafana Labs is excited to announce version 2.5 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Features and enhancements

  • Alertmanager Discord support
    Alertmanager can now be configured to send alerts in Discord channels.

  • Configurable TLS minimum version and cipher suites
    We added the flags -server.tls-min-version and -server.tls-cipher-suites that can be used to define the minimum TLS version and the supported cipher suites in all HTTP and gRPC servers in Mimir.

  • Lower memory usage in store-gateway, ingester and alertmanager
    We made various changes related to how index lookups are performed and how the active series custom trackers are implemented, which results in better performance and lower overall memory usage in the store-gateway and ingester.
    We also optimized the alertmanager, which results in a 50% reduction in memory usage in use cases with larger numbers of tenants.

  • Improved Mimir dashboards
    We added two new dashboards named Mimir / Overview resources and Mimir / Overview networking. Furthermore, we have made various improvements to the following existing dashboards:

    • Mimir / Overview: Add "remote read", "metadata", and "exemplar" queries.
    • Mimir / Writes: Add optional row about the distributor's new forwarding feature.
    • Mimir / Tenants: Add insights into the read path.

Helm chart improvements

  • Zone aware replication
    Helm now supports deploying the ingesters and store-gateways as different availability zones. The replication is also zone-aware, therefore multiple instances of one zone can fail without any service interruption and roll outs can be performed faster because many instances of each zone can be restarted together, as opposed to them all restarting in sequence.

    This is a breaking change, for details on how to upgrade please review the Helm changelog.

  • Running without root privileges
    All Mimir, GEM and Agent processes now don't require root privileges to run anymore.

  • Unified reverse proxy (gateway) configuration for Mimir and GEM
    This change allows for an easier upgrade path from Mimir to GEM, without any downtime. The unified configuration also makes it possible to autoscale the GEM gateway pods and it supports OpenShift Route. The change also deprecates the nginx section in the configuration. The section will be removed in release 7.0.0.

  • Updated MinIO
    The MinIO sub-chart was updated from 4.x to 5.0.0, note that this update inherits a breaking change because the MinIO gateway mode was removed.

  • Updated sizing plans
    We updated our sizing plans to make them reflect better how we recommend running Mimir and GEM in production. Note that this includes a breaking change for users of the "small" plan, more details can be found in the Helm changelog.

  • Various quality of life improvements

    • Rollout strategies without downtime
    • Read path and compactor configuration refresh, providing better default settings
    • OTLP ingestion support in the Nginx configuration
    • A default configuration for alertmanager, so the user interface and the sending of alerts from the ruler works out of the box

Bug fixes

  • Flusher: Added Overrides as a dependency to prevent panics when starting with -target=flusher. PR 3151
  • Query-frontend: properly close gRPC streams to the query-scheduler to stop memory and goroutines leak. PR 3302
  • Ruler: persist evaluation delay configured in the rulegroup. PR 3392
  • Fix panics in OTLP ingest path when parse errors occur. PR 3538

Changelog

2.5.0

Grafana Mimir

  • [CHANGE] Flag -azure.msi-resource is now ignored, and will be removed in Mimir 2.7. This setting is now made automatically by Azure. #2682
  • [CHANGE] Experimental flag -blocks-storage.tsdb.out-of-order-capacity-min has been removed. #3261
  • [CHANGE] Distributor: Wrap errors from pushing to ingesters with useful context, for example clarifying timeouts. #3307
  • [CHANGE] The default value of -server.http-write-timeout has changed from 30s to 2m. #3346
  • [CHANGE] Reduce period of health checks in connection pools for querier->store-gateway, ruler->ruler, and alertmanager->alertmanager clients to 10s. This reduces the time to fail a gRPC call when the remote stops responding. #3168
  • [CHANGE] Hide TSDB block ranges period config from doc and mark it experimental. #3518
  • [FEATURE] Alertmanager: added Discord support. #3309
  • [ENHANCEMENT] Added -server.tls-min-version and -server.tls-cipher-suites flags to configure cipher suites and min TLS version supported by HTTP and gRPC servers. #2898
  • [ENHANCEMENT] Distributor: Add age filter to forwarding functionality, to not forward samples which are older than defined duration. If such samples are not ingested, cortex_discarded_samples_total{reason="forwarded-sample-too-old"} is increased. #3049 #3113
  • [ENHANCEMENT] Store-gateway: Reduce memory allocation when generating ids in index cache. #3179
  • [ENHANCEMENT] Query-frontend: truncate queries based on the configured creation grace period (--validation.create-grace-period) to avoid querying too far into the future. #3172
  • [ENHANCEMENT] Ingester: Reduce activity tracker memory allocation. #3203
  • [ENHANCEMENT] Query-frontend: Log more detailed information in the case of a failed query. #3190
  • [ENHANCEMENT] Added -usage-stats.installation-mode configuration to track the installation mode via the anonymous usage statistics. #3244
  • [ENHANCEMENT] Compactor: Add new cortex_compactor_block_max_time_delta_seconds histogram for detecting if compaction of blocks is lagging behind. #3240 #3429
  • [ENHANCEMENT] Ingester: reduced the memory footprint of active series custom trackers. #2568
  • [ENHANCEMENT] Distributor: Include X-Scope-OrgId header in requests forwarded to configured forwarding endpoint. #3283 #3385
  • [ENHANCEMENT] Alertmanager: reduced memory utilization in Mimir clusters with a large number of tenants. #3309
  • [ENHANCEMENT] Add experimental flag -shutdown-delay to allow components to wait after receiving SIGTERM and before stopping. In this time the component returns 503 from /ready endpoint. #3298
  • [ENHANCEMENT] Go: update to go 1.19.3. #3371
  • [ENHANCEMENT] Alerts: added RulerRemoteEvaluationFailing alert, firing when communication between ruler and frontend fails in remote operational mode. #3177 #3389
  • [ENHANCEMENT] Clarify which S3 signature versions are supported in the error "unsupported signature version". #3376
  • [ENHANCEMENT] Store-gateway: improved index header reading performance. #3393 #3397 #3436
  • [ENHANCEMENT] Store-gateway: improved performance of series matching. #3391
  • [ENHANCEMENT] Move the validation of incoming series before the distributor's forwarding functionality, so that we don't forward invalid series. #3386 #3458
  • [ENHANCEMENT] S3 bucket configuration now validates that the endpoint does not have the bucket name prefix. #3414
  • [ENHANCEMENT] Query-frontend: added "fetched index bytes" to query statistics, so that the statistics contain the total bytes read by store-gateways from TSDB block indexes. #3206
  • [ENHANCEMENT] Distributor: push wrapper should only receive unforwarded samples. #2980
  • [BUGFIX] Flusher: Add Overrides as a dependency to prevent panics when starting with -target=flusher. #3151
  • [BUGFIX] Updated golang.org/x/text dependency to fix CVE-2022-32149. #3285
  • [BUGFIX] Query-frontend: properly close gRPC streams to the query-scheduler to stop memory and goroutines leak. #3302
  • [BUGFIX] Ruler: persist evaluation delay configured in the rulegroup. #3392
  • [BUGFIX] Ring status pages: show 100% ownership as "100%", not "1e+02%". #3435
  • [BUGFIX] Fix panics in OTLP ingest path when parse errors exist. #3538

Mixin

  • [CHANGE] Alerts: Change MimirSchedulerQueriesStuck for time to 7 minutes to account for the time it takes for HPA to scale up. #3223
  • [CHANGE] Dashboards: Removed the Querier > Stages panel from the Mimir / Queries dashboard. #3311
  • [CHANGE] Configuration: The format of the autoscaling section of the configuration has changed to support more components. #3378
    • Instead of specific config variables for each component, they are listed in a dictionary. For example, autoscaling.querier_enabled becomes autoscaling.querier.enabled.
  • [FEATURE] Dashboards: Added "Mimir / Overview resources" dashboard, providing an high level view over a Mimir cluster resources utilization. #3481
  • [FEATURE] Dashboards: Added "Mimir / Overview networking" dashboard, providing an high level view over a Mimir cluster network bandwidth, inflight requests and TCP connections. #3487
  • [FEATURE] Compile baremetal mixin along k8s mixin. #3162 #3514
  • [ENHANCEMENT] Alerts: Add MimirRingMembersMismatch firing when a component does not have the expected number of running jobs. #2404
  • [ENHANCEMENT] Dashboards: Add optional row about the Distributor's metric forwarding feature to the Mimir / Writes dashboard. #3182 #3394 #3394 #3461
  • [ENHANCEMENT] Dashboards: Remove the "Instance Mapper" row from the "Alertmanager Resources Dashboard". This is a Grafana Cloud specific service and not relevant for external users. #3152
  • [ENHANCEMENT] Dashboards: Add "remote read", "metadata", and "exemplar" queries to "Mimir / Overview" dashboard. #3245
  • [ENHANCEMENT] Dashboards: Use non-red colors for non-error series in the "Mimir / Overview" dashboard. #3246
  • [ENHANCEMENT] Dashboards: Add support to multi-zone deployments for the experimental read-write deployment mode. #3256
  • [ENHANCEMENT] Dashboards: If enabled, add new row to the Mimir / Writes for distributor autoscaling metrics. #3378
  • [ENHANCEMENT] Dashboards: Add read path insights row to the "Mimir / Tenants" dashboard. #3326
  • [ENHANCEMENT] Alerts: Add runbook urls for alerts. #3452
  • [ENHANCEMENT] Configuration: Make it possible to configure namespace label, job label, and job prefix. #3482
  • [ENHANCEMENT] Dashboards: improved resources and networking dashboards to work with read-write deployment mode too. #3497 #3504 #3519 #3531
  • [ENHANCEMENT] Alerts: Added "MimirDistributorForwardingErrorRate" alert, which fires on high error rates in the distributor’s forwarding feature. #3200
  • [ENHANCEMENT] Improve phrasing in Overview dashboard. #3488
  • [BUGFIX] Dashboards: Fix legend showing persistentvolumeclaim when using deployment_type=baremetal for Disk space utilization panels. #3173 #3184
  • [BUGFIX] Alerts: Fixed MimirGossipMembersMismatch alert when Mimir is deployed in read-write mode. #3489
  • [BUGFIX] Dashboards: Remove "Inflight requests" from object store panels because the panel is not tracking the inflight requests to object storage. #3521

Jsonnet

  • [CHANGE] Replaced the deprecated policy/v1beta1 with policy/v1 when configuring a PodDisruptionBudget. #3284
  • [CHANGE] Common storage configuration is now used to configure object storage in all components. This is a breaking change in terms of Jsonnet manifests and also a CLI flag update for components that use object storage, so it will require a rollout of those components. The changes include: #3257
    • blocks_storage_backend was renamed to storage_backend and is now used as the common storage backend for all components.
      • So were the related blocks_storage_azure_account_(name|key) and blocks_storage_s3_endpoint configurations.
    • storage_s3_endpoint is now rendered by default using the aws_region configuration instead of a hardcoded us-east-1.
    • ruler_client_type and alertmanager_client_type were renamed to ruler_storage_backend and alertmanager_storage_backend respectively, and their corresponding CLI flags won't be rendered unless explicitly set to a value different from the one in storage_backend (like local).
    • alertmanager_s3_bucket_name, alertmanager_gcs_bucket_name and alertmanager_azure_container_name have been removed, and replaced by a single alertmanager_storage_bucket_name configuration used for all object storages.
    • genericBlocksStorageConfig configuration object was removed, and so any extensions to it will be now ignored. Use blockStorageConfig instead.
    • rulerClientConfig and alertmanagerStorageClientConfig configuration objects were renamed to rulerStorageConfig and alertmanagerStorageConfig respectively, and so any extensions to their previous names will be now ignored. Use the new names instead.
    • The CLI flags *.s3.region are no longer rendered as they are optional and the region can be inferred by Mimir by performing an initial API call to the endpoint.
    • The migration to this change should usually consist of:
      • Renaming blocks_storage_backend key to storage_backend.
      • For Azure/S3:
        • Renaming blocks_storage_(azure|s3)_* configurations to storage_(azure|s3)_*.
        • If ruler_storage_(azure|s3)_* and alertmanager_storage_(azure|s3)_* keys were different from the block_storage_* ones, they should be now provided using CLI flags, see configuration reference for more details.
      • Removing ruler_client_type and alertmanager_client_type if their value match the storage_backend, or renaming them to their new names otherwise.
      • Reviewing any possible extensions to genericBlocksStorageConfig, rulerClientConfig and alertmanagerStorageClientConfig and moving them to the corresponding new options.
      • Renaming the alertmanager's bucket name configuration from provider-specific to the new alertmanager_storage_bucket_name key.
  • [CHANGE] The overrides-exporter.libsonnet file is now always imported. The overrides-exporter can be enabled in jsonnet setting the following: #3379
    {
      _config+:: {
        overrides_exporter_enabled: true,
      }
    }
    
  • [FEATURE] Added support for experimental read-write deployment mode. Enabling the read-write deployment mode on a existing Mimir cluster is a destructive operation, because the cluster will be re-created. If you're creating a new Mimir cluster, you can deploy it in read-write mode adding the following configuration: #3379 #3475 #3405
    {
      _config+:: {
        deployment_mode: 'read-write',
    
        // See operations/mimir/read-write-deployment.libsonnet for more configuration options.
        mimir_write_replicas: 3,
        mimir_read_replicas: 2,
        mimir_backend_replicas: 3,
      }
    }
    
  • [ENHANCEMENT] Add autoscaling support to the mimir-read component when running the read-write-deployment model. #3419
  • [ENHANCEMENT] Added $._config.usageStatsConfig to track the installation mode via the anonymous usage statistics. #3294
  • [ENHANCEMENT] The query-tee node port ($._config.query_tee_node_port) is now optional. #3272
  • [ENHANCEMENT] Add support for autoscaling distributors. #3378
  • [ENHANCEMENT] Make auto-scaling logic ensure integer KEDA thresholds. #3512
  • [BUGFIX] Fixed query-scheduler ring configuration for dedicated ruler's queries and query-frontends. #3237 #3239
  • [BUGFIX] Jsonnet: Fix auto-scaling so that ruler-querier CPU threshold is a string-encoded integer millicores value. #3520

Mimirtool

  • [FEATURE] Added mimirtool alertmanager verify command to validate configuration without uploading. #3440
  • [ENHANCEMENT] Added mimirtool rules delete-namespace command to delete all of the rule groups in a namespace including the namespace itself. #3136
  • [ENHANCEMENT] Refactor mimirtool analyze prometheus: add concurrency and resiliency #3349
    • Add --concurrency flag. Default: number of logical CPUs
  • [BUGFIX] --log.level=debug now correctly prints the response from the remote endpoint when a request fails. #3180

Documentation

  • [ENHANCEMENT] Documented how to configure HA deduplication using Consul in a Mimir Helm deployment. #2972
  • [ENHANCEMENT] Improve MimirQuerierAutoscalerNotActive runbook. #3186
  • [ENHANCEMENT] Improve MimirSchedulerQueriesStuck runbook to reflect debug steps with querier auto-scaling enabled. #3223
  • [ENHANCEMENT] Use imperative for docs titles. #3178 #3332 #3343
  • [ENHANCEMENT] Docs: mention gRPC compression in "Production tips". #3201
  • [ENHANCEMENT] Update ADOPTERS.md. #3224 #3225
  • [ENHANCEMENT] Add a note for jsonnet deploying. #3213
  • [ENHANCEMENT] out-of-order runbook update with use case. #3253
  • [ENHANCEMENT] Fixed TSDB retention mentioned in the "Recover source blocks from ingesters" runbook. #3280
  • [ENHANCEMENT] Run Grafana Mimir in production using the Helm chart. #3072
  • [ENHANCEMENT] Use common configuration in the tutorial. #3282
  • [ENHANCEMENT] Updated detailed steps for migrating blocks from Thanos to Mimir. #3290
  • [ENHANCEMENT] Add scheme to DNS service discovery docs. #3450
  • [BUGFIX] Remove reference to file that no longer exists in contributing guide. #3404
  • [BUGFIX] Fix some minor typos in the contributing guide and on the runbooks page. #3418
  • [BUGFIX] Fix small typos in API reference. #3526
  • [BUGFIX] Fixed TSDB retention mentioned in the "Recover source blocks from ingesters" runbook. #3278
  • [BUGFIX] Fixed configuration example in the "Configuring the Grafana Mimir query-frontend to work with Prometheus" guide. #3374

Tools

  • [FEATURE] Add copyblocks tool, to copy Mimir blocks between two GCS buckets. #3264
  • [ENHANCEMENT] copyblocks: copy no-compact global markers and optimize min time filter check. #3268
  • [ENHANCEMENT] Mimir rules GitHub action: Added the ability to change default value of label when running prepare command. #3236
  • [BUGFIX] Mimir rules Github action: Fix single line output. #3421

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.4.0...mimir-2.5.0

mimir - 2.5.0-rc.0

Published by replay almost 2 years ago

This release contains 227 PRs from 43 authors, including new contributors Aldo D'Aquino, Anıl Mısırlıoğlu, Charles Korn, Danny Staple, Dylan Crees, Eduardo Silvi, FG, Jesse Weaver, KarlisAG, Leegin-darknight, Rohan Kumar, Wille Faler, Y.Horie, manohar-koukuntla, paulroche, songjiayang, Éamon Ryan. Thank you!

Grafana Mimir version 2.5.0-rc.0 release notes

Grafana Labs is excited to announce version 2.5.0-rc.0 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Features and enhancements

  • Alertmanager Discord support
    Alertmanager can now be configured to send alerts in Discord channels.

  • Configurable TLS minimum version and cipher suites
    We added the flags -server.tls-min-version and -server.tls-cipher-suites that can be used to define the minimum TLS version and the supported cipher suites in all HTTP and gRPC servers in Mimir.

  • Lower memory usage in store-gateway, ingester and alertmanager
    We made various changes related to how index lookups are performed and how the active series custom trackers are implemented, which results in better performance and lower overall memory usage in the store-gateway and ingester.
    We also optimized the alertmanager, which results in a 50% reduction in memory usage in use cases with larger numbers of tenants.

  • Improved Mimir dashboards
    We added two new dashboards named Mimir / Overview resources and Mimir / Overview networking. Furthermore, we have made various improvements to the following existing dashboards:

    • Mimir / Overview: Add "remote read", "metadata", and "exemplar" queries.
    • Mimir / Writes: Add optional row about the distributor's new forwarding feature.
    • Mimir / Tenants: Add insights into the read path.

Helm chart improvements

  • Zone aware replication
    Helm now supports deploying the ingesters and store-gateways as different availability zones. The replication is also zone-aware, therefore multiple instances of one zone can fail without any service interruption and roll outs can be performed faster because many instances of each zone can be restarted together, as opposed to them all restarting in sequence.

    This is a breaking change, for details on how to upgrade please review the Helm changelog.

  • Running without root privileges
    All Mimir, GEM and Agent processes now don't require root privileges to run anymore.

  • Unified reverse proxy (gateway) configuration for Mimir and GEM
    This change allows for an easier upgrade path from Mimir to GEM, without any downtime. The unified configuration also makes it possible to autoscale the GEM gateway pods and it supports OpenShift Route. The change also deprecates the nginx section in the configuration. The section will be removed in release 7.0.0.

  • Updated MinIO
    The MinIO sub-chart was updated from 4.x to 5.0.0, note that this update inherits a breaking change because the MinIO gateway mode was removed.

  • Updated sizing plans
    We updated our sizing plans to make them reflect better how we recommend running Mimir and GEM in production. Note that this includes a breaking change for users of the "small" plan, more details can be found in the Helm changelog.

  • Various quality of life improvements

    • Rollout strategies without downtime
    • Read path and compactor configuration refresh, providing better default settings
    • OTLP ingestion support in the Nginx configuration
    • A default configuration for alertmanager, so the user interface and the sending of alerts from the ruler works out of the box

Bug fixes

  • Flusher: Added Overrides as a dependency to prevent panics when starting with -target=flusher. PR 3151
  • Query-frontend: properly close gRPC streams to the query-scheduler to stop memory and goroutines leak. PR 3302
  • Ruler: persist evaluation delay configured in the rulegroup. PR 3392
  • Fix panics in OTLP ingest path when parse errors occur. PR 3538

Changelog

2.5.0-rc.0

Grafana Mimir

  • [CHANGE] Flag -azure.msi-resource is now ignored, and will be removed in Mimir 2.7. This setting is now made automatically by Azure. #2682
  • [CHANGE] Experimental flag -blocks-storage.tsdb.out-of-order-capacity-min has been removed. #3261
  • [CHANGE] Distributor: Wrap errors from pushing to ingesters with useful context, for example clarifying timeouts. #3307
  • [CHANGE] The default value of -server.http-write-timeout has changed from 30s to 2m. #3346
  • [CHANGE] Reduce period of health checks in connection pools for querier->store-gateway, ruler->ruler, and alertmanager->alertmanager clients to 10s. This reduces the time to fail a gRPC call when the remote stops responding. #3168
  • [CHANGE] Hide TSDB block ranges period config from doc and mark it experimental. #3518
  • [FEATURE] Alertmanager: added Discord support. #3309
  • [ENHANCEMENT] Added -server.tls-min-version and -server.tls-cipher-suites flags to configure cipher suites and min TLS version supported by HTTP and gRPC servers. #2898
  • [ENHANCEMENT] Distributor: Add age filter to forwarding functionality, to not forward samples which are older than defined duration. If such samples are not ingested, cortex_discarded_samples_total{reason="forwarded-sample-too-old"} is increased. #3049 #3113
  • [ENHANCEMENT] Store-gateway: Reduce memory allocation when generating ids in index cache. #3179
  • [ENHANCEMENT] Query-frontend: truncate queries based on the configured creation grace period (--validation.create-grace-period) to avoid querying too far into the future. #3172
  • [ENHANCEMENT] Ingester: Reduce activity tracker memory allocation. #3203
  • [ENHANCEMENT] Query-frontend: Log more detailed information in the case of a failed query. #3190
  • [ENHANCEMENT] Added -usage-stats.installation-mode configuration to track the installation mode via the anonymous usage statistics. #3244
  • [ENHANCEMENT] Compactor: Add new cortex_compactor_block_max_time_delta_seconds histogram for detecting if compaction of blocks is lagging behind. #3240 #3429
  • [ENHANCEMENT] Ingester: reduced the memory footprint of active series custom trackers. #2568
  • [ENHANCEMENT] Distributor: Include X-Scope-OrgId header in requests forwarded to configured forwarding endpoint. #3283 #3385
  • [ENHANCEMENT] Alertmanager: reduced memory utilization in Mimir clusters with a large number of tenants. #3309
  • [ENHANCEMENT] Add experimental flag -shutdown-delay to allow components to wait after receiving SIGTERM and before stopping. In this time the component returns 503 from /ready endpoint. #3298
  • [ENHANCEMENT] Go: update to go 1.19.3. #3371
  • [ENHANCEMENT] Alerts: added RulerRemoteEvaluationFailing alert, firing when communication between ruler and frontend fails in remote operational mode. #3177 #3389
  • [ENHANCEMENT] Clarify which S3 signature versions are supported in the error "unsupported signature version". #3376
  • [ENHANCEMENT] Store-gateway: improved index header reading performance. #3393 #3397 #3436
  • [ENHANCEMENT] Store-gateway: improved performance of series matching. #3391
  • [ENHANCEMENT] Move the validation of incoming series before the distributor's forwarding functionality, so that we don't forward invalid series. #3386 #3458
  • [ENHANCEMENT] S3 bucket configuration now validates that the endpoint does not have the bucket name prefix. #3414
  • [ENHANCEMENT] Query-frontend: added "fetched index bytes" to query statistics, so that the statistics contain the total bytes read by store-gateways from TSDB block indexes. #3206
  • [ENHANCEMENT] Distributor: push wrapper should only receive unforwarded samples. #2980
  • [BUGFIX] Flusher: Add Overrides as a dependency to prevent panics when starting with -target=flusher. #3151
  • [BUGFIX] Updated golang.org/x/text dependency to fix CVE-2022-32149. #3285
  • [BUGFIX] Query-frontend: properly close gRPC streams to the query-scheduler to stop memory and goroutines leak. #3302
  • [BUGFIX] Ruler: persist evaluation delay configured in the rulegroup. #3392
  • [BUGFIX] Ring status pages: show 100% ownership as "100%", not "1e+02%". #3435
  • [BUGFIX] Fix panics in OTLP ingest path when parse errors exist. #3538

Mixin

  • [CHANGE] Alerts: Change MimirSchedulerQueriesStuck for time to 7 minutes to account for the time it takes for HPA to scale up. #3223
  • [CHANGE] Dashboards: Removed the Querier > Stages panel from the Mimir / Queries dashboard. #3311
  • [CHANGE] Configuration: The format of the autoscaling section of the configuration has changed to support more components. #3378
    • Instead of specific config variables for each component, they are listed in a dictionary. For example, autoscaling.querier_enabled becomes autoscaling.querier.enabled.
  • [FEATURE] Dashboards: Added "Mimir / Overview resources" dashboard, providing an high level view over a Mimir cluster resources utilization. #3481
  • [FEATURE] Dashboards: Added "Mimir / Overview networking" dashboard, providing an high level view over a Mimir cluster network bandwidth, inflight requests and TCP connections. #3487
  • [FEATURE] Compile baremetal mixin along k8s mixin. #3162 #3514
  • [ENHANCEMENT] Alerts: Add MimirRingMembersMismatch firing when a component does not have the expected number of running jobs. #2404
  • [ENHANCEMENT] Dashboards: Add optional row about the Distributor's metric forwarding feature to the Mimir / Writes dashboard. #3182 #3394 #3394 #3461
  • [ENHANCEMENT] Dashboards: Remove the "Instance Mapper" row from the "Alertmanager Resources Dashboard". This is a Grafana Cloud specific service and not relevant for external users. #3152
  • [ENHANCEMENT] Dashboards: Add "remote read", "metadata", and "exemplar" queries to "Mimir / Overview" dashboard. #3245
  • [ENHANCEMENT] Dashboards: Use non-red colors for non-error series in the "Mimir / Overview" dashboard. #3246
  • [ENHANCEMENT] Dashboards: Add support to multi-zone deployments for the experimental read-write deployment mode. #3256
  • [ENHANCEMENT] Dashboards: If enabled, add new row to the Mimir / Writes for distributor autoscaling metrics. #3378
  • [ENHANCEMENT] Dashboards: Add read path insights row to the "Mimir / Tenants" dashboard. #3326
  • [ENHANCEMENT] Alerts: Add runbook urls for alerts. #3452
  • [ENHANCEMENT] Configuration: Make it possible to configure namespace label, job label, and job prefix. #3482
  • [ENHANCEMENT] Dashboards: improved resources and networking dashboards to work with read-write deployment mode too. #3497 #3504 #3519 #3531
  • [ENHANCEMENT] Alerts: Added "MimirDistributorForwardingErrorRate" alert, which fires on high error rates in the distributor’s forwarding feature. #3200
  • [ENHANCEMENT] Improve phrasing in Overview dashboard. #3488
  • [BUGFIX] Dashboards: Fix legend showing persistentvolumeclaim when using deployment_type=baremetal for Disk space utilization panels. #3173 #3184
  • [BUGFIX] Alerts: Fixed MimirGossipMembersMismatch alert when Mimir is deployed in read-write mode. #3489
  • [BUGFIX] Dashboards: Remove "Inflight requests" from object store panels because the panel is not tracking the inflight requests to object storage. #3521

Jsonnet

  • [CHANGE] Replaced the deprecated policy/v1beta1 with policy/v1 when configuring a PodDisruptionBudget. #3284
  • [CHANGE] Common storage configuration is now used to configure object storage in all components. This is a breaking change in terms of Jsonnet manifests and also a CLI flag update for components that use object storage, so it will require a rollout of those components. The changes include: #3257
    • blocks_storage_backend was renamed to storage_backend and is now used as the common storage backend for all components.
      • So were the related blocks_storage_azure_account_(name|key) and blocks_storage_s3_endpoint configurations.
    • storage_s3_endpoint is now rendered by default using the aws_region configuration instead of a hardcoded us-east-1.
    • ruler_client_type and alertmanager_client_type were renamed to ruler_storage_backend and alertmanager_storage_backend respectively, and their corresponding CLI flags won't be rendered unless explicitly set to a value different from the one in storage_backend (like local).
    • alertmanager_s3_bucket_name, alertmanager_gcs_bucket_name and alertmanager_azure_container_name have been removed, and replaced by a single alertmanager_storage_bucket_name configuration used for all object storages.
    • genericBlocksStorageConfig configuration object was removed, and so any extensions to it will be now ignored. Use blockStorageConfig instead.
    • rulerClientConfig and alertmanagerStorageClientConfig configuration objects were renamed to rulerStorageConfig and alertmanagerStorageConfig respectively, and so any extensions to their previous names will be now ignored. Use the new names instead.
    • The CLI flags *.s3.region are no longer rendered as they are optional and the region can be inferred by Mimir by performing an initial API call to the endpoint.
    • The migration to this change should usually consist of:
      • Renaming blocks_storage_backend key to storage_backend.
      • For Azure/S3:
        • Renaming blocks_storage_(azure|s3)_* configurations to storage_(azure|s3)_*.
        • If ruler_storage_(azure|s3)_* and alertmanager_storage_(azure|s3)_* keys were different from the block_storage_* ones, they should be now provided using CLI flags, see configuration reference for more details.
      • Removing ruler_client_type and alertmanager_client_type if their value match the storage_backend, or renaming them to their new names otherwise.
      • Reviewing any possible extensions to genericBlocksStorageConfig, rulerClientConfig and alertmanagerStorageClientConfig and moving them to the corresponding new options.
      • Renaming the alertmanager's bucket name configuration from provider-specific to the new alertmanager_storage_bucket_name key.
  • [CHANGE] The overrides-exporter.libsonnet file is now always imported. The overrides-exporter can be enabled in jsonnet setting the following: #3379
    {
      _config+:: {
        overrides_exporter_enabled: true,
      }
    }
    
  • [FEATURE] Added support for experimental read-write deployment mode. Enabling the read-write deployment mode on a existing Mimir cluster is a destructive operation, because the cluster will be re-created. If you're creating a new Mimir cluster, you can deploy it in read-write mode adding the following configuration: #3379 #3475 #3405
    {
      _config+:: {
        deployment_mode: 'read-write',
    
        // See operations/mimir/read-write-deployment.libsonnet for more configuration options.
        mimir_write_replicas: 3,
        mimir_read_replicas: 2,
        mimir_backend_replicas: 3,
      }
    }
    
  • [ENHANCEMENT] Add autoscaling support to the mimir-read component when running the read-write-deployment model. #3419
  • [ENHANCEMENT] Added $._config.usageStatsConfig to track the installation mode via the anonymous usage statistics. #3294
  • [ENHANCEMENT] The query-tee node port ($._config.query_tee_node_port) is now optional. #3272
  • [ENHANCEMENT] Add support for autoscaling distributors. #3378
  • [ENHANCEMENT] Make auto-scaling logic ensure integer KEDA thresholds. #3512
  • [BUGFIX] Fixed query-scheduler ring configuration for dedicated ruler's queries and query-frontends. #3237 #3239
  • [BUGFIX] Jsonnet: Fix auto-scaling so that ruler-querier CPU threshold is a string-encoded integer millicores value. #3520

Mimirtool

  • [FEATURE] Added mimirtool alertmanager verify command to validate configuration without uploading. #3440
  • [ENHANCEMENT] Added mimirtool rules delete-namespace command to delete all of the rule groups in a namespace including the namespace itself. #3136
  • [ENHANCEMENT] Refactor mimirtool analyze prometheus: add concurrency and resiliency #3349
    • Add --concurrency flag. Default: number of logical CPUs
  • [BUGFIX] --log.level=debug now correctly prints the response from the remote endpoint when a request fails. #3180

Documentation

  • [ENHANCEMENT] Documented how to configure HA deduplication using Consul in a Mimir Helm deployment. #2972
  • [ENHANCEMENT] Improve MimirQuerierAutoscalerNotActive runbook. #3186
  • [ENHANCEMENT] Improve MimirSchedulerQueriesStuck runbook to reflect debug steps with querier auto-scaling enabled. #3223
  • [ENHANCEMENT] Use imperative for docs titles. #3178 #3332 #3343
  • [ENHANCEMENT] Docs: mention gRPC compression in "Production tips". #3201
  • [ENHANCEMENT] Update ADOPTERS.md. #3224 #3225
  • [ENHANCEMENT] Add a note for jsonnet deploying. #3213
  • [ENHANCEMENT] out-of-order runbook update with use case. #3253
  • [ENHANCEMENT] Fixed TSDB retention mentioned in the "Recover source blocks from ingesters" runbook. #3280
  • [ENHANCEMENT] Run Grafana Mimir in production using the Helm chart. #3072
  • [ENHANCEMENT] Use common configuration in the tutorial. #3282
  • [ENHANCEMENT] Updated detailed steps for migrating blocks from Thanos to Mimir. #3290
  • [ENHANCEMENT] Add scheme to DNS service discovery docs. #3450
  • [BUGFIX] Remove reference to file that no longer exists in contributing guide. #3404
  • [BUGFIX] Fix some minor typos in the contributing guide and on the runbooks page. #3418
  • [BUGFIX] Fix small typos in API reference. #3526
  • [BUGFIX] Fixed TSDB retention mentioned in the "Recover source blocks from ingesters" runbook. #3278
  • [BUGFIX] Fixed configuration example in the "Configuring the Grafana Mimir query-frontend to work with Prometheus" guide. #3374

Tools

  • [FEATURE] Add copyblocks tool, to copy Mimir blocks between two GCS buckets. #3264
  • [ENHANCEMENT] copyblocks: copy no-compact global markers and optimize min time filter check. #3268
  • [ENHANCEMENT] Mimir rules GitHub action: Added the ability to change default value of label when running prepare command. #3236
  • [BUGFIX] Mimir rules Github action: Fix single line output. #3421

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.4.0...mimir-2.5.0-rc.0

mimir - 2.4.0

Published by pracucci almost 2 years ago

This release contains 190 PRs from 29 authors, including new contributors Fayzal Ghantiwala, Furkan Türkal, Joe Blubaugh, Justin Lei, Nicolas DUPEUX, Paul Puschmann, Radu Domnu, Shubham Ranjan. Thank you!

Grafana Mimir version 2.4.0 release notes

Grafana Labs is excited to announce version 2.4 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Note: If you are upgrading from Grafana Mimir 2.3, review the list of important changes that follow.

Features and enhancements

  • Query-scheduler ring-based service discovery:
    The query-scheduler is an optional, stateless component that retains a queue of queries to execute, and distributes the workload to available queriers. The use the query-scheduler, query-frontends and queriers are required to discover the addresses of the query-scheduler instances.

    In addition to DNS-based service discovery, Mimir 2.4 introduces the ring-based service discovery for the query-scheduler. When enabled, the query-schedulers join their own hash ring (similar to other Mimir components), and the query-frontends and queriers discover query-scheduler instances via the ring.

    Ring-based service discovery makes it easier to set up the query-scheduler in environments where you can't easily define a DNS entry that resolves to the running query-scheduler instances. For more information, refer to query-scheduler configuration.

  • New API endpoint exposes per-tenant limits:
    Mimir 2.4 introduces a new API endpoint, which is available on all Mimir components that load the runtime configuration. The endpoint exposes the limits of the authenticated tenant. You can use this new API endpoint when developing custom integrations with Mimir that require looking up the actual limits that are applied on a given tenant. For more information, refer to Get tenant limits.

  • New TLS configuration options:
    Mimir 2.4 introduces new options to configure the accepted TLS cipher suites, and the minimum versions for the HTTP and gRPC clients that are used between Mimir components, or by Mimir to communicate to external services such as Consul or etcd.

    You can use these new configuration options to override the default TLS settings and meet your security policy requirements. For more information, refer to Securing Grafana Mimir communications with TLS.

  • Maximum range query length limit:
    Mimir 2.4 introduces the new configuration option -query-frontend.max-total-query-length to limit the maximum range query length, which is computed as the query's end minus start timestamp. This limit is enforced in the query-frontend and defaults to -store.max-query-length if unset.

    The new configuration option allows you to set different limits between the received query maximum length (-query-frontend.max-total-query-length) and the maximum length of partial queries after splitting and sharding (-store.max-query-length).

The following experimental features have been promoted to stable:

Helm chart improvements

The mimir-distributed Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.4 release, we’re also releasing version 3.2 of the mimir-distributed Helm chart.

Notable enhancements follow. For the full list of changes, see the Helm chart changelog.

  • Added support for topologySpreadContraints.
  • Replaced the default anti-affinity rules with topologySpreadContraints for all components which puts less restrictions on where Kubernetes can run pods.
  • Important: if you are not using the sizing plans (small.yaml, large.yaml, capped-small.yaml, capped-large.yaml) in production, you must reintroduce pod affinity rules for the ingester and store-gateway. This also fixes a missing label selector for the ingester.
    Merge the following with your custom values file:
    ingester:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: target
                    operator: In
                    values:
                      - ingester
              topologyKey: "kubernetes.io/hostname"
            - labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/component
                    operator: In
                    values:
                      - ingester
              topologyKey: "kubernetes.io/hostname"
    store_gateway:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: target
                    operator: In
                    values:
                      - store-gateway
              topologyKey: "kubernetes.io/hostname"
            - labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/component
                    operator: In
                    values:
                      - store-gateway
              topologyKey: "kubernetes.io/hostname"
    
  • Updated the anti affinity rules in the sizing plans (small.yaml, large.yaml, capped-small.yaml, capped-large.yaml). The sizing plans now enforce that no two pods of the ingester, store-gateway, or alertmanager StatefulSets are scheduled on the same Node. Pods from different StaatefulSets can share a Node.
  • Support for Openshift Route resource for nginx has been added.

Important changes

In Grafana Mimir 2.4, the default values of the following configuration options have changed:

  • -distributor.remote-timeout has changed from 20s to 2s.
  • -distributor.forwarding.request-timeout has changed from 10s to 2s.
  • -blocks-storage.tsdb.head-compaction-concurrency has changed from 5 to 1.
  • The hash-ring heartbeat period for distributors, ingesters, rulers, and compactors has increased from 5s to 15s.

In Grafana Mimir 2.4, the following deprecated configuration options have been removed:

  • The YAML configuration option limits.active_series_custom_trackers_config.
  • The CLI flag -ingester.ring.join-after and its respective YAML configuration option ingester.ring.join_after.
  • The CLI flag -querier.shuffle-sharding-ingesters-lookback-period and its respective YAML configuration option querier.shuffle_sharding_ingesters_lookback_period.

With Grafana Mimir 2.4, the anonymous usage statistics tracking is enabled by default.
Mimir maintainers use this anonymous information to learn more about how the open source community runs Mimir and what the Mimir team should focus on when working on the next features and documentation improvements.
If possible, we ask you to keep the usage reporting feature enabled.
In case you want to opt-out from anonymous usage statistics reporting, refer to Disable the anonymous usage statistics reporting.

Bug fixes

  • PR 2979: Fix remote write HTTP response status code returned by Mimir when failing to write only to one ingester (the quorum is still honored when running Mimir with the default replication factor of 3) and some series are not ingested because of validation errors or some limits being reached.
  • PR 3005: Fix the querier to re-balance its workers connections when a query-frontend or query-scheduler instance is terminated.
  • PR 2963: Fix the remote read endpoint to correctly support the Accept-Encoding: snappy HTTP request header.

Changelog

2.4.0

Grafana Mimir

  • [CHANGE] Distributor: change the default value of -distributor.remote-timeout to 2s from 20s and -distributor.forwarding.request-timeout to 2s from 10s to improve distributor resource usage when ingesters crash. #2728 #2912
  • [CHANGE] Anonymous usage statistics tracking: added the -ingester.ring.store value. #2981
  • [CHANGE] Series metadata HELP that is longer than -validation.max-metadata-length is now truncated silently, instead of being dropped with a 400 status code. #2993
  • [CHANGE] Ingester: changed default setting for -ingester.ring.readiness-check-ring-health from true to false. #2953
  • [CHANGE] Anonymous usage statistics tracking has been enabled by default, to help Mimir maintainers make better decisions to support the open source community. #2939 #3034
  • [CHANGE] Anonymous usage statistics tracking: added the minimum and maximum value of -ingester.out-of-order-time-window. #2940
  • [CHANGE] The default hash ring heartbeat period for distributors, ingesters, rulers and compactors has been increased from 5s to 15s. Now the default heartbeat period for all Mimir hash rings is 15s. #3033
  • [CHANGE] Reduce the default TSDB head compaction concurrency (-blocks-storage.tsdb.head-compaction-concurrency) from 5 to 1, in order to reduce CPU spikes. #3093
  • [CHANGE] Ruler: the ruler's remote evaluation mode (-ruler.query-frontend.address) is now stable. #3109
  • [CHANGE] Limits: removed the deprecated YAML configuration option active_series_custom_trackers_config. Please use active_series_custom_trackers instead. #3110
  • [CHANGE] Ingester: removed the deprecated configuration option -ingester.ring.join-after. #3111
  • [CHANGE] Querier: removed the deprecated configuration option -querier.shuffle-sharding-ingesters-lookback-period. The value of -querier.query-ingesters-within is now used internally for shuffle sharding lookback, while you can use -querier.shuffle-sharding-ingesters-enabled to enable or disable shuffle sharding on the read path. #3111
  • [CHANGE] Memberlist: cluster label verification feature (-memberlist.cluster-label and -memberlist.cluster-label-verification-disabled) is now marked as stable. #3108
  • [CHANGE] Distributor: only single per-tenant forwarding endpoint can be configured now. Support for per-rule endpoint has been removed. #3095
  • [FEATURE] Query-scheduler: added an experimental ring-based service discovery support for the query-scheduler. Refer to query-scheduler configuration for more information. #2957
  • [FEATURE] Introduced the experimental endpoint /api/v1/user_limits exposed by all components that load runtime configuration. This endpoint exposes realtime limits for the authenticated tenant, in JSON format. #2864 #3017
  • [FEATURE] Query-scheduler: added the experimental configuration option -query-scheduler.max-used-instances to restrict the number of query-schedulers effectively used regardless how many replicas are running. This feature can be useful when using the experimental read-write deployment mode. #3005
  • [ENHANCEMENT] Go: updated to go 1.19.2. #2637 #3127 #3129
  • [ENHANCEMENT] Runtime config: don't unmarshal runtime configuration files if they haven't changed. This can save a bit of CPU and memory on every component using runtime config. #2954
  • [ENHANCEMENT] Query-frontend: Add cortex_frontend_query_result_cache_skipped_total and cortex_frontend_query_result_cache_attempted_total metrics to track the reason why query results are not cached. #2855
  • [ENHANCEMENT] Distributor: pool more connections per host when forwarding request. Mark requests as idempotent so they can be retried under some conditions. #2968
  • [ENHANCEMENT] Distributor: failure to send request to forwarding target now also increments cortex_distributor_forward_errors_total, with status_code="failed". #2968
  • [ENHANCEMENT] Distributor: added support forwarding push requests via gRPC, using httpgrpc messages from weaveworks/common library. #2996
  • [ENHANCEMENT] Query-frontend / Querier: increase internal backoff period used to retry connections to query-frontend / query-scheduler. #3011
  • [ENHANCEMENT] Querier: do not log "error processing requests from scheduler" when the query-scheduler is shutting down. #3012
  • [ENHANCEMENT] Query-frontend: query sharding process is now time-bounded and it is cancelled if the request is aborted. #3028
  • [ENHANCEMENT] Query-frontend: improved Prometheus response JSON encoding performance. #2450
  • [ENHANCEMENT] TLS: added configuration parameters to configure the client's TLS cipher suites and minimum version. The following new CLI flags have been added: #3070
    • -alertmanager.alertmanager-client.tls-cipher-suites
    • -alertmanager.alertmanager-client.tls-min-version
    • -alertmanager.sharding-ring.etcd.tls-cipher-suites
    • -alertmanager.sharding-ring.etcd.tls-min-version
    • -compactor.ring.etcd.tls-cipher-suites
    • -compactor.ring.etcd.tls-min-version
    • -distributor.forwarding.grpc-client.tls-cipher-suites
    • -distributor.forwarding.grpc-client.tls-min-version
    • -distributor.ha-tracker.etcd.tls-cipher-suites
    • -distributor.ha-tracker.etcd.tls-min-version
    • -distributor.ring.etcd.tls-cipher-suites
    • -distributor.ring.etcd.tls-min-version
    • -ingester.client.tls-cipher-suites
    • -ingester.client.tls-min-version
    • -ingester.ring.etcd.tls-cipher-suites
    • -ingester.ring.etcd.tls-min-version
    • -memberlist.tls-cipher-suites
    • -memberlist.tls-min-version
    • -querier.frontend-client.tls-cipher-suites
    • -querier.frontend-client.tls-min-version
    • -querier.store-gateway-client.tls-cipher-suites
    • -querier.store-gateway-client.tls-min-version
    • -query-frontend.grpc-client-config.tls-cipher-suites
    • -query-frontend.grpc-client-config.tls-min-version
    • -query-scheduler.grpc-client-config.tls-cipher-suites
    • -query-scheduler.grpc-client-config.tls-min-version
    • -query-scheduler.ring.etcd.tls-cipher-suites
    • -query-scheduler.ring.etcd.tls-min-version
    • -ruler.alertmanager-client.tls-cipher-suites
    • -ruler.alertmanager-client.tls-min-version
    • -ruler.client.tls-cipher-suites
    • -ruler.client.tls-min-version
    • -ruler.query-frontend.grpc-client-config.tls-cipher-suites
    • -ruler.query-frontend.grpc-client-config.tls-min-version
    • -ruler.ring.etcd.tls-cipher-suites
    • -ruler.ring.etcd.tls-min-version
    • -store-gateway.sharding-ring.etcd.tls-cipher-suites
    • -store-gateway.sharding-ring.etcd.tls-min-version
  • [ENHANCEMENT] Store-gateway: Add -blocks-storage.bucket-store.max-concurrent-reject-over-limit option to allow requests that exceed the max number of inflight object storage requests to be rejected. #2999
  • [ENHANCEMENT] Query-frontend: allow setting a separate limit on the total (before splitting/sharding) query length of range queries with the new experimental -query-frontend.max-total-query-length flag, which defaults to -store.max-query-length if unset or set to 0. #3058
  • [ENHANCEMENT] Query-frontend: Lower TTL for cache entries overlapping the out-of-order samples ingestion window (re-using -ingester.out-of-order-allowance from ingesters). #2935
  • [ENHANCEMENT] Ruler: added support to forcefully disable recording and/or alerting rules evaluation. The following new configuration options have been introduced, which can be overridden on a per-tenant basis in the runtime configuration: #3088
    • -ruler.recording-rules-evaluation-enabled
    • -ruler.alerting-rules-evaluation-enabled
  • [ENHANCEMENT] Distributor: Add age filter to forwarding functionality, to not forward samples which are older than defined duration. #3049
  • [ENHANCEMENT] Distributor: Improved error messages reported when the distributor fails to remote write to ingesters. #3055
  • [ENHANCEMENT] Improved tracing spans tracked by distributors, ingesters and store-gateways. #2879 #3099 #3089
  • [ENHANCEMENT] Ingester: improved the performance of label value cardinality endpoint. #3044
  • [ENHANCEMENT] Ruler: use backoff retry on remote evaluation #3098
  • [ENHANCEMENT] Query-frontend: Include multiple tenant IDs in query logs when present instead of dropping them. #3125
  • [ENHANCEMENT] Query-frontend: truncate queries based on the configured blocks retention period (-compactor.blocks-retention-period) to avoid querying past this period. #3134
  • [ENHANCEMENT] Alertmanager: reduced memory utilization in Mimir clusters with a large number of tenants. #3143
  • [ENHANCEMENT] Store-gateway: added extra span logging to improve observability. #3131
  • [BUGFIX] Querier: Fix 400 response while handling streaming remote read. #2963
  • [BUGFIX] Fix a bug causing query-frontend, query-scheduler, and querier not failing if one of their internal components fail. #2978
  • [BUGFIX] Querier: re-balance the querier worker connections when a query-frontend or query-scheduler is terminated. #3005
  • [BUGFIX] Distributor: Now returns the quorum error from ingesters. For example, with replication_factor=3, two HTTP 400 errors and one HTTP 500 error, now the distributor will always return HTTP 400. Previously the behaviour was to return the error which the distributor first received. #2979
  • [BUGFIX] Ruler: fix panic when ruler.external_url is explicitly set to an empty string ("") in YAML. #2915
  • [BUGFIX] Alertmanager: Fix support for the Telegram API URL in the global settings. #3097
  • [BUGFIX] Alertmanager: Fix parsing of label matchers without label value in the API used to retrieve alerts. #3097
  • [BUGFIX] Ruler: Fix not restoring alert state for rule groups when other ruler replicas shut down. #3156
  • [BUGFIX] Updated golang.org/x/net dependency to fix CVE-2022-27664. #3124
  • [BUGFIX] Fix distributor from returning a 500 status code when a 400 was received from the ingester. #3211
  • [BUGFIX] Fix incorrect OS value set in Mimir v2.3.* RPM packages. #3221

Mixin

  • [CHANGE] Alerts: MimirQuerierAutoscalerNotActive is now critical and fires after 1h instead of 15m. #2958
  • [FEATURE] Dashboards: Added "Mimir / Overview" dashboards, providing an high level view over a Mimir cluster. #3122 #3147 #3155
  • [ENHANCEMENT] Dashboards: Updated the "Writes" and "Rollout progress" dashboards to account for samples ingested via the new OTLP ingestion endpoint. #2919 #2938
  • [ENHANCEMENT] Dashboards: Include per-tenant request rate in "Tenants" dashboard. #2874
  • [ENHANCEMENT] Dashboards: Include inflight object store requests in "Reads" dashboard. #2914
  • [ENHANCEMENT] Dashboards: Make queries used to find job, cluster and namespace for dropdown menus configurable. #2893
  • [ENHANCEMENT] Dashboards: Include rate of label and series queries in "Reads" dashboard. #3065 #3074
  • [ENHANCEMENT] Dashboards: Fix legend showing on per-pod panels. #2944
  • [ENHANCEMENT] Dashboards: Use the "req/s" unit on panels showing the requests rate. #3118
  • [ENHANCEMENT] Dashboards: Use a consistent color across dashboards for the error rate. #3154

Jsonnet

  • [FEATURE] Added support for query-scheduler ring-based service discovery. #3128
  • [ENHANCEMENT] Querier autoscaling is now slower on scale downs: scale down 10% every 1m instead of 100%. #2962
  • [BUGFIX] Memberlist: gossip_member_label is now set for ruler-queriers. #3141

Mimirtool

  • [ENHANCEMENT] mimirtool analyze: Store the query errors instead of exit during the analysis. #3052
  • [BUGFIX] mimir-tool remote-read: fix returns where some conditions return nil error even if there is error. #3053

Documentation

  • [ENHANCEMENT] Added documentation on how to configure storage retention. #2970
  • [ENHANCEMENT] Improved gRPC clients config documentation. #3020
  • [ENHANCEMENT] Added documentation on how to manage alerting and recording rules. #2983
  • [ENHANCEMENT] Improved MimirSchedulerQueriesStuck runbook. #3006
  • [ENHANCEMENT] Added "Cluster label verification" section to memberlist documentation. #3096
  • [ENHANCEMENT] Mention compression in multi-zone replication documentation. #3107
  • [BUGFIX] Fixed configuration option names in "Enabling zone-awareness via the Grafana Mimir Jsonnet". #3018
  • [BUGFIX] Fixed mimirtool analyze parameters documentation. #3094
  • [BUGFIX] Fixed YAML configuraton in the "Manage the configuration of Grafana Mimir with Helm" guide. #3042
  • [BUGFIX] Fixed Alertmanager capacity planning documentation. #3132

Tools

  • [BUGFIX] trafficdump: Fixed panic occurring when -success-only=true and the captured request failed. #2863

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.3.1...mimir-2.4.0

mimir - 2.4.0-rc.1

Published by pracucci about 2 years ago

This release contains 8 PRs from 2 authors. Thank you!

Changelog

2.4.0-rc.1

Grafana Mimir

  • [BUGFIX] Fix distributor from returning a 500 status code when a 400 was received from the ingester. #3211
  • [BUGFIX] Fix incorrect OS value set in Mimir v2.3.* RPM packages. #3221

All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.4.0-rc.0...mimir-2.4.0-rc.1

mimir - 2.4.0-rc.0

Published by pracucci about 2 years ago

This release contains 166 PRs from 29 authors. Thank you!

Grafana Mimir version 2.4.0-rc.0 release notes

Grafana Labs is excited to announce version 2.4 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Note: If you are upgrading from Grafana Mimir 2.3, review the list of important changes that follow.

Features and enhancements

  • Query-scheduler ring-based service discovery: The query-scheduler is an optional, stateless component that retains a queue of queries to execute, and distributes the workload to available queriers. The use the query-scheduler, query-frontends and queriers are required to discover the addresses of the query-scheduler instances.

    In addition to DNS-based service discovery, Mimir 2.4 introduces the ring-based service discovery for the query-scheduler. When enabled, the query-schedulers join their own hash ring (similar to other Mimir components), and the query-frontends and queriers discover query-scheduler instances via the ring.

    Ring-based service discovery makes it easier to set up the query-scheduler in environments where you can’t easily define a DNS entry that resolves to the running query-scheduler instances. For more information, refer to query-scheduler configuration.

  • New API endpoint exposes per-tenant limits: Mimir 2.4 introduces a new API endpoint, which is available on all Mimir components that load the runtime configuration. The endpoint exposes the limits of the authenticated tenant. You can use this new API endpoint when developing custom integrations with Mimir that require looking up the actual limits that are applied on a given tenant. For more information, refer to Get tenant limits.

    New TLS configuration options: Mimir 2.4 introduces new options to configure the accepted TLS cipher suites, and the minimum versions for the HTTP and gRPC clients that are used between Mimir components, or by Mimir to communicate to external services such as Consul or etcd.

    You can use these new configuration options to override the default TLS settings and meet your security policy requirements. For more information, refer to Securing Grafana Mimir communications with TLS.

  • Maximum range query length limit: Mimir 2.4 introduces the new configuration option -query-frontend.max-total-query-length to limit the maximum range query length, which is computed as the query’s end minus start timestamp. This limit is enforced in the query-frontend and defaults to -store.max-query-length if unset.

    The new configuration option allows you to set different limits between the received query maximum length (-query-frontend.max-total-query-length) and the maximum length of partial queries after splitting and sharding (-store.max-query-length).

Helm chart improvements

The mimir-distributed Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.4 release, we’re also releasing version 3.2 of the mimir-distributed Helm chart.

Notable enhancements follow. For the full list of changes, see the Helm chart changelog.

  • Added support for topologySpreadContraints.

  • Replaced the default anti-affinity rules with topologySpreadContraints for all components which puts less restrictions on where Kubernetes can run pods.

  • Important: if you are not using the sizing plans (small.yaml, large.yaml, capped-small.yaml, capped-large.yaml) in production, you must reintroduce pod affinity rules for the ingester and store-gateway. This also fixes a missing label selector for the ingester. Merge the following with your custom values file:

    ingester:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: target
                    operator: In
                    values:
                      - ingester
              topologyKey: "kubernetes.io/hostname"
            - labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/component
                    operator: In
                    values:
                      - ingester
              topologyKey: "kubernetes.io/hostname"
    store_gateway:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: target
                    operator: In
                    values:
                      - store-gateway
              topologyKey: "kubernetes.io/hostname"
            - labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/component
                    operator: In
                    values:
                      - store-gateway
              topologyKey: "kubernetes.io/hostname"
    
  • Updated the anti affinity rules in the sizing plans (small.yaml, large.yaml, capped-small.yaml, capped-large.yaml). The sizing plans now enforce that no two pods of the ingester, store-gateway, or alertmanager StatefulSets are scheduled on the same Node. Pods from different StaatefulSets can share a Node.

  • Support for Openshift Route resource for nginx has been added.

Important changes

In Grafana Mimir 2.4, the default values of the following configuration options have changed:

  • -distributor.remote-timeout has changed from 20s to 2s.
  • -distributor.forwarding.request-timeout has changed from 10s to 2s.
  • -blocks-storage.tsdb.head-compaction-concurrency has changed from 5 to 1.
  • The hash-ring heartbeat period for distributors, ingesters, rulers, and compactors has increased from 5s to 15s.

With Grafana Mimir 2.4, the anonymous usage statistics tracking is enabled by default. Mimir maintainers use this anonymous information to learn more about how the open source community runs Mimir and what the Mimir team should focus on when working on the next features and documentation improvements. If possible, we ask you to keep the usage reporting feature enabled. In case you want to opt-out from anonymous usage statistics reporting, refer to Disable the anonymous usage statistics reporting.

Bug fixes

  • PR 2979: Fix remote write HTTP response status code returned by Mimir when failing to write only to one ingester (the quorum is still honored when running Mimir with the default replication factor of 3) and some series are not ingested because of validation errors or some limits being reached.
  • PR 3005: Fix the querier to re-balance its workers connections when a query-frontend or query-scheduler instance is terminated.
  • PR 2963: Fix the remote read endpoint to correctly support the Accept-Encoding: snappy HTTP request header.

Changelog

2.4.0-rc.0

Grafana Mimir

  • [CHANGE] Distributor: change the default value of -distributor.remote-timeout to 2s from 20s and -distributor.forwarding.request-timeout to 2s from 10s to improve distributor resource usage when ingesters crash. #2728 #2912
  • [CHANGE] Anonymous usage statistics tracking: added the -ingester.ring.store value. #2981
  • [CHANGE] Series metadata HELP that is longer than -validation.max-metadata-length is now truncated silently, instead of being dropped with a 400 status code. #2993
  • [CHANGE] Ingester: changed default setting for -ingester.ring.readiness-check-ring-health from true to false. #2953
  • [CHANGE] Anonymous usage statistics tracking has been enabled by default, to help Mimir maintainers make better decisions to support the open source community. #2939 #3034
  • [CHANGE] Anonymous usage statistics tracking: added the minimum and maximum value of -ingester.out-of-order-time-window. #2940
  • [CHANGE] The default hash ring heartbeat period for distributors, ingesters, rulers and compactors has been increased from 5s to 15s. Now the default heartbeat period for all Mimir hash rings is 15s. #3033
  • [CHANGE] Reduce the default TSDB head compaction concurrency (-blocks-storage.tsdb.head-compaction-concurrency) from 5 to 1, in order to reduce CPU spikes. #3093
  • [CHANGE] Ruler: the ruler's remote evaluation mode (-ruler.query-frontend.address) is now stable. #3109
  • [CHANGE] Limits: removed the deprecated YAML configuration option active_series_custom_trackers_config. Please use active_series_custom_trackers instead. #3110
  • [CHANGE] Ingester: removed the deprecated configuration option -ingester.ring.join-after. #3111
  • [CHANGE] Querier: removed the deprecated configuration option -querier.shuffle-sharding-ingesters-lookback-period. The value of -querier.query-ingesters-within is now used internally for shuffle sharding lookback, while you can use -querier.shuffle-sharding-ingesters-enabled to enable or disable shuffle sharding on the read path. #3111
  • [CHANGE] Memberlist: cluster label verification feature (-memberlist.cluster-label and -memberlist.cluster-label-verification-disabled) is now marked as stable. #3108
  • [CHANGE] Distributor: only single per-tenant forwarding endpoint can be configured now. Support for per-rule endpoint has been removed. #3095
  • [CHANGE] Query-frontend: truncate queries based on the configured blocks retention period (-compactor.blocks-retention-period) to avoid querying past this period. #3134
  • [FEATURE] Query-scheduler: added an experimental ring-based service discovery support for the query-scheduler. Refer to query-scheduler configuration for more information. #2957
  • [FEATURE] Introduced the experimental endpoint /api/v1/user_limits exposed by all components that load runtime configuration. This endpoint exposes realtime limits for the authenticated tenant, in JSON format. #2864 #3017
  • [FEATURE] Query-scheduler: added the experimental configuration option -query-scheduler.max-used-instances to restrict the number of query-schedulers effectively used regardless how many replicas are running. This feature can be useful when using the experimental read-write deployment mode. #3005
  • [ENHANCEMENT] Go: updated to go 1.19.2. #2637 #3127 #3129
  • [ENHANCEMENT] Runtime config: don't unmarshal runtime configuration files if they haven't changed. This can save a bit of CPU and memory on every component using runtime config. #2954
  • [ENHANCEMENT] Query-frontend: Add cortex_frontend_query_result_cache_skipped_total and cortex_frontend_query_result_cache_attempted_total metrics to track the reason why query results are not cached. #2855
  • [ENHANCEMENT] Distributor: pool more connections per host when forwarding request. Mark requests as idempotent so they can be retried under some conditions. #2968
  • [ENHANCEMENT] Distributor: failure to send request to forwarding target now also increments cortex_distributor_forward_errors_total, with status_code="failed". #2968
  • [ENHANCEMENT] Distributor: added support forwarding push requests via gRPC, using httpgrpc messages from weaveworks/common library. #2996
  • [ENHANCEMENT] Query-frontend / Querier: increase internal backoff period used to retry connections to query-frontend / query-scheduler. #3011
  • [ENHANCEMENT] Querier: do not log "error processing requests from scheduler" when the query-scheduler is shutting down. #3012
  • [ENHANCEMENT] Query-frontend: query sharding process is now time-bounded and it is cancelled if the request is aborted. #3028
  • [ENHANCEMENT] Query-frontend: improved Prometheus response JSON encoding performance. #2450
  • [ENHANCEMENT] TLS: added configuration parameters to configure the client's TLS cipher suites and minimum version. The following new CLI flags have been added: #3070
    • -alertmanager.alertmanager-client.tls-cipher-suites
    • -alertmanager.alertmanager-client.tls-min-version
    • -alertmanager.sharding-ring.etcd.tls-cipher-suites
    • -alertmanager.sharding-ring.etcd.tls-min-version
    • -compactor.ring.etcd.tls-cipher-suites
    • -compactor.ring.etcd.tls-min-version
    • -distributor.forwarding.grpc-client.tls-cipher-suites
    • -distributor.forwarding.grpc-client.tls-min-version
    • -distributor.ha-tracker.etcd.tls-cipher-suites
    • -distributor.ha-tracker.etcd.tls-min-version
    • -distributor.ring.etcd.tls-cipher-suites
    • -distributor.ring.etcd.tls-min-version
    • -ingester.client.tls-cipher-suites
    • -ingester.client.tls-min-version
    • -ingester.ring.etcd.tls-cipher-suites
    • -ingester.ring.etcd.tls-min-version
    • -memberlist.tls-cipher-suites
    • -memberlist.tls-min-version
    • -querier.frontend-client.tls-cipher-suites
    • -querier.frontend-client.tls-min-version
    • -querier.store-gateway-client.tls-cipher-suites
    • -querier.store-gateway-client.tls-min-version
    • -query-frontend.grpc-client-config.tls-cipher-suites
    • -query-frontend.grpc-client-config.tls-min-version
    • -query-scheduler.grpc-client-config.tls-cipher-suites
    • -query-scheduler.grpc-client-config.tls-min-version
    • -query-scheduler.ring.etcd.tls-cipher-suites
    • -query-scheduler.ring.etcd.tls-min-version
    • -ruler.alertmanager-client.tls-cipher-suites
    • -ruler.alertmanager-client.tls-min-version
    • -ruler.client.tls-cipher-suites
    • -ruler.client.tls-min-version
    • -ruler.query-frontend.grpc-client-config.tls-cipher-suites
    • -ruler.query-frontend.grpc-client-config.tls-min-version
    • -ruler.ring.etcd.tls-cipher-suites
    • -ruler.ring.etcd.tls-min-version
    • -store-gateway.sharding-ring.etcd.tls-cipher-suites
    • -store-gateway.sharding-ring.etcd.tls-min-version
  • [ENHANCEMENT] Store-gateway: Add -blocks-storage.bucket-store.max-concurrent-reject-over-limit option to allow requests that exceed the max number of inflight object storage requests to be rejected. #2999
  • [ENHANCEMENT] Query-frontend: allow setting a separate limit on the total (before splitting/sharding) query length of range queries with the new experimental -query-frontend.max-total-query-length flag, which defaults to -store.max-query-length if unset or set to 0. #3058
  • [ENHANCEMENT] Query-frontend: Lower TTL for cache entries overlapping the out-of-order samples ingestion window (re-using -ingester.out-of-order-allowance from ingesters). #2935
  • [ENHANCEMENT] Ruler: added support to forcefully disable recording and/or alerting rules evaluation. The following new configuration options have been introduced, which can be overridden on a per-tenant basis in the runtime configuration: #3088
    • -ruler.recording-rules-evaluation-enabled
    • -ruler.alerting-rules-evaluation-enabled
  • [ENHANCEMENT] Distributor: Add age filter to forwarding functionality, to not forward samples which are older than defined duration. #3049
  • [ENHANCEMENT] Distributor: Improved error messages reported when the distributor fails to remote write to ingesters. #3055
  • [ENHANCEMENT] Improved tracing spans tracked by distributors, ingesters and store-gateways. #2879 #3099 #3089
  • [ENHANCEMENT] Ingester: improved the performance of label value cardinality endpoint. #3044
  • [ENHANCEMENT] Ruler: use backoff retry on remote evaluation #3098
  • [ENHANCEMENT] Query-frontend: Include multiple tenant IDs in query logs when present instead of dropping them. #3125
  • [ENHANCEMENT] Alertmanager: reduced memory utilization in Mimir clusters with a large number of tenants. #3143
  • [ENHANCEMENT] Store-gateway: added extra span logging to improve observability. #3131
  • [BUGFIX] Querier: Fix 400 response while handling streaming remote read. #2963
  • [BUGFIX] Fix a bug causing query-frontend, query-scheduler, and querier not failing if one of their internal components fail. #2978
  • [BUGFIX] Querier: re-balance the querier worker connections when a query-frontend or query-scheduler is terminated. #3005
  • [BUGFIX] Distributor: Now returns the quorum error from ingesters. For example, with replication_factor=3, two HTTP 400 errors and one HTTP 500 error, now the distributor will always return HTTP 400. Previously the behaviour was to return the error which the distributor first received. #2979
  • [BUGFIX] Ruler: fix panic when ruler.external_url is explicitly set to an empty string ("") in YAML. #2915
  • [BUGFIX] Alertmanager: Fix support for the Telegram API URL in the global settings. #3097
  • [BUGFIX] Alertmanager: Fix parsing of label matchers without label value in the API used to retrieve alerts. #3097
  • [BUGFIX] Ruler: Fix not restoring alert state for rule groups when other ruler replicas shut down. #3156
  • [BUGFIX] Updated golang.org/x/net dependency to fix CVE-2022-27664. #3124

Mixin

  • [CHANGE] Alerts: MimirQuerierAutoscalerNotActive is now critical and fires after 1h instead of 15m. #2958
  • [FEATURE] Dashboards: Added "Mimir / Overview" dashboards, providing an high level view over a Mimir cluster. #3122 #3147 #3155
  • [ENHANCEMENT] Dashboards: Updated the "Writes" and "Rollout progress" dashboards to account for samples ingested via the new OTLP ingestion endpoint. #2919 #2938
  • [ENHANCEMENT] Dashboards: Include per-tenant request rate in "Tenants" dashboard. #2874
  • [ENHANCEMENT] Dashboards: Include inflight object store requests in "Reads" dashboard. #2914
  • [ENHANCEMENT] Dashboards: Make queries used to find job, cluster and namespace for dropdown menus configurable. #2893
  • [ENHANCEMENT] Dashboards: Include rate of label and series queries in "Reads" dashboard. #3065 #3074
  • [ENHANCEMENT] Dashboards: Fix legend showing on per-pod panels. #2944
  • [ENHANCEMENT] Dashboards: Use the "req/s" unit on panels showing the requests rate. #3118
  • [ENHANCEMENT] Dashboards: Use a consistent color across dashboards for the error rate. #3154

Jsonnet

  • [FEATURE] Added support for query-scheduler ring-based service discovery. #3128
  • [ENHANCEMENT] Querier autoscaling is now slower on scale downs: scale down 10% every 1m instead of 100%. #2962
  • [BUGFIX] Memberlist: gossip_member_label is now set for ruler-queriers. #3141

Mimirtool

  • [ENHANCEMENT] mimirtool analyze: Store the query errors instead of exit during the analysis. #3052
  • [BUGFIX] mimir-tool remote-read: fix returns where some conditions return nil error even if there is error. #3053

Documentation

  • [ENHANCEMENT] Added documentation on how to configure storage retention. #2970
  • [ENHANCEMENT] Improved gRPC clients config documentation. #3020
  • [ENHANCEMENT] Added documentation on how to manage alerting and recording rules. #2983
  • [ENHANCEMENT] Improved MimirSchedulerQueriesStuck runbook. #3006
  • [ENHANCEMENT] Added "Cluster label verification" section to memberlist documentation. #3096
  • [ENHANCEMENT] Mention compression in multi-zone replication documentation. #3107
  • [BUGFIX] Fixed configuration option names in "Enabling zone-awareness via the Grafana Mimir Jsonnet". #3018
  • [BUGFIX] Fixed mimirtool analyze parameters documentation. #3094
  • [BUGFIX] Fixed YAML configuraton in the "Manage the configuration of Grafana Mimir with Helm" guide. #3042
  • [BUGFIX] Fixed Alertmanager capacity planning documentation. #3132

Tools

  • [BUGFIX] trafficdump: Fixed panic occurring when -success-only=true and the captured request failed. #2863
mimir - mimir-2.3.1

Published by treid314 about 2 years ago

This release contains 5 PRs from 1 author. Thank you!

2.3.1

Grafana Mimir

  • [BUGFIX] Query-frontend: query sharding took exponential time to map binary expressions. #3027
  • [BUGFIX] Distributor: Stop panics on OTLP endpoint when a single metric has multiple timeseries. #3040

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.3.0...mimir-2.3.1

mimir - mimir-2.3.0

Published by treid314 about 2 years ago

Grafana Mimir version 2.3 release notes

Grafana Labs is excited to announce version 2.3 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Note: If you are upgrading from Grafana Mimir 2.2, review the list of important changes that follow.

This release contains 370 PRs from 39 authors. Thank you!

Features and enhancements

  • Ingest metrics in OpenTelemetry format:
    This release of Grafana Mimir introduces experimental support for ingesting metrics from the OpenTelemetry Collector's otlphttp exporter. This adds a second ingestion option for users of the OTel Collector; Mimir was already compatible with the prometheusremotewrite exporter. For more information, please see Configure OTel Collector.

  • Tenant federation for metadata queries:
    Users with tenant federation enabled could already issue instant queries, range queries, and exemplar queries to multiple tenants at once and receive a single aggregated result. With Grafana Mimir 2.3, we've added tenant federation support to the /api/v1/metadata endpoint as well.

  • Simpler object storage configuration:
    Users can now configure block, alertmanager, and ruler storage all at once with the common YAML config option key (or -common.storage.* CLI flags). By centralizing your object storage configuration in one place, this enhancement makes configuration faster and less error prone. Users may still individually configure storage for each of these components if they desire. For more information, see the Common Configurations.

  • .deb and .rpm packages for Mimir:
    Starting with version 2.3, we're publishing .deb and .rpm files for Grafana Mimir, which will make installing and running it on Debian or RedHat-based linux systems much easier. Thank you to community contributor wilfriedroset for your work to implement this!

  • Import historic data:
    Users can now backfill time series data from their existing Prometheus or Cortex installation into Mimir using mimirtool, making it possible to migrate to Grafana Mimir without losing your existing metrics data. This support is still considered experimental and does not yet work for data stored in Thanos. To learn more about this feature, please see mimirtool backfill and Configure TSDB block upload

  • Increased instant query performance:
    Grafana Mimir now supports splitting instant queries by time. This allows it to better parallelize execution of instant queries and therefore return results faster. At present, splitting is only supported for a subset of instant queries, which means not all instant queries will see a speedup. This feature is currently experimental and is disabled by default. It can be enabled with the split_instant_queries_by_interval YAML config option in the limits section (or the CLI flag -query-frontend.split-instant-queries-by-interval).

Helm chart improvements

The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.3 release, we’re also releasing version 3.1 of the Mimir Helm chart.

Notable enhancements follow. For the full list of changes, see the Helm chart changelog.

  • We've upgraded the MinIO subchart dependency from a deprecated chart to the supported one. This creates a breaking change in how the administrator password is set. However, as the built-in MinIO is not a recommended object store for production use cases, this change did not warrant a new major version of the Mimir Helm chart.
  • Query sharding is now enabled by default which should give you better performance on high cardinality metrics queries.
    • To compensate for the increased number of queries generated by query sharding, the query scheduler component is now enabled by default.
  • The backfill API endpoints for importing historic time series data are now exposed on the Nginx gateway.
  • Nginx now sets the value of the X-Scope-OrgID header equal to the value of Mimir's no_auth_tenant parameter by default. The previous release had set the value of X-Scope-OrgID to anonymous by default which complicated the process of migrating to Mimir.
  • Memberlist now uses DNS service-discovery by default, which decreases startup time for large Mimir clusters.

Important changes

In Grafana Mimir 2.3 we have removed the following previously deprecated configuration options:

  • The extend_writes parameter in the distributor YAML configuration and -distributor.extend-writes CLI flag have been removed.
  • The active_series_custom_trackers parameter has been removed from the YAML configuration. It had already been moved to the runtime configuration. See #1188 for details.
  • The blocks-storage.tsdb.isolation-enabled parameter in the YAML configuration and -blocks-storage.tsdb.isolation-enabled CLI flag have been removed.

With Grafana Mimir 2.3 we have also updated the default value for the CLI flag -distributor.ha-tracker.max-clusters to 100 to provide Denial-of-Service protection. Previously -distributor.ha-tracker.max-clusters was unlimited by default which could allow a tenant with HA Dedupe enabled to overload the HA tracker with __cluster__ label values that could cause the HA Dedupe database to fail.

Also, as noted above, the administrator password for Helm chart deployments using the built-in MinIO is now set differently.

Bug fixes

  • PR 2447: Fix incorrect mapping of http status codes 429 to 500 when the request queue is full in the query-frontend. This corrects behavior in the query-frontend where a retryable 429 "Too Many Outstanding Requests" error from a querier was incorrectly returned as an unretryable 500 system error.
  • PR 2505: The Memberlist key-value (KV) store now tries to "fast-join" the cluster to avoid serving an empty KV store. This fix addresses the confusing "empty ring" error response and the error log message "ring doesn't exist in KV store yet" emitted by services when there are other members present in the ring when a service starts. Those using other key-value store options (e.g., consul, etcd) are not impacted by this bug.
  • PR 2289: The "List Prometheus rules" API endpoint of the Mimir Ruler component is no longer blocked while rules are being synced. This means users can now list rules while syncing larger rule sets.

Changelog

2.3.0

Grafana Mimir

  • [CHANGE] Ingester: Added user label to ingester metric cortex_ingester_tsdb_out_of_order_samples_appended_total. On multitenant clusters this helps us find the rate of appended out-of-order samples for a specific tenant. #2493
  • [CHANGE] Compactor: delete source and output blocks from local disk on compaction failed, to reduce likelihood that subsequent compactions fail because of no space left on disk. #2261
  • [CHANGE] Ruler: Remove unused CLI flags -ruler.search-pending-for and -ruler.flush-period (and their respective YAML config options). #2288
  • [CHANGE] Successful gRPC requests are no longer logged (only affects internal API calls). #2309
  • [CHANGE] Add new -*.consul.cas-retry-delay flags. They have a default value of 1s, while previously there was no delay between retries. #2309
  • [CHANGE] Store-gateway: Remove the experimental ability to run requests in a dedicated OS thread pool and associated CLI flag -store-gateway.thread-pool-size. #2423
  • [CHANGE] Memberlist: disabled TCP-based ping fallback, because Mimir already uses a custom transport based on TCP. #2456
  • [CHANGE] Change default value for -distributor.ha-tracker.max-clusters to 100 to provide a DoS protection. #2465
  • [CHANGE] Experimental block upload API exposed by compactor has changed: Previous /api/v1/upload/block/{block} endpoint for starting block upload is now /api/v1/upload/block/{block}/start, and previous endpoint /api/v1/upload/block/{block}?uploadComplete=true for finishing block upload is now /api/v1/upload/block/{block}/finish. New API endpoint has been added: /api/v1/upload/block/{block}/check. #2486 #2548
  • [CHANGE] Compactor: changed -compactor.max-compaction-time default from 0s (disabled) to 1h. When compacting blocks for a tenant, the compactor will move to compact blocks of another tenant or re-plan blocks to compact at least every 1h. #2514
  • [CHANGE] Distributor: removed previously deprecated extend_writes (see #1856) YAML key and -distributor.extend-writes CLI flag from the distributor config. #2551
  • [CHANGE] Ingester: removed previously deprecated active_series_custom_trackers (see #1188) YAML key from the ingester config. #2552
  • [CHANGE] The tenant ID __mimir_cluster is reserved by Mimir and not allowed to store metrics. #2643
  • [CHANGE] Purger: removed the purger component and moved its API endpoints /purger/delete_tenant and /purger/delete_tenant_status to the compactor at /compactor/delete_tenant and /compactor/delete_tenant_status. The new endpoints on the compactor are stable. #2644
  • [CHANGE] Memberlist: Change the leave timeout duration (-memberlist.leave-timeout duration) from 5s to 20s and connection timeout (-memberlist.packet-dial-timeout) from 5s to 2s. This makes leave timeout 10x the connection timeout, so that we can communicate the leave to at least 1 node, if the first 9 we try to contact times out. #2669
  • [CHANGE] Alertmanager: return status code 412 Precondition Failed and log info message when alertmanager isn't configured for a tenant. #2635
  • [CHANGE] Distributor: if forwarding rules are used to forward samples, exemplars are now removed from the request. #2710, #2725
  • [CHANGE] Limits: change the default value of max_global_series_per_metric limit to 0 (disabled). Setting this limit by default does not provide much benefit because series are sharded by all labels. #2714
  • [CHANGE] Ingester: experimental -blocks-storage.tsdb.new-chunk-disk-mapper has been removed, new chunk disk mapper is now always used, and is no longer marked experimental. Default value of -blocks-storage.tsdb.head-chunks-write-queue-size has changed to 1000000, this enables async chunk queue by default, which leads to improved latency on the write path when new chunks are created in ingesters. #2762
  • [CHANGE] Ingester: removed deprecated -blocks-storage.tsdb.isolation-enabled option. TSDB-level isolation is now always disabled in Mimir. #2782
  • [CHANGE] Compactor: -compactor.partial-block-deletion-delay must either be set to 0 (to disable partial blocks deletion) or a value higher than 4h. #2787
  • [CHANGE] Query-frontend: CLI flag -query-frontend.align-querier-with-step has been deprecated. Please use -query-frontend.align-queries-with-step instead. #2840
  • [FEATURE] Compactor: Adds the ability to delete partial blocks after a configurable delay. This option can be configured per tenant. #2285
    • -compactor.partial-block-deletion-delay, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of 0, the default, disables this feature.
    • The metric cortex_compactor_blocks_marked_for_deletion_total has a new value for the reason label reason="partial", when a block deletion marker is triggered by the partial block deletion delay.
  • [FEATURE] Querier: enabled support for queries with negative offsets, which are not cached in the query results cache. #2429
  • [FEATURE] EXPERIMENTAL: OpenTelemetry Metrics ingestion path on /otlp/v1/metrics. #695 #2436 #2461
  • [FEATURE] Querier: Added support for tenant federation to metric metadata endpoint. #2467
  • [FEATURE] Query-frontend: introduced experimental support to split instant queries by time. The instant query splitting can be enabled setting -query-frontend.split-instant-queries-by-interval. #2469 #2564 #2565 #2570 #2571 #2572 #2573 #2574 #2575 #2576 #2581 #2582 #2601 #2632 #2633 #2634 #2641 #2642 #2766
  • [FEATURE] Introduced an experimental anonymous usage statistics tracking (disabled by default), to help Mimir maintainers make better decisions to support the open source community. The tracking system anonymously collects non-sensitive, non-personally identifiable information about the running Mimir cluster, and is disabled by default. #2643 #2662 #2685 #2732 #2733 #2735
  • [FEATURE] Introduced an experimental deployment mode called read-write and running a fully featured Mimir cluster with three components: write, read and backend. The read-write deployment mode is a trade-off between the monolithic mode (only one component, no isolation) and the microservices mode (many components, high isolation). #2754 #2838
  • [ENHANCEMENT] Distributor: Decreased distributor tests execution time. #2562
  • [ENHANCEMENT] Alertmanager: Allow the HTTP proxy_url configuration option in the receiver's configuration. #2317
  • [ENHANCEMENT] ring: optimize shuffle-shard computation when lookback is used, and all instances have registered timestamp within the lookback window. In that case we can immediately return origial ring, because we would select all instances anyway. #2309
  • [ENHANCEMENT] Memberlist: added experimental memberlist cluster label support via -memberlist.cluster-label and -memberlist.cluster-label-verification-disabled CLI flags (and their respective YAML config options). #2354
  • [ENHANCEMENT] Object storage can now be configured for all components using the common YAML config option key (or -common.storage.* CLI flags). #2330 #2347
  • [ENHANCEMENT] Go: updated to go 1.18.4. #2400
  • [ENHANCEMENT] Store-gateway, listblocks: list of blocks now includes stats from meta.json file: number of series, samples and chunks. #2425
  • [ENHANCEMENT] Added more buckets to cortex_ingester_client_request_duration_seconds histogram metric, to correctly track requests taking longer than 1s (up until 16s). #2445
  • [ENHANCEMENT] Azure client: Improve memory usage for large object storage downloads. #2408
  • [ENHANCEMENT] Distributor: Add -distributor.instance-limits.max-inflight-push-requests-bytes. This limit protects the distributor against multiple large requests that together may cause an OOM, but are only a few, so do not trigger the max-inflight-push-requests limit. #2413
  • [ENHANCEMENT] Distributor: Drop exemplars in distributor for tenants where exemplars are disabled. #2504
  • [ENHANCEMENT] Runtime Config: Allow operator to specify multiple comma-separated yaml files in -runtime-config.file that will be merged in left to right order. #2583
  • [ENHANCEMENT] Query sharding: shard binary operations only if it doesn't lead to non-shardable vector selectors in one of the operands. #2696
  • [ENHANCEMENT] Add packaging for both debian based deb file and redhat based rpm file using FPM. #1803
  • [ENHANCEMENT] Distributor: Add cortex_distributor_query_ingester_chunks_deduped_total and cortex_distributor_query_ingester_chunks_total metrics for determining how effective ingester chunk deduplication at query time is. #2713
  • [ENHANCEMENT] Upgrade Docker base images to alpine:3.16.2. #2729
  • [ENHANCEMENT] Ruler: Add <prometheus-http-prefix>/api/v1/status/buildinfo endpoint. #2724
  • [ENHANCEMENT] Querier: Ensure all queries pulled from query-frontend or query-scheduler are immediately executed. The maximum workers concurrency in each querier is configured by -querier.max-concurrent. #2598
  • [ENHANCEMENT] Distributor: Add cortex_distributor_received_requests_total and cortex_distributor_requests_in_total metrics to provide visiblity into appropriate per-tenant request limits. #2770
  • [ENHANCEMENT] Distributor: Add single forwarding remote-write endpoint for a tenant (forwarding_endpoint), instead of using per-rule endpoints. This takes precendence over per-rule endpoints. #2801
  • [ENHANCEMENT] Added err-mimir-distributor-max-write-message-size to the errors catalog. #2470
  • [ENHANCEMENT] Add sanity check at startup to ensure the configured filesystem directories don't overlap for different components. #2828
  • [BUGFIX] TSDB: Fixed a bug on the experimental out-of-order implementation that led to wrong query results. #2701
  • [BUGFIX] Compactor: log the actual error on compaction failed. #2261
  • [BUGFIX] Alertmanager: restore state from storage even when running a single replica. #2293
  • [BUGFIX] Ruler: do not block "List Prometheus rules" API endpoint while syncing rules. #2289
  • [BUGFIX] Ruler: return proper *status.Status error when running in remote operational mode. #2417
  • [BUGFIX] Alertmanager: ensure the configured -alertmanager.web.external-url is either a path starting with /, or a full URL including the scheme and hostname. #2381 #2542
  • [BUGFIX] Memberlist: fix problem with loss of some packets, typically ring updates when instances were removed from the ring during shutdown. #2418
  • [BUGFIX] Ingester: fix misfiring MimirIngesterHasUnshippedBlocks and stale cortex_ingester_oldest_unshipped_block_timestamp_seconds when some block uploads fail. #2435
  • [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 429 to 500 when request queue is full. #2447
  • [BUGFIX] Memberlist: Fix problem with ring being empty right after startup. Memberlist KV store now tries to "fast-join" the cluster to avoid serving empty KV store. #2505
  • [BUGFIX] Compactor: Fix bug when using -compactor.partial-block-deletion-delay: compactor didn't correctly check for modification time of all block files. #2559
  • [BUGFIX] Query-frontend: fix wrong query sharding results for queries with boolean result like 1 < bool 0. #2558
  • [BUGFIX] Fixed error messages related to per-instance limits incorrectly reporting they can be set on a per-tenant basis. #2610
  • [BUGFIX] Perform HA-deduplication before forwarding samples according to forwarding rules in the distributor. #2603 #2709
  • [BUGFIX] Fix reporting of tracing spans from PromQL engine. #2707
  • [BUGFIX] Apply relabel and drop_label rules before forwarding rules in the distributor. #2703
  • [BUGFIX] Distributor: Register cortex_discarded_requests_total metric, which previously was not registered and therefore not exported. #2712
  • [BUGFIX] Ruler: fix not restoring alerts' state at startup. #2648
  • [BUGFIX] Ingester: Fix disk filling up after restarting ingesters with out-of-order support disabled while it was enabled before. #2799
  • [BUGFIX] Memberlist: retry joining memberlist cluster on startup when no nodes are resolved. #2837
  • [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 413 to 500 when request is too large. #2819
  • [BUGFIX] Alertmanager: revert upstream alertmananger to v0.24.0 to fix panic when unmarshalling email headers #2924 #2925
  • [BUGFIX] Fix sanity check done on configured filesystem directories when running Alertmanager in microservices mode. #2947

Mixin

  • [CHANGE] Dashboards: "Slow Queries" dashboard no longer works with versions older than Grafana 9.0. #2223
  • [CHANGE] Alerts: use RSS memory instead of working set memory in the MimirAllocatingTooMuchMemory alert for ingesters. #2480
  • [CHANGE] Dashboards: remove the "Cache - Latency (old)" panel from the "Mimir / Queries" dashboard. #2796
  • [FEATURE] Dashboards: added support to experimental read-write deployment mode. #2780
  • [ENHANCEMENT] Dashboards: added missed rule evaluations to the "Evaluations per second" panel in the "Mimir / Ruler" dashboard. #2314
  • [ENHANCEMENT] Dashboards: add k8s resource requests to CPU and memory panels. #2346
  • [ENHANCEMENT] Dashboards: add RSS memory utilization panel for ingesters, store-gateways and compactors. #2479
  • [ENHANCEMENT] Dashboards: allow to configure graph tooltip. #2647
  • [ENHANCEMENT] Alerts: MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck alerts are more reliable now as they consider all the intermediate samples in the minute prior to the evaluation. #2630
  • [ENHANCEMENT] Alerts: added RolloutOperatorNotReconciling alert, firing if the optional rollout-operator is not successfully reconciling. #2700
  • [ENHANCEMENT] Dashboards: added support to query-tee in front of ruler-query-frontend in the "Remote ruler reads" dashboard. #2761
  • [ENHANCEMENT] Dashboards: Introduce support for baremetal deployment, setting deployment_type: 'baremetal' in the mixin _config. #2657
  • [ENHANCEMENT] Dashboards: use timeseries panel to show exemplars. #2800
  • [BUGFIX] Dashboards: fixed unit of latency panels in the "Mimir / Ruler" dashboard. #2312
  • [BUGFIX] Dashboards: fixed "Intervals per query" panel in the "Mimir / Queries" dashboard. #2308
  • [BUGFIX] Dashboards: Make "Slow Queries" dashboard works with Grafana 9.0. #2223
  • [BUGFIX] Dashboards: add missing API routes to Ruler dashboard. #2412
  • [BUGFIX] Dashboards: stop setting 'interval' in dashboards; it should be set on your datasource. #2802

Jsonnet

  • [CHANGE] query-scheduler is enabled by default. We advise to deploy the query-scheduler to improve the scalability of the query-frontend. #2431
  • [CHANGE] Replaced anti-affinity rules with pod topology spread constraints for distributor, query-frontend, querier and ruler. #2517
    • The following configuration options have been removed:
      • distributor_allow_multiple_replicas_on_same_node
      • query_frontend_allow_multiple_replicas_on_same_node
      • querier_allow_multiple_replicas_on_same_node
      • ruler_allow_multiple_replicas_on_same_node
    • The following configuration options have been added:
      • distributor_topology_spread_max_skew
      • query_frontend_topology_spread_max_skew
      • querier_topology_spread_max_skew
      • ruler_topology_spread_max_skew
  • [CHANGE] Change max_global_series_per_metric to 0 in all plans, and as a default value. #2669
  • [FEATURE] Memberlist: added support for experimental memberlist cluster label, through the jsonnet configuration options memberlist_cluster_label and memberlist_cluster_label_verification_disabled. #2349
  • [FEATURE] Added ruler-querier autoscaling support. It requires KEDA installed in the Kubernetes cluster. Ruler-querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2545
    • autoscaling_ruler_querier_enabled: true to enable autoscaling.
    • autoscaling_ruler_querier_min_replicas: minimum number of ruler-querier replicas.
    • autoscaling_ruler_querier_max_replicas: maximum number of ruler-querier replicas.
    • autoscaling_prometheus_url: Prometheus base URL from which to scrape Mimir metrics (e.g. http://prometheus.default:9090/prometheus).
  • [ENHANCEMENT] Memberlist now uses DNS service-discovery by default. #2549
  • [ENHANCEMENT] Upgrade memcached image tag to memcached:1.6.16-alpine. #2740
  • [ENHANCEMENT] Added $._config.configmaps and $._config.runtime_config_files to make it easy to add new configmaps or runtime config file to all components. #2748

Mimirtool

  • [ENHANCEMENT] Added mimirtool backfill command to upload Prometheus blocks using API available in the compactor. #1822
  • [ENHANCEMENT] mimirtool bucket-validation: Verify existing objects can be overwritten by subsequent uploads. #2491
  • [ENHANCEMENT] mimirtool config convert: Now supports migrating to the current version of Mimir. #2629
  • [BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors by using custom parsing. #2386
  • [BUGFIX] Version checking no longer prompts for updating when already on latest version. #2723

Mimir Continuous Test

  • [ENHANCEMENT] Added basic authentication and bearer token support for when Mimir is behind a gateway authenticating the calls. #2717

Query-tee

  • [CHANGE] Renamed CLI flag -server.service-port to -server.http-service-port. #2683
  • [CHANGE] Renamed metric cortex_querytee_request_duration_seconds to cortex_querytee_backend_request_duration_seconds. Metric cortex_querytee_request_duration_seconds is now reported without label backend. #2683
  • [ENHANCEMENT] Added HTTP over gRPC support to query-tee to allow testing gRPC requests to Mimir instances. #2683

Documentation

  • [ENHANCEMENT] Referenced mimirtool commands in the HTTP API documentation. #2516
  • [ENHANCEMENT] Improved DNS service discovery documentation. #2513

Tools

  • [ENHANCEMENT] markblocks now processes multiple blocks concurrently. #2677

New Contributors

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.2.0...mimir-2.3.0

mimir - mimir-2.3.0-rc.2

Published by treid314 about 2 years ago

Changes since 2.3.0-rc.0

This release contains 33 contributions from 9 authors. Thank you!

Note: We tagged a 2.3.0-rc.1 but found a panic in the alertmanager before publishing the 2.3.0-rc.1 pre-release. With 2.3.0-rc.2 we have included the fix for the alertmanager and created a new tag and release candidate.


2.3.0-rc.2

Grafana Mimir

  • [BUGFIX] Alertmanager: revert upstream alertmananger to v0.24.0 to fix panic when unmarshalling email headers #2924 #2925

2.3.0-rc.1

Grafana Mimir

  • [CHANGE] Distributor: if forwarding rules are used to forward samples, exemplars are now removed from the request #2725
  • [CHANGE] Ingester: experimental -blocks-storage.tsdb.new-chunk-disk-mapper has been removed, new chunk disk mapper is now always used, and is no longer marked experimental. Default value of -blocks-storage.tsdb.head-chunks-write-queue-size has changed to 1000000, this enables async chunk queue by default, which leads to improved latency on the write path when new chunks are created in ingesters. #2762
  • [CHANGE] Ingester: removed deprecated -blocks-storage.tsdb.isolation-enabled option. TSDB-level isolation is now always disabled in Mimir. #2782
  • [CHANGE] Compactor: -compactor.partial-block-deletion-delay must either be set to 0 (to disable partial blocks deletion) or a value higher than 4h. #2787
  • [CHANGE] Query-frontend: CLI flag -query-frontend.align-querier-with-step has been deprecated. Please use -query-frontend.align-queries-with-step instead. #2840
  • [CHANGE] Distributor: change the default value of -distributor.remote-timeout to 2s from 20s and -distributor.forwarding.request-timeout to 2s from 10s to improve distributor resource usage when ingesters crash. #2728
  • [FEATURE] Introduced an experimental anonymous usage statistics tracking (disabled by default), to help Mimir maintainers make better decisions to support the open source community. The tracking system anonymously collects non-sensitive, non-personally identifiable information about the running Mimir cluster, and is disabled by default. #2643 #2662 #2685 #2732 #2733 #2735
  • [FEATURE] Introduced an experimental deployment mode called read-write and running a fully featured Mimir cluster with three components: write, read and backend. The read-write deployment mode is a trade-off between the monolithic mode (only one component, no isolation) and the microservices mode (many components, high isolation). #2754 #2838
  • [ENHANCEMENT] Distributor: Add cortex_distributor_query_ingester_chunks_deduped_total and cortex_distributor_query_ingester_chunks_total metrics for determining how effective ingester chunk deduplication at query time is. #2713
  • [ENHANCEMENT] Upgrade Docker base images to alpine:3.16.2. #2729
  • [ENHANCEMENT] Ruler: Add <prometheus-http-prefix>/api/v1/status/buildinfo endpoint. #2724
  • [ENHANCEMENT] Querier: Ensure all queries pulled from query-frontend or query-scheduler are immediately executed. The maximum workers concurrency in each querier is configured by -querier.max-concurrent. #2598
  • [ENHANCEMENT] Distributor: Add cortex_distributor_received_requests_total and cortex_distributor_requests_in_total metrics to provide visiblity into appropriate per-tenant request limits. #2770
  • [ENHANCEMENT] Distributor: Add single forwarding remote-write endpoint for a tenant (forwarding_endpoint), instead of using per-rule endpoints. This takes precendence over per-rule endpoints. #2801
  • [ENHANCEMENT] Added err-mimir-distributor-max-write-message-size to the errors catalog. #2470
  • [ENHANCEMENT] Add sanity check at startup to ensure the configured filesystem directories don't overlap for different components. #2828
  • [ENHANCEMENT] Go: updated to go 1.19.1. #2637
  • [BUGFIX] Ruler: fix not restoring alerts' state at startup. #2648
  • [BUGFIX] Ingester: Fix disk filling up after restarting ingesters with out-of-order support disabled while it was enabled before. #2799
  • [BUGFIX] Memberlist: retry joining memberlist cluster on startup when no nodes are resolved. #2837
  • [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 413 to 500 when request is too large. #2819
  • [BUGFIX] Ruler: fix panic when ruler.external_url is explicitly set to an empty string ("") in YAML. #2915

Mixin

  • [CHANGE] Dashboards: remove the "Cache - Latency (old)" panel from the "Mimir / Queries" dashboard. #2796
  • [FEATURE] Dashboards: added support to experimental read-write deployment mode. #2780
  • [ENHANCEMENT] Dashboards: Updated the Writes dashboard to account for samples ingested via the new OTLP ingestion endpoint. #2919
  • [ENHANCEMENT] Dashboards: added support to query-tee in front of ruler-query-frontend in the "Remote ruler reads" dashboard. #2761
  • [ENHANCEMENT] Dashboards: Introduce support for baremetal deployment, setting deployment_type: 'baremetal' in the mixin _config. #2657
  • [ENHANCEMENT] Dashboards: use timeseries panel to show exemplars. #2800
  • [ENHANCEMENT] Dashboards: Include per-tenant request rate in "Tenants" dashboard. #2874
  • [ENHANCEMENT] Dashboards: Include inflight object store requests in "Reads" dashboard. #2914
  • [BUGFIX] Dashboards: stop setting 'interval' in dashboards; it should be set on your datasource. #2802

Jsonnet

  • [ENHANCEMENT] Upgrade memcached image tag to memcached:1.6.16-alpine. #2740
  • [ENHANCEMENT] Added $._config.configmaps and $._config.runtime_config_files to make it easy to add new configmaps or runtime config file to all components. #2748

Mimirtool

  • [BUGFIX] Version checking no longer prompts for updating when already on latest version. #2723

Query-tee

  • [CHANGE] Renamed CLI flag -server.service-port to -server.http-service-port. #2683
  • [CHANGE] Renamed metric cortex_querytee_request_duration_seconds to cortex_querytee_backend_request_duration_seconds. Metric cortex_querytee_request_duration_seconds is now reported without label backend. #2683
  • [ENHANCEMENT] Added HTTP over gRPC support to query-tee to allow testing gRPC requests to Mimir instances. #2683

Mimir Continuous Test

  • [ENHANCEMENT] Added basic authentication and bearer token support for when Mimir is behind a gateway authenticating the calls. #2717

Documentation

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.3.0-rc0...mimir-2.3.0-rc.2

mimir - mimir-2.3.0-rc0

Published by treid314 about 2 years ago

This release contains 333 PRs from 39 authors. Thank you!

Grafana Mimir version 2.3 release notes

Grafana Labs is excited to announce version 2.3 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

The highlights that follow include the top features, enhancements, and bugfixes in this release. If you are upgrading from Grafana Mimir 2.2, there is upgrade-related information as well.
For the complete list of changes, see the Changelog.

Features and enhancements

  • Ingest metrics in OpenTelemetry format:
    This release of Grafana Mimir introduces experimental support for ingesting metrics from the OpenTelemetry Collector's otlphttp exporter. This adds a second ingestion option for users of the OTel Collector; Mimir was already compatible with the prometheusremotewrite exporter. For more information, please see Configure OTel Collector.

  • Increased instant query performance:
    Grafana Mimir now supports splitting instant queries by time. This allows it to better parallelize execution of instant queries and therefore return results faster. At present, splitting is only supported for a subset of instant queries, which means not all instant queries will see a speedup. This feature is being released as experimental and is disabled by default. It can be enabled by setting -query-frontend.split-instant-queries-by-interval.

  • Tenant federation for metadata queries:
    Users with tenant federation enabled could previously issue instant queries, range queries, and exemplar queries to multiple tenants at once and receive a single aggregated result. With Grafana Mimir 2.3, we've added tenant federation support to the /api/v1/metadata endpoint as well.

  • Simpler object storage configuration:
    Users can now configure block, alertmanager, and ruler storage all at once with the common YAML config option key (or -common.storage.* CLI flags). By centralizing your object storage configuration in one place, this enhancement makes configuration faster and less error prone. Users can still individually configure storage for each of these components if they desire. For more information, see the Common Configurations.

  • DEB and RPM packages for Mimir:
    Starting with version 2.3, we're publishing deb and rpm files for Grafana Mimir, which will make installing and running it on Debian or RedHat-based linux systems much easier. Thank you to community contributor wilfriedroset for your work to implement this!

  • Import historic data to Grafana Mimir:
    Users can now backfill time series data from their existing Prometheus or Cortex installation into Mimir using mimirtool, making it possible to migrate to Grafana Mimir without losing your existing metrics data. This support is still considered experimental and does not work for data stored in Thanos yet. To learn more about this feature, please see mimirtool backfill and Configure TSDB block upload

  • New Helm chart minor release: The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.3 release, we’re also releasing version 3.1 of the Mimir Helm chart. Notable enhancements follow. For the full list of changes, see the Helm chart changelog.

    • We've upgraded the MinIO subchart dependency from a deprecated chart to the supported one. This creates a breaking change in how the administrator password is set. However, as the built-in MinIO is not a recommended object store for production use cases, this change did not warrant a new major version of the Mimir Helm chart.
    • The backfill API endpoints for importing historic time series data are now exposed on the Nginx gateway.
    • Nginx now sets the value of the X-Scope-OrgID header equal to the value of Mimir's no_auth_tenant parameter by default. The previous release had set the value of X-Scope-OrgID to anonymous by default which complicated the process of migrating to Mimir.
    • Memberlist now uses DNS service-discovery by default, which should decrease startup time for large Mimir clusters.

Upgrade considerations

In Grafana Mimir 2.3 we have removed the following previously deprecated configuration options:

  • The extend_writes parameter in the distributor YAML configuration and -distributor.extend-writes CLI flag have been removed.
  • The active_series_custom_trackers parameter has been removed from the YAML configuration. It had already been moved to the runtime configuration. See #1188 for details.

With Grafana Mimir 2.3 we have also updated the default value for -distributor.ha-tracker.max-clusters to 100 to provide Denial-of-Service protection. Previously -distributor.ha-tracker.max-clusters was unlimited by default which could allow a tenant with HA Dedupe enabled to overload the HA tracker with __cluster__ label values that could cause the HA Dedupe database to fail.

Bug fixes

  • PR 2447: Fix incorrect mapping of http status codes 429 to 500 when the request queue is full in the query-frontend. This corrects behavior in the query-frontend where a 429 "Too Many Outstanding Requests" error (a retriable error) from a querier was incorrectly returned as a 500 system error (an unretriable error).
  • PR 2505: The Memberlist key-value (KV) store now tries to "fast-join" the cluster to avoid serving an empty KV store. This fix addresses the confusing "empty ring" error response and the error log message "ring doesn't exist in KV store yet" emitted by services when there are other members present in the ring when a service starts. Those using other key-value store options (e.g., consul, etcd) are not impacted by this bug.
  • PR 2289: The "List Prometheus rules" API endpoint of the Mimir Ruler component is no longer blocked while rules are being synced. This means users can now list rules while syncing larger rule sets.

Changelog since 2.2

2.3.0-rc.0

Grafana Mimir

  • [CHANGE] Ingester: Added user label to ingester metric cortex_ingester_tsdb_out_of_order_samples_appended_total. On multitenant clusters this helps us find the rate of appended out-of-order samples for a specific tenant. #2493
  • [CHANGE] Compactor: delete source and output blocks from local disk on compaction failed, to reduce likelihood that subsequent compactions fail because of no space left on disk. #2261
  • [CHANGE] Ruler: Remove unused CLI flags -ruler.search-pending-for and -ruler.flush-period (and their respective YAML config options). #2288
  • [CHANGE] Successful gRPC requests are no longer logged (only affects internal API calls). #2309
  • [CHANGE] Add new -*.consul.cas-retry-delay flags. They have a default value of 1s, while previously there was no delay between retries. #2309
  • [CHANGE] Store-gateway: Remove the experimental ability to run requests in a dedicated OS thread pool and associated CLI flag -store-gateway.thread-pool-size. #2423
  • [CHANGE] Memberlist: disabled TCP-based ping fallback, because Mimir already uses a custom transport based on TCP. #2456
  • [CHANGE] Change default value for -distributor.ha-tracker.max-clusters to 100 to provide a DoS protection. #2465
  • [CHANGE] Experimental block upload API exposed by compactor has changed: Previous /api/v1/upload/block/{block} endpoint for starting block upload is now /api/v1/upload/block/{block}/start, and previous endpoint /api/v1/upload/block/{block}?uploadComplete=true for finishing block upload is now /api/v1/upload/block/{block}/finish. New API endpoint has been added: /api/v1/upload/block/{block}/check. #2486 #2548
  • [CHANGE] Compactor: changed -compactor.max-compaction-time default from 0s (disabled) to 1h. When compacting blocks for a tenant, the compactor will move to compact blocks of another tenant or re-plan blocks to compact at least every 1h. #2514
  • [CHANGE] Distributor: removed previously deprecated extend_writes (see #1856) YAML key and -distributor.extend-writes CLI flag from the distributor config. #2551
  • [CHANGE] Ingester: removed previously deprecated active_series_custom_trackers (see #1188) YAML key from the ingester config. #2552
  • [CHANGE] The tenant ID __mimir_cluster is reserved by Mimir and not allowed to store metrics. #2643
  • [CHANGE] Purger: removed the purger component and moved its API endpoints /purger/delete_tenant and /purger/delete_tenant_status to the compactor at /compactor/delete_tenant and /compactor/delete_tenant_status. The new endpoints on the compactor are stable. #2644
  • [CHANGE] Memberlist: Change the leave timeout duration (-memberlist.leave-timeout duration) from 5s to 20s and connection timeout (-memberlist.packet-dial-timeout) from 5s to 2s. This makes leave timeout 10x the connection timeout, so that we can communicate the leave to at least 1 node, if the first 9 we try to contact times out. #2669
  • [CHANGE] Alertmanager: return status code 412 Precondition Failed and log info message when alertmanager isn't configured for a tenant. #2635
  • [CHANGE] Distributor: if forwarding rules are used to forward samples, exemplars are now removed from the request. #2710
  • [CHANGE] Limits: change the default value of max_global_series_per_metric limit to 0 (disabled). Setting this limit by default does not provide much benefit because series are sharded by all labels. #2714
  • [FEATURE] Compactor: Adds the ability to delete partial blocks after a configurable delay. This option can be configured per tenant. #2285
    • -compactor.partial-block-deletion-delay, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of 0, the default, disables this feature.
    • The metric cortex_compactor_blocks_marked_for_deletion_total has a new value for the reason label reason="partial", when a block deletion marker is triggered by the partial block deletion delay.
  • [FEATURE] Querier: enabled support for queries with negative offsets, which are not cached in the query results cache. #2429
  • [FEATURE] EXPERIMENTAL: OpenTelemetry Metrics ingestion path on /otlp/v1/metrics. #695 #2436 #2461
  • [FEATURE] Querier: Added support for tenant federation to metric metadata endpoint. #2467
  • [FEATURE] Query-frontend: introduced experimental support to split instant queries by time. The instant query splitting can be enabled setting -query-frontend.split-instant-queries-by-interval. #2469 #2564 #2565 #2570 #2571 #2572 #2573 #2574 #2575 #2576 #2581 #2582 #2601 #2632 #2633 #2634 #2641 #2642 #2766
  • [ENHANCEMENT] Distributor: Decreased distributor tests execution time. #2562
  • [ENHANCEMENT] Alertmanager: Allow the HTTP proxy_url configuration option in the receiver's configuration. #2317
  • [ENHANCEMENT] ring: optimize shuffle-shard computation when lookback is used, and all instances have registered timestamp within the lookback window. In that case we can immediately return origial ring, because we would select all instances anyway. #2309
  • [ENHANCEMENT] Memberlist: added experimental memberlist cluster label support via -memberlist.cluster-label and -memberlist.cluster-label-verification-disabled CLI flags (and their respective YAML config options). #2354
  • [ENHANCEMENT] Object storage can now be configured for all components using the common YAML config option key (or -common.storage.* CLI flags). #2330 #2347
  • [ENHANCEMENT] Go: updated to go 1.18.4. #2400
  • [ENHANCEMENT] Store-gateway, listblocks: list of blocks now includes stats from meta.json file: number of series, samples and chunks. #2425
  • [ENHANCEMENT] Added more buckets to cortex_ingester_client_request_duration_seconds histogram metric, to correctly track requests taking longer than 1s (up until 16s). #2445
  • [ENHANCEMENT] Azure client: Improve memory usage for large object storage downloads. #2408
  • [ENHANCEMENT] Distributor: Add -distributor.instance-limits.max-inflight-push-requests-bytes. This limit protects the distributor against multiple large requests that together may cause an OOM, but are only a few, so do not trigger the max-inflight-push-requests limit. #2413
  • [ENHANCEMENT] Distributor: Drop exemplars in distributor for tenants where exemplars are disabled. #2504
  • [ENHANCEMENT] Runtime Config: Allow operator to specify multiple comma-separated yaml files in -runtime-config.file that will be merged in left to right order. #2583
  • [ENHANCEMENT] Query sharding: shard binary operations only if it doesn't lead to non-shardable vector selectors in one of the operands. #2696
  • [ENHANCEMENT] Add packaging for both debian based deb file and redhat based rpm file using FPM. #1803
  • [BUGFIX] TSDB: Fixed a bug on the experimental out-of-order implementation that led to wrong query results. #2701
  • [BUGFIX] Compactor: log the actual error on compaction failed. #2261
  • [BUGFIX] Alertmanager: restore state from storage even when running a single replica. #2293
  • [BUGFIX] Ruler: do not block "List Prometheus rules" API endpoint while syncing rules. #2289
  • [BUGFIX] Ruler: return proper *status.Status error when running in remote operational mode. #2417
  • [BUGFIX] Alertmanager: ensure the configured -alertmanager.web.external-url is either a path starting with /, or a full URL including the scheme and hostname. #2381 #2542
  • [BUGFIX] Memberlist: fix problem with loss of some packets, typically ring updates when instances were removed from the ring during shutdown. #2418
  • [BUGFIX] Ingester: fix misfiring MimirIngesterHasUnshippedBlocks and stale cortex_ingester_oldest_unshipped_block_timestamp_seconds when some block uploads fail. #2435
  • [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 429 to 500 when request queue is full. #2447
  • [BUGFIX] Memberlist: Fix problem with ring being empty right after startup. Memberlist KV store now tries to "fast-join" the cluster to avoid serving empty KV store. #2505
  • [BUGFIX] Compactor: Fix bug when using -compactor.partial-block-deletion-delay: compactor didn't correctly check for modification time of all block files. #2559
  • [BUGFIX] Query-frontend: fix wrong query sharding results for queries with boolean result like 1 < bool 0. #2558
  • [BUGFIX] Fixed error messages related to per-instance limits incorrectly reporting they can be set on a per-tenant basis. #2610
  • [BUGFIX] Perform HA-deduplication before forwarding samples according to forwarding rules in the distributor. #2603 #2709
  • [BUGFIX] Fix reporting of tracing spans from PromQL engine. #2707
  • [BUGFIX] Apply relabel and drop_label rules before forwarding rules in the distributor. #2703
  • [BUGFIX] Distributor: Register cortex_discarded_requests_total metric, which previously was not registered and therefore not exported. #2712

Mixin

  • [CHANGE] Dashboards: "Slow Queries" dashboard no longer works with versions older than Grafana 9.0. #2223
  • [CHANGE] Alerts: use RSS memory instead of working set memory in the MimirAllocatingTooMuchMemory alert for ingesters. #2480
  • [ENHANCEMENT] Dashboards: added missed rule evaluations to the "Evaluations per second" panel in the "Mimir / Ruler" dashboard. #2314
  • [ENHANCEMENT] Dashboards: add k8s resource requests to CPU and memory panels. #2346
  • [ENHANCEMENT] Dashboards: add RSS memory utilization panel for ingesters, store-gateways and compactors. #2479
  • [ENHANCEMENT] Dashboards: allow to configure graph tooltip. #2647
  • [ENHANCEMENT] Alerts: MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck alerts are more reliable now as they consider all the intermediate samples in the minute prior to the evaluation. #2630
  • [ENHANCEMENT] Alerts: added RolloutOperatorNotReconciling alert, firing if the optional rollout-operator is not successfully reconciling. #2700
  • [BUGFIX] Dashboards: fixed unit of latency panels in the "Mimir / Ruler" dashboard. #2312
  • [BUGFIX] Dashboards: fixed "Intervals per query" panel in the "Mimir / Queries" dashboard. #2308
  • [BUGFIX] Dashboards: Make "Slow Queries" dashboard works with Grafana 9.0. #2223
  • [BUGFIX] Dashboards: add missing API routes to Ruler dashboard. #2412

Jsonnet

  • [CHANGE] query-scheduler is enabled by default. We advise to deploy the query-scheduler to improve the scalability of the query-frontend. #2431
  • [CHANGE] Replaced anti-affinity rules with pod topology spread constraints for distributor, query-frontend, querier and ruler. #2517
    • The following configuration options have been removed:
      • distributor_allow_multiple_replicas_on_same_node
      • query_frontend_allow_multiple_replicas_on_same_node
      • querier_allow_multiple_replicas_on_same_node
      • ruler_allow_multiple_replicas_on_same_node
    • The following configuration options have been added:
      • distributor_topology_spread_max_skew
      • query_frontend_topology_spread_max_skew
      • querier_topology_spread_max_skew
      • ruler_topology_spread_max_skew
  • [CHANGE] Change max_global_series_per_metric to 0 in all plans, and as a default value. #2669
  • [FEATURE] Memberlist: added support for experimental memberlist cluster label, through the jsonnet configuration options memberlist_cluster_label and memberlist_cluster_label_verification_disabled. #2349
  • [FEATURE] Added ruler-querier autoscaling support. It requires KEDA installed in the Kubernetes cluster. Ruler-querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2545
    • autoscaling_ruler_querier_enabled: true to enable autoscaling.
    • autoscaling_ruler_querier_min_replicas: minimum number of ruler-querier replicas.
    • autoscaling_ruler_querier_max_replicas: maximum number of ruler-querier replicas.
    • autoscaling_prometheus_url: Prometheus base URL from which to scrape Mimir metrics (e.g. http://prometheus.default:9090/prometheus).
  • [ENHANCEMENT] Memberlist now uses DNS service-discovery by default. #2549

Mimirtool

  • [ENHANCEMENT] Added mimirtool backfill command to upload Prometheus blocks using API available in the compactor. #1822
  • [ENHANCEMENT] mimirtool bucket-validation: Verify existing objects can be overwritten by subsequent uploads. #2491
  • [ENHANCEMENT] mimirtool config convert: Now supports migrating to the current version of Mimir. #2629
  • [BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors by using custom parsing. #2386

Mimir Continuous Test

Documentation

  • [ENHANCEMENT] Referenced mimirtool commands in the HTTP API documentation. #2516
  • [ENHANCEMENT] Improved DNS service discovery documentation. #2513

Tools

  • [ENHANCEMENT] markblocks now processes multiple blocks concurrently. #2677

New Contributors

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.2.0...mimir-2.3.0-rc0

Package Rankings
Top 7.98% on Alpine-edge
Top 1.45% on Proxy.golang.org
Related Projects