Bot releases are visible (Hide)

mimir - 2.2.0

Published by krajorama about 2 years ago

Grafana Labs is excited to announce version 2.2 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

The highlights that follow include the top features, enhancements, and bugfixes in this release. If you are upgrading from Grafana Mimir 2.1, there is upgrade-related information as well.
For the complete list of changes, see the Changelog.

This release contains 214 contributions from 32 authors. Thank you!

Features and enhancements

Support for ingesting out-of-order samples: Grafana Mimir includes new, experimental support for ingesting out-of-order samples.
This support is configurable, and it allows you to set how far out-of-order Mimir accepts samples on a per-tenant basis.
This feature still needs additional testing; we do not recommend using it in a production environment.
For more information, see Configuring out-of-order samples ingestion
Improved error messages: The error messages that Mimir reports are more human readable, and the messages include error codes that are easily searchable.
For error descriptions, see the Grafana Mimir runbooks’ Errors catalog.
Configurable prefix for object storage: Mimir can now store block data, rules, and alerts in one bucket, with each under its own user-defined prefix, rather than requiring one bucket for each.
You can configure the storage prefix by using -<storage>.storage-prefix option for corresponding storage: ruler-storage, alertmanager-storage or blocks-storage.
Store-gateway performance optimization
The store-gateway can now pre-populate the file system cache when memory-mapping index-header files.
This avoids the store-gateway from appearing to be stuck while loading index-headers.
This feature is experimental and disabled by default; enable it using the flag -blocks-storage.bucket-store.index-header.map-populate-enabled.
Faster ingester startup: Ingesters now replay their WALs (write ahead logs) about 50% faster, and they also re-join the ring sooner under some conditions.
Helm Chart improvements: The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.2 release, we're also releasing version 3.0 of the Helm chart. Notable enhancements follow. For the full list of changes, see the Helm chart changelog.
- The Helm chart now supports OpenShift.
- The Helm chart can now easily deploy Grafana Agent in order to scrape metrics and logs from all Mimir pods, and ship them to a remote store, which makes it easier to monitor the health of your Mimir installation. For more information, see Collecting metrics and logs from Grafana Mimir.
- The Helm chart now enables multi-tenancy by default. This makes it easy for you to add tenants as you grow your cluster. You can take advantage of Mimir's per-tenant quality-of-service features, which improves stability and resilience at high scale. To learn more about how multi-tenancy in Mimir works, see Grafana Mimir authorization and authentication. This change is backwards-compatible. To read about how we implemented this, see #2117.
- We have significantly improved the configuration experience for the Helm chart, and here are a few of the most salient changes:
  - We've added an extraEnvFrom capability to all Mimir services to enable you to inject secrets via environment variables.
  - We've made it possible to globally set environment variables and inject secrets across all pods in the chart using global.extraEnv and global.extraEnvFrom. Note that the memcached and minio pods are not included.
  - We've switched the default storage of the Mimir configuration from a Secret to a ConfigMap, which makes it easier to quickly see the differences between your Mimir configurations between upgrades. We especially like the Helm diff plugin for this purpose.
  - We've added a structuredConfig option, which allows you to overwrite specific key-value pairs in the mimir.config template, which saves you from having to maintain the entire mimir.config in your own values.yaml file.
  - We've added the ability to create global pod annotations. This unlocks the ability to trigger a restart of all services in response to a single event, such as the update of the secret containing Mimir's storage credentials.
- We've set the chart to disable -ingester.ring.unregister-on-shutdown and -distributor.extend-writes, for a smoother upgrade experience. Rolling restarts of ingesters are now less likely to cause spikes in resource usage.
- We've improved the documentation for the Helm chart by adding a Getting started with Mimir using the Helm chart.
- We've added a smoke test for your Mimir cluster to help catch errors immediately after you install or upgrade Mimir via the Helm chart.

Upgrade considerations

All deprecated API endpoints that are under /api/v1/rules* and /prometheus/rules* have now been removed from the ruler component in favor of identical endpoints that use the prefix /prometheus/config/v1/rules*.

In Grafana Mimir 2.2, we have updated default values and some parameters to give you a better out-of-the-box experience:

Message size limits for gRPC messages that are exchanged between internal Mimir components have increased to 100 MiB from 4 MiB.
This helps to avoid internal server errors when pushing or querying large data.
The -blocks-storage.bucket-store.ignore-blocks-within parameter changed from 0 to 10h.
The default value of -querier.query-store-after changed from 0 to 12h.
For most-recent data, both changes improve query performance by querying only the ingesters, rather than object storage.
The option -querier.shuffle-sharding-ingesters-lookback-period has been deprecated.
If you previously changed this option from its default of 0s, set -querier.shuffle-sharding-ingesters-enabled to true and specify the lookback period by setting the -querier.query-ingesters-within option.
The -memberlist.abort-if-join-fails parameter now defaults to false.
When Mimir is using memberlist as the backend store for its hash ring, and it fails to join the memberlist cluster, Mimir no longer aborts startup by default.

If you have used a previous version of the Mimir Helm chart, you must address some of the chart's breaking changes before upgrading to helm chart version 3.0. For a detailed information about how to do this, see Upgrade the Grafana Mimir Helm chart from version 2.1 to 3.0.

Bug fixes

PR 1883: Fixed a bug that caused the query-frontend and querier to crash when they received a user query with a special regular expression label matcher.
PR 1933: Fixed a bug in the ingester ring page, which showed incorrect status of entries in the ring.
PR 2090: Ruler in remote rule evaluation mode now applies the timeout correctly. Previously the ruler could get stuck forever, which halted rule evaluation.
PR 2036: Fixed panic at startup when Mimir is running in monolithic mode and query sharding is enabled.

Changelog

2.2.0

Grafana Mimir

[CHANGE] Increased default configuration for -server.grpc-max-recv-msg-size-bytes and -server.grpc-max-send-msg-size-bytes from 4MB to 100MB. #1884
[CHANGE] Default values have changed for the following settings. This improves query performance for recent data (within 12h) by only reading from ingesters: #1909 #1921
- -blocks-storage.bucket-store.ignore-blocks-within now defaults to 10h (previously 0)
- -querier.query-store-after now defaults to 12h (previously 0)
[CHANGE] Alertmanager: removed support for migrating local files from Cortex 1.8 or earlier. Related to original Cortex PR https://github.com/cortexproject/cortex/pull/3910. #2253
[CHANGE] The following settings are now classified as advanced because the defaults should work for most users and tuning them requires in-depth knowledge of how the read path works: #1929
- -querier.query-ingesters-within
- -querier.query-store-after
[CHANGE] Config flag category overrides can be set dynamically at runtime. #1934
[CHANGE] Ingester: deprecated -ingester.ring.join-after. Mimir now behaves as this setting is always set to 0s. This configuration option will be removed in Mimir 2.4.0. #1965
[CHANGE] Blocks uploaded by ingester no longer contain __org_id__ label. Compactor now ignores this label and will compact blocks with and without this label together. mimirconvert tool will remove the label from blocks as "unknown" label. #1972
[CHANGE] Querier: deprecated -querier.shuffle-sharding-ingesters-lookback-period, instead adding -querier.shuffle-sharding-ingesters-enabled to enable or disable shuffle sharding on the read path. The value of -querier.query-ingesters-within is now used internally for shuffle sharding lookback. #2110
[CHANGE] Memberlist: -memberlist.abort-if-join-fails now defaults to false. Previously it defaulted to true. #2168
[CHANGE] Ruler: /api/v1/rules* and /prometheus/rules* configuration endpoints are removed. Use /prometheus/config/v1/rules*. #2182
[CHANGE] Ingester: -ingester.exemplars-update-period has been renamed to -ingester.tsdb-config-update-period. You can use it to update multiple, per-tenant TSDB configurations. #2187
[FEATURE] Ingester: (Experimental) Add the ability to ingest out-of-order samples up to an allowed limit. If you enable this feature, it requires additional memory and disk space. This feature also enables a write-behind log, which might lead to longer ingester-start replays. When this feature is disabled, there is no overhead on memory, disk space, or startup times. #2187
- -ingester.out-of-order-time-window, as duration string, allows you to set how back in time a sample can be. The default is 0s, where s is seconds.
- cortex_ingester_tsdb_out_of_order_samples_appended_total metric tracks the total number of out-of-order samples ingested by the ingester.
- cortex_discarded_samples_total has a new label reason="sample-too-old", when the -ingester.out-of-order-time-window flag is greater than zero. The label tracks the number of samples that were discarded for being too old; they were out of order, but beyond the time window allowed. The labels reason="sample-out-of-order" and reason="sample-out-of-bounds" are not used when out-of-order ingestion is enabled.
[ENHANCEMENT] Distributor: Added limit to prevent tenants from sending excessive number of requests: #1843
- The following CLI flags (and their respective YAML config options) have been added:
  - -distributor.request-rate-limit
  - -distributor.request-burst-limit
- The following metric is exposed to tell how many requests have been rejected:
  - cortex_discarded_requests_total
[ENHANCEMENT] Store-gateway: Add the experimental ability to run requests in a dedicated OS thread pool. This feature can be configured using -store-gateway.thread-pool-size and is disabled by default. Replaces the ability to run index header operations in a dedicated thread pool. #1660 #1812
[ENHANCEMENT] Improved error messages to make them easier to understand; each now have a unique, global identifier that you can use to look up in the runbooks for more information. #1907 #1919 #1888 #1939 #1984 #2009 #2056 #2066 #2104 #2150 #2234
[ENHANCEMENT] Memberlist KV: incoming messages are now processed on per-key goroutine. This may reduce loss of "maintanance" packets in busy memberlist installations, but use more CPU. New memberlist_client_received_broadcasts_dropped_total counter tracks number of dropped per-key messages. #1912
[ENHANCEMENT] Blocks Storage, Alertmanager, Ruler: add support a prefix to the bucket store (*_storage.storage_prefix). This enables using the same bucket for the three components. #1686 #1951
[ENHANCEMENT] Upgrade Docker base images to alpine:3.16.0. #2028
[ENHANCEMENT] Store-gateway: Add experimental configuration option for the store-gateway to attempt to pre-populate the file system cache when memory-mapping index-header files. Enabled with -blocks-storage.bucket-store.index-header.map-populate-enabled=true. Note this flag only has an effect when running on Linux. #2019 #2054
[ENHANCEMENT] Chunk Mapper: reduce memory usage of async chunk mapper. #2043
[ENHANCEMENT] Ingester: reduce sleep time when reading WAL. #2098
[ENHANCEMENT] Compactor: Run sanity check on blocks storage configuration at startup. #2144
[ENHANCEMENT] Compactor: Add HTTP API for uploading TSDB blocks. Enabled with -compactor.block-upload-enabled. #1694 #2126
[ENHANCEMENT] Ingester: Enable querying overlapping blocks by default. #2187
[ENHANCEMENT] Distributor: Auto-forget unhealthy distributors after ten failed ring heartbeats. #2154
[ENHANCEMENT] Distributor: Add new metric cortex_distributor_forward_errors_total for error codes resulting from forwarding requests. #2077
[ENHANCEMENT] /ready endpoint now returns and logs detailed services information. #2055
[ENHANCEMENT] Memcached client: Reduce number of connections required to fetch cached keys from memcached. #1920
[ENHANCEMENT] Improved error message returned when -querier.query-store-after validation fails. #1914
[BUGFIX] Fix regexp parsing panic for regexp label matchers with start/end quantifiers. #1883
[BUGFIX] Ingester: fixed deceiving error log "failed to update cached shipped blocks after shipper initialisation", occurring for each new tenant in the ingester. #1893
[BUGFIX] Ring: fix bug where instances may appear unhealthy in the hash ring web UI even though they are not. #1933
[BUGFIX] API: gzip is now enforced when identity encoding is explicitly rejected. #1864
[BUGFIX] Fix panic at startup when Mimir is running in monolithic mode and query sharding is enabled. #2036
[BUGFIX] Ruler: report cortex_ruler_queries_failed_total metric for any remote query error except 4xx when remote operational mode is enabled. #2053 #2143
[BUGFIX] Ingester: fix slow rollout when using -ingester.ring.unregister-on-shutdown=false with long -ingester.ring.heartbeat-period. #2085
[BUGFIX] Ruler: add timeout for remote rule evaluation queries to prevent rule group evaluations getting stuck indefinitely. The duration is configurable with -querier.timeout (default 2m). #2090 #2222
[BUGFIX] Limits: Active series custom tracker configuration has been named back from active_series_custom_trackers_config to active_series_custom_trackers. For backwards compatibility both version is going to be supported for until Mimir v2.4. When both fields are specified, active_series_custom_trackers_config takes precedence over active_series_custom_trackers. #2101
[BUGFIX] Ingester: fixed the order of labels applied when incrementing the cortex_discarded_metadata_total metric. #2096
[BUGFIX] Ingester: fixed bug where retrieving metadata for a metric with multiple metadata entries would return multiple copies of a single metadata entry rather than all available entries. #2096
[BUGFIX] Distributor: canceled requests are no longer accounted as internal errors. #2157
[BUGFIX] Memberlist: Fix typo in memberlist admin UI. #2202
[BUGFIX] Ruler: fixed typo in error message when ruler failed to decode a rule group. #2151
[BUGFIX] Active series custom tracker configuration is now displayed properly on /runtime_config page. #2065
[BUGFIX] Query-frontend: vector and time functions were sharded, which made expressions like vector(1) > 0 and vector(1) fail. #2355

Mixin

[CHANGE] Split mimir_queries rules group into mimir_queries and mimir_ingester_queries to keep number of rules per group within the default per-tenant limit. #1885
[CHANGE] Dashboards: Expose full image tag in "Mimir / Rollout progress" dashboard's "Pod per version panel." #1932
[CHANGE] Dashboards: Disabled gateway panels by default, because most users don't have a gateway exposing the metrics expected by Mimir dashboards. You can re-enable it setting gateway_enabled: true in the mixin config and recompiling the mixin running make build-mixin. #1955
[CHANGE] Alerts: adapt MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck to consider ruler query path components. #1949
[CHANGE] Alerts: Change MimirRulerTooManyFailedQueries severity to critical. #2165
[ENHANCEMENT] Dashboards: Add config option datasource_regex to customise the regular expression used to select valid datasources for Mimir dashboards. #1802
[ENHANCEMENT] Dashboards: Added "Mimir / Remote ruler reads" and "Mimir / Remote ruler reads resources" dashboards. #1911 #1937
[ENHANCEMENT] Dashboards: Make networking panels work for pods created by the mimir-distributed helm chart. #1927
[ENHANCEMENT] Alerts: Add MimirStoreGatewayNoSyncedTenants alert that fires when there is a store-gateway owning no tenants. #1882
[ENHANCEMENT] Rules: Make recording_rules_range_interval configurable for cases where Mimir metrics are scraped less often that every 30 seconds. #2118
[ENHANCEMENT] Added minimum Grafana version to mixin dashboards. #1943
[BUGFIX] Fix container_memory_usage_bytes:sum recording rule. #1865
[BUGFIX] Fix MimirGossipMembersMismatch alerts if Mimir alertmanager is activated. #1870
[BUGFIX] Fix MimirRulerMissedEvaluations to show % of missed alerts as a value between 0 and 100 instead of 0 and 1. #1895
[BUGFIX] Fix MimirCompactorHasNotUploadedBlocks alert false positive when Mimir is deployed in monolithic mode. #1902
[BUGFIX] Fix MimirGossipMembersMismatch to make it less sensitive during rollouts and fire one alert per installation, not per job. #1926
[BUGFIX] Do not trigger MimirAllocatingTooMuchMemory alerts if no container limits are supplied. #1905
[BUGFIX] Dashboards: Remove empty "Chunks per query" panel from Mimir / Queries dashboard. #1928
[BUGFIX] Dashboards: Use Grafana's $__rate_interval for rate queries in dashboards to support scrape intervals of >15s. #2011
[BUGFIX] Alerts: Make each version of MimirCompactorHasNotUploadedBlocks distinct to avoid rule evaluation failures due to duplicate series being generated. #2197
[BUGFIX] Fix MimirGossipMembersMismatch alert when using remote ruler evaluation. #2159

Jsonnet

[CHANGE] Remove use of -querier.query-store-after, -querier.shuffle-sharding-ingesters-lookback-period, -blocks-storage.bucket-store.ignore-blocks-within, and -blocks-storage.tsdb.close-idle-tsdb-timeout CLI flags since the values now match defaults. #1915 #1921
[CHANGE] Change default value for -blocks-storage.bucket-store.chunks-cache.memcached.timeout to 450ms to increase use of cached data. #2035
[CHANGE] The memberlist_ring_enabled configuration now applies to Alertmanager. #2102 #2103 #2107
[CHANGE] Default value for memberlist_ring_enabled is now true. It means that all hash rings use Memberlist as default KV store instead of Consul (previous default). #2161
[CHANGE] Configure -ingester.max-global-metadata-per-user to correspond to 20% of the configured max number of series per tenant. #2250
[CHANGE] Configure -ingester.max-global-metadata-per-metric to be 10. #2250
[CHANGE] Change _config.multi_zone_ingester_max_unavailable to 25. #2251
[FEATURE] Added querier autoscaling support. It requires KEDA installed in the Kubernetes cluster and query-scheduler enabled in the Mimir cluster. Querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2013 #2023
- autoscaling_querier_enabled: true to enable autoscaling.
- autoscaling_querier_min_replicas: minimum number of querier replicas.
- autoscaling_querier_max_replicas: maximum number of querier replicas.
- autoscaling_prometheus_url: Prometheus base URL from which to scrape Mimir metrics (e.g. http://prometheus.default:9090/prometheus).
[FEATURE] Jsonnet: Add support for ruler remote evaluation mode (ruler_remote_evaluation_enabled), which deploys and uses a dedicated query path for rule evaluation. This enables the benefits of the query-frontend for rule evaluation, such as query sharding. #2073
[ENHANCEMENT] Added compactor service, that can be used to route requests directly to compactor (e.g. admin UI). #2063
[ENHANCEMENT] Added a consul_enabled configuration option to provide the ability to disable consul. It is automatically set to false when memberlist_ring_enabled is true and multikv_migration_enabled (used for migration from Consul to memberlist) is not set. #2093 #2152
[BUGFIX] Querier: Fix disabling shuffle sharding on the read path whilst keeping it enabled on write path. #2164

Mimirtool

[CHANGE] mimirtool rules: --use-legacy-routes now toggles between using /prometheus/config/v1/rules (default) and /api/v1/rules (legacy) endpoints. #2182
[FEATURE] Added bearer token support for when Mimir is behind a gateway authenticating by bearer token. #2146
[BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors (#1840). #1973
[BUGFIX] Make mimirtool build for Windows work again. #2273

Mimir Continuous Test

[ENHANCEMENT] Added the -tests.smoke-test flag to run the mimir-continuous-test suite once and immediately exit. #2047 #2094

Documentation

[ENHANCEMENT] Published Grafana Mimir runbooks as part of documentation. #1970
[ENHANCEMENT] Improved ruler's "remote operational mode" documentation. #1906
[ENHANCEMENT] Recommend fast disks for ingesters and store-gateways in production tips. #1903
[ENHANCEMENT] Explain the runtime override of active series matchers. #1868
[ENHANCEMENT] Clarify "Set rule group" API specification. #1869
[ENHANCEMENT] Published Mimir jsonnet documentation. #2024
[ENHANCEMENT] Documented required scrape interval for using alerting and recording rules from Mimir jsonnet. #2147
[ENHANCEMENT] Runbooks: Mention memberlist as possible source of problems for various alerts. #2158
[ENHANCEMENT] Added step-by-step article about migrating from Consul to Memberlist KV store using jsonnet without downtime. #2166
[ENHANCEMENT] Documented /memberlist admin page. #2166
[ENHANCEMENT] Documented how to configure Grafana Mimir's ruler with Jsonnet. #2127
[ENHANCEMENT] Documented how to configure queriers’ autoscaling with Jsonnet. #2128
[ENHANCEMENT] Updated mixin building instructions in "Installing Grafana Mimir dashboards and alerts" article. #2015 #2163
[ENHANCEMENT] Fix location of "Monitoring Grafana Mimir" article in the documentation hierarchy. #2130
[ENHANCEMENT] Runbook for MimirRequestLatency was expanded with more practical advice. #1967
[BUGFIX] Fixed ruler configuration used in the getting started guide. #2052
[BUGFIX] Fixed Mimir Alertmanager datasource in Grafana used by "Play with Grafana Mimir" tutorial. #2115
[BUGFIX] Fixed typos in "Scaling out Grafana Mimir" article. #2170
[BUGFIX] Added missing ring endpoint exposed by Ingesters. #1918

New Contributors

@pdf made their first contribution in https://github.com/grafana/mimir/pull/1865
@secustor made their first contribution in https://github.com/grafana/mimir/pull/1870
@zenador made their first contribution in https://github.com/grafana/mimir/pull/1930
@pr00se made their first contribution in https://github.com/grafana/mimir/pull/1934
@hjet made their first contribution in https://github.com/grafana/mimir/pull/1973
@williamzelesny made their first contribution in https://github.com/grafana/mimir/pull/2028
@javad-hajiani made their first contribution in https://github.com/grafana/mimir/pull/2146
@rojas-diego made their first contribution in https://github.com/grafana/mimir/pull/2147
@jhesketh made their first contribution in https://github.com/grafana/mimir/pull/2163
@gonzalez made their first contribution in https://github.com/grafana/mimir/pull/2112
@Eve832 made their first contribution in https://github.com/grafana/mimir/pull/2170

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.1.0...mimir-2.2.0

mimir - Mimir 2.2.0-rc.1

Published by colega over 2 years ago

This release contains 26 contributions from 6 authors. Thank you!

Changes since 2.2.0-rc.0

Grafana Mimir

[BUGFIX] Query-frontend: vector and time functions were sharded, which made expressions like vector(1) > 0 and vector(1) fail. #2355

Mimirtool

[BUGFIX] Make mimirtool build for Windows work again. #2273

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.2.0-rc.0...mimir-2.2.0-rc.1

mimir - Mimir 2.2.0-rc.0

Published by pstibrany over 2 years ago

2.2.0-rc.0

This release contains 214 contributions from 32 authors. Thank you!

Grafana Labs is excited to announce version 2.2 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

Highlights include the top features, enhancements, and bugfixes in this release. If you are upgrading from Grafana Mimir 2.1, there is migration-related information as well.
For the complete list of changes, see the Changelog.

Features and enhancements

Support for ingesting out-of-order samples: Grafana Mimir includes new, experimental support for ingesting out-of-order samples.
This support is configurable, with users able to set how far out-of-order Mimir will accept samples on a per-tenant basis.
Note that this feature still needs a heavy testing, and is not production-ready yet.
Error messages: The error messages that Mimir reports are more human readable, and the messages include error codes that are easily searchable.
Configurable prefix for object storage: Mimir can now store block data, rules, and alerts in one bucket, each under its own user-defined prefix, rather than requiring one bucket for each.
You can configure the storage prefix by using -<storage>.storage-prefix option for corresponding storage: ruler-storage, alertmanager-storage or blocks-storage.
Helm Chart update: TBD
Store-gateway can now optionally prepopulate the file system cache when memory-mapping index-header files.
This can help store-gateway to avoid looking stuck while loading index-headers.
Feature can be enabled with new experimental flag -blocks-storage.bucket-store.index-header.map-populate-enabled.
Faster ingester startup: Ingesters now replay Write-Ahead-Log by about 50% faster, and they also re-join the ring sooner under some conditions.

Upgrade considerations

We have updated default values and some parameters in Grafana Mimir 2.2 to give you a better out-of-the-box experience:

Message size limits for gRPC messages exchanged between internal Mimir components increased to 100 MiB from the previous 4 MiB.
This helps to avoid internal server errors when pushing or querying large data.
The -blocks-storage.bucket-store.ignore-blocks-within parameter changed from 0 to 10h.
The default value of -querier.query-store-after changed from 0 to 12h.
Both changes improve query performance for most-recent data by querying only the ingesters, rather than object storage.
The option -querier.shuffle-sharding-ingesters-lookback-period has been deprecated.
If you previously changed this option from its default of 0s, set -querier.shuffle-sharding-ingesters-enabled to true and specify the lookback period by setting the -querier.query-ingesters-within option.
The -memberlist.abort-if-join-fails parameter now defaults to false.
When Mimir is using memberlist as a backend store for hash ring, and it fails to join the memberlist cluster, Mimir no longer aborts startup by default.

Bug fixes

PR 1883: Fixed a bug that caused the query-frontend and querier to crash when they received a user query with a special regular expression label matcher.
PR 1933: Fixed a bug in the ingester ring page, which showed incorrect status of entries in the ring.
PR 2090: Ruler in remote rule evaluation mode now applies the timeout correctly. Previously the ruler could get stuck forever, which halted rule evaluation.
PR 2036: Fixed panic at startup when Mimir is running in monolithic mode and query sharding is enabled.

CHANGELOG

Grafana Mimir

[CHANGE] Increased default configuration for -server.grpc-max-recv-msg-size-bytes and -server.grpc-max-send-msg-size-bytes from 4MB to 100MB. #1884
[CHANGE] Default values have changed for the following settings. This improves query performance for recent data (within 12h) by only reading from ingesters: #1909 #1921
- -blocks-storage.bucket-store.ignore-blocks-within now defaults to 10h (previously 0)
- -querier.query-store-after now defaults to 12h (previously 0)
[CHANGE] Alertmanager: removed support for migrating local files from Cortex 1.8 or earlier. Related to original Cortex PR https://github.com/cortexproject/cortex/pull/3910. #2253
[CHANGE] The following settings are now classified as advanced because the defaults should work for most users and tuning them requires in-depth knowledge of how the read path works: #1929
- -querier.query-ingesters-within
- -querier.query-store-after
[CHANGE] Config flag category overrides can be set dynamically at runtime. #1934
[CHANGE] Ingester: deprecated -ingester.ring.join-after. Mimir now behaves as this setting is always set to 0s. This configuration option will be removed in Mimir 2.4.0. #1965
[CHANGE] Blocks uploaded by ingester no longer contain __org_id__ label. Compactor now ignores this label and will compact blocks with and without this label together. mimirconvert tool will remove the label from blocks as "unknown" label. #1972
[CHANGE] Querier: deprecated -querier.shuffle-sharding-ingesters-lookback-period, instead adding -querier.shuffle-sharding-ingesters-enabled to enable or disable shuffle sharding on the read path. The value of -querier.query-ingesters-within is now used internally for shuffle sharding lookback. #2110
[CHANGE] Memberlist: -memberlist.abort-if-join-fails now defaults to false. Previously it defaulted to true. #2168
[CHANGE] Ruler: /api/v1/rules* and /prometheus/rules* configuration endpoints are removed. Use /prometheus/config/v1/rules*. #2182
[CHANGE] Ingester: -ingester.exemplars-update-period has been renamed to -ingester.tsdb-config-update-period. You can use it to update multiple, per-tenant TSDB configurations. #2187
[FEATURE] Ingester: (Experimental) Add the ability to ingest out-of-order samples up to an allowed limit. If you enable this feature, it requires additional memory and disk space. This feature also enables a write-behind log, which might lead to longer ingester-start replays. When this feature is disabled, there is no overhead on memory, disk space, or startup times. #2187
- -ingester.out-of-order-time-window, as duration string, allows you to set how back in time a sample can be. The default is 0s, where s is seconds.
- cortex_ingester_tsdb_out_of_order_samples_appended_total metric tracks the total number of out-of-order samples ingested by the ingester.
- cortex_discarded_samples_total has a new label reason="sample-too-old", when the -ingester.out-of-order-time-window flag is greater than zero. The label tracks the number of samples that were discarded for being too old; they were out of order, but beyond the time window allowed.
[ENHANCEMENT] Distributor: Added limit to prevent tenants from sending excessive number of requests: #1843
- The following CLI flags (and their respective YAML config options) have been added:
  - -distributor.request-rate-limit
  - -distributor.request-burst-limit
- The following metric is exposed to tell how many requests have been rejected:
  - cortex_discarded_requests_total
[ENHANCEMENT] Store-gateway: Add the experimental ability to run requests in a dedicated OS thread pool. This feature can be configured using -store-gateway.thread-pool-size and is disabled by default. Replaces the ability to run index header operations in a dedicated thread pool. #1660 #1812
[ENHANCEMENT] Improved error messages to make them easier to understand; each now have a unique, global identifier that you can use to look up in the runbooks for more information. #1907 #1919 #1888 #1939 #1984 #2009 #2056 #2066 #2104 #2150 #2234
[ENHANCEMENT] Memberlist KV: incoming messages are now processed on per-key goroutine. This may reduce loss of "maintanance" packets in busy memberlist installations, but use more CPU. New memberlist_client_received_broadcasts_dropped_total counter tracks number of dropped per-key messages. #1912
[ENHANCEMENT] Blocks Storage, Alertmanager, Ruler: add support a prefix to the bucket store (*_storage.storage_prefix). This enables using the same bucket for the three components. #1686 #1951
[ENHANCEMENT] Upgrade Docker base images to alpine:3.16.0. #2028
[ENHANCEMENT] Store-gateway: Add experimental configuration option for the store-gateway to attempt to pre-populate the file system cache when memory-mapping index-header files. Enabled with -blocks-storage.bucket-store.index-header.map-populate-enabled=true. Note this flag only has an effect when running on Linux. #2019 #2054
[ENHANCEMENT] Chunk Mapper: reduce memory usage of async chunk mapper. #2043
[ENHANCEMENT] Ingester: reduce sleep time when reading WAL. #2098
[ENHANCEMENT] Compactor: Run sanity check on blocks storage configuration at startup. #2144
[ENHANCEMENT] Compactor: Add HTTP API for uploading TSDB blocks. Enabled with -compactor.block-upload-enabled. #1694 #2126
[ENHANCEMENT] Ingester: Enable querying overlapping blocks by default. #2187
[ENHANCEMENT] Distributor: Auto-forget unhealthy distributors after ten failed ring heartbeats. #2154
[ENHANCEMENT] Distributor: Add new metric cortex_distributor_forward_errors_total for error codes resulting from forwarding requests. #2077
[ENHANCEMENT] /ready endpoint now returns and logs detailed services information. #2055
[ENHANCEMENT] Memcached client: Reduce number of connections required to fetch cached keys from memcached. #1920
[ENHANCEMENT] Improved error message returned when -querier.query-store-after validation fails. #1914
[BUGFIX] Fix regexp parsing panic for regexp label matchers with start/end quantifiers. #1883
[BUGFIX] Ingester: fixed deceiving error log "failed to update cached shipped blocks after shipper initialisation", occurring for each new tenant in the ingester. #1893
[BUGFIX] Ring: fix bug where instances may appear unhealthy in the hash ring web UI even though they are not. #1933
[BUGFIX] API: gzip is now enforced when identity encoding is explicitly rejected. #1864
[BUGFIX] Fix panic at startup when Mimir is running in monolithic mode and query sharding is enabled. #2036
[BUGFIX] Ruler: report cortex_ruler_queries_failed_total metric for any remote query error except 4xx when remote operational mode is enabled. #2053 #2143
[BUGFIX] Ingester: fix slow rollout when using -ingester.ring.unregister-on-shutdown=false with long -ingester.ring.heartbeat-period. #2085
[BUGFIX] Ruler: add timeout for remote rule evaluation queries to prevent rule group evaluations getting stuck indefinitely. The duration is configurable with -querier.timeout (default 2m). #2090 #2222
[BUGFIX] Limits: Active series custom tracker configuration has been named back from active_series_custom_trackers_config to active_series_custom_trackers. For backwards compatibility both version is going to be supported for until Mimir v2.4. When both fields are specified, active_series_custom_trackers_config takes precedence over active_series_custom_trackers. #2101
[BUGFIX] Ingester: fixed the order of labels applied when incrementing the cortex_discarded_metadata_total metric. #2096
[BUGFIX] Ingester: fixed bug where retrieving metadata for a metric with multiple metadata entries would return multiple copies of a single metadata entry rather than all available entries. #2096
[BUGFIX] Distributor: canceled requests are no longer accounted as internal errors. #2157
[BUGFIX] Memberlist: Fix typo in memberlist admin UI. #2202
[BUGFIX] Ruler: fixed typo in error message when ruler failed to decode a rule group. #2151
[BUGFIX] Active series custom tracker configuration is now displayed properly on /runtime_config page. #2065

Mixin

[CHANGE] Split mimir_queries rules group into mimir_queries and mimir_ingester_queries to keep number of rules per group within the default per-tenant limit. #1885
[CHANGE] Dashboards: Expose full image tag in "Mimir / Rollout progress" dashboard's "Pod per version panel." #1932
[CHANGE] Dashboards: Disabled gateway panels by default, because most users don't have a gateway exposing the metrics expected by Mimir dashboards. You can re-enable it setting gateway_enabled: true in the mixin config and recompiling the mixin running make build-mixin. #1955
[CHANGE] Alerts: adapt MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck to consider ruler query path components. #1949
[CHANGE] Alerts: Change MimirRulerTooManyFailedQueries severity to critical. #2165
[ENHANCEMENT] Dashboards: Add config option datasource_regex to customise the regular expression used to select valid datasources for Mimir dashboards. #1802
[ENHANCEMENT] Dashboards: Added "Mimir / Remote ruler reads" and "Mimir / Remote ruler reads resources" dashboards. #1911 #1937
[ENHANCEMENT] Dashboards: Make networking panels work for pods created by the mimir-distributed helm chart. #1927
[ENHANCEMENT] Alerts: Add MimirStoreGatewayNoSyncedTenants alert that fires when there is a store-gateway owning no tenants. #1882
[ENHANCEMENT] Rules: Make recording_rules_range_interval configurable for cases where Mimir metrics are scraped less often that every 30 seconds. #2118
[ENHANCEMENT] Added minimum Grafana version to mixin dashboards. #1943
[BUGFIX] Fix container_memory_usage_bytes:sum recording rule. #1865
[BUGFIX] Fix MimirGossipMembersMismatch alerts if Mimir alertmanager is activated. #1870
[BUGFIX] Fix MimirRulerMissedEvaluations to show % of missed alerts as a value between 0 and 100 instead of 0 and 1. #1895
[BUGFIX] Fix MimirCompactorHasNotUploadedBlocks alert false positive when Mimir is deployed in monolithic mode. #1902
[BUGFIX] Fix MimirGossipMembersMismatch to make it less sensitive during rollouts and fire one alert per installation, not per job. #1926
[BUGFIX] Do not trigger MimirAllocatingTooMuchMemory alerts if no container limits are supplied. #1905
[BUGFIX] Dashboards: Remove empty "Chunks per query" panel from Mimir / Queries dashboard. #1928
[BUGFIX] Dashboards: Use Grafana's $__rate_interval for rate queries in dashboards to support scrape intervals of >15s. #2011
[BUGFIX] Alerts: Make each version of MimirCompactorHasNotUploadedBlocks distinct to avoid rule evaluation failures due to duplicate series being generated. #2197
[BUGFIX] Fix MimirGossipMembersMismatch alert when using remote ruler evaluation. #2159

Jsonnet

[CHANGE] Remove use of -querier.query-store-after, -querier.shuffle-sharding-ingesters-lookback-period, -blocks-storage.bucket-store.ignore-blocks-within, and -blocks-storage.tsdb.close-idle-tsdb-timeout CLI flags since the values now match defaults. #1915 #1921
[CHANGE] Change default value for -blocks-storage.bucket-store.chunks-cache.memcached.timeout to 450ms to increase use of cached data. #2035
[CHANGE] The memberlist_ring_enabled configuration now applies to Alertmanager. #2102 #2103 #2107
[CHANGE] Default value for memberlist_ring_enabled is now true. It means that all hash rings use Memberlist as default KV store instead of Consul (previous default). #2161
[CHANGE] Configure -ingester.max-global-metadata-per-user to correspond to 20% of the configured max number of series per tenant. #2250
[CHANGE] Configure -ingester.max-global-metadata-per-metric to be 10. #2250
[CHANGE] Change _config.multi_zone_ingester_max_unavailable to 25. #2251
[FEATURE] Added querier autoscaling support. It requires KEDA installed in the Kubernetes cluster and query-scheduler enabled in the Mimir cluster. Querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2013 #2023
- autoscaling_querier_enabled: true to enable autoscaling.
- autoscaling_querier_min_replicas: minimum number of querier replicas.
- autoscaling_querier_max_replicas: maximum number of querier replicas.
- autoscaling_prometheus_url: Prometheus base URL from which to scrape Mimir metrics (e.g. http://prometheus.default:9090/prometheus).
[FEATURE] Jsonnet: Add support for ruler remote evaluation mode (ruler_remote_evaluation_enabled), which deploys and uses a dedicated query path for rule evaluation. This enables the benefits of the query-frontend for rule evaluation, such as query sharding. #2073
[ENHANCEMENT] Added compactor service, that can be used to route requests directly to compactor (e.g. admin UI). #2063
[ENHANCEMENT] Added a consul_enabled configuration option to provide the ability to disable consul. It is automatically set to false when memberlist_ring_enabled is true and multikv_migration_enabled (used for migration from Consul to memberlist) is not set. #2093 #2152
[BUGFIX] Querier: Fix disabling shuffle sharding on the read path whilst keeping it enabled on write path. #2164

Mimirtool

[CHANGE] mimirtool rules: --use-legacy-routes now toggles between using /prometheus/config/v1/rules (default) and /api/v1/rules (legacy) endpoints. #2182
[FEATURE] Added bearer token support for when Mimir is behind a gateway authenticating by bearer token. #2146
[BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors (#1840). #1973

Mimir Continuous Test

[ENHANCEMENT] Added the -tests.smoke-test flag to run the mimir-continuous-test suite once and immediately exit. #2047 #2094

Documentation

[ENHANCEMENT] Published Grafana Mimir runbooks as part of documentation. #1970
[ENHANCEMENT] Improved ruler's "remote operational mode" documentation. #1906
[ENHANCEMENT] Recommend fast disks for ingesters and store-gateways in production tips. #1903
[ENHANCEMENT] Explain the runtime override of active series matchers. #1868
[ENHANCEMENT] Clarify "Set rule group" API specification. #1869
[ENHANCEMENT] Published Mimir jsonnet documentation. #2024
[ENHANCEMENT] Documented required scrape interval for using alerting and recording rules from Mimir jsonnet. #2147
[ENHANCEMENT] Runbooks: Mention memberlist as possible source of problems for various alerts. #2158
[ENHANCEMENT] Added step-by-step article about migrating from Consul to Memberlist KV store using jsonnet without downtime. #2166
[ENHANCEMENT] Documented /memberlist admin page. #2166
[ENHANCEMENT] Documented how to configure Grafana Mimir's ruler with Jsonnet. #2127
[ENHANCEMENT] Documented how to configure queriers’ autoscaling with Jsonnet. #2128
[ENHANCEMENT] Updated mixin building instructions in "Installing Grafana Mimir dashboards and alerts" article. #2015 #2163
[ENHANCEMENT] Fix location of "Monitoring Grafana Mimir" article in the documentation hierarchy. #2130
[ENHANCEMENT] Runbook for MimirRequestLatency was expanded with more practical advice. #1967
[BUGFIX] Fixed ruler configuration used in the getting started guide. #2052
[BUGFIX] Fixed Mimir Alertmanager datasource in Grafana used by "Play with Grafana Mimir" tutorial. #2115
[BUGFIX] Fixed typos in "Scaling out Grafana Mimir" article. #2170
[BUGFIX] Added missing ring endpoint exposed by Ingesters. #1918

New Contributors

@pdf made their first contribution in https://github.com/grafana/mimir/pull/1865
@secustor made their first contribution in https://github.com/grafana/mimir/pull/1870
@zenador made their first contribution in https://github.com/grafana/mimir/pull/1930
@pr00se made their first contribution in https://github.com/grafana/mimir/pull/1934
@hjet made their first contribution in https://github.com/grafana/mimir/pull/1973
@williamzelesny made their first contribution in https://github.com/grafana/mimir/pull/2028
@javad-hajiani made their first contribution in https://github.com/grafana/mimir/pull/2146
@rojas-diego made their first contribution in https://github.com/grafana/mimir/pull/2147
@jhesketh made their first contribution in https://github.com/grafana/mimir/pull/2163
@gonzalez made their first contribution in https://github.com/grafana/mimir/pull/2112
@Eve832 made their first contribution in https://github.com/grafana/mimir/pull/2170

Full Changelog: https://github.com/grafana/mimir/compare/mimir-2.1.0...mimir-2.2.0-rc.0

mimir - 2.1.0

Published by johannaratliff over 2 years ago

Grafana Labs is excited to announce version 2.1 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

Below we highlight the top features, enhancements and bugfixes in this release, as well as relevant callouts for those upgrading from Grafana Mimir 2.0. The complete list of changes is recorded in the Changelog.

Features and enhancements

Mimir on ARM: We now publish Docker images for both amd64 and arm64, making it easier for those on arm-based machines to develop and run Mimir. Multiplaform images are available from the Mimir docker registry. Note that our existing integration test suite only uses the amd64 images, which means we cannot make any functional or performance guarantees about the arm64 images.
Remote ruler mode for improved rule evaluation performance: We've added a remote mode for the Grafana Mimir ruler, in which the ruler delegates rule evaluation to the query-frontend rather than evaluating rules directly within the ruler process itself. This allows recording and alerting rules to benefit from the query parallelization techniques implemented in the query-frontend (like query sharding). Remote mode is considered experimental and is off by default. To enable, see remote ruler.
Per-tenant custom trackers for monitoring cardinality: In Grafana Mimir 2.0, we introduced a custom tracker feature that allows you to track the count of active series over time that match a specific label matcher. In Grafana Mimir 2.1, we've made it possible to configure custom trackers via the runtime configuration file. This means you can now define different trackers for each tenant in your cluster and modify those trackers without an ingester restart.
Reduce cardinality of Grafana Mimir's /metrics endpoint: While Grafana Mimir does a good job of exposing a relatively small number of series about its own state, this number can tick up when running Grafana Mimir clusters with high tenant counts or high active series counts. To reduce this number (and the accompanying cost of scraping and storing these time series), we made several optimizations which decreased series count on the /metrics endpoint by more than 10%.

Upgrade considerations

We've updated the default values for 2 parameters in Grafana Mimir to give users better out-of-the-box performance:

We've changed the default for -blocks-storage.tsdb.isolation-enabled from true to false. We've marked this flag as deprecated and will remove it completely in 2 releases. TSDB isolation is a feature inherited from Prometheus that didn't provide any benefit given Grafana Mimir's distributed architecture and in our 1 billion series load test we found it actually hurt performance. Disabling it reduced our ingester 99th percentile latency by 90%.
The store-gateway attributes cache is now enabled by default (achieved by updating the default for -blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items from 0 to 50000). This in-memory cache makes it faster to look up object attributes for chunk data. We've been running this optional cache internally for a while and upon a recent configuration audit, realized it made sense to do the same for all users. The increase in store-gateway memory utilization from enabling this cache is negligible and easily justified given the performance gains.

Bug fixes

2.1.0 bug fixes

PR 1704: Fixed a bug that previously caused Grafana Mimir to crash on startup when trying to run in monolithic mode with the results cache enabled due to duplicate metric names.
PR 1835: Fixed a bug that caused Grafana Mimir to crash when an invalid Alertmanager configuration was set even though the Alertmanager component was disabled. After this fix, the Alertmanager configuration is only validated if the Alertmanager component is loaded.
PR 1836: The ability to run Alertmanager with local storage broke in Grafana Mimir 2.0 when we removed the ability to run the Alertmanager without sharding. With this bugfix, we've made it possible to again run Alertmanager with local storage. However, for production use, we still recommend using external store since this is needed to persist Alertmanager state (e.g. silences) between replicas.
PR 1715: Restored Grafana Mimir's ability to use CNAME DNS records to reach memcached servers. The bug was inherited from an upstream change to Thanos; we contributed a fix to Thanos and subsequently updated our Thanos version.

CHANGELOG

Grafana Mimir

[CHANGE] Compactor: No longer upload debug meta files to object storage. #1257
[CHANGE] Default values have changed for the following settings: #1547
- -alertmanager.alertmanager-client.grpc-max-recv-msg-size now defaults to 100 MiB (previously was not configurable and set to 16 MiB)
- -alertmanager.alertmanager-client.grpc-max-send-msg-size now defaults to 100 MiB (previously was not configurable and set to 4 MiB)
- -alertmanager.max-recv-msg-size now defaults to 100 MiB (previously was 16 MiB)
[CHANGE] Ingester: Add user label to metrics cortex_ingester_ingested_samples_total and cortex_ingester_ingested_samples_failures_total. #1533
[CHANGE] Ingester: Changed -blocks-storage.tsdb.isolation-enabled default from true to false. The config option has also been deprecated and will be removed in 2 minor version. #1655
[CHANGE] Query-frontend: results cache keys are now versioned, this will cause cache to be re-filled when rolling out this version. #1631
[CHANGE] Store-gateway: enabled attributes in-memory cache by default. New default configuration is -blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items=50000. #1727
[CHANGE] Compactor: Removed the metric cortex_compactor_garbage_collected_blocks_total since it duplicates cortex_compactor_blocks_marked_for_deletion_total. #1728
[CHANGE] All: Logs that used theorg_id label now use user label. #1634 #1758
[CHANGE] Alertmanager: the following metrics are not exported for a given user and integration when the metric value is zero: #1783
- cortex_alertmanager_notifications_total
- cortex_alertmanager_notifications_failed_total
- cortex_alertmanager_notification_requests_total
- cortex_alertmanager_notification_requests_failed_total
- cortex_alertmanager_notification_rate_limited_total
[CHANGE] Removed the following metrics exposed by the Mimir hash rings: #1791
- cortex_member_ring_tokens_owned
- cortex_member_ring_tokens_to_own
- cortex_ring_tokens_owned
- cortex_ring_member_ownership_percent
[CHANGE] Querier / Ruler: removed the following metrics tracking number of query requests send to each ingester. You can use cortex_request_duration_seconds_count{route=~"/cortex.Ingester/(QueryStream|QueryExemplars)"} instead. #1797
- cortex_distributor_ingester_queries_total
- cortex_distributor_ingester_query_failures_total
[CHANGE] Distributor: removed the following metrics tracking the number of requests from a distributor to ingesters: #1799
- cortex_distributor_ingester_appends_total
- cortex_distributor_ingester_append_failures_total
[CHANGE] Distributor / Ruler: deprecated -distributor.extend-writes. Now Mimir always behaves as if this setting was set to false, which we expect to be safe for every Mimir cluster setup. #1856
[FEATURE] Querier: Added support for streaming remote read. Should be noted that benefits of chunking the response are partial here, since in a typical query-frontend setup responses will be buffered until they've been completed. #1735
[FEATURE] Ruler: Allow setting evaluation_delay for each rule group via rules group configuration file. #1474
[FEATURE] Ruler: Added support for expression remote evaluation. #1536 #1818
- The following CLI flags (and their respective YAML config options) have been added:
  - -ruler.query-frontend.address
  - -ruler.query-frontend.grpc-client-config.grpc-max-recv-msg-size
  - -ruler.query-frontend.grpc-client-config.grpc-max-send-msg-size
  - -ruler.query-frontend.grpc-client-config.grpc-compression
  - -ruler.query-frontend.grpc-client-config.grpc-client-rate-limit
  - -ruler.query-frontend.grpc-client-config.grpc-client-rate-limit-burst
  - -ruler.query-frontend.grpc-client-config.backoff-on-ratelimits
  - -ruler.query-frontend.grpc-client-config.backoff-min-period
  - -ruler.query-frontend.grpc-client-config.backoff-max-period
  - -ruler.query-frontend.grpc-client-config.backoff-retries
  - -ruler.query-frontend.grpc-client-config.tls-enabled
  - -ruler.query-frontend.grpc-client-config.tls-ca-path
  - -ruler.query-frontend.grpc-client-config.tls-cert-path
  - -ruler.query-frontend.grpc-client-config.tls-key-path
  - -ruler.query-frontend.grpc-client-config.tls-server-name
  - -ruler.query-frontend.grpc-client-config.tls-insecure-skip-verify
[FEATURE] Distributor: Added the ability to forward specifics metrics to alternative remote_write API endpoints. #1052
[FEATURE] Ingester: Active series custom trackers now supports runtime tenant-specific overrides. The configuration has been moved to limit config, the ingester config has been deprecated. #1188
[ENHANCEMENT] Alertmanager API: Concurrency limit for GET requests is now configurable using -alertmanager.max-concurrent-get-requests-per-tenant. #1547
[ENHANCEMENT] Alertmanager: Added the ability to configure additional gRPC client settings for the Alertmanager distributor #1547
- -alertmanager.alertmanager-client.backoff-max-period
- -alertmanager.alertmanager-client.backoff-min-period
- -alertmanager.alertmanager-client.backoff-on-ratelimits
- -alertmanager.alertmanager-client.backoff-retries
- -alertmanager.alertmanager-client.grpc-client-rate-limit
- -alertmanager.alertmanager-client.grpc-client-rate-limit-burst
- -alertmanager.alertmanager-client.grpc-compression
- -alertmanager.alertmanager-client.grpc-max-recv-msg-size
- -alertmanager.alertmanager-client.grpc-max-send-msg-size
[ENHANCEMENT] Ruler: Add more detailed query information to ruler query stats logging. #1411
[ENHANCEMENT] Admin: Admin API now has some styling. #1482 #1549 #1821 #1824
[ENHANCEMENT] Alertmanager: added insight=true field to alertmanager dispatch logs. #1379
[ENHANCEMENT] Store-gateway: Add the experimental ability to run index header operations in a dedicated thread pool. This feature can be configured using -blocks-storage.bucket-store.index-header-thread-pool-size and is disabled by default. #1660
[ENHANCEMENT] Store-gateway: don't drop all blocks if instance finds itself as unhealthy or missing in the ring. #1806 #1823
[ENHANCEMENT] Querier: wait until inflight queries are completed when shutting down queriers. #1756 #1767
[BUGFIX] Query-frontend: do not shard queries with a subquery unless the subquery is inside a shardable aggregation function call. #1542
[BUGFIX] Query-frontend: added component=query-frontend label to results cache memcached metrics to fix a panic when Mimir is running in single binary mode and results cache is enabled. #1704
[BUGFIX] Mimir: services' status content-type is now correctly set to text/html. #1575
[BUGFIX] Multikv: Fix panic when using using runtime config to set primary KV store used by multi KV. #1587
[BUGFIX] Multikv: Fix watching for runtime config changes in multi KV store in ruler and querier. #1665
[BUGFIX] Memcached: allow to use CNAME DNS records for the memcached backend addresses. #1654
[BUGFIX] Querier: fixed temporary partial query results when shuffle sharding is enabled and hash ring backend storage is flushed / reset. #1829
[BUGFIX] Alertmanager: prevent more file traversal cases related to template names. #1833
[BUGFUX] Alertmanager: Allow usage with -alertmanager-storage.backend=local. Note that when using this storage type, the Alertmanager is not able persist state remotely, so it not recommended for production use. #1836
[BUGFIX] Alertmanager: Do not validate alertmanager configuration if it's not running. #1835

Mixin

[CHANGE] Dashboards: Remove per-user series legends from Tenants dashboard. #1605
[CHANGE] Dashboards: Show in-memory series and the per-user series limit on Tenants dashboard. #1613
[CHANGE] Dashboards: Slow-queries dashboard now uses user label from logs instead of org_id. #1634
[CHANGE] Dashboards: changed all Grafana dashboards UIDs to not conflict with Cortex ones, to let people install both while migrating from Cortex to Mimir: #1801 #1808
- Alertmanager from a76bee5913c97c918d9e56a3cc88cc28 to b0d38d318bbddd80476246d4930f9e55
- Alertmanager Resources from 68b66aed90ccab448009089544a8d6c6 to a6883fb22799ac74479c7db872451092
- Compactor from 9c408e1d55681ecb8a22c9fab46875cc to 1b3443aea86db629e6efdb7d05c53823
- Compactor Resources from df9added6f1f4332f95848cca48ebd99 to 09a5c49e9cdb2f2b24c6d184574a07fd
- Config from 61bb048ced9817b2d3e07677fb1c6290 to 5d9d0b4724c0f80d68467088ec61e003
- Object Store from d5a3a4489d57c733b5677fb55370a723 to e1324ee2a434f4158c00a9ee279d3292
- Overrides from b5c95fee2e5e7c4b5930826ff6e89a12 to 1e2c358600ac53f09faea133f811b5bb
- Queries from d9931b1054053c8b972d320774bb8f1d to b3abe8d5c040395cc36615cb4334c92d
- Reads from 8d6ba60eccc4b6eedfa329b24b1bd339 to e327503188913dc38ad571c647eef643
- Reads Networking from c0464f0d8bd026f776c9006b05910000 to 54b2a0a4748b3bd1aefa92ce5559a1c2
- Reads Resources from 2fd2cda9eea8d8af9fbc0a5960425120 to cc86fd5aa9301c6528986572ad974db9
- Rollout Progress from 7544a3a62b1be6ffd919fc990ab8ba8f to 7f0b5567d543a1698e695b530eb7f5de
- Ruler from 44d12bcb1f95661c6ab6bc946dfc3473 to 631e15d5d85afb2ca8e35d62984eeaa0
- Scaling from 88c041017b96856c9176e07cf557bdcf to 64bbad83507b7289b514725658e10352
- Slow queries from e6f3091e29d2636e3b8393447e925668 to 6089e1ce1e678788f46312a0a1e647e6
- Tenants from 35fa247ce651ba189debf33d7ae41611 to 35fa247ce651ba189debf33d7ae41611
- Top Tenants from bc6e12d4fe540e4a1785b9d3ca0ffdd9 to bc6e12d4fe540e4a1785b9d3ca0ffdd9
- Writes from 0156f6d15aa234d452a33a4f13c838e3 to 8280707b8f16e7b87b840fc1cc92d4c5
- Writes Networking from 681cd62b680b7154811fe73af55dcfd4 to 978c1cb452585c96697a238eaac7fe2d
- Writes Resources from c0464f0d8bd026f776c9006b0591bb0b to bc9160e50b52e89e0e49c840fea3d379
[FEATURE] Alerts: added the following alerts on mimir-continuous-test tool: #1676
- MimirContinuousTestNotRunningOnWrites
- MimirContinuousTestNotRunningOnReads
- MimirContinuousTestFailed
[ENHANCEMENT] Added per_cluster_label support to allow to change the label name used to differentiate between Kubernetes clusters. #1651
[ENHANCEMENT] Dashboards: Show QPS and latency of the Alertmanager Distributor. #1696
[ENHANCEMENT] Playbooks: Add Alertmanager suggestions for MimirRequestErrors and MimirRequestLatency #1702
[ENHANCEMENT] Dashboards: Allow custom datasources. #1749
[ENHANCEMENT] Dashboards: Add config option gateway_enabled (defaults to true) to disable gateway panels from dashboards. #1761
[ENHANCEMENT] Dashboards: Extend Top tenants dashboard with queries for tenants with highest sample rate, discard rate, and discard rate growth. #1842
[ENHANCEMENT] Dashboards: Show ingestion rate limit and rule group limit on Tenants dashboard. #1845
[ENHANCEMENT] Dashboards: Add "last successful run" panel to compactor dashboard. #1628
[BUGFIX] Dashboards: Fix "Failed evaluation rate" panel on Tenants dashboard. #1629
[BUGFIX] Honor the configured per_instance_label in all dashboards and alerts. #1697

Jsonnet

[FEATURE] Added support for mimir-continuous-test. To deploy mimir-continuous-test you can use the following configuration: #1675 #1850

_config+: {
  continuous_test_enabled: true,
  continuous_test_tenant_id: 'type-tenant-id',
  continuous_test_write_endpoint: 'http://type-write-path-hostname',
  continuous_test_read_endpoint: 'http://type-read-path-hostname/prometheus',
},

[ENHANCEMENT] Ingester anti-affinity can now be disabled by using ingester_allow_multiple_replicas_on_same_node configuration key. #1581
[ENHANCEMENT] Added node_selector configuration option to select Kubernetes nodes where Mimir should run. #1596
[ENHANCEMENT] Alertmanager: Added a PodDisruptionBudget of withMaxUnavailable = 1, to ensure we maintain quorum during rollouts. #1683
[ENHANCEMENT] Store-gateway anti-affinity can now be enabled/disabled using store_gateway_allow_multiple_replicas_on_same_node configuration key. #1730
[ENHANCEMENT] Added store_gateway_zone_a_args, store_gateway_zone_b_args and store_gateway_zone_c_args configuration options. #1807
[BUGFIX] Pass primary and secondary multikv stores via CLI flags. Introduced new multikv_switch_primary_secondary config option to flip primary and secondary in runtime config.

Mimirtool

[BUGFIX] config convert: Retain Cortex defaults for blocks_storage.backend, ruler_storage.backend, alertmanager_storage.backend, auth.type, activity_tracker.filepath, alertmanager.data_dir, blocks_storage.filesystem.dir, compactor.data_dir, ruler.rule_path, ruler_storage.filesystem.dir, and graphite.querier.schemas.backend. #1626 #1762

Tools

[FEATURE] Added a markblocks tool that creates no-compact and delete marks for the blocks. #1551
[FEATURE] Added mimir-continuous-test tool to continuously run smoke tests on live Mimir clusters. #1535 #1540 #1653 #1603 #1630 #1691 #1675 #1676 #1692 #1706 #1709 #1775 #1777 #1778 #1795
[FEATURE] Added mimir-rules-action GitHub action, located at operations/mimir-rules-action/, used to lint, prepare, verify, diff, and sync rules to a Mimir cluster. #1723

mimir - 2.1.0-rc.1

Published by johannaratliff over 2 years ago

CHANGELOG since mimir-2.1.0-rc.0

[CHANGE] Distributor / Ruler: deprecated -distributor.extend-writes. Now Mimir always behaves as if this setting was set to false, which we expect to be safe for every Mimir cluster setup. #1856

mimir - 2.1.0-rc.0

Published by johannaratliff over 2 years ago

Grafana Mimir version 2.1 release notes

Grafana Labs is excited to announce version 2.1 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

Features and enhancements

Mimir on ARM: We now publish Docker images for both amd64 and arm64, making it easier for those on arm-based machines to develop and run Mimir. Multiplaform images are available from the Mimir docker registry. Note that our existing integration test suite only uses the amd64 images, which means we cannot make any functional or performance guarantees about the arm64 images.
Remote ruler mode for improved rule evaluation performance: We've added a remote mode for the Grafana Mimir ruler, in which the ruler delegates rule evaluation to the query-frontend rather than evaluating rules directly within the ruler process itself. This allows recording and alerting rules to benefit from the query parallelization techniques implemented in the query-frontend (like query sharding). Remote mode is considered experimental and is off by default. To enable, see remote ruler.
Per-tenant custom trackers for monitoring cardinality: In Grafana Mimir 2.0, we introduced a custom tracker feature that allows you to track the count of active series over time that match a specific label matcher. In Grafana Mimir 2.1, we've made it possible to configure custom trackers via the runtime configuration file. This means you can now define different trackers for each tenant in your cluster and modify those trackers without an ingester restart.
Reduce cardinality of Grafana Mimir's /metrics endpoint: While Grafana Mimir does a good job of exposing a relatively small number of series about its own state, this number can tick up when running Grafana Mimir clusters with high tenant counts or high active series counts. To reduce this number (and the accompanying cost of scraping and storing these time series), we made several optimizations which decreased series count on the /metrics endpoint by more than 10%.

Upgrade considerations

We've updated the default values for 2 parameters in Grafana Mimir to give users better out-of-the-box performance:

We've changed the default for -blocks-storage.tsdb.isolation-enabled from true to false. We've marked this flag as deprecated and will remove it completely in 2 releases. TSDB isolation is a feature inherited from Prometheus that didn't provide any benefit given Grafana Mimir's distributed architecture and in our 1 billion series load test we found it actually hurt performance. Disabling it reduced our ingester 99th percentile latency by 90%.
The store-gateway attributes cache is now enabled by default (achieved by updating the default for -blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items from 0 to 50000). This in-memory cache makes it faster to look up object attributes for chunk data. We've been running this optional cache internally for a while and upon a recent configuration audit, realized it made sense to do the same for all users. The increase in store-gateway memory utilization from enabling this cache is negligible and easily justified given the performance gains.

Bug fixes

2.1.0 bug fixes

PR 1704: Fixed a bug that previously caused Grafana Mimir to crash on startup when trying to run in monolithic mode with the results cache enabled due to duplicate metric names.
PR 1835: Fixed a bug that caused Grafana Mimir to crash when an invalid Alertmanager configuration was set even though the Alertmanager component was disabled. After this fix, the Alertmanager configuration is only validated if the Alertmanager component is loaded.
PR 1836: The ability to run Alertmanager with local storage broke in Grafana Mimir 2.0 when we removed the ability to run the Alertmanager without sharding. With this bugfix, we've made it possible to again run Alertmanager with local storage. However, for production use, we still recommend using external store since this is needed to persist Alertmanager state (e.g. silences) between replicas.
PR 1715: Restored Grafana Mimir's ability to use CNAME DNS records to reach memcached servers. The bug was inherited from an upstream change to Thanos; we contributed a fix to Thanos and subsequently updated our Thanos version.

mimir - 2.0.0

Published by pracucci over 2 years ago

Grafana Labs is excited to announce the first release of Grafana Mimir, the most scalable, most performant open source time series database in the world. In customer tests, we’ve shown that a single cluster can support more than 1 billion active time series.

Besides massive scale, Grafana Mimir offers a host of other benefits, including easy deployment, native multi-tenancy, high availability, durable long-term storage, and exceptional query performance on even the highest cardinality queries.

We’re launching Grafana Mimir with a 2.0 version number to signal our respect for Cortex, the project from which Grafana Mimir was forked. The choice of 2.0 also represents our conviction that Grafana Mimir is real-world-tested, production-ready software. It has served as the backbone of our Grafana Cloud Metrics and Grafana Enterprise Metrics products since their inception.

Learn more:

The complete list of changes is recorded in the Changelog.

mimir - 2.0.0-rc.4

Published by pracucci over 2 years ago

mimir - 2.0.0-rc.3

Published by pracucci over 2 years ago

mimir - 2.0.0-rc.2

Published by pracucci over 2 years ago

mimir - 2.0.0-rc.1

Published by pracucci over 2 years ago

mimir - 2.0.0-rc.0

Published by pracucci over 2 years ago

Package Rankings

Top 7.98% on Alpine-edge

Top 1.45% on Proxy.golang.org

Related Projects

siglens

100x Efficient Log Management than Splunk Reduce your observability cost by 90%

31 Oct 2023 986

fastapi-observability

Observe FastAPI app with three pillars of observability: Traces (Tempo), Metrics (Prometheus), Lo...

13 Apr 2022 603

qryn

Lightweight, Polyglot, Snap-on Observability Stack. Drop-in Compatible with Loki, Prometheus, Tem...

26 Dec 2018 948

autometrics-rs

Easily add metrics to your code that actually help you spot and debug issues in production. Built...

12 Jan 2023 769

agent

Vendor-neutral programmable observability pipelines.

25 Nov 2019 1,588

grafana-ansible-collection

grafana.grafana Ansible collection provides modules and roles for managing various resources on G...

01 Aug 2022 128

opentelemetry-apm

A language-agnostic application performance management(APM) with OpenTelemetry, Grafana, and Prom...

17 Aug 2023 61

terraform-aws-observability-accelerator

Open source project to help accelerate and ease observability setup on AWS environments

16 Aug 2022 279

spring-boot-observability

Observe Spring Boot app with three pillars of observability: Traces (Tempo), Metrics (Prometheus)...

24 Sep 2022 229

cognative

A modern approach to observability, operations, and business intelligence.

17 Mar 2024 4

otel-grafana-demo

Demo application showing how to instrument a Node application with OpenTelemetry, Prometheus, Jae...

03 Aug 2021 35

mlops-infra-ta

fastapi service observability with Grafana Cloud and OpenTelemetry, gke for infra.

09 Sep 2024 1

langtrace

Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applic...

30 Mar 2024 438

fastapi-jaeger

Trace FastAPI with Jaeger through OpenTelemetry Python API and SDK.

30 Apr 2022 71

alloy

OpenTelemetry Collector distribution with programmable pipelines

28 Feb 2024 1,217