thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.

APACHE-2.0 License

Stars
12.6K
Committers
620
thanos - v0.12.0

Published by squat over 4 years ago

Fixed

  • #2288 Ruler: fixes issue #2281, a bug causing incorrect parsing of query address with path prefix.
  • #2238 Ruler: fixed issue #2204, where a bug in alert queue signaling filled up the queue and alerts were dropped.
  • #2231 Bucket Web: sort chunks by thanos.downsample.resolution for better grouping.
  • #2254 Bucket: fix issue where metrics were registered multiple times in bucket replicate.
  • #2271 Bucket Web: fixed issue #2260, where the bucket passes null when storage is empty.
  • #2339 Query: fix a bug where --store.unhealthy-timeout was never respected.
  • #2208 Query and Rule: fix handling of web.route-prefix to correctly handle / and prefixes that do not begin with a /.
  • #2311 Receive: ensure receive component serves TLS when TLS configuration is provided.
  • #2319 Query: fixed inconsistent naming of metrics.
  • #2390 Store: fixed bug that was causing all posting offsets to be used instead of only 1/32 as intended; added hidden flag to control this behavior.
  • #2393 Store: fixed bug causing certain not-existing label values queried to fail with "invalid-size" error from binary header.
  • #2382 Store: fixed bug causing partial writes of index-header.
  • #2383 Store: handle expected errors correctly, e.g. do not increment failure counters.

Added

  • #2252 Query: add new --store-strict flag. More information available here.
  • #2265 Compact: add --wait-interval to specify compaction wait interval between consecutive compact runs when --wait is enabled.
  • #2250 Compact: enable vertical compaction for offline deduplication (experimental). Uses --deduplication.replica-label flag to specify the replica label on which to deduplicate (hidden). Please note that this uses a NAIVE algorithm for merging (no smart replica deduplication, just chaining samples together). This works well for deduplication of blocks with precisely the same samples like those produced by Receiver replication. We plan to add a smarter algorithm in the following weeks.
  • #1714 Compact: the compact component now exposes the bucket web UI when it is run as a long-lived process.
  • #2304 Store: added max_item_size configuration option to memcached-based index cache. This should be set to the max item size configured in memcached (-I flag) in order to not waste network round-trips to cache items larger than the limit configured in memcached.
  • #2297 Store: add --experimental.enable-index-cache-postings-compression flag to enable re-encoding and compressing postings before storing them into the cache. Compressed postings take about 10% of the original size.
  • #2357 Compact and Store: the compact and store components now serve the bucket UI on :<http-port>/loaded, which shows exactly the blocks that are currently seen by compactor and the store gateway. The compactor also serves a different bucket UI on :<http-port>/global, which shows the status of object storage without any filters.
  • #2166 Bucket Web: improve the tooltip for the bucket UI; it was reconstructed and now exposes much more information about blocks.
  • #2172 Store: add support for sharding the store component based on the label hash.
  • #2113 Bucket: added thanos bucket replicate command to replicate blocks from one bucket to another.
  • #1922 Docs: create a new document to explain sharding in Thanos.
  • #2230 Store: optimize conversion of labels.

Changed

  • #2136 breaking Store, Compact, Bucket: schedule block deletion by adding deletion-mark.json. This adds a consistent way for multiple readers and writers to access object storage.
    Since there are no consistency guarantees provided by some Object Storage providers, this PR adds a consistent lock-free way of dealing with Object Storage irrespective of the choice of object storage. In order to achieve this co-ordination, blocks are not deleted directly. Instead, blocks are marked for deletion by uploading the deletion-mark.json file for the block that was chosen to be deleted. This file contains Unix time of when the block was marked for deletion. If you want to keep existing behavior, you should add --delete-delay=0s as a flag.
  • #2090 breaking Downsample command: the downsample command has moved and is now a sub-command of the thanos bucket sub-command; it cannot be called via thanos downsample any more.
  • #2294 Store: optimizations for fetching postings. Queries using =~".*" matchers or negation matchers (!=... or !~...) benefit the most.
  • #2301 Ruler: exit with an error when initialization fails.
  • #2310 Query: report timespan 0 to 0 when discovering no stores.
  • #2330 Store: index-header is no longer experimental. It is enabled by default for store Gateway. You can disable it with new hidden flag: --store.disable-index-header. The --experimental.enable-index-header flag was removed.
  • #1848 Ruler: allow returning error messages when a reload is triggered via HTTP.
  • #2270 All: Thanos components will now print stack traces when they error out.
thanos - v0.12.0-rc.1

Published by squat over 4 years ago

Fixed

  • #2288 Ruler: fixes issue #2281, a bug causing incorrect parsing of query address with path prefix.
  • #2238 Ruler: fixed issue #2204, where a bug in alert queue signaling filled up the queue and alerts were dropped.
  • #2231 Bucket Web: sort chunks by thanos.downsample.resolution for better grouping.
  • #2254 Bucket: fix issue where metrics were registered multiple times in bucket replicate.
  • #2271 Bucket Web: fixed issue #2260, where the bucket passes null when storage is empty.
  • #2339 Query: fix a bug where --store.unhealthy-timeout was never respected.
  • #2208 Query and Rule: fix handling of web.route-prefix to correctly handle / and prefixes that do not begin with a /.
  • #2311 Receive: ensure receive component serves TLS when TLS configuration is provided.
  • #2319 Query: fixed inconsistent naming of metrics.
  • #2390 Store: fixed bug that was causing all posting offsets to be used instead of only 1/32 as intended; added hidden flag to control this behavior.
  • #2393 Store: fixed bug causing certain not-existing label values queried to fail with "invalid-size" error from binary header.
  • #2382 Store: fixex bug causing partial writes of index-header.
  • #2383 Store: handle expected errors correctly, e.g. do not increment failure counters.

Added

  • #2252 Query: add new --store-strict flag. More information available here.
  • #2265 Compact: add --wait-interval to specify compaction wait interval between consecutive compact runs when --wait is enabled.
  • #2250 Compact: enable vertical compaction for offline deduplication (experimental). Uses --deduplication.replica-label flag to specify the replica label on which to deduplicate (hidden). Please note that this uses a NAIVE algorithm for merging (no smart replica deduplication, just chaining samples together). This works well for deduplication of blocks with precisely the same samples like those produced by Receiver replication. We plan to add a smarter algorithm in the following weeks.
  • #1714 Compact: the compact component now exposes the bucket web UI when it is run as a long-lived process.
  • #2304 Store: added max_item_size configuration option to memcached-based index cache. This should be set to the max item size configured in memcached (-I flag) in order to not waste network round-trips to cache items larger than the limit configured in memcached.
  • #2297 Store: add --experimental.enable-index-cache-postings-compression flag to enable re-encoding and compressing postings before storing them into the cache. Compressed postings take about 10% of the original size.
  • #2357 Compact and Store: the compact and store components now serve the bucket UI on :<http-port>/loaded, which shows exactly the blocks that are currently seen by compactor and the store gateway. The compactor also serves a different bucket UI on :<http-port>/global, which shows the status of object storage without any filters.
  • #2166 Bucket Web: improve the tooltip for the bucket UI; it was reconstructed and now exposes much more information about blocks.
  • #2172 Store: add support for sharding the store component based on the label hash.
  • #2113 Bucket: added thanos bucket replicate command to replicate blocks from one bucket to another.
  • #1922 Docs: create a new document to explain sharding in Thanos.
  • #2230 Store: optimize conversion of labels.

Changed

  • #2136 breaking Store, Compact, Bucket: schedule block deletion by adding deletion-mark.json. This adds a consistent way for multiple readers and writers to access object storage.
    Since there are no consistency guarantees provided by some Object Storage providers, this PR adds a consistent lock-free way of dealing with Object Storage irrespective of the choice of object storage. In order to achieve this co-ordination, blocks are not deleted directly. Instead, blocks are marked for deletion by uploading the deletion-mark.json file for the block that was chosen to be deleted. This file contains Unix time of when the block was marked for deletion. If you want to keep existing behavior, you should add --delete-delay=0s as a flag.
  • #2090 breaking Downsample command: the downsample command has moved and is now a sub-command of the thanos bucket sub-command; it cannot be called via thanos downsample any more.
  • #2294 Store: optimizations for fetching postings. Queries using =~".*" matchers or negation matchers (!=... or !~...) benefit the most.
  • #2301 Ruler: exit with an error when initialization fails.
  • #2310 Query: report timespan 0 to 0 when discovering no stores.
  • #2330 Store: index-header is no longer experimental. It is enabled by default for store Gateway. You can disable it with new hidden flag: --store.disable-index-header. The --experimental.enable-index-header flag was removed.
  • #1848 Ruler: allow returning error messages when a reload is triggered via HTTP.
  • #2270 All: Thanos components will now print stack traces when they error out.
thanos - v0.12.0-rc.0

Published by squat over 4 years ago

Fixed

  • #2288 Ruler: fixes issue #2281, a bug causing incorrect parsing of query address with path prefix.
  • #2238 Ruler: fixed issue #2204, where a bug in alert queue signaling filled up the queue and alerts were dropped.
  • #2231 Bucket Web: sort chunks by thanos.downsample.resolution for better grouping.
  • #2254 Bucket: fix issue where metrics were registered multiple times in bucket replicate.
  • #2271 Bucket Web: fixed issue #2260, where the bucket passes null when storage is empty.
  • #2339 Query: fix a bug where --store.unhealthy-timeout was never respected.
  • #2208 Query and Rule: fix handling of web.route-prefix to correctly handle / and prefixes that do not begin with a /.
  • #2311 Receive: ensure receive component serves TLS when TLS configuration is provided.
  • #2319 Query: fixed inconsistent naming of metrics.

Added

  • #2252 Query: add new --store-strict flag. More information available here.
  • #2265 Compact: add --wait-interval to specify compaction wait interval between consecutive compact runs when --wait is enabled.
  • #2250 Compact: enable vertical compaction for offline deduplication (experimental). Uses --deduplication.replica-label flag to specify the replica label on which to deduplicate (hidden). Please note that this uses a NAIVE algorithm for merging (no smart replica deduplication, just chaining samples together). This works well for deduplication of blocks with precisely the same samples like those produced by Receiver replication. We plan to add a smarter algorithm in the following weeks.
  • #1714 Compact: the compact component now exposes the bucket web UI when it is run as a long-lived process.
  • #2304 Store: added max_item_size configuration option to memcached-based index cache. This should be set to the max item size configured in memcached (-I flag) in order to not waste network round-trips to cache items larger than the limit configured in memcached.
  • #2297 Store: add --experimental.enable-index-cache-postings-compression flag to enable re-encoding and compressing postings before storing them into the cache. Compressed postings take about 10% of the original size.
  • #2357 Compact and Store: the compact and store components now serve the bucket UI on :<http-port>/loaded, which shows exactly the blocks that are currently seen by compactor and the store gateway. The compactor also serves a different bucket UI on :<http-port>/global, which shows the status of object storage without any filters.
  • #2166 Bucket Web: improve the tooltip for the bucket UI; it was reconstructed and now exposes much more information about blocks.
  • #2172 Store: add support for sharding the store component based on the label hash.
  • #2113 Bucket: added thanos bucket replicate command to replicate blocks from one bucket to another.
  • #1922 Docs: create a new document to explain sharding in Thanos.
  • #2230 Store: optimize conversion of labels.

Changed

  • #2136 breaking Store, Compact, Bucket: schedule block deletion by adding deletion-mark.json. This adds a consistent way for multiple readers and writers to access object storage.
    Since there are no consistency guarantees provided by some Object Storage providers, this PR adds a consistent lock-free way of dealing with Object Storage irrespective of the choice of object storage. In order to achieve this co-ordination, blocks are not deleted directly. Instead, blocks are marked for deletion by uploading the deletion-mark.json file for the block that was chosen to be deleted. This file contains Unix time of when the block was marked for deletion. If you want to keep existing behavior, you should add --delete-delay=0s as a flag.
  • #2090 breaking Downsample command: the downsample command has moved and is now a sub-command of the thanos bucket sub-command; it cannot be called via thanos downsample any more.
  • #2294 Store: optimizations for fetching postings. Queries using =~".*" matchers or negation matchers (!=... or !~...) benefit the most.
  • #2301 Ruler: exit with an error when initialization fails.
  • #2310 Query: report timespan 0 to 0 when discovering no stores.
  • #2330 Store: index-header is no longer experimental. It is enabled by default for store Gateway. You can disable it with new hidden flag: --store.disable-index-header. The --experimental.enable-index-header flag was removed.
  • #1848 Ruler: allow returning error messages when a reload is triggered via HTTP.
  • #2270 All: Thanos components will now print stack traces when they error out.
thanos - v0.11.0

Published by metalmatze over 4 years ago

Fixed

  • #2033 Minio-go: Fixed Issue #1494 support Web Identity providers for IAM credentials for AWS EKS.
  • #1985 Store Gateway: Fixed case where series entry is larger than 64KB in index.
  • #2051 Ruler: Fixed issue where ruler does not expose shipper metrics.
  • #2101 Ruler: Fixed bug where thanos_alert_sender_errors_total was not registered.
  • #1789 Store Gateway: Improve timeouts.
  • #2139 Properly handle SIGHUP for reloading.
  • #2040 UI: Fix URL of alerts in Ruler
  • #2033 Ruler: Fix tracing in Thanos Ruler

Added

  • #2003 Query: Support downsampling for /series.
  • #1952 Store Gateway: Implemented binary index header. This significantly reduces resource consumption (memory, CPU, net bandwidth) for startup and data loading processes as well as baseline memory. This means that adding more blocks into object storage, without querying them will use almost no resources. This, however, still means that querying large amounts of data will result in high spikes of memory and CPU use as before, due to simply fetching large amounts of metrics data. Since we fixed baseline, we are now focusing on query performance optimizations in separate initiatives. To enable experimental index-header mode run store with hidden experimental.enable-index-header flag.
  • #2009 Store Gateway: Minimum age of all blocks before they are being read. Set it to a safe value (e.g 30m) if your object storage is eventually consistent. GCS and S3 are (roughly) strongly consistent.
  • #1963 Mixin: Add Thanos Ruler alerts.
  • #1984 Query: Add cache-control header to not cache on error.
  • #1870 UI: Persist settings in query.
  • #1969 Sidecar: allow setting http connection pool size via flags.
  • #1967 Receive: Allow local TSDB compaction.
  • #1939 Ruler: Add TLS and authentication support for query endpoints with the --query.config and --query.config-file CLI flags. See documentation for further information.
  • #1982 Ruler: Add support for Alertmanager v2 API endpoints.
  • #2030 Query: Add thanos_proxy_store_empty_stream_responses_total metric for number of empty responses from stores.
  • #2049 Tracing: Support sampling on Elastic APM with new sample_rate setting.
  • #2008 Querier, Receiver, Sidecar, Store: Add gRPC health check endpoints.
  • #2145 Tracing: track query sent to prometheus via remote read api.

Changed

  • #1970 breaking Receive: Use gRPC for forwarding requests between peers. Note that existing values for the --receive.local-endpoint flag and the endpoints in the hashring configuration file must now specify the receive gRPC port and must be updated to be a simple host:port combination, e.g. 127.0.0.1:10901, rather than a full HTTP URL, e.g. http://127.0.0.1:10902/api/v1/receive.
  • #1933 Add a flag --tsdb.wal-compression to configure whether to enable tsdb wal compression in ruler and receiver.
  • #2021 Rename metric thanos_query_duplicated_store_address to thanos_query_duplicated_store_addresses_total and thanos_rule_duplicated_query_address to thanos_rule_duplicated_query_addresses_total.
thanos - v0.11.0-rc.1

Published by metalmatze over 4 years ago

Fixed

  • #2189 minio-go: Fixed Issue #2181, unable to use IAM metadata credentials
  • #2033 Minio-go: Fixed Issue #1494 support Web Identity providers for IAM credentials for AWS EKS.
  • #1985 Store Gateway: Fixed case where series entry is larger than 64KB in index.
  • #2051 Ruler: Fixed issue where ruler does not expose shipper metrics.
  • #2101 Ruler: Fixed bug where thanos_alert_sender_errors_total was not registered.
  • #1789 Store Gateway: Improve timeouts.
  • #2139 Properly handle SIGHUP for reloading.
  • #2040 UI: Fix URL of alerts in Ruler
  • #2033 Ruler: Fix tracing in Thanos Ruler

Added

  • #2003 Query: Support downsampling for /series.
  • #1952 Store Gateway: Implemented binary index header. This significantly reduces resource consumption (memory, CPU, net bandwidth) for startup and data loading processes as well as baseline memory. This means that adding more blocks into object storage, without querying them will use almost no resources. This, however, still means that querying large amounts of data will result in high spikes of memory and CPU use as before, due to simply fetching large amounts of metrics data. Since we fixed baseline, we are now focusing on query performance optimizations in separate initiatives. To enable experimental index-header mode run store with hidden experimental.enable-index-header flag.
  • #2009 Store Gateway: Minimum age of all blocks before they are being read. Set it to a safe value (e.g 30m) if your object storage is eventually consistent. GCS and S3 are (roughly) strongly consistent.
  • #1963 Mixin: Add Thanos Ruler alerts.
  • #1984 Query: Add cache-control header to not cache on error.
  • #1870 UI: Persist settings in query.
  • #1969 Sidecar: allow setting http connection pool size via flags.
  • #1967 Receive: Allow local TSDB compaction.
  • #1939 Ruler: Add TLS and authentication support for query endpoints with the --query.config and --query.config-file CLI flags. See documentation for further information.
  • #1982 Ruler: Add support for Alertmanager v2 API endpoints.
  • #2030 Query: Add thanos_proxy_store_empty_stream_responses_total metric for number of empty responses from stores.
  • #2049 Tracing: Support sampling on Elastic APM with new sample_rate setting.
  • #2008 Querier, Receiver, Sidecar, Store: Add gRPC health check endpoints.
  • #2145 Tracing: track query sent to prometheus via remote read api.

Changed

  • #1970 breaking Receive: Use gRPC for forwarding requests between peers. Note that existing values for the --receive.local-endpoint flag and the endpoints in the hashring configuration file must now specify the receive gRPC port and must be updated to be a simple host:port combination, e.g. 127.0.0.1:10901, rather than a full HTTP URL, e.g. http://127.0.0.1:10902/api/v1/receive.
  • #1933 Add a flag --tsdb.wal-compression to configure whether to enable tsdb wal compression in ruler and receiver.
  • #2021 Rename metric thanos_query_duplicated_store_address to thanos_query_duplicated_store_addresses_total and thanos_rule_duplicated_query_address to thanos_rule_duplicated_query_addresses_total.
thanos - v0.11.0-rc.0

Published by metalmatze over 4 years ago

Fixed

  • #2033 Minio-go: Fixed Issue #1494 support Web Identity providers for IAM credentials for AWS EKS.
  • #1985 Store Gateway: Fixed case where series entry is larger than 64KB in index.
  • #2051 Ruler: Fixed issue where ruler does not expose shipper metrics.
  • #2101 Ruler: Fixed bug where thanos_alert_sender_errors_total was not registered.
  • #1789 Store Gateway: Improve timeouts.
  • #2139 Properly handle SIGHUP for reloading.
  • #2040 UI: Fix URL of alerts in Ruler
  • #2033 Ruler: Fix tracing in Thanos Ruler

Added

  • #2003 Query: Support downsampling for /series.
  • #1952 Store Gateway: Implemented binary index header. This significantly reduces resource consumption (memory, CPU, net bandwidth) for startup and data loading processes as well as baseline memory. This means that adding more blocks into object storage, without querying them will use almost no resources. This, however, still means that querying large amounts of data will result in high spikes of memory and CPU use as before, due to simply fetching large amounts of metrics data. Since we fixed baseline, we are now focusing on query performance optimizations in separate initiatives. To enable experimental index-header mode run store with hidden experimental.enable-index-header flag.
  • #2009 Store Gateway: Minimum age of all blocks before they are being read. Set it to a safe value (e.g 30m) if your object storage is eventually consistent. GCS and S3 are (roughly) strongly consistent.
  • #1963 Mixin: Add Thanos Ruler alerts.
  • #1984 Query: Add cache-control header to not cache on error.
  • #1870 UI: Persist settings in query.
  • #1969 Sidecar: allow setting http connection pool size via flags.
  • #1967 Receive: Allow local TSDB compaction.
  • #1939 Ruler: Add TLS and authentication support for query endpoints with the --query.config and --query.config-file CLI flags. See documentation for further information.
  • #1982 Ruler: Add support for Alertmanager v2 API endpoints.
  • #2030 Query: Add thanos_proxy_store_empty_stream_responses_total metric for number of empty responses from stores.
  • #2049 Tracing: Support sampling on Elastic APM with new sample_rate setting.
  • #2008 Querier, Receiver, Sidecar, Store: Add gRPC health check endpoints.
  • #2145 Tracing: track query sent to prometheus via remote read api.

Changed

  • #1970 breaking Receive: Use gRPC for forwarding requests between peers. Note that existing values for the --receive.local-endpoint flag and the endpoints in the hashring configuration file must now specify the receive gRPC port and must be updated to be a simple host:port combination, e.g. 127.0.0.1:10901, rather than a full HTTP URL, e.g. http://127.0.0.1:10902/api/v1/receive.
  • #1933 Add a flag --tsdb.wal-compression to configure whether to enable tsdb wal compression in ruler and receiver.
  • #2021 Rename metric thanos_query_duplicated_store_address to thanos_query_duplicated_store_addresses_total and thanos_rule_duplicated_query_address to thanos_rule_duplicated_query_addresses_total.
thanos - v0.10.1

Published by bwplotka over 4 years ago

Patch release fixing /api/v1/series.

See details in the CHANGELOG

thanos - v0.10.0

Published by GiedriusS almost 5 years ago

Thanks to all contributors! ❀️

Highlights: Store now supports memcached; StoreAPI has a new skip-chunks option which is used to greatly speed-up the /api/v1/series end-point; Store/Compactor has improved synchronization of meta JSON files; Ruler supports TLS and authentication; fixed a potential data loss when uploading older blocks or when the upload is taking a long time while the Compactor is running; Compaction process should take significantly less RAM but a longer time.

❗ memcached support is marked experimental for now ❗

As always, here is the detailed changelog:

Fixed

  • #1919 Compactor: Fixed potential data loss when uploading older blocks, or upload taking long time while compactor is
    running.

  • #1937 Compactor: Improved synchronization of meta JSON files.
    Compactor now properly handles partial block uploads for all operation like retention apply, downsampling and compaction. Additionally:

    • Removed thanos_compact_sync_meta_* metrics. Use thanos_blocks_meta_* metrics instead.
    • Added thanos_consistency_delay_seconds and thanos_compactor_aborted_partial_uploads_deletion_attempts_total metrics.
  • #1936 Store: Improved synchronization of meta JSON files. Store now properly handles corrupted disk cache. Added meta.json sync metrics.

  • #1856 Receive: close DBReadOnly after flushing to fix a memory leak.

  • #1882 Receive: upload to object storage as 'receive' rather than 'sidecar'.

  • #1907 Store: Fixed the duration unit for the metric thanos_bucket_store_series_gate_duration_seconds.

  • #1931 Compact: Fixed the compactor successfully exiting when actually an error occurred while compacting a blocks group.

  • #1872 Ruler: /api/v1/rules now shows a properly formatted value

  • #1945 master container images are now built with Go 1.13

  • #1956 Ruler: now properly ignores duplicated query addresses

  • #1975 Store Gateway: fixed panic caused by memcached servers selector when there's 1 memcached node

Added

  • #1852 Add support for AWS_CONTAINER_CREDENTIALS_FULL_URI by upgrading to minio-go v6.0.44
  • #1854 Update Rule UI to support alerts count displaying and filtering.
  • #1838 Ruler: Add TLS and authentication support for Alertmanager with the --alertmanagers.config and --alertmanagers.config-file CLI flags. See documentation for further information.
  • #1838 Ruler: Add a new --alertmanagers.sd-dns-interval CLI option to specify the interval between DNS resolutions of Alertmanager hosts.
  • #1881 Store Gateway: memcached support for index cache. See documentation for further information.
  • #1904 Add a skip-chunks option in Store Series API to improve the response time of /api/v1/series endpoint.
  • #1910 Query: /api/v1/labels now understands POST - useful for sending bigger requests

Changed

  • #1947 Upgraded Prometheus dependencies to v2.15.2. This includes:

    • Compactor: Significant reduction of memory footprint for compaction and downsampling process.
    • Querier: Accepting spaces between time range and square bracket. e.g [ 5m]
    • Querier: Improved PromQL parser performance.
  • #1833 --shipper.upload-compacted flag has been promoted to non hidden, non experimental state. More info available here.

  • #1867 Ruler: now sets a Thanos/$version User-Agent in requests

  • #1887 Service discovery now deduplicates targets between different target groups

thanos - v0.10.0-rc.1

Published by GiedriusS almost 5 years ago

thanos - v0.10.0-rc.0

Published by GiedriusS almost 5 years ago

thanos - v0.9.0

Published by bwplotka almost 5 years ago

Thanks to all contributors!

Worth-noting changes: Support for AlibabaCloud object storage; LightStep tracing; Ruler fixes, Store UI page fixed, Store gateway has now metrics for startup cycle plus optimization.

Added

  • #1678 Add Lightstep as a tracing provider.
  • #1687 Add a new --grpc-grace-period CLI option to components which serve gRPC to set how long to wait until gRPC Server shuts down.
  • #1660 Sidecar: Add a new --prometheus.ready_timeout CLI option to the sidecar to set how long to wait until Prometheus starts up.
  • #1573 AliYun OSS object storage, see documents for further information.
  • #1680 Add a new --http-grace-period CLI option to components which serve HTTP to set how long to wait until HTTP Server shuts down.
  • #1712 Bucket: Rename flag on bucket web component from --listen to --http-address to match other components.
  • #1733 Compactor: New metric thanos_compactor_iterations_total on Thanos Compactor which shows the number of successful iterations.
  • #1758 Bucket: thanos bucket web now supports --web.external-prefix for proxying on a subpath.
  • #1770 Bucket: Add --web.prefix-header flags to allow for bucket UI to be accessible behind a reverse proxy.
  • #1668 Receiver: Added TLS options for both server and client remote write.

Fixed

  • #1656 Store Gateway: Store now starts metric and status probe HTTP server earlier in its start-up sequence. /-/healthy endpoint now starts to respond with success earlier. /metrics endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the /-/ready endpoint rather than /metrics.
  • #1669 Store Gateway: Fixed store sharding. Now it does not load excluded meta.jsons and load/fetch index-cache.json files.
  • #1670 Sidecar: Fixed un-ordered blocks upload. Sidecar now uploads the oldest blocks first.
  • #1568 Store Gateway: Store now retains the first raw value of a chunk during downsampling to avoid losing some counter resets that occur on an aggregation boundary.
  • #1751 Querier: Fixed labels for StoreUI
  • #1773 Ruler: Fixed the /api/v1/rules endpoint that returned 500 status code with failed to assert type of rule ... message.
  • #1770 Querier: Fixed --web.external-prefix 404s for static resources.
  • #1785 Ruler: The /api/v1/rules endpoints now returns the original rule filenames.
  • #1791 Ruler: Ruler now supports identical rule filenames in different directories.
  • #1562 Querier: Downsampling option now carries through URL.
  • #1675 Querier: Reduced resource usage while using certain queries like offset.
  • #1725 & #1718 Store Gateway: Per request memory improvements.

Changed

  • #1666 Compact: thanos_compact_group_compactions_total now counts block compactions, so operations that resulted in a compacted block. The old behaviour
    is now exposed by new metric: thanos_compact_group_compaction_runs_started_total and thanos_compact_group_compaction_runs_completed_total which counts compaction runs overall.
  • #1748 Updated all dependencies.
  • #1694 prober_ready and prober_healthy metrics are removed, for sake of status. Now status exposes same metric with a label, check. check can have "healty" or "ready" depending on status of the probe.
  • #1790 Ruler: Fixes subqueries support for ruler.
  • #1769 & #1545 Adjusted most of the metrics histogram buckets.
thanos - v0.9.0-rc.0

Published by bwplotka almost 5 years ago

RC release for v0.9.0

See changes here

thanos - v0.8.1

Published by bwplotka about 5 years ago

Fixed

  • #1632 Removes the duplicated external labels detection on Thanos Querier. It is warning only now. Also made Store Gateway compatible with older Querier versions.
    NOTE: thanos_store_nodes_grpc_connections metric is now per external_labels and store_type. It is a recommended metric for Querier storeAPIs. thanos_store_node_info is marked as obsolete and will be removed in next release.
    NOTE2: Store Gateway is now advertising artificial: "@thanos_compatibility_store_type=store" label. This is to have the current Store Gateway compatible with Querier pre v0.8.0.
    This label can be disabled by hidden debug.advertise-compatibility-label=false flag on Store Gateway.

See full CHANGELOG here

thanos - v0.8.0

Published by bwplotka about 5 years ago

Lots of improvements this release! Outstanding items:

  • First Katacoda tutorial! 🐱
  • Fixed Deletion order causing Compactor to produce not needed πŸ‘» blocks with missing random files.
  • Store GW memory improvements (more to come!).
  • Querier allows multiple deduplication labels.
  • Both Compactor and Store Gateway can be sharded within the same bucket using relabelling!
  • Sidecar exposed data from Prometheus can be now limited to given min-time (e.g 3h only).
  • Numerous Thanos Receive improvements.

Make sure you check out Prometheus 2.13.0 as well. New release drastically improves usage and resource consumption of both Prometheus and sidecar with Thanos: https://prometheus.io/blog/2019/10/10/remote-read-meets-streaming/

Added

  • #1619 Thanos sidecar allows to limit min time range for data it exposes from Prometheus.
  • #1583 Thanos sharding:
    • Add relabel config (--selector.relabel-config-file and selector.relabel-config) into Thanos Store and Compact components.
      Selecting blocks to serve depends on the result of block labels relabeling.
    • For store gateway, advertise labels from "approved" blocks.
  • #1540 Thanos Downsample added /-/ready and /-/healthy endpoints.
  • #1538 Thanos Rule added /-/ready and /-/healthy endpoints.
  • #1537 Thanos Receive added /-/ready and /-/healthy endpoints.
  • #1460 Thanos Store Added /-/ready and /-/healthy endpoints.
  • #1534 Thanos Query Added /-/ready and /-/healthy endpoints.
  • #1533 Thanos inspect now supports the timeout flag.
  • #1496 Thanos Receive now supports setting block duration.
  • #1362 Optional replicaLabels param for /query and
    /query_range querier endpoints. When provided overwrite the query.replica-label cli flags.
  • #1482 Thanos now supports Elastic APM as tracing provider.
  • #1612 Thanos Rule added resendDelay flag.
  • #1480 Thanos Receive flushes storage on hashring change.
  • #1613 Thanos Receive now traces forwarded requests.

Changed

  • #1362 query.replica-label configuration can be provided more than
    once for multiple deduplication labels like: --query.replica-label=prometheus_replica --query.replica-label=service.
  • #1581 Thanos Store now can use smaller buffer sizes for Bytes pool; reducing memory for some requests.
  • #1622 & #1590 Updated to Go 1.13.1
  • #1498 Thanos Receive change flag labels to label to be consistent with other commands.

Fixed

  • #1525 Thanos now deletes block's file in correct order allowing to detect partial blocks without problems.
  • #1505 Thanos Store now removes invalid local cache blocks.
  • #1587 Thanos Sidecar cleanups all cache dirs after each compaction run.
  • #1582 Thanos Rule correctly parses Alertmanager URL if there is more + in it.
  • #1544 Iterating over object store is resilient to the edge case for some providers.
  • #1469 Fixed Azure potential failures (EOF) when requesting more data then blob has.
  • #1512 Thanos Store fixed memory leak for chunk pool.
  • #1488 Thanos Rule now now correctly links to query URL from rules and alerts.

See full CHANGELOG here

thanos - v0.7.0

Published by domgreen about 5 years ago

v0.7.0

Accepted into CNCF:

Added

  • #1378 Thanos Receive now exposes thanos_receive_config_hash, thanos_receive_config_last_reload_successful and thanos_receive_config_last_reload_success_timestamp_seconds metrics to track latest configuration change
  • #1268 Thanos Sidecar added support for newest Prometheus streaming remote read added here. This massively improves memory required by single
    request for both Prometheus and sidecar. Single requests now should take constant amount of memory on sidecar, so resource consumption prediction is now straightforward. This will be used if you have Prometheus 2.13 or 2.12-master.
  • #1358 Added part_size configuration option for HTTP multipart requests minimum part size for S3 storage type
  • #1363 Thanos Receive now exposes thanos_receive_hashring_nodes and thanos_receive_hashring_tenants metrics to monitor status of hash-rings
  • #1395 Thanos Sidecar added /-/ready and /-/healthy endpoints to Thanos sidecar.
  • #1297 Thanos Compact added /-/ready and /-/healthy endpoints to Thanos compact.
  • #1431 Thanos Query added hidden flag to allow the use of downsampled resolution data for instant queries.
  • #1408 Thanos Store Gateway can now allow the specifying of supported time ranges it will serve (time sharding). Flags: min-time & max-time

Changed

  • #1414 Upgraded important dependencies: Prometheus to 2.12-rc.0. TSDB is now part of Prometheus.
  • #1380 Upgraded important dependencies: Prometheus to 2.11.1 and TSDB to 0.9.1. Some changes affecting Querier:
    • [ENHANCEMENT] Query performance improvement: Efficient iteration and search in HashForLabels and HashWithoutLabels. #5707
    • [ENHANCEMENT] Optimize queries using regexp for set lookups. tsdb#602
    • [BUGFIX] prometheus_tsdb_compactions_failed_total is now incremented on any compaction failure. tsdb#613
    • [BUGFIX] PromQL: Correctly display {name="a"}.
  • #1338 Thanos Query still warns on store API duplicate, but allows a single one from duplicated set. This is gracefully warn about the problematic logic and not disrupt immediately.
  • #1385 Thanos Compact exposes flag to disable downsampling downsampling.disable.

Fixed

  • #1327 Thanos Query /series API end-point now properly returns an empty array just like Prometheus if there are no results
  • #1302 Thanos now efficiently reuses HTTP keep-alive connections
  • #1371 Thanos Receive fixed race condition in hashring
  • #1430 Thanos fixed value of GOMAXPROCS inside container.

Deprecated

  • #1458 Thanos Query and Receive now use common instrumentation middleware. As as result, for sake of http_requests_total and http_request_duration_seconds_bucket; Thanos Query no longer exposes thanos_query_api_instant_query_duration_seconds, thanos_query_api_range_query_duration_second metrics and Thanos Receive no longer exposes thanos_http_request_duration_seconds, thanos_http_requests_total, thanos_http_response_size_bytes.
  • #1423 Thanos Bench deprecated.
thanos - v0.7.0-rc.0

Published by domgreen about 5 years ago

TLDR; Move to CNCF, Added steaming between Prometheus and Sidecar, allow time sharding on Store Gateway and many bug fixes.

More detailed information on the release can be found here https://github.com/thanos-io/thanos/blob/master/CHANGELOG.md

thanos - v0.6.1

Published by bwplotka about 5 years ago

thanos - v0.6.0

Published by GiedriusS over 5 years ago

Added

TL;DR: Jaeger tracing support (tracing flag changed), various observability improvements, Thanos receiver improvements, improvement external label propagation, including federated Queriers (!) and other fixes.

NOTE: Thanks to improved external labels propagation, if you run have duplicate queries in your Querier configuration with hierarchical federation of multiple Queries, Thanos now will detect this case and block all duplicates. New releases (potentially in v0.6.1) will just warn and block all but one.

  • #1097 Added thanos check rules linter for Thanos rule rules files.

  • #1253 Add support for specifying a maximum amount of retries when using Azure Blob storage (default: no retries).

  • #1244 Thanos Compact now exposes new metrics thanos_compact_downsample_total and thanos_compact_downsample_failures_total which are useful to catch when errors happen

  • #1260 Thanos Query/Rule now exposes metrics thanos_querier_store_apis_dns_provider_results and thanos_ruler_query_apis_dns_provider_results which tell how many addresses were configured and how many were actually discovered respectively

  • #1248 Add a web UI to show the state of remote storage.

  • #1217 Thanos Receive gained basic hashring support

  • #1262 Thanos Receive got a new metric thanos_http_requests_total which shows how many requests were handled by it

  • #1243 Thanos Receive got an ability to forward time series data between nodes. Now you can pass the hashring configuration via --receive.hashrings-file; the refresh interval --receive.hashrings-file-refresh-interval; the name of the local node's name --receive.local-endpoint; and finally the header's name which is used to determine the tenant --receive.tenant-header.

  • #1147 Support for the Jaeger tracer has been added!

breaking New common flags were added for configuring tracing: --tracing.config-file and --tracing.config. You can either pass a file to Thanos with the tracing configuration or pass it in the command line itself. Old --gcloudtrace.* flags were removed ⚠️

To migrate over the old --gcloudtrace.* configuration, your tracing configuration should look like this:

---
type: STACKDRIVER
config:
- service_name: 'foo'
  project_id: '123'
  sample_factor: 123

The other type you can use is JAEGER now. The config keys and values are Jaeger specific and you can find all of the information here.

Changed

  • #1284 Add support for multiple label-sets in Info gRPC service. This deprecates the single Labels slice of the InfoResponse, in a future release backward compatible handling for the single set of Labels will be removed. Upgrading to v0.6.0 or higher is advised.

  • #1314 Removes http_request_duration_microseconds (Summary) and adds http_request_duration_seconds (Histogram) from http server instrumentation used in Thanos APIs and UIs.

  • #1287 Sidecar now waits on Prometheus' external labels before starting the uploading process

  • #1261 Thanos Receive now exposes metrics thanos_http_request_duration_seconds and thanos_http_response_size_bytes properly of each handler

  • #1274 Iteration limit has been lifted from the LRU cache so there should be no more spam of error messages as they were harmless

  • #1321 Thanos Query now fails early on a query which only uses external labels - this improves clarity in certain situations

Fixed

  • #1227 Some context handling issues were fixed in Thanos Compact; some unnecessary memory allocations were removed in the hot path of Thanos Store.

  • #1183 Compactor now correctly propogates retriable/haltable errors which means that it will not unnecessarily restart if such an error occurs

  • #1231 Receive now correctly handles SIGINT and closes without deadlocking

  • #1278 Fixed inflated values problem with sum() on Thanos Query

  • #1280 Fixed a problem with concurrent writes to a map in Thanos Query while rendering the UI

  • #1311 Fixed occasional panics in Compact and Store when using Azure Blob cloud storage caused by lack of error checking in client library.

  • #1322 Removed duplicated closing of the gRPC listener - this gets rid of harmless messages like store gRPC listener: close tcp 0.0.0.0:10901: use of closed network connection when those programs are being closed

Deprecated

  • #1216 the old "Command-line flags" has been removed from Thanos Query UI since it was not populated and because we are striving for consistency
thanos - v0.6.0-rc.0

Published by GiedriusS over 5 years ago

thanos - v0.5.0

Published by bwplotka over 5 years ago

TL;DR: Store LRU cache is no longer leaking, Upgraded Thanos UI to Prometheus 2.9, Fixed auto-downsampling, Moved to Go 1.12.5 and more.

This version moved tarballs to Golang 1.12.5 from 1.11 as well, so same warning applies if you use container_memory_usage_bytes from cadvisor. Use container_memory_working_set_bytes instead.

breaking As announced couple of times this release also removes gossip with all configuration flags (--cluster.*).

Fixed

  • #1142 fixed major leak on store LRU cache for index items (postings and series).
  • #1163 sidecar is no longer blocking for custom Prometheus versions/builds. It only checks if flags return non 404, then it performs optional checks.
  • #1146 store/bucket: make getFor() work with interleaved resolutions.
  • #1157 querier correctly handles duplicated stores when some store changes external labels in place.

Added

  • #1094 Allow configuring the response header timeout for the S3 client.

Changed

  • #1118 breaking swift: Added support for cross-domain authentication by introducing userDomainID, userDomainName, projectDomainID, projectDomainName.
    The outdated terms tenantID, tenantName are deprecated and have been replaced by projectID, projectName.

  • #1066 Upgrade Thanos ui to Prometheus v2.9.1.

    Changes from the upstream:

    • query:
      • [ENHANCEMENT] Update moment.js and moment-timezone.js PR #4679
      • [ENHANCEMENT] Support to query elements by a specific time PR #4764
      • [ENHANCEMENT] Update to Bootstrap 4.1.3 PR #5192
      • [BUGFIX] Limit number of merics in prometheus UI PR #5139
      • [BUGFIX] Web interface Quality of Life improvements PR #5201
    • rule:
      • [ENHANCEMENT] Improve rule views by wrapping lines PR #4702
      • [ENHANCEMENT] Show rule evaluation errors on rules page PR #4457
  • #1156 Moved CI and docker multistage to Golang 1.12.5 for latest mem alloc improvements.

  • #1103 Updated go-cos deps. (COS bucket client).

  • #1149 Updated google Golang API deps (GCS bucket client).

  • #1190 Updated minio deps (S3 bucket client). This fixes minio retries.

  • #1133 Use prometheus v2.9.2, common v0.4.0 & tsdb v0.8.0.

    Changes from the upstreams:

    • store gateway:
      • [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
    • store gateway & compactor:
      • [BUGFIX] Fix fd and vm_area leak on error path in chunks.NewDirReader.
      • [BUGFIX] Fix fd and vm_area leak on error path in index.NewFileReader.
    • query:
      • [BUGFIX] Make sure subquery range is taken into account for selection #5467
      • [ENHANCEMENT] Check for cancellation on every step of a range evaluation. #5131
      • [BUGFIX] Exponentation operator to drop metric name in result of operation. #5329
      • [BUGFIX] Fix output sample values for scalar-to-vector comparison operations. #5454
    • rule:
      • [BUGFIX] Reload rules: copy state on both name and labels. #5368

Deprecated

  • #1008 breaking Removed Gossip implementation. All --cluster.* flags removed and Thanos will error out if any is provided.

See full CHANGELOG here