Bot releases are visible (Hide)

druid -

Published by gianm over 8 years ago

Druid 0.9.1.1 contains only one change since Druid 0.9.1, #3204, which addresses a bug with the Coordinator web console. The full list of changes for the Druid 0.9.1 line is here: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed

Updating from 0.9.0

Query time lookups

Query time lookup (QTL) functionality has been substantially reworked in this release. Most users will need to update their configurations and queries.

The druid-namespace-lookup extension is now deprecated, and will be removed in a future version of Druid. Users should migrate to the new druid-lookups-cached-global extension. Both extensions can be loaded simultaneously to simplify migration. For details about migrating, see Transitioning to lookups-cached-global in the documentation.

Other notes

Aside from the QTL changes, please note the following changes:

The default value for maxRowsInMemory has been set to 75,000 across the board for all forms of ingestion. This is in line with previous defaults for Hadoop tasks and Tranquility-based ingestion. If you were creating realtime index tasks directly (without Tranquility) then this is lower than the previous default of 500,000.
The /druid/coordinator/v1/datasources/{dataSourceName}?kill=true&interval={myISO8601Interval} REST endpoint is now deprecated. The new /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?kill=true REST endpoint can be used instead.
The druid.indexer.runner.separateIngestionEndpoint property is now deprecated. If you were using this functionality to isolate event-push requests and query serving requests for realtime tasks, you can accomplish something similar with druid.indexer.server.maxChatRequests.
For developers of Druid extensions, note that the QueryGranularity constants (ALL, NONE, etc) have been moved to io.druid.granularity.QueryGranularities in #2980. Query syntax is not affected.

Rolling updates

The standard Druid update process described by http://druid.io/docs/0.9.1.1/operations/rolling-updates.html should be followed for rolling updates.

Kafka Supervisor

Druid 0.9.1 is the first version to include the experimental Kafka indexing service, utilizing a new Kafka-type indexing task and a supervisor that runs within the Druid overlord. The Kafka indexing service provides an exactly-once ingestion guarantee and does not have the restriction of events requiring timestamps which fall within a window period. More details about this feature are available in the documentation: http://druid.io/docs/0.9.1.1/development/extensions-core/kafka-ingestion.html.

Note: The Kafka indexing service uses the Java Kafka consumer that was introduced in Kafka 0.9. As there were protocol changes made in this version, Kafka 0.9 consumers are not compatible with older brokers and you will need to ensure that your Kafka brokers are version 0.9 or better. Details on upgrading to the latest version of Kafka can be found here: http://kafka.apache.org/documentation.html#upgrade

New Features

#2656 Supervisor for KafkaIndexTask
#2602 implement special distinctcount
#2220 Appenderators, DataSource metadata, KafkaIndexTask
#2424 Enabling datasource level authorization in Druid
#2410 statsd-emitter
#1576 [QTL] Query time lookup cluster wide config

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3AFeature

Improvements

#2972 Improved Segment Distrubution (new cost function)
#2931 Optimize filter for timeseries, search, and select queries
#2753 More consistent empty-set filtering behavior on multi-value columns
#2727 BoundFilter optimizations, and related interface changes.
#2711 All Filters should work with FilteredAggregators
#2690 Allow filters to use extraction functions
#2577 Implement native in filter

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3AImprovement

Bug Fixes

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3ABug

Documentation

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3ADocumentation

Thanks to everyone who contributed to this release!
@acslk
@b-slim
@binlijin
@bjozet
@dclim
@drcrallen
@du00cs
@erikdubbelboer
@fjy
@gaodayue
@gianm
@guobingkun
@harshjain2
@himanshug
@jaehc
@javasoze
@jisookim0513
@jon-wei
@JonStrabala
@kilida
@lizhanhui
@michaelschiff
@mrijke
@navis
@nishantmonu51
@pdeva
@pjain1
@rasahner
@sascha-coenen
@se7entyse7en
@shekhargulati
@sirpkt
@skilledmonster
@spektom
@xvrl
@yuppie-flu

druid -

Published by gianm over 8 years ago

Druid 0.9.1 contains hundreds of performance improvements, stability improvements, and bug fixes from over 30 contributors. Major new features include an experimental Kafka Supervisor to support exactly-once consumption from Apache Kafka, support for cluster-wide query-time lookups (QTL), and an improved segment balancing algorithm.

The full list of changes is here: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed

Updating from 0.9.0

Query time lookups

Query time lookup (QTL) functionality has been substantially reworked in this release. Most users will need to update their configurations and queries.

Other notes

Aside from the QTL changes, please note the following changes:

The default value for maxRowsInMemory has been set to 75,000 across the board for all forms of ingestion. This is in line with previous defaults for Hadoop tasks and Tranquility-based ingestion. If you were creating realtime index tasks directly (without Tranquility) then this is lower than the previous default of 500,000.
The /druid/coordinator/v1/datasources/{dataSourceName}?kill=true&interval={myISO8601Interval} REST endpoint is now deprecated. The new /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?kill=true REST endpoint can be used instead.
The druid.indexer.runner.separateIngestionEndpoint property is now deprecated. If you were using this functionality to isolate event-push requests and query serving requests for realtime tasks, you can accomplish something similar with druid.indexer.server.maxChatRequests.
For developers of Druid extensions, note that the QueryGranularity constants (ALL, NONE, etc) have been moved to io.druid.granularity.QueryGranularities in #2980. Query syntax is not affected.

Rolling updates

The standard Druid update process described by http://druid.io/docs/0.9.1/operations/rolling-updates.html should be followed for rolling updates.

Kafka Supervisor

New Features

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3AFeature

Improvements

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3AImprovement

Bug Fixes

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3ABug

Documentation

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.1+is%3Aclosed+label%3ADocumentation

druid - Druid 0.9.0

Published by gianm over 8 years ago

Druid 0.9.0 introduces an update to the extension system that requires configuration changes. There were additionally over 400 pull requests from 0.8.3 to 0.9.0. Below we highlight the more important changes in this patch.

Full list of changes is here: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.0+is%3Aclosed

Updating from 0.8.x

Extensions

In Druid 0.9, we have refactored the extension loading mechanism. The main reason behind this change is to make Druid load extensions from the local file system without having to download stuff from the internet at runtime.

To learn all about the new extension loading mechanism, see Include extensions and Include Hadoop Dependencies. If you are impatient, here is the summary.

The following properties have been deprecated:
druid.extensions.coordinates
druid.extensions.remoteRepositories
druid.extensions.localRepository
druid.extensions.defaultVersion

Instead, specify druid.extensions.loadList, druid.extensions.directory and druid.extensions.hadoopDependenciesDir.

druid.extensions.loadList specifies the list of extensions that will be loaded by Druid at runtime. An example would be druid.extensions.loadList=["druid-datasketches", "mysql-metadata-storage"].

druid.extensions.directory specifies the directory where all the extensions live. An example would be druid.extensions.directory=/xxx/extensions.

Note that mysql-metadata-storage extension is not packaged in druid distribution due to license issue. You will have to manually download it from druid.io, decompress and then put in the extensions directory specified.

druid.extensions.hadoopDependenciesDir specifies the directory where all the Hadoop dependencies live. An example would be druid.extensions.hadoopDependenciesDir=/xxx/hadoop-dependencies. Note: We didn't change the way of specifying which Hadoop version to use. So you just need to make sure the Hadoop you want to use exists underneath /xxx/hadoop-dependencies.

You might now wonder if you have to manually put extensions inside /xxx/extensions and /xxx/hadoop-dependencies. The answer is no, we already have created them for you. Download the latest Druid tarball at http://druid.io/downloads.html. Unpack it and you will see extensions and hadoop-dependencies folders there. Simply copy them to /xxx/extensions and /xxx/hadoop-dependencies respectively, now you are all set!

If the extension or the Hadoop dependency you want to load is not included in the core extension, you can use pull-deps to download it to your extension directory.

If you want to load your own extension, you can first do mvn install to install it into local repository, and then use pull-deps to download it to your extension directory.

Please feel free to leave any questions regarding the migration.

Extensions have now also been refactored in core and contrib extensions. Core extensions will be maintained by Druid committers and are packaged as part of the download tarball. Contrib extensions are community maintained and can be installed as needed. For more information, please see here.

Ordering of Dimensions

Until Druid 0.8.x the order of dimensions given at indexing time did not affect the way data gets indexed. Rows would be ordered first by timestamp, then by dimension values, in lexicographical order of dimension names.

As of Druid 0.9.0, Druid respects the given dimension order given and will order rows first by timestamp, then by dimension values, in the given dimension order.

This means segments may now vary in size depending on the order in which dimensions are given. Specifying a dimension with many unique values first, may result in worse compression than specifying dimensions with repeating values first.

Min/Max Aggregators no longer supported, use doubleMin/doubleMax instead

As indicated in the 0.8.3 release notes, min/max aggregators have been removed in favor of doubleMin, doubleMax, longMin, and longMax aggregators.

If you have any issues starting up because of this, please see https://github.com/druid-io/druid/issues/2749

Configuration changes

druid.indexer.task.baseDir and druid.indexer.task.baseTaskDir now default to using the standard Java temporary directory specified by java.io.tmpdir system property, instead of /tmp,

Other issues to be aware of: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.0+is%3Aclosed+label%3A%22Release+Notes%22

and

https://github.com/druid-io/druid/issues?q=milestone%3A0.9.0+is%3Aclosed+label%3AIncompatible

New Features

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.0+is%3Aclosed+label%3AFeature

#1719 Add Rackspace Cloud Files Deep Storage Extension
#1858 Support avro ingestion for realtime & hadoop batch indexing
#1873 add ability to express CONCAT as an extractionFn
#1921 Add docs and benchmark for JSON flattening parser
#1936 adding Upper/Lower Bound Filter
#1978 Graphite emitter
#1986 Preserve dimension order across indexes during ingestion
#2008 Regex search query
#2014 Support descending time ordering for time series query
#2043 Add dimension selector support for groupby/having filter
#2076 adding lower and upper extraction fn
#2209 support cascade execution of extraction filters in extraction dimension spec
#2221 Allow change minTopNThreshold per topN query
#2264 Adding custom mapper for json processing exception
#2271 time-descending result of select queries
#2258 acl for zookeeper is added

Improvements

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.0+is%3Aclosed+label%3AImprovement

#984 Use thread priorities. (aka set nice values for background-like tasks)
#1638 Remove Maven client at runtime + Provide a way to load Druid extensions through local file system
#1728 Store AggregatorFactory[] in segment metadata
#1988 support multiple intervals in dataSource inputSpec
#2006 Preserve dimension order across indexes during ingestion
#2047 optimize InputRowSerde
#2075 Configurable value replacement on match failure for RegexExtractionFn
#2079 reduce bytearray copy to minimal optimize VSizeIndexedWriter
#2084 minor optimize IndexMerger's MMappedIndexRowIterable
#2094 Simplifying dimension merging
#2107 More efficient SegmentMetadataQuery
#2111 optimize create inverted indexes
#2138 build v9 directly
#2228 Improve heap usage for IncrementalIndex
#2261 Prioritize loading of segments based on segment interval
#2306 More specific null/empty str handling in IndexMerger

Bug Fixes

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.0+is%3Aclosed+label%3ABug

Documentation

Full list: https://github.com/druid-io/druid/issues?q=milestone%3A0.9.0+is%3Aclosed+label%3ADocumentation

#2100 doc update to make it easy to find how to do re-indexing or delta ingestion
#2186 Add intro developer docs
#2279 Some more multitenancy docs
#2364 Add more docs around timezone handling
#2216 Completely rework the Druid getting started process

Thanks to everyone who contributed to this patch!
@fjy
@xvrl
@drcrallen
@pjain1
@chtefi
@liubin
@salsakran
@jaebinyo
@erikdubbelboer
@gianm
@bjozet
@navis
@AlexanderSaydakov
@himanshug
@guobingkun
@abbondanza
@binlijin
@rasahner
@jon-wei
@CHOIJAEHONG1
@loganlinn
@michaelschiff
@himank
@nishantmonu51
@sirpkt
@duilio
@pdeva
@KurtYoung
@mangesh-pardeshi
@dclim
@desaianuj
@stevemns
@b-slim
@cheddar
@jkukul
@AdrieanKhisbe
@liuqiyun
@codingwhatever
@clintropolis
@zhxiaogg
@rohitkochar
@itsmee
@Angelmmiguel
@noddi
@se7entyse7en
@zhaown
@genevien

druid - Druid 0.8.3 - Stable

Published by xvrl over 8 years ago

Updating from 0.8.x

You must set druid.selectors.coordinator.serviceName to your Coordinator's druid.service value (defaults to druid/coordinator) in common.runtime.properties of all nodes. Realtime handoff will only work if this config is properly set. (See #2015)
Instead of the normal rolling update procedure, for this release you should update your Coordinator nodes before updating the overlord. (See #2015)
Min/max aggregators are now deprecated and will be removed in Druid 0.9.0. Please use longMin, longMax, doubleMin, or doubleMax aggregators as appropriate.

New Features

#1881 Restorable indexing tasks
#1897, #1991 complex aggregator based on http://datasketches.github.io
#1943 Enable caching on intermediate realtime persists
#1957 Ability to skip Incremental Index during query using query context

Improvements

#1770 Add segment merge time as a metric
#1791 EventReceiverFirehoseMonitor
#1824 Add hashCode and equals to UniformGranularitySpec
#1889 update server metrics and emitter version
#1920 Update curator to 2.9.1
#1929 separate ingestion and query thread pool
#1960 optimize index merge
#1967 Add datasource and taskId to metrics emitted by peons
#1973 CacheMonitor - make cache injection optional
#2015 Remove ServerView from RealtimeIndexTasks and use coordinator http endpoint for handoffs
#2045 Update mmx emitter to 0.3.6
#2145 druid.indexer.task.restoreTasksOnRestart configuration

Bug Fixes

#1387 Add special handler to allow logger messages during shutdown
#1799 Support multiple outer aggregators of same type
#1815 Fix Race in jar upload during hadoop indexing
#1842 Do not pass druid.indexer.runner.javaOpts to Peon as a property
#1867 fixing hadoop test scope dependencies in indexing-hadoop
#1888 forward cancellation request to all brokers, fixes #1802
#1917 RemoteTaskActionClient: Fix statusCode check.
#1932 DataSchema: Exclude metric names from dimension list.
#1935 ForkingTaskRunner: Log without buffering.
#1940 Move Jackson Guice adapters into io.druid
#1954 EC2 autoscaler: avoid hitting aws filter limits
#1985 Change LookupExtractionFn cache key to be unique
#2036 Disable javadoc linting
#1973 Make cache injection optional
#2141 Fix some problems with restoring
#2227 Update bytebuffer-collections to 0.2.4 (upstream bugfixes in roaring bitmaps)
#2240 Fix loadRule when one of the tiers had no available servers
#2207 Fix bug for thetaSketch metric not working with select queries
#2266 Fix loss in segment announcements when segments do not fit in zNode
#2189 add ChatHandlerServerModule to realtime example
#2338 Fix tutorial so indexing service can start up

Documentation

#1832 add examples for duration and period granularities
#1843 "druid.manager.segment" should be "druid.manager.segments
#1854 Fix documentation about lookup
#1900 fix doc - correct default value for maxRowsInMemory

Thanks to all the contributors to this release!

@b-slim
@binlijin
@dclim
@drcrallen
@fjy
@gianm
@guobingkun
@himanshug
@nishantmonu51
@pjain1
@xvrl

druid - Druid 0.8.2 - Stable

Published by xvrl almost 9 years ago

Updating from 0.8.1

If you are using union queries, please make sure to update broker nodes prior to updating any historical nodes, realtime nodes, or indexing service.

Otherwise, you can follow standard rolling update procedures.

New Features

#1744 Memcached connection pooling
#1753 Allow SegmentMetadataQuery to skip cardinality and size calculations
#1609 Experimental kafa simple consumer based firehose
#1800 Experimental Hybrid L1/L2 cache

Improvements

#1821 cache max data timestamp in QueryableIndexStorageAdapter
#1765 Add CPUTimeMetricQueryRunner to ClientQuerySegmentWalker
#1776 Modified the Twitter firehose to process more properties
#1748 Allow ForkingTaskRunner javaOpts to have quoted arguments which contain spaces
#1759 better faster smaller roaring bitmaps
#1755 update druid-api for timestamp parsing speedup
#1756 improving msging when indexing service is not found
#1739 Allow SegmentAnalyzer to read columns from StorageAdapter, allow SegmentMetadataQuery to query IncrementalIndexSegments on realtime node
#1732 Add support for a configurable default segment history period for segmentMetadata queries and GET /datasources/ lookups
#1695 Allow writing InputRowParser extensions that use hadoop/any libraries
#1688 More memcached metrics
#1712 Add dimension extraction functionality to SearchQuery
#1696 Add CPU time to metrics for segment scanning.
#1718 Adds task duration to indexer console for completed tasks.
#1725 Don't check for sortedness if we already know GenericIndexedWriter isn't sorted
#1699 composing emitter module to use multiple emitters together
#1639 New plumber
#1604 Allow task to override ForkingTaskRunner tunings and jvm settings
#1542 add endpoint to fetch rule history for all datasources
#1682 Support parsing of BytesWritable strings in HadoopDruidIndexerMapper
#1622 Support for JSON Smile format for EventReceiverFirehoseFactory
#1654 Add ability to provide taskResource for IndexTask.

Bug Fixes

#1868 Removing parent paths causes watchers of the "announcements" path to get stuck
#1855 fix [GreaterThan,LessThan,Equals] HavingSpecs
#1862 Add timeout to shutdown request to middle manager for indexing service
#1822 support multiple non-consecutive intervals in outer query of nested group-by
#1811 Server discovery selector ipv6 friendly
#1823 For dataSource inputSpec in hadoop batch ingestion, use configured query granularity for reading existing segments instead of NONE
#1818 Add hashCode and equals to stock lookups
#1812 Bump server-metrics to 0.2.5 to catch a few fixes.
#1806 Fix index exceeded msg to give maxRowCount as well
#1801 Fix ClientInfoResource
#1795 Try and make AnnouncerTest a bit more predictable
#1797 ingest segment firehose ut
#1798 Update httpcomponents and aws-sdk
#1792 GroupByQueryRunnerTest for hyperUnique finalizing post aggregators
#1781 Fix failure in nested groupBy with multiple aggregators with same fie…
#1790 Cleanup kafka-extraction-namespace
#1782 Add analysisTypes to SegmentMetadataQuery cache key
#1730 fix #1727 - Union bySegment queries fix
#1783 Separate ListColumnIncluderator cache key parts with nul bytes
#1740 fix #1715 - Zombie tasks able to acquire locks after failure
#1778 Redirect fixes
#1777 fail task if finishjob throws any exception
#1775 SQLMetadataConnector: Retry table creation, in case something goes wrong.
#1772 RemoteTaskRunner: Fix for starting an overlord before any workers ever existed.
#1764 Enable logging for memcached in factory
#1760 Update memcached client for better concurrency in metrics.
#1761 LocalDataSegmentPusher: Fix for Hadoop + relative paths.
#1763 fix NPE and duplicate metric keys
#1758 Fix memcached cache provider injection and add test
#1747 Account for potential gaps in hydrants in sink initialization, hydrant swapping (e.g. h0, h1, h4)
#1751 Soften concurrency requirements on IncrementalIndexTest
#1736 IngestSegmentFirehostFactoryTimelineTest for overshadowing of the middle of a segment.
#1741 Add better concurrency testing to IncrementalIndexTest
#1743 Disable metadata publishing attempt in example script
#1697 Better logging of URIExtractionNamespace failures due to missing files
#1702 do not have dataSource twice in path to segment storage on hdfs
#1710 Add some basic latching to concurrency testing in IncrementalIndexTest
#1734 fix broken integration-test
#1731 fix NPE with regex extraction function
#1700 update indexing in the helper to use multiple persists and merge
#1721 fix for "java.io.IOException: No FileSystem for scheme: hdfs" error
#1694 Better timing and locking in NamespaceExtractionCacheManagerExecutorsTest
#1703 add null check for task context.
#1637 Make jetty scheduler threads daemon thread
#1658 Hopefully add better timeouts and ordering to JDBCExtractionNamespaceTest
#1620 Allow long values in the key or value fields for URIExtractionNamespace
#1578 Fix UT and documentation to the extraction filter
#1687 do not let user override hadoop job settings explicitly provided by druid code
#1689 Update LZ4Transcoder to match Compressed strategy factory type.
#1685 Close output streams and channels loudly when creating segments.
#1686 Replace funky imports with standard ones.
#1683 Remove unused Indexer interface.
#1632 Inner Query should build on sub query
#1676 fix convert segment task
#1672 Migrate TestDerbyConnector to a JUnit @Rule
#1675 update druid-api for jackson 2.4.6
#1632 Inner Query should build on sub query
#1668 Code cleanup for CachingClusteredClientTest
#1669 Upgrade dependencies
#1663 TaskActionToolbox: Remove allowOlderVersions, lift interval constraint
#1619 update server metrics
#1661 Poll rules immediately after change
#1665 Consolidate SQL retrying by moving logic into the connectors.
#1648 handle commas in the path before calling MultipleInputs.addInputPaths
#1647 Approx histogram "integration" unit test

Documentation

#1814 Adjust realtime constraints in the docs.
#1794 update R / Python clients
#1784 Minor documentation fixes for CONTRIBUTING.md
#1774 update doc about aggregation field in merge task and a null check
#1742 add docs for search filter
#1737 Docs: Suggest hadoopyString parser for Hadoop.
#1735 add pivot as a UI
#1723 fix typo in segments.md
#1720 update ingestion faq to mention dataSource inputSpec
#1717 in configuration/index.md s/instantialize/initialize
#1698 Timeseries skipEmptyBucket docs.
#1662 Add documentation for pathFormat in batch ingestion
#1673 Fix batch ingestion doc
#1670 fix formatting
#1656 more docs for common questions
#1664 add documentation about TimedShutoff firehose
#1633 swap description and dimension column for some JVM metrics
#1793 fixing the link to chunkPeriod doc

Thanks to all the contributors to this release!

@anwenxu
@cheddar
@dclim
@drcrallen
@fjy
@gianm
@guobingkun
@Hailei
@himanshug
@jon-wei
@nishantmonu51
@pjain1
@potto007
@qix
@rasahner
@xvrl

druid - Druid 0.8.1 - Stable

Published by xvrl about 9 years ago

Updating from 0.8.0

There should be no update concerns and standard updating procedures can be followed for rolling updates

New Features

#1259 Experimental Query Time Lookups (QTL) -– Ability to do limited joins at query time.
Simple example use case is Country Code to Country Name.
#1374 Experimental Hadoop batch re-indexing and Delta ingestion.
Re-Indexing allows you to ingest existing druid segments using a new schema with certain columns removed, changed granularity etc. "Delta" Ingestion allows appending data to existing interval in a datasource. See the new dataSource inputSpec and multi inputSpec for more information.

Improvements

#1465 Read Hadoop configuration file from HDFS
#1472 Support using combiner for Hadoop ingestion
#1506 Better support for null input rows during ingestion
#1518 More support added for Azure deep store
#1550 Add configuration option to print all HTTP requests to log
#1563 #1602 Improved merging performance on Broker
#1567 #1568 Improved error logging for segment activities
#1596 Improved coordinator console, now a separate maven dependency instead of giant code dump
#1601 Reduced lock contention during segment scan
#1603 Improved performance of Lexicographic TopNs
#1643 helpful cause explaining why SegmentDescriptorInfo did not exist

Improved test coverage for indexing service, ingestion, and coordinator endpoints

Bug Fixes

#1406 Fix groupBy breaking when exceeding max intermediate rows
#1441 Fix flush errors being suppressed when closing output streams
#1469 Fix inconsistent property names for druid.metadata.* properties
#1484 JobHelper.ensurePaths will set properties from config properly
#1499 Fix groupBy caching with renamed aggregators
#1503 Fix leaking indexing service status nodes in ZK
#1534 Fix caching for approximate histograms
#1616 Fix dependency error in local index task
#1627 Fix realtime tasks getting stuck on shutdown even after status being shown as SUCCESS
#1634 Allow IrcFirehoseFactory to shutdown cleanly
#1640 Package extensions in release tarball + script to run druid servers
#1653 Fix success flag emitted in router query metrics
#1659 on kill segment, don't leave version, interval and dataSource dir behind on HDFS
#1681 Fix overlapping segments not working for ingest segment firehose

Documentation

New documentation for firehoses, evaluating Druid, and plenty of fixes.
Improved documentation for working with CDH
Added instructions for PostgreSQL metadata store
More documentation on how to use ApproximateHistograms

The full list of changes can be found here

Thanks

Special thanks to everyone that contributed (code, docs, etc.) to this release!

@drcrallen
@davideanastasia
@guobingkun
@himanshug
@michaelschiff
@fjy
@krismolendyke
@nishantmonu51
@rasahner
@xvrl
@gianm
@pjain1
@samjhecht
@solimant
@sherry-q
@ubercow
@zhaown
@mvfast
@mistercrunch
@pdeva
@KurtYoung
@onlychoice
@b-slim
@cheddar
@MarConSchneid

druid - Druid 0.8.0 - Stable

Published by drcrallen over 9 years ago

We recently introduced a backwards incompatible change to the schema Druid uses when it emits metrics. If you are not emitting Druid metrics to an http endpoint, the update procedure should be straightforward.

Updating from 0.7.x

If you are emitting Druid metrics to an http endpoint, please consult https://github.com/druid-io/druid/blob/master/docs/content/operations/metrics.md for the new schema used for Druid metrics
io.druid.server.metrics.ServerMonitor has been renamed to io.druid.server.metrics.HistoricalMetricsMonitor. You will need to update any configs that contain this change.
Correction to one of db index keys requires migration steps described in https://github.com/druid-io/druid/pull/1322

Updating from 0.6.x

Please see the update guide from 0.6.x to 0.7.x: https://github.com/druid-io/druid/releases/tag/druid-0.7.0
After updating to 0.7.x, follow the previous instructions to update to 0.8.x

New Features

Redo Druid metrics to use an understandable metrics schema
Support compression for multi-value columns
Added longMax/longMin aggregators in addition to previous min/max [double] aggregators which have been renamed to appropriate doubleMax/doubleMin
Added a hadoop_convert_segment task for the indexer to allow large scale batch re-compression of old data as an indexer task.

Improvements

Index task now ignores invalid rows (#1264)
Improved segment filtering for dataSourceMetadataQuery (#1299)
Numerous additional unit tests

Bug Fixes

Fixed deprecated warnings in Hadoop batch indexing (#1275). Thanks @infynyxx!
Fix groupBys applying limitSpecs to historicals on post aggregations (#1292). Thanks @guobingkun!
Fix incorrectly typed values in metadata sql queries (#1295). THanks @anubhgup!
Fix timeBoundary cache serde problems (#1303)
Fix serde issue with pulling timestamps from cache (#1304)
Fixed concatenated gzip files with static s3 firehose (#1311)
Fix audit table config serde problems (#1322)
Fix IRC firehose serde (#1331)
Fix Arithmetic exceptions on the broker (#1336)
Fix an error where the Convert Segment Task would leave zombie tasks if the task failed (#1363)
Fixed #1365 to return actual complex metric name in segment metadata query response
Fix groupBy caching to work with renamed aggregators (#1499)

Documentation

Numerous typo fixes. Thanks to @textractor, @rasahner, & @bobrik.

druid - Druid 0.7.3 - Stable

Published by fjy over 9 years ago

This release is mainly to get out dimension compression and rework the druid documentation. There are no update concerns with this version of Druid.

New Features

Added support for Dimension compression of segment columns, enabled by default. Compression is applied to the column storing the dimension value indices, but not to the dimension values themselves. This change only applies to single value dimensions, multi-value dimensions are left uncompressed. With real-world data we have seen segments sizes reduced by 50% for some datasources, but actual compression ratios will vary based on the data. Sparse and repetitive columns will benefit the most, whereas more random and higher cardinality columns will benefit less. Old segments can be converted using the updated VersionConverterTask.
Initial support for Microsoft Azure as a deep storage option has been added. Thanks @davrodpin!

Improvements

Improved VersionConverterTask to allow for an IndexSpec and forced updates. This enables the ability to convert old segments to use dimension compression,
Improved how the datasource metadata query filters on segments to scan.

Bug Fixes

Ignore rows with invalid interval for index task (#1264)
always re-upload snapshot self-contained jars to hdfs (#1261)
Skip raising false alert when the coordinator loses leadership (#1224)
Fix an issue that after broker forwards GroupByQuery to historical, havingSpec is still applied (#1292). Thanks @guobingkun!
Fix type incorrect types for update sql statement for metadata storage (#1295). Thanks @anubhgup!
fix serde issue when pulling timestamps from cache (#1304)

Documentation

Reworked the Druid documentation such that it can be consumed in order.
Many documentation fixes and improvements thanks to @bobrik, @infynyxx, @rasahner, @b-slim, @textractor, @truenorth, @gknapp, and others we may have missed!

druid - Druid 0.7.1.1 - Stable

Published by xvrl over 9 years ago

New Features

Group results by day of week, hour of day, etc.

We added support for time extraction functions where you can group by results based on anything DateTimeFormatter supports. For more details, see http://druid.io/docs/latest/DimensionSpecs.html#time-format-extraction-function .
Audit rule and dynamic configuration changes

Druid now provides support for remembering why a rule or configuration change was made, and who made the change. Note that you must provide the author and comment fields yourself. The IP which issued the configuration change will be recorded by default. For more details, see headers "X-Druid-Author" and "X-Druid-Comment" on http://druid.io/docs/latest/Coordinator.html
Provide support for a password provider for the metadata store

This enables people to write a module extension which implements the logic for getting a password to the metadata store.
Enable servlet filters on Druid nodes

This enables people to write authentication filters for Druid requests.

Improvements

Query parallelization on the broker for long interval queries

We’ve added the ability to break up a long interval query into multiple shorter interval queries that can be run in parallel. This should improve the performance of more expensive groupBys. For more details, see "chunkPeriod" on http://druid.io/docs/latest/Querying.html#query-context
Better schema exploration

The broker can now return the dimensions and metrics for a datasource broken down by interval.
Improved code coverage

We’ve added numerous unit tests to improve code coverage and will be tracking coverage in the future with Coveralls.
Additional ingestion metrics

Added additional metrics for failed persists and failed handoffs.
Configurable InputFormat for batch ingestion (#1177)

Bug Fixes

Fixed a bug where sometimes the broker and coordinator would miss announcements of segments, leading to null pointer exceptions. (#1161)
Fixed a bug where groupBy queries would fail when aggregators and post-aggregators were named the same (#1044)
Fixed a bug where not including a pagingSpec in a select query generates a obscure NPE (#1165). Thanks to friedhardware!
"bySegment" groupBy queries should now work (#1180)
Honor ignoreInvalidRows in reducer for Hadoop indexing
Download dependencies from Maven Central over https
Updated MySQL connector to fix issues with recent MySQL versions
Fix timeBoundary query on union datasources (#1243)
Fix Guice injections for DruidSecondaryModule (#1245)
Fix log4j version dependencies (#1239)
Fix NPE when partition number 0 does not exist (#1190)
Fix arbitrary granularity spec (#1214) and ignore rows with invalid interval for index task (#1264)
Fix thread starvation in AsyncQueryForwardingServletTest (#1233)
More useful ZooKeeper log messages
Various new unit tests for things
Updated MapDB to 1.0.7 for bugfixes
Fix re-uploading of self-containted SNAPSHOT jars when developing on hadoop (#1261)

Documentation

Reworked the flow of Druid documentation and fixed numerous errors along the way.
Thanks to @infynxx for fixing many of our broken links!
Thanks to @mrijke for many fixes with metrics and emitter configuration.
Thanks to @b-slim, @bobrik and @andrewserff for documentation and example fixes

Misc

Improved startup scripts thanks to @housejester.

druid - Druid 0.7.0 - Stable

Published by xvrl over 9 years ago

Updating to Druid 0.7.0 – Things to be Aware

New ingestion spec

Druid 0.7.0 requires a new ingestion spec format. Druid 0.6.172 supports both the old and new formats of ingestion and has scripts to convert from the old to the new format. This script can be run with 'tools convertSpec' using the same Main used to run Druid nodes. You can update your Druid cluster to 0.6.172, update your ingestion specs to the new format, and then update to Druid 0.7.0. If you update your cluster to Druid 0.7.0 directly, make sure your real-time ingestion pipeline understands the new spec.
MySQL is no longer the default metadata storage

Druid now defaults to embedding Apache Derby, which was chosen mainly for testability purposes. However, we do not recommend using Derby in production. For anything other than testing, please use MySQL or PostgreSQL metadata storage.

Configuration parameters for metadata storage were renamed from druid.db to druid.metadata.storage and an additional druid.metadata.storage.type=<mysql|postgresql> is required to use anything other than Derby.

The convertProps tool can assis you in convertng all 0.6.x properties to 0.7 properties.
Druid is now case-sensitive

Druid column names are now case-sensitive. We previously tried to be case-insensitive for queries and case-preserving for data, but we decided to make this change as there were numerous bugs related to various casing problems.

If you are upgrading from version 0.6.x:
1. Please make sure the column casing in your queries matches the casing of your column names in your data and update your queries accordingly.
2. One very important thing to note is that 0.6 internally lower-cased all column names at ingestion time and query time. In 0.7, this is no longer the case, however, we still strongly recommend that you use lowercase column names in 0.7 for simplicity.
3. If you are currently ingesting data with mixed case column names as part of your data or ingestion schema:
  - for TSV or CSV data, simply lower-case your column names in your schema when you update to 0.7.0.
  - for JSON data with mixed case fields and if you were not specifying the names of the columns, you can use the jsonLowerCase parseSpec to lower-case the data for you at ingestion time and maintain backwards compatibility.
For all other parse specs, you will need to lower-case the

metric/aggregator names if you were using mixed case before.
Batch segment announcement is now the default

Druid now uses batch segment announcement by default for all nodes. If you are already using batch segment announcement, you should be all set.

If you have not yet updated to using batch segments announcement, please read this guide in the forum on how to update your current 0.6.x cluster to use batch announcement first.
Kafka 0.7.x removed in favor of Kafka 0.8.x

If you are using Kafka 0.7, you will have to build the kafka-seven extension manually. It is commented out in the build, because Kafka 0.7 is not available in Maven Central. The Kafka 0.8 (kafka-eight) extension is unaffected.
Coordinator endpoint changes

Numerous coordinator endpoints have changed. Please refer to the coordinator documentation for what they are.

In particular:
1. /info on the coordinator has been removed.
2. /health on historical nodes has been removed
Separate jar required for com.metamx.metrics.SysMonitor

If you currenly have com.metamx.metrics.SysMonitor as part of your druid.monitoring.monitors configuration and would like to keep it, you will have to add the SIGAR library jar to your classpath.

Alternatively, you can simply remove com.metamx.metrics.SysMonitor if you do not rely on the sys/.* metrics.

We had to remove the direct dependency on SIGAR in order to move Druid artifacts to Maven Central, since SIGAR is currently not available there.
Update Procedure

If you are running a version of Druid older than 0.6.172, please upgrade to 0.6.172 first. See the 0.6.172 release notes for instructions.

In order to ensure a smooth rolling upgrade without downtime, nodes must be updated in the following order:
1. historical nodes
2. indexing service/real-time nodes
3. router nodes (if you have any),
4. broker nodes
5. coordinator nodes

New Features

Long metric column support

Until now Druid stored all metrics as single precision floating point values, which could introduce rounding errors and unexpected results with queries using longSum aggregators, especially for groupBy queries.
Pluggable metadata storage

MySQL, PostgreSQL, and Derby (for testing) are now supported out of the box. Derby only supports single master or should not be used for high availability production, use MySQL or PostgreSQL failover for that.
Simplified data ingestion API

completely redo Druid’s data ingestion API.
Switch compression for metric colums from LZF to LZ4

Initial performance tests show it may be between 15% and 25% faster, and results in segments about 3-5% smaller on typical data sets.
Configurable inverted bitmap indexes

Druid now supports Roaring Bitmaps in addition to the default Concise Bitmaps. Initial performance tests show Roaring may be up to 20% faster for certain types of queries, at the expense of segments being 20% larger on average.
Integration tests

We have added a set of integration tests that use Docker to spin up a Druid cluster to run a series of indexing and query tests.
New Druid Coordinator console

We introduced a new Druid console that should hopefully provide a better overview of the status of your cluster and be a bit more scalable if you have hundreds of thousands of segments. We plan to expand this console to provide more information about the current state of a Druid cluster.
Query Result Context

Result contexts can report errors during queries in the query headers. We are currently using this feature for internal retries, but hope to expand it to report more information back to clients.

Improvements

Faster query speeds

Lots of speed improvements thanks to faster compression format, small optimizations in column structure, and optimizations of queries with multiple aggregations, as well as numerous groupBy query performance improvements. Overall, some queries can be up to twice as fast using the new index format.
Druid artifacts in Maven Central

Druid artifacts are now available in Maven Central to make your own builds and deployments easier.
Common Configuration File

Druid now has a common.runtime.properties where you can declare all global properties as well as all of your external dependencies. This avoids repeated configuration across multiple nodes and will hopefully make setting up a Druid cluster a little less painful.
Default host names, port and service names

Default host names, ports, and service names for all nodes means a lot less configuration is required upfront if you are happy with the defaults. It also means you can run all node types on a single machine without fiddling with port conflicts.
Druid column names are now case sensitive

Death to casing bugs. Be aware of the dangers of updating to 0.7.0 if you have mixed case columns and are using 0.6.x. See above for more details.
Query Retries

Druid will now automatically retry queries for certain classes of failures.
Background caching

For certain types of queries, especially those that involve distinct (hyperloglog) counts, this can improve performance up over 20%. Background caching is disabled by default.
Reduced coordinator memory usage

Reduced coordinator memory usage (by up to 50%). This fixes a problem where a coordinator would sometimes lose leadership due to frequent GCs.
Metrics can now be emitted to SSL endpoints
Additional AWS credentials support, Thanks @gnethercutt
Additional persist and throttle metrics for real-time ingestion

This should help diagnose when real-time ingestion is being throttled and how long persists are taking. These metrics provide a good indication of when it is time to scale up real-time ingestion.
Broker initialization endpoint

Brokers now provides a status endpoint at /druid/broker/v1/loadstatus to indicate whether they are ready to be queried, making rolling upgrades / restarts easier.

Bug Fixes

Support multiple shards of the same datasource on the same realtime node. Thanks @zhaown
HDFS task logs should now work as expected. Thanks @flowbehappy.
Possible deadlock condition fixed in the Broker.
Various fixes for GZIP compression in returning results.
druid.host should now support IPv6 addresses as well.

Documentation

New tutorials.
New ingestion documentation.
New configuration documentation.
Improvements to rule documentation. Thanks @mrijke

Known issues

Merging segments with different types of bitmap indices is currently not possible, so if you have both types of indices in your cluster, you must set druid.coordinator.merge.on to false. ‘false’ is the default value of the config.
https://github.com/druid-io/druid/issues/1045 Issue with GoupBy queries with complex aggregations and post-aggregations using the same name
https://github.com/druid-io/druid-api/pull/38 If you are using longSum in your ingestion spec, having floating point data may throw exceptions.

druid - Druid 0.6.172 - Stable

Published by xvrl over 9 years ago

Druid 0.6.172 fixes a few bugs to make the upgrade path towards Druid 0.7.0 seamless:

Fixes ingestion schema forward-compatibility with 0.7.0
Fixes dynamic worker configuration and worker affinity settings for the indexing service

Updating

If you are not already running 0.6.171, please see the 0.6.171 release notes for important notes on the upgrade procedure.

druid - Druid 0.6.171 - Stable

Published by fjy over 9 years ago

Druid 0.6.171 is a bug fix stable mainly meant to enable a less painful update to Druid 0.7.0. Going forward, we will be backporting fixes to 0.6.x as required for the community and continuing to develop major features on 0.7.x.

Download

http://static.druid.io/artifacts/releases/druid-services-0.6.171-bin.tar.gz

Updating, Things to be Aware

Both this version and 0.7.0-RC1 provide much better out of the box support for PostgreSQL as a metadata store. In order to provide this functionality, we had to make some small changes to the way data is stored in metadata storage for MySQL setups.

Before updating to 0.6.171, please make sure that:
All Druid MySQL metadata tables are using UTF-8 encoding for all string/text columns,
The default character set for the Druid MySQL database has been changed to UTF-8.
Druid Coordinator and Overlord will refuse to start if the database default character set is not UTF-8.

To check column character encoding, use
SHOW CREATE TABLE <table>;.
If the default table encoding is not UTF-8 or if any columns are encoded using anything other than UTF-8 you will need to convert those tables.

To check the database default encoding, use
SHOW VARIABLES LIKE 'character_set_database';

If you are not already using UTF-8 encoding for your columns, you can convert your tables and change the database default using the following commands. Please keep in mind that table conversion can take a while (order of minutes) and segment loading / handoff will be interrupted for the duration of the upgrade.

Make a backup of your database before performing the upgrade!

ALTER TABLE druid_config    CONVERT TO CHARSET utf8;
ALTER TABLE druid_rules     CONVERT TO CHARSET utf8;
ALTER TABLE druid_segments  CONVERT TO CHARSET utf8;
ALTER TABLE druid_tasks     CONVERT TO CHARSET utf8;
ALTER TABLE druid_tasklogs  CONVERT TO CHARSET utf8;
ALTER TABLE druid_tasklocks CONVERT TO CHARSET utf8;

-- replace druid with your Druid database name here 
ALTER DATABASE druid DEFAULT CHARACTER SET utf8;

Improvements

We introduced several query optimizations, mainly for topNs and HLLs
The overlord can now optionally choose what worker to send tasks to #904
Improved retry logic for realtime plumbers when handoffs fail during the final merge step

Bug Fixes

Fixed searching with same value in multiple columns
Fixed jetty defaults to increase number of threads and prevent lockups
Fixed query/wait metrics being emitted twice
Fixed default dimension exclusions for timestamp and aggregators in ingestion schema
Fixed missing origin in cache key for period granularities
Fixed default FilteredServerView to actually be filtered
Fixed files not cleaning up correctly in segment cache directory
Fixed results sometimes coming in out of order
Fixed bySegment TopN queries not returning at the broker level
Fixed a few bugs related to filtered aggregators
Fixed crazy amounts of logging when coordinator loses leadership
Updated jetty and spymemcached libraries for various fixes
Fixed cardinality aggregator caching schema problem
Fixed Coordinator and overlord '/status' page should not be redirected to the leader instances
Made postgres actually work out of the box in 0.6.x

druid - Druid 0.6.160 - Stable

Published by fjy almost 10 years ago

Improvements

Broker nodes now only start up after reading all information about segments in Zookeeper
Nested groupBy queries should now work with post aggregations.
Nested groupBy queries should now work with complex metrics.
The overlord in the indexing service can now assign tasks to workers based on strategies.
Local firehose can now find all files under a directory.
Timestamp and metrics are now automatically added to dimension exclusions.
Improved failure handling during real-time hand-offs.
Parallel downloading of segments. Multiple threads can now be used to download segments from deep storage.
Segments can be announced and queried as a node is initially loading up.
Native filtered aggregators for selector type filters
Custom Broker selection strategy for Router can now be written in JavaScript

Documentation

Example Hadoop Configuration now available
Best Practices and Recommendations now updated
Experimental Router node is now documented (druid.io/docs/latest/Router.html).
Local firehose is now documented (http://druid.io/docs/latest/Firehose.html).
Numerous improvements to FAQS, segment metadata docs improved, ingest firehose docs improved, full - cluster view explained. Thanks @pdeva!
Updates to Cassandra documentation. Thanks @lexicalunit!

Bug Fixes

Added a workaround for a jetty half open connection issue that appears when client connections terminate a long running query. The symptoms when this bug appears are that the cluster appears stuck and unresponsive. Another workaround for this issue is to simply use query context timeouts.
Fixed merging results from partitions with time gaps, which could cause out of order unmerged results (#796).
HDFS should now work for non-default filesystems. Thanks @flowbehappy!
Multiple spatial dimensions can now be ingested.
Fixed a bug with approximate histograms not working with groupBy queries.
Fixed last 8kb not working for non-s3 task logs.
Fixed dynamic configuration not working for replication throttling.
Fix search queries throwing exceptions if querying for non-existing dimensions
Fix ingest firehose breaking for non-present dimensions.
Select queries now work if you specify non-existing dimensions (#778)
groupBy cache now works with complex metrics
Fixed some serde problems that existed with RabbitMQ (#794)

druid - Druid 0.6.146 - Stable

Published by fjy about 10 years ago

New features

Reschema capabilities added. You can now ingest an existing Druid segment and change the name, dimensions, metrics, rollup, etc. of the segment. (More info: http://druid.io/docs/0.6.146/Ingestion-FAQ.html)
Approximate histograms and quantiles. We’ve open sourced a new module, druid-histogram that includes a new aggregator to build approximate distributions and can be used for quantiles. Depending on the accuracy of the desired results, this aggregator can be slower than the other Druid aggregators. This features is still somewhat experimental, but we would really love to work with the community to make it more production stable.
(More info: http://druid.io/docs/0.6.146/ApproxHisto.html)
Query timeout and cancellation. You can now specify an optional “timeout” key and a long value in the Druid query context to cancel queries that have been running for too long. You can also issue explicit query cancellation.
(More info: http://druid.io/docs/0.6.146/Querying.html)
groupBy and select query caching (disabled by default). Select and groupBy queries do not cache by default. This is to prevent large result sets from these queries overflowing the cache. However, if your workload generates groupBy results of reasonable size and you’d like to enable the cache for these queries, you can override the default values for druid.*.cache.unCacheable (http://druid.io/docs/0.6.146/Broker-Config.html).
Middle-managers can now be blacklisted. This allows for rolling updates of middleManagers. See new docs on rolling Druid updates. (http://druid.io/docs/0.6.146/Rolling-Updates.html)
S3 credentials can now be read from file. Thanks @metacret!
HDFS task logs for the indexing service now supported. Thanks @realfun!
Index tasks now support manual specification of shardSpecs and the ability to skip the determine partitions step.
TimeBoundary queries can now return just the max or min time.
http://druid.io/docs/0.6.146/TimeBoundaryQuery.html

Improvements

Nested groupBy queries now support post aggregators and all functionality of normal groupBy queries.
groupBy queries now support cardinality aggregators.
Port finding strategies for peons are smarter and can now reuse ports.
Existing complete sinks will now try to be handed off much sooner after real-time updates or restarts.
More flexible userData for indexing service autoscaling on EC2 that is no longer tied to our deployment environment.
The async logic in the Druid router was improved significantly.
Routers now support optional routing strategy overrides.
Druid 0.6.x deployments now work with Apache Whirr. We are going to create a way of deploying Druid with docker soon as well.
Cleaned up some redundant configs in the indexing service.
A whole bunch of query and caching unit tests were added.
Explicit job properties can now be added for Hadoop ingestion tasks.

Docs

There are now docs about how to do rolling Druid updates and restarts.
http://druid.io/docs/0.6.146/Rolling-Updates.html
New docs for configuring logging in Druid.
http://druid.io/docs/0.6.146/Logging.html
Kafka 8 docs now added. Thanks @r4j4h
http://druid.io/docs/0.6.146/Kafka-Eight.html
Added docs for inverted topNs
http://druid.io/docs/0.6.146/TopNMetricSpec.html#inverted-topnmetricspec
Updated Cassandra documentation. Thanks @lexicalunit
https://github.com/metamx/druid/pull/680

Misc

Curator version bumped to 2.6.0
Jetty version bumped to 9.2.2
Guava version bumped to 16.0.1
Logging for coordinator and historical nodes is now less verbose

druid -

Published by xvrl over 10 years ago

druid - Druid 0.6.52 - Stable

Published by fjy over 10 years ago

druid - Druid 0.6.73 - Stable

Published by fjy over 10 years ago

We are pleased to announce a new Druid stable, version 0.6.73. New features include:

A production tested dimension cardinality estimation module

We recently open sourced our HyperLogLog module described in bit.ly/1fIEpjM and //bit.ly/1ebLnNI . Documentation has been added on how to use this module as an aggregator and as part of post aggregators.

Hash-based partitioning

We recently introduced a new sharding format for batch indexing. We use the HyperLogLog module to estimate the size of a data set and create partitions based on this size. In our tests, partitioning via this hash based method is both faster and leads to more evenly partitioned segments.

Cross-tier replication

We can now replicate segments across different tiers. This means that you can create a “hot” tier that loads a single copy of the data on more powerful hardware and a “cold” tier that loads another copy of the data on less powerful hardware. This can lead to significant reductions in infrastructure costs.

Nested GroupBy Queries

Thanks to an awesome contribution from Yuval Oren et. al, we can do multi-level aggregation with groupBys. More info here: https://groups.google.com/forum/#!topic/druid-development/8oL28iuC4Gw

GroupBy memory improvements

We’ve made improvements as to how multi-threaded groupBy queries utilize memory. This should help reduce memory pressure on nodes with concurrent, expensive groupBy queries.

Real-time ingestion stability improvements

We’ve seen some stability issues with real-time ingestion with a high number of concurrent persists and have added smarter throttling to handle this type of workload.

Additional features

multi-data center distribution (experimental)
request tracing
restore tasks (to restore archived segments)
memcached stability improvements
indexing service stability improvements
smarter autoscaling in the indexing service
numerous bug fixes
new documentation for production configurations

Things on our plate

Reducing CPU usage on the broker nodes when interacting with the cache (we are seeing query bottlenecks when merging too many results from memcached)
Having historical nodes populate memcached (so bySegment results are no longer returned and historical nodes can do their own local merging)
Consolidating batch and real-time ingestion schemas so we can move towards a simpler data ingestion model
Scaling groupBys with off-heap result merging
Improving real-time ingestion stability and performance by moving to more off-heap data structures
Autoscaling and sharding the real-time ingestion pipeline
Evaluating append only style updates for streaming data (https://github.com/metamx/druid/issues/418)

druid - Druid 0.6.121 - Stable

Published by xvrl over 10 years ago

This is a small release with mainly stability and performance updates.

Updating

If updating from 0.6.105, no particular steps need to be taken.
If updating from an older release, see the notes for Druid 0.6.105

Release Notes

New features

new cardinality estimation aggregator: uses hyperUnique (the optimized HyperLogLog aggregator) to estimate the cardinality of a dimension
we have completely redone the ingestion schemas to consolidate batch and real-time ingestion. Everything is backwards compatible for the time being, and we hope to have new examples and tutorials that show how to use the new schema. It should hopefully simplify ingestion.
alphanumeric sorted topNs
a new union query (right now this only works if there are commonly named columns and metrics among your datasources)
allow config-based overriding of Hadoop job properties for batch ingestion
multi-threaded the coordinator cost balancing algorithm for faster load balancing decisions (the number of threads to use is dynamically configurable, it is 1 by default)
added a context parameter to force a 2-pass topN optimization algorithm (previously this was done a heuristic that was rarely used)
additional coordinator endpoints to return more info about cluster state

Improvements

improved real-time ingestion memory usage. Depending on the number of total segments in your cluster, much less memory can now be used for real-time ingestion.
faster batch ingestion when there are numerous individual raw data files. Thanks @deepujain.
more resilient rabbitMQ firehoses. Thanks @tucksaun.
JavaScript aggregator now supports multi-valued dimensions.
inverted topN now works with lexicographic sorting
lexicographic topN now supports dimension extraction functions

Bug Fixes

several fixes for hyperUnique aggregator where large errors in estimates could be reported in certain edge cases
fixed an edge case race condition in the coordinator where it could load/drop segments incorrectly when disconnecting/reconnecting from Zookeeper
fixed an edge condition with real-time ingestion where a bad sink can be created with delayed events
updated jetty to 9.1.5 for a fix of a half-open connection problem that occurs occasionally (it’s been extremely difficult for us to reproduce this -- but when it occurs nodes appear to have their jetty threads stalled while writing to a channel that is already closed)
fixed a bug where cached results would get combined in arbitrary order
fixed additional casing bugs
Druid now passes tests with Java 8

Documentation

new documentation about possible hardware for production nodes and configuration for them. Look for more improvements to configuration coming soon.
Fixed several broken links in docs. Thanks @jcollum.

druid - Druid 0.6.105 - Stable

Published by xvrl over 10 years ago

Updating

When updating Druid with no downtime, we highly recommend updating historical nodes and real-time nodes before updating the broker layer. Changes in queries are typically compatible with an old broker version and a new historical node version, but not vice versa. Our recommended rolling update process is:

indexing service/real-time nodes
historical nodes (with a wait in between each node, the wait time corresponds to how long it takes for a historical node to restart and load all locally cached segments)
broker nodes
coordinator nodes

Release Notes

Historical nodes can now use and maintain a local cache (disabled by default). This cache can either be heap based or memcached. This allows historical nodes to merge results locally and reduces much of the memory pressure seen on brokers while pulling a large number of results from the cache. Populating the cache is also now done in an asynchronous manner.
Experimental router node. We’ve been experimenting with a fully asynchronous router node that can route queries to different brokers depending on the actual query. Currently, the router node makes decisions about which broker to talk to based on rules from the coordinator node. It is our goal to at some point merge the router and broker logic and move towards hierarchical brokers.
Post aggregation optimization. We’ve optimized calculations of post aggregations (previously post aggs were being calculated more than necessary). In some initial benchmarks, this can lead to 20%-30% improvement in queries that involve post aggregations.
Support hyperUnique in groupBys. We’ve fixed a reported problem where groupBys would report incorrect results when using complex metrics (especially hyperUnique).
Support dimension extraction functions in groupBy
Persist and persist-n-merge threads now no longer block each other during real-time ingestion. We added a parameter for throttling real-time ingestion a few months ago, and what we’ve seen is that very high ingestion rates that lead to a high number of intermediate persists can be blocked while waiting for a hand-off operation to complete. This behavior has now been improved. You are also now able to set maxPendingPersists in the plumber.
hyperUnique performance optimizations: ~30-50% faster aggregations

Miscellaneous other things

Fix integer overflow in hash based partitions
Support for arbitrary JSON objects in query context
Request logs now include query timing statistics
Hadoop 2.3 support by default
Update to Jetty 9
Do not require valid database connections for testing
Gracefully handle NaN / Infinity returned by compute nodes
better error reporting for cases where the ChainedExecutionQueryRunner throws NPEs