Druid – what is it?
Druid is a distributed, column-based data-store designed to allow BI/OLAP like queries on massive volumes of data. Designed for quick and efficient querying, aggregation and analysis of time-series that is series of timestamped data points. System architecture allows for extremely low-latency queries being run against very large datasets.
Data storage and partitioning
Druid partitions data objects into segments based on the data timestamp. Sizing segment files is usually a part of system optimization as it has impact on system performance. However Druid documentation recommends segment file sizes between 300 – 700MB. Druid allows multiple segments for the same interval in which case the segments form a block.
Each data object in Druid can be divided into three separate parts:
• Timestamp columns
• Dimension columns – attributes describing the context of data like country, product etc.
• Metric columns – numerical columns with quantitative assessment of an event being subject to analysis and aggregation
Data aggregation and querying
Druid employs both exact and approximate calculation algorithms such as:
- HyperLogLog – distinct count approximation
- Theta sketches – approximating results of set operations (union, intersection etc.)
- TopN – quick ranking algorithm
Approximate algorithms allow for significant calculation time reduction while sustaining good quality of results (~98% accuracy) which is acceptable in many applications.
Druid uses JSON over HTTP as a query language which makes it quite difficult for end-users to effectively query and analyze the data. Using a third-party data query tool, such as Apache Superset or Pivot is highly recommended.
Where to use?
Druid is a highly acclaimed tool in multiple areas such as: network activity analysis, cloud security, IoT sensor data analysis and others. Apache Druid is the tool of choice for:
• Highly efficient data time-series aggregation and analysis
• Real time data analytics
• Extremely large data volume (hundreds of millions of events)
• Highly Available solution
Thanks to its performance, Druid was quickly adopted by multiple companies including Netflix, Alibaba, AirBnB, eBay, Cisco, PayPal, Yahoo and many more.