Yume 練功地

偷偷練

Time-Series Database Requirements

原文

Definition of Data Type

For my purposes, time-series can be defined as follows:

series 會有個標示 source name/ID(如 hostID) 跟 metric name/ID

series 由 {timestamp, value} 組成,measurements 依照 timestamp 排序,並且 timestamp 是個高精確度的 Unix timestamp 且是 float

Workload Characteristics

時序資料不是作為一般用途且在他的工作負載上一定的模式在。

time-series database 應該優化以下事項:

For writes:

基本上幾乎都在寫;約95% ~ 99%以上都在寫

寫入幾乎都是 sequential appends;資料也幾乎是按照時間順序進來。caveat

寫入舊資料非常稀有。大多測量值被觀察到後最糟會在約幾秒或幾分會被寫入。

更改資料很稀有

刪除大量資料會在相鄰區塊。很少刪除個別資料或隨機位置資料。有效刪除大量資料很重要;要接近 0 成本。非大量資料刪除可不必最佳化

基於上面,不可變 storage format 可能是可好東西。As a further consequence of immutable storage, a predefined or fixed schema may be a problem long-term.

For reads, the following usually holds:

資料比 memory 大且很少讀,cache 基本上沒什麼作用; 系統幾乎在處理 IO (systems are often IO-bound).

對於每個序列讀取相當於一般低邏輯順序,
Reads are typically logically sequential per-series, ascending or descending.

併發讀取和一次讀取多個序列是很常見的事。
Concurrent reads and reads of multiple series at once are reasonably common.

寫入是依照 {timestamp, series_id} 順序,讀取想以 {series_id, timestamp} 順序。就 tradeoff 寫快/讀快
The caveat to “writes arrive in sequential order” is that measurements typically arrive ordered by {timestamp, series_id}, but reads are typically done in {series_id, timestamp} order. Reads need to be fast, even though they are rare. There are generally two approaches to dealing with this. The first is to write efficiently, so the data isn’t read-optimized per-series on disk, and deploy massive amounts of compute power in parallel for reads, scanning through all the data linearly. The second is to pay a penalty on writes, so the data is tightly packed by series and optimized for sequential reads of a series.

Performance and Scaling Characteristics

A time-series database should be:

Distributed by design — no bolt-on clustering or sharding. Automatic data distribution, automatic query distribution. Fault-tolerant and highly available, with built-in replication and automatic failover. I think by this point we should all understand what it means for a database to be natively distributed. There are several good examples of databases that do it sensibly, and little of this should need to be novel.
Send the query to the data, don’t bring the data to the query. This is a restatement of “automatic query distribution.” Queries may touch many gigabytes or terabytes of data, so moving it across the network is not scalable.
Efficient per-node so it is capable of running at large scale without requiring thousands of servers.
Able to take advantage of powerful hardware: PCIe flash storage, lots of RAM, many CPU cores. This rules out single-writer systems.
Fast and consistent. No spikes or stalls; no checkpoint freezes; no compaction lock-ups.
Operational Requirements
I do not specifically need ACID, but I need the database to quickly recover to a consistent state after events like a power failure. For my purposes, time-series data is not subject to the same durability constraints as financial data.
Non-blocking backups are a must. Incremental backups are a very good thing.
It needs to be possible to scale the cluster up or down without downtime or locking.
Compressed storage. Time-series data is big, but highly compressible.
The database should be well instrumented.
Language and/or API Design
I’ve spoken to many people who have built large-scale time-series databases for big companies. Most of them have told me that the lack of a high-level way to access and query the database was the long-term millstone around their neck.

I would be happy with something that looks like SQL, as InfluxDB’s query language does. Crucially, it needs to avoid a few of the legacy limitations of SQL. The way I think about it is that SQL tables are fixed-width and grow downwards by adding rows. A natural outcome of that is that each column in SQL statements is known in advance and explicitly named, and expressions naturally work within a single row or in aggregates over groups of rows, but cannot span rows otherwise without doing a JOIN.

However, in time-series databases, rows are series identified by the “primary key.” Rows grow sideways as new measurements are added, tables grow downwards as new series are added, and columns are timestamps. Thus, tables are sparse matrices. Expressions must operate in aggregates over rectangular sections of the sparse matrix, not just rows or columns, and the language must permit a GROUP BY functionality in both directions. You could say that both rows and columns must be addressable by keys instead of by literal identifiers, and ideally by pattern matching in addition to strict equality and ranges.

Ideally, the language and database should support server-side processing of at least the following, and probably much more:

Expressions such as arithmetic and string operations.
Aggregate functions.
Resampling into time resolutions different from the storage resolution.
Expressions and operators that refer to different series, e.g. to sum series, or divide one by another, and to combine such expressions, e.g. to sum up all series whose identifiers match a pattern, then divide the result by the sum of another group of series.
Ordering, ranking, and limiting.
Another way to say the above is that the language and database should be designed for analytics, not just for drawing strip charts. Many open-source time-series databases such as RRDTool are far too tightly coupled with their expected use case, and this is a serious limitation.

There should be an efficient binary protocol that supports bulk inserts.

Non-Requirements
I’d like a database that does one thing well. I do not think I need any of the following, and I regard them as neutral, or in some cases even as drawbacks:

Access control — authentication and authorization.
Ability to visualize data, draw graphs, etc.
Support for multiple measurements at the same timestamp. The measurement’s primary key is series,timestamp and it does not make sense to allow multiple values with the same timestamp.
Multi-dimensionality. Multiple dimensions for a series can be stored as multiple series, and multiple series can be combined in expressions with the query language I specified, so the atom of “series” already provides for the use case of multi-dimensionality.
“Tagging” measurements or series with additional ad-hoc key-value pairs. (Update: I now see the use case for tagging.
Joins from time-series data to relational data.
Bonus and Additional Features
The preceding sections describe a good general-purpose time-series database, from my point of view. Nice-to-have features might include:

Support for retention policies.
Support for storing data in multiple resolutions (materialized views) and selecting the appropriate resolution to access for a given request.
Support for maintaining downsampled data in coarser resolutions, automatically building these materialized views as high-resolution data arrives (automatic rollup).
Support for query priorities or admission control to prevent starvation and DOS from large queries.
For my particular uses, I also need support for:

Many series per server in my cluster, far more than practical limits on the number of files in a directory for example.
Although some series are long-lived, many are not. Many are sparse, with measurements only once in a long while. Series are dynamic and are not predefined; new series may appear at any moment. Due to this requirement, I need efficient support for discovering which series exist during any given time range.
Multi-tenancy at the physical level. This is partially by demand; some customers want to know that their data is separate from other customers’ data. It is partially pragmatic, to support features such as separate retention policies per customer.
Conclusion
The future of “big data” is mostly time-series. Someone who creates a good time-series database for such use cases will probably do quite well. I’m sure my requirements aren’t the most general-purpose or complete, but I hope it’s useful to share anyway.

Updates and related reading: