How to use time-series with Sqlite, with fast time-range queries?(如何在 Sqlite 中使用时间序列和快速时间范围查询?)
问题描述
假设我们用 Unix 时间戳列 ts
在 Sqlite 数据库中记录事件:
Let's say we log events in a Sqlite database with Unix timestamp column ts
:
CREATE TABLE data(ts INTEGER, text TEXT); -- more columns in reality
我们希望快速查找日期时间范围,例如:
and that we want fast lookup for datetime ranges, for example:
SELECT text FROM data WHERE ts BETWEEN 1608710000 and 1608718654;
像这样,EXPLAIN QUERY PLAN
给出了不好的 SCAN TABLE data
,所以一个明显的解决方案是 create an index with CREATE INDEX dt_idx ON data(ts)
.
Like this, EXPLAIN QUERY PLAN
gives SCAN TABLE data
which is bad, so one obvious solution is to create an index with CREATE INDEX dt_idx ON data(ts)
.
那么问题就解决了,但是必须为我们已经增加的序列/已经排序的列 ts
维护一个索引是一个糟糕的解决方案可以直接在 O(log n) 中使用 B-tree 搜索.在内部这将是索引:
Then the problem is solved, but it's rather a poor solution to have to maintain an index for an already-increasing sequence / already-sorted column ts
for which we could use a B-tree search in O(log n) directly. Internally this will be the index:
ts rowid
1608000001 1
1608000002 2
1608000012 3
1608000077 4
这是对数据库空间的浪费(当查询必须首先查看索引时会浪费 CPU).
which is a waste of DB space (and CPU when a query has to look in the index first).
为了避免这种情况:
(1) 我们可以使用
ts
作为INTEGER PRIMARY KEY
,所以ts
将是rowid代码>本身.但这失败了,因为
ts
不是唯一的:2 个事件可以在同一秒(甚至在同一毫秒)发生.
(1) we could use
ts
asINTEGER PRIMARY KEY
, sots
would be therowid
itself. But this fails becausets
is not unique: 2 events can happen at the same second (or even at the same millisecond).
例如查看 SQLite Autoincrement 中给出的信息.
See for example the info given in SQLite Autoincrement.
(2) 我们可以使用 rowid
作为时间戳 ts
与递增的数字连接.示例:
(2) we could use rowid
as timestamp ts
concatenated with an increasing number. Example:
16087186540001
16087186540002
[--------][--]
ts increasing number
那么 rowid
是唯一的并且严格递增(假设每秒事件少于 10k),并且不需要索引.查询 WHERE ts BETWEEN a AND b
将简单地变为 WHERE rowid BETWEEN a*10000 AND b*10000+9999
.
Then rowid
is unique and strictly increasing (provided there are less than 10k events per second), and no index would be required. A query WHERE ts BETWEEN a AND b
would simply become WHERE rowid BETWEEN a*10000 AND b*10000+9999
.
但是有没有一种简单的方法可以让 Sqlite INSERT
rowid
大于或等于给定值的项目?假设当前时间戳为 1608718654
并出现两个事件:
But is there an easy way to ask Sqlite to INSERT
an item with a rowid
greater than or equal to a given value? Let's say the current timestamp is 1608718654
and two events appear:
CREATE TABLE data(ts_and_incr INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT);
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540001
INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello") #16087186540002
更一般地说,如何使用 Sqlite 以最佳方式创建时间序列,以实现快速查询WHERE timestamp BETWEEN a AND b
?
推荐答案
第一个解决方案
问题中详述的方法(2)似乎效果很好.在基准测试中,我获得了:
First solution
The method (2) detailed in the question seems to work well. In a benchmark, I obtained:
- 简单的方法,没有索引:18 MB 数据库,86 ms 查询时间
- 简单的方法,索引:32 MB 数据库,12 毫秒查询时间
- 方法 (2):18 MB 数据库,12 ms 查询时间
这里的关键是使用 dt
作为 INTEGER PRIMARY KEY
,所以 它将是行 id 本身(另见 是主键需要的索引吗SQLite?),使用 B-tree,并且将 not 有另一个隐藏的 rowid
列.因此,我们避免了一个额外的索引,它会产生对应的 dt =>rowid
:这里dt
是行id.
The key point is here to use dt
as an INTEGER PRIMARY KEY
, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid
column. Thus we avoid an extra index which would make a correspondance dt => rowid
: here dt
is the row id.
我们还使用 AUTOINCREMENT
,它在内部创建一个 sqlite_sequence
表,用于跟踪最后添加的 ID.这在插入时很有用:因为两个事件可能具有相同的时间戳(以秒为单位)(即使使用毫秒或微秒的时间戳也是可能的,操作系统可能会截断精度),我们使用 timestamp*10000 之间的最大值
和 last_added_ID + 1
以确保它是唯一的:
We also use AUTOINCREMENT
which internally creates a sqlite_sequence
table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000
and last_added_ID + 1
to make sure it's unique:
MAX(?, (SELECT seq FROM sqlite_sequence) + 1)
代码:
import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probability 1%
t += 1
db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
使用 WITHOUT ROWID
表
这是另一种带有 WITHOUT ROWID
的方法,它给出了 8 毫秒 查询时间.我们必须自己实现一个自动递增的 id,因为使用 WITHOUT ROWID
时 AUTOINCREMENT 不可用.
当我们想要使用 PRIMARY KEY(dt, another_column1, another_column2, id)
并避免有额外的 rowid
列时,WITHOUT ROWID
很有用.rowid
有一个 B-tree,(dt, another_column1, ...)
有一个 B-tree,我们只需要一个.
Using a WITHOUT ROWID
table
Here is another method with WITHOUT ROWID
which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID
.
WITHOUT ROWID
is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id)
and avoid to have an extra rowid
column. Instead of having one B-tree for rowid
and one B-tree for (dt, another_column1, ...)
, we'll have just one.
db.executescript("""
CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);
CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
""")
t = 1600000000
for i in range(1000*1000):
if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
t += 1
db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()
# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100 # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)):
pass
print(time.time()-start)
粗略排序的 UUID
更一般地说,该问题与粗略排序"的 ID 有关.按日期时间.更多信息:
Roughly-sorted UUID
More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:
- ULID(通用唯一字典排序标识符)
- 雪花
- MongoDB ObjectId
- ULID (Universally Unique Lexicographically Sortable Identifier)
- Snowflake
- MongoDB ObjectId
所有这些方法都使用一个 ID:
All these methods use an ID which is:
[---- timestamp ----][---- random and/or incremental ----]
这篇关于如何在 Sqlite 中使用时间序列和快速时间范围查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何在 Sqlite 中使用时间序列和快速时间范围查询?
基础教程推荐
- Sql Server 字符串到日期的转换 2021-01-01
- SQL Server 2016更改对象所有者 2022-01-01
- 在 VB.NET 中更新 SQL Server DateTime 列 2021-01-01
- ERROR 2006 (HY000): MySQL 服务器已经消失 2021-01-01
- SQL Server 中单行 MERGE/upsert 的语法 2021-01-01
- 使用pyodbc“不安全"的Python多处理和数据库访问? 2022-01-01
- SQL Server:只有 GROUP BY 中的最后一个条目 2021-01-01
- 如何在 SQL Server 的嵌套过程中处理事务? 2021-01-01
- 将数据从 MS SQL 迁移到 PostgreSQL? 2022-01-01
- 无法在 ubuntu 中启动 mysql 服务器 2021-01-01