Querying Large Dataset on Join (15+ million rows)(在连接时查询大型数据集(15+ 百万行))
问题描述
我正在尝试加入两个表,products
和 products_markets
.products
不到一百万条记录,而 product_markets
接近 2000 万条记录.数据已更改,因此架构创建表中可能有一两个错字:
I am trying to join two tables, products
and products_markets
. While products
is under a million records, product_markets
is closer to 20 million records. The data has been changed so there might be a typo or two in the schema create tables:
CREATE TABLE `products_markets` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`product_id` int(10) unsigned NOT NULL,
`country_code_id` int(10) unsigned NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`product_id`,`country_code_id`)
) ENGINE=InnoDB AUTO_INCREMENT=21052102 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
CREATE TABLE `products` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`manufacturer_id` int(10) unsigned NOT NULL,
`department_id` int(10) unsigned NOT NULL,
`code` varchar(100) COLLATE utf8mb4_unicode_ci NOT NULL,
`popularity` int(11) DEFAULT NULL,
`name` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`value` bigint(20) unsigned NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `products_code_unique` (`code`),
KEY `products_department_id_foreign` (`department_id`),
KEY `products_manufacturer_id_foreign` (`manufacturer_id`),
CONSTRAINT `products_department_id_foreign`
FOREIGN KEY (`department_id`) REFERENCES `departments` (`id`),
CONSTRAINT `products_manufacturer_id_foreign`
FOREIGN KEY (`manufacturer_id`) REFERENCES `manufacturers` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=731563 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
我正在尝试返回 50 条特定国家/地区最受欢迎产品的记录,我遇到的时间约为 50 秒,这似乎比预期的要长.
I am trying to return 50 records of the most popular products available in a specific country and I am running into times around ~50 seconds which seems higher than expected.
我尝试了几个不同的查询,但没有成功:
I've tried a few different queries with no success:
select `products_markets`.`product_id`
from products_markets
left join
( SELECT products.id, products.popularity
from products
) p ON p.id = products_markets.product_id
where products_markets.country_code_id = 121
order by `popularity` desc, `p`.`id` asc
limit 50
和
select `products`.*
from `products`
where products.id in (
SELECT product_id
from products_markets
where products_markets.country_code_id = 121
)
group by `products`.`name`, `products`.`manufacturer_id`
order by `popularity` desc, `products`.`id` asc
limit 50
这个查询的解释是:
id select_type table type possible_keys key key_len refs rows extra
1 PRIMARY products ALL PRIMARY NULL NULL NULL 623848 Using temporary; Using filesort
1 PRIMARY products_markets ref unique_index unique_index 4 main.products.id 14 Using where; Using index; FirstMatch(products)
我喜欢的一个选项是将 products_markets 拆分为每个国家/地区的单独表以减少查询.我尝试向服务器添加更多内存但没有取得多大成功.任何人都可以识别出数据库设计/查询有什么明显错误吗?
One option I am entertaining is splitting up products_markets into individual tables for each country to lessen the query. I've tried adding more memory to the server without much success. Can anyone identify anything glaringly wrong with the database design/query?
还有哪些其他选项可以使这个查询只占当前约 50 秒的一小部分?
What other options are available to make this query a fraction of its current ~50 seconds?
推荐答案
去掉products_markets
中的id
并添加
PRIMARY KEY(country_code_id, product_id)
然后去掉 UNIQUE
键,除非其他查询需要它.
Then get rid of the UNIQUE
key unless it is needed for some other query.
这将显着缩小该大表的磁盘占用空间,从而可能加快所有涉及它的查询.
This will shrink the disk footprint of that large table significantly, thereby potentially speeding up all queries touching it.
这将有助于 Hamaza 建议的重新制定.
And it will help with Hamaza's suggested reformulation.
这篇关于在连接时查询大型数据集(15+ 百万行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在连接时查询大型数据集(15+ 百万行)
基础教程推荐
- SQL Server 2016更改对象所有者 2022-01-01
- Sql Server 字符串到日期的转换 2021-01-01
- 将数据从 MS SQL 迁移到 PostgreSQL? 2022-01-01
- ERROR 2006 (HY000): MySQL 服务器已经消失 2021-01-01
- 使用pyodbc“不安全"的Python多处理和数据库访问? 2022-01-01
- SQL Server:只有 GROUP BY 中的最后一个条目 2021-01-01
- 无法在 ubuntu 中启动 mysql 服务器 2021-01-01
- 如何在 SQL Server 的嵌套过程中处理事务? 2021-01-01
- 在 VB.NET 中更新 SQL Server DateTime 列 2021-01-01
- SQL Server 中单行 MERGE/upsert 的语法 2021-01-01