代码之家 › 专栏 › 技术社区 › DMI

聚合具有优先权的SQL行

aggregation group-by mysql sql

DMI · 技术社区 · 15 年前

我有一张桌子,上面摆满了来自不同来源的物品。有些来源可能有相同的位置(在我的例子中,不同的BBC新闻源可能是不同的来源,但它们都来自BBC)。每个项目都有一个“唯一”的ID,可以用来在同一位置的其他项目中识别它。这意味着,与同一个网站上的同一新闻故事相关但在不同源下发布的项目将具有相同的“唯一ID”,但这不一定是全局唯一的。

问题是,我希望在显示时消除重复,以便(取决于您看到的提要)每个报道最多只能获得一个版本,即使其中两个或三个提要可能包含指向它的链接。

我有一个 sources 包含每个源的信息的表,以及 location_id 和 location_precedence 领域。然后我有一个 items 包含每个项的表,其 unique_id , source_id 和 content .相同的项目 单桅帆船 来源 位置标识 至多出现一次,具有最高的源 位置优先 获胜。

我本以为是这样的:

SELECT `sources`.`name` AS `source`,
       `items`.`content`,
       `items`.`published`
FROM `items` INNER JOIN `sources`
  ON `items`.`source_id` = `sources`.`id` AND `sources`.`active` = 1
GROUP BY `items`.`unique_id`, `sources`.`location_id`
ORDER BY `sources`.`location_priority` DESC

可以做到,但这似乎忽略了位置优先字段。我错过了什么?

实例数据:

CREATE TABLE IF NOT EXISTS `sources` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `location_id` int(10) unsigned NOT NULL,
  `location_priority` int(11) NOT NULL,
  `active` tinyint(1) unsigned NOT NULL default '1',
  `name` varchar(150) NOT NULL,
  `url` text NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `active` (`active`)
);

INSERT INTO `sources` (`id`, `location_id`, `location_priority`, `active`, `name`, `url`) VALUES
(1, 1, 25, 1, 'BBC News Front Page', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml'),
(2, 1, 10, 1, 'BBC News England', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml'),
(3, 1, 15, 1, 'BBC Technology News', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/technology/rss.xml'),
(4, 2, 0, 1, 'Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot'),
(5, 3, 0, 1, 'The Daily WTF', 'http://syndication.thedailywtf.com/TheDailyWtf');

CREATE TABLE IF NOT EXISTS `items` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `source_id` int(10) unsigned NOT NULL,
  `published` datetime NOT NULL,
  `content` text NOT NULL,
  `unique_id` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `unique_id` (`unique_id`,`source_id`),
  KEY `published` (`published`),
  KEY `source_id` (`source_id`)
);

INSERT INTO `items` (`id`, `source_id`, `published`, `content`, `unique_id`) VALUES
(1,  1, '2009-12-01 16:25:53', 'Story about Subject One',                     'abc'),
(2,  2, '2009-12-01 16:21:31', 'Subject One in story',                        'abc'),
(3,  3, '2009-12-01 16:17:20', 'Techy goodness',                              'def'),
(4,  2, '2009-12-01 16:05:57', 'Further updates on Foo case',                 'ghi'),
(5,  3, '2009-12-01 15:53:39', 'Foo, Bar and Quux in court battle',           'ghi'),
(6,  2, '2009-12-01 15:52:02', 'Anti-Fubar protests cause disquiet',          'mno'),
(7,  4, '2009-12-01 15:39:00', 'Microsoft Bleh meets lukewarm reception',     'pqr'),
(8,  5, '2009-12-01 15:13:45', 'Ever thought about doing it in VB?',          'pqr'),
(9,  1, '2009-12-01 15:13:15', 'Celebrity has &#039;new friend&#039;',        'pqr'),
(10, 1, '2009-12-01 15:09:57', 'Microsoft launches Bleh worldwide',           'stu'),
(11, 2, '2009-12-01 14:57:22', 'Microsoft launches Bleh in UK',               'stu'),
(12, 3, '2009-12-01 14:57:22', 'Microsoft launches Bleh',                     'stu'),
(13, 3, '2009-12-01 14:42:15', 'Tech round-up',                               'vwx'),
(14, 2, '2009-12-01 14:36:26', 'Estates &#039;old news&#039; say government', 'yza'),
(15, 1, '2009-12-01 14:15:21', 'Iranian doctor &#039;was poisoned&#039;',     'bcd'),
(16, 4, '2009-12-01 14:14:02', 'Apple fans overjoyed by iBlah',               'axf');

查询后需要的内容:

关于第一主题的故事
技术善
在法庭上的对决中
反Fubar抗议引起不安
Microsoft Bleh遇到冷淡的接待
有没有想过用VB做?
名人有“新朋友”
微软在世界范围内推出Bleh
技术综述
地产“老消息”说政府
伊朗医生“中毒”
苹果迷对伊布拉欣喜若狂

我尝试了Andomar解决方案的一个变体,并取得了一些成功:

SELECT      s.`name` AS `source`,
            i.`content`,
            i.`published`
FROM        `items` i
INNER JOIN  `sources` s
ON          i.`source_id` = s.`id`
AND         s.`active` = 1
INNER JOIN (
  SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
  FROM `items` i
  INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
  GROUP BY `location_id`, `unique_id`
) `filter`
ON          i.`unique_id` = `filter`.`unique_id`
AND         s.`location_priority` = `filter`.`prio`
ORDER BY    i.`published` DESC
LIMIT 50

用 AND s.location_priority = filter.prio 一切都按我的意愿进行。由于项目可以来自具有相同优先级的多个源,因此可以重复项目。在这种情况下,额外的 GROUP BY i.unique_id 在外部查询中,它完成了任务,如果优先级相等,那么哪个源“获胜”并不重要。

我试过了 AND i.source_id = filter.source_id 相反,它几乎起作用(即消除多余的 GROUP BY )但不能给出正确来源的结果。在上面的例子中,它给了我“关于foo案件的进一步更新”(来源“BBC新闻英格兰”),而不是“法庭上的foo、bar和quux”(来源“BBC技术新闻”)。查看内部查询的结果,我得到:

unique_id: 'ghi'
source_id: 2
prio: 15

请注意,源ID不正确(应为:3)。

3 回复 | 直到 15 年前

Andomar 15 年前

Order by 只是命令行,而不是从中挑选。

筛选行的方法之一 location_priority 是使用 inner join 作为过滤器:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
INNER JOIN (
    SELECT unique_id, max(location_priority) as prio
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id) filter
ON         i.unique_id = filter.unique_id
AND        s.location_priority = filter.prio;

另一种选择是 where ... in <subquery> 子句,例如:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
WHERE      (i.unique_id, s.location_priority) IN (
    SELECT unique_id, max(location_priority)
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id
);

这个问题也被称为“选择最大值在整个组内的记录”。 nice article 在上面。

编辑:以相同优先级中断与多个源的关系的一种方法是 WHERE 带有子查询的子句。这个例子打破了 i.id DESC :

SELECT     s.name, i.unique_id, i.content, i.published
FROM       (
           SELECT unique_id, min(location_priority) as prio
           FROM items i
           INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
           GROUP BY unique_id
           ) filter
JOIN       items i
JOIN       sources s
ON         s.id = i.source_id 
           AND s.active = 1
WHERE      i.id =
           (
           SELECT   i.id
           FROM     items i
           JOIN     sources s 
           ON       s.id = i.source_id 
                    AND s.active = 1
           WHERE    i.unique_id = filter.unique_id
           AND      s.location_priority = filter.prio
           ORDER BY i.id DESC
           LIMIT 1
           )

Quassnoi也有一篇关于 selecting records holding group-wise maximum (resolving ties) :)

Paul Creasey 15 年前

像这样对派生表执行自联接

select max(location_priority) from table where ...

outis 15 年前

我错过了什么?

这个 ORDER BY 发生在 GROUP BY 已将每个组缩减为一行。保罗给出了一个决议。

关于查询的问题:

SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
FROM `items` i
INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
GROUP BY `location_id`, `unique_id`

source_id 既不是聚合也不是分组。因此,您得到的值是不确定的。