代码之家 › 专栏 › 技术社区 › lisak

存储表达式及其在文本中出现的数据库模型

database-design postgresql mysql

lisak · 技术社区 · 15 年前

我在做一个统计研究应用。我需要存储的话,根据2个首字母,这是676个组合,每个词有它的出现次数(最小,最大,平均)在文本中。我不确定模型/模式应该是什么样子。将有大量的检查关键字是否已经被持久化。我很感激你的建议。

1 回复 | 直到 15 年前

araqnid 15 年前

除非你有数百万个单词,否则仅仅存储它们的前缀似乎是个糟糕的计划。

为了将新数据添加到表中,您只需编写一个临时表,其中包含传入的单词,然后在导入运行结束时一次性聚合并合并这些单词。也就是说,类似于:

BEGIN;
CREATE TEMP TABLE word_stage(word text) ON COMMIT DROP;
COPY word_stage FROM stdin;
-- use pgputcopydata to send all the words to the db...
SET work_mem = 256MB; -- use lots of memory for this aggregate..
CREATE TEMP TABLE word_count_stage AS
    SELECT word, count(*) as occurrences
    FROM word_stage
    GROUP BY word;
-- word should be unique, check that and maybe use this index for merging
ALTER TABLE word_count_stage ADD PRIMARY KEY(word);
-- this UPDATE/INSERT pair is not comodification-safe
LOCK TABLE word_count IN SHARE ROW EXCLUSIVE MODE;
-- now update the existing words in the main table
UPDATE word_count
SET word_count.occurrences = word_count.occurrences + word_count_stage.occurrences,
    word_count.min_occurrences = least(word_count.occurrences, word_count_stage.occurrences),
    word_count.max_occurrences = greatest(word_count.occurrences, word_count_stage.occurrences)
FROM word_count_stage
WHERE word_count_stage.word = word_count.word;
-- and add the new words, if any
INSERT INTO word_count(word, occurrences, min_occurrences, max_occurrences)
  SELECT word, occurrences, occurrences, occurrences
  FROM word_count_stage
  WHERE NOT EXISTS (SELECT 1 FROM word_count WHERE word_count.word = word_count_stage.word);
END;

因此,这将聚合一批单词,然后将它们应用于单词计数表。有索引的 word_stage(word) word_count(word) word_count . (尽管指定了一个较低的填充因子,比如60左右 字数

如果您的输入实际上是单词/事件对,而不仅仅是单词(您的文本不是很清楚),那么您可以去掉首字母 word_stage 然后复制到 word_count_stage .

说真的,至少在开始的时候,我会尝试把整个单词作为一个键——你引用的数字在可用性的范围之内。另外请注意,我上面概述的加载方法可以很容易地修改为将单词截断为前两个字符(或者以任意方式将其转换为内存键),或者在数据移动到内存时进行转换 字数阶段 或者在最后将转换放到update/insert语句中(尽管这样可能会失去在temp表上建立索引的好处)。

推荐文章

developer · 带外键的SQL表设计

5 月前

GH DevOps · 多对多关系中同类型的SQL Server关系表设计

5 月前

relatively_random · 确保两个表之间一致的共同参考

6 月前

b126 · 在两种不同的Oracle模式上执行相同查询的速度差异很大

1 年前

robertspierre · 在多对多关系中自动删除未引用的行

1 年前

Dawn Deschain · 在MySQL中,如何在Table1中设置一个新列,从Table2中提取特定列的COUNT()?

1 年前

Gerrit de Swardt · 如何解决含税和不含税价格的准确性问题?

1 年前

Sylith · 你能在另一个文档中使用MongoDB文档ID作为密钥吗?

1 年前

Community wiki · 在进行TDD和增量添加功能时,我是提前设计数据库还是在编码时添加表和列?

1 年前

Michael Samuel · MYSQL在以下情况下自动创建索引

7 年前