我正在运行MySQL 5.6并在Go中构建应用程序。我遇到了一个无法优化的顽固查询,所以我试图将其分解为最简单的组件。根本问题是
GROUP BY列上的索引在运行时提供了持续的改进,而不是我所期望的对数性能
.
这里有一个基准测试的例子。鉴于此
data.go
:
package data
import (
"database/sql"
"log"
_ "github.com/go-sql-driver/mysql"
)
type GroupCount struct {
Cohort string
Cnt uint
}
func GroupByQuery(db *sql.DB) ([]GroupCount, error) {
var counts []GroupCount
res, err := db.Query(`
SELECT cohort, COUNT(cohort) AS cnt
FROM test_table
GROUP BY cohort
`)
defer res.Close()
if err != nil {
log.Println(err)
return []GroupCount{}, err
}
for res.Next() {
var gc GroupCount
err := res.Scan(&gc.Cohort, &gc.Cnt)
if err != nil {
return []GroupCount{}, err
}
counts = append(counts, gc)
}
return counts, nil
}
和
data_test.go
:
package data
import (
"database/sql"
"fmt"
"math/rand"
"testing"
"time"
)
func BenchmarkGroupByQuery(b *testing.B) {
// local db connection
db, err := sql.Open("mysql", "dbuser:dbpass@tcp(localhost:3306)/testdb")
defer db.Close()
// declare db table
db.Exec("DROP TABLE test_table")
_, err = db.Exec("CREATE TABLE test_table (id INT, cohort VARCHAR(255))")
if err != nil {
b.Fatal(err)
}
// comment in or out to test index
// _, err = db.Exec("CREATE INDEX idx_cohort ON test_table (cohort)")
// if err != nil {
// b.Fatal(err)
// }
// insert some data into the table
n := 100000
stmt := "INSERT INTO test_table VALUES (%d, %s)"
rand.Seed(time.Now().UnixNano())
for i := 0; i < n; i++ {
j := rand.Intn(5)
insertStmt := fmt.Sprintf(stmt, i, fmt.Sprintf("\"group%d\"", j))
_, err := db.Exec(insertStmt)
if err != nil {
b.Error(err)
}
}
b.ResetTimer()
// access and print results
for i := 0; i < b.N; i++ {
GroupByQuery(db)
}
}
要复制,您需要在MySQL中设置具有用户访问权限的适当数据库。运行给定的基准测试(
go test . --bench BenchmarkGroupByQuery
)带有变量
n
,我得到以下结果:
unindexed
i = 100 : 0.5 ms/op
i = 1000 : 2.3 ms/op
i = 10000 : 20.1 ms/op
i = 100000 : 215.6 ms/op
indexed
i = 100 : 0.2 ms/op
i = 1000 : 0.7 ms/op
i = 10000 : 6.0 ms/op
i = 100000 : 59.6 ms/op
我可以验证,在给定分为5组的大型数据集的情况下,有和没有索引的查询给出了不同的执行计划,指示是否使用索引。
无索引
mysql> EXPLAIN SELECT cohort, COUNT(cohort) AS cnt FROM test_table GROUP BY cohort;
+----+-------------+------------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | test_table | ALL | NULL | NULL | NULL | NULL | 100256 | Using temporary; Using filesort |
+----+-------------+------------+------+---------------+------+---------+------+--------+---------------------------------+
1 row in set (0.00 sec)
带索引
mysql> EXPLAIN SELECT cohort, COUNT(cohort) AS cnt FROM test_table GROUP BY cohort;
+----+-------------+------------+-------+---------------+------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+------------+---------+------+--------+-------------+
| 1 | SIMPLE | test_table | index | idx_cohort | idx_cohort | 768 | NULL | 100256 | Using index |
+----+-------------+------------+-------+---------------+------------+---------+------+--------+-------------+
1 row in set (0.00 sec)
最后,这是查询本身的结果(考虑到基准测试中的设置,它有点随机)
mysql> SELECT cohort, COUNT(cohort) AS cnt FROM test_table GROUP BY cohort;
+--------+-------+
| cohort | cnt |
+--------+-------+
| group0 | 19928 |
| group1 | 19791 |
| group2 | 19916 |
| group3 | 20282 |
| group4 | 20083 |
+--------+-------+
5 rows in set (0.07 sec)
这些结果让我非常惊讶。本质上,在我们分组的列上添加一个索引,可以得到大约0.3的恒定运行时改进系数。我不明白为什么它既不会提供对数运行时改进,也不会提供任何改进。