代码之家 › 专栏 › 技术社区 › Nick Silvestri

基于另一个表中列的名称在mysql中创建表

sql mysql python

Nick Silvestri · 技术社区 · 7 年前

我有一张133m行16列的桌子。我想在同一服务器上的另一个数据库上为每一列3-16创建14个表(列1和列2是 `id` 和 `timestamp` 它也将出现在最后的14个表中,但是没有自己的表),其中每个表都有原始列的名称。这是否可以只使用sql脚本?在我看来,这似乎是一种更可取、更快捷的方法。

目前,我有一个python脚本,它通过解析原始表的csv转储(使用50行进行测试)、创建新表和添加相关值来“工作”,但速度非常慢(我估计将近1年的时间可以传输所有133M行,这显然是不可接受的)。这是我第一次以任何方式使用sql,我确信我的代码可以加速,但我不确定如何使用,因为我对sql不熟悉。中间的大sql字符串命令是从代码库中的其他代码复制的。我试过使用如下所示的事务,但它似乎对速度没有任何显著影响。

import re
import mysql.connector
import time

# option flags
debug = False  # prints out information during runtime
timing = True  # times the execution time of the program

# save start time for timing. won't be used later if timing is false
start_time = time.time()

# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')

# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str)  # parse columns with regex to remove commas and vasiala_
if debug:
    print(columns)

# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
                              host='127.0.0.1',
                              database='measurements')
cursor = cnx.cursor()

# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
    table_name = 'vaisala2_' + columns[i]
    sql_command = "CREATE TABLE IF NOT EXISTS " + \
                  table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
                               "`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
                               "`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
                               "`value` varchar(255) DEFAULT NULL, " \
                               "PRIMARY KEY (`id`), " \
                               "UNIQUE KEY `milliseconds` (`milliseconds`)" \
                               "COMMENT 'Eliminates duplicate millisecond values', " \
                               "KEY `timestamp` (`timestamp`)) " \
                               "ENGINE=InnoDB DEFAULT CHARSET=utf8;"

    if debug:
        print("Creating table", table_name, "in database")

    cursor.execute(sql_command)

# read in rest of lines in CSV file
for line in file.readlines():
    cursor.execute("START TRANSACTION;")
    line = line.strip()
    values = re.split(',"|",|,', line)  # regex split along commas, or commas and quotes
    if debug:
        print(values)

    # iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
    for i in range(2, len(columns)):
        table_name = "vaisala2_" + columns[i]
        timestamp = values[1]

        # translate timestamp back to epoch time
        try:
            pattern = '%Y-%m-%d %H:%M:%S'
            epoch = int(time.mktime(time.strptime(timestamp, pattern)))
            milliseconds = epoch * 1000  # convert seconds to ms
        except ValueError:  # errors default to 0
            milliseconds = 0

        value = values[i]

        # generate SQL command to insert data into destination table
        sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
                                                                                 milliseconds, value)
        if debug:
            print(sql_command)

        cursor.execute(sql_command)
cnx.commit()  # commits changes in destination MySQL server

# print total execution time
if timing:
    print("Completed in %s seconds" % (time.time() - start_time))

这不需要进行难以置信的优化;如果机器必须运行几天才能完成,这是完全可以接受的。但是一年太长了。

2 回复 | 直到 7 年前

sticky bit 7 年前

您可以从 SELECT 像:

CREATE TABLE <other database name>.<column name>
             AS
             SELECT <column name>
                    FROM <original database name>.<table name>;

(更换 <...> 使用实际的对象名或用其他列或 WHERE 条款或……)

这也会将查询中的数据插入到新表中。这可能是最快的方法。

您可以使用动态sql和目录中的信息(即 information_schema.columns )创建 CREATE 语句或手动创建它们,这很烦人,但我猜对于14列来说是可以接受的。

aydow 7 年前

当使用脚本与数据库对话时,您希望在每条消息产生进一步的执行时间延迟时尽量减少发送的消息数量。目前,看起来好像您正在发送(根据您的近似值)1.33亿条消息,因此,将脚本的速度减慢了1.33亿次。一个简单的优化方法是解析电子表格并将数据拆分到表中(在内存中或保存到磁盘中),然后将数据发送到新的数据库。

正如您所暗示的,编写一个sql脚本来重新分发数据要快得多。