我有一张133m行16列的桌子。我想在同一服务器上的另一个数据库上为每一列3-16创建14个表(列1和列2是
`id`
和
`timestamp`
它也将出现在最后的14个表中,但是没有自己的表),其中每个表都有原始列的名称。这是否可以只使用sql脚本?在我看来,这似乎是一种更可取、更快捷的方法。
目前,我有一个python脚本,它通过解析原始表的csv转储(使用50行进行测试)、创建新表和添加相关值来“工作”,但速度非常慢(我估计将近1年的时间可以传输所有133M行,这显然是不可接受的)。这是我第一次以任何方式使用sql,我确信我的代码可以加速,但我不确定如何使用,因为我对sql不熟悉。中间的大sql字符串命令是从代码库中的其他代码复制的。我试过使用如下所示的事务,但它似乎对速度没有任何显著影响。
import re
import mysql.connector
import time
# option flags
debug = False # prints out information during runtime
timing = True # times the execution time of the program
# save start time for timing. won't be used later if timing is false
start_time = time.time()
# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')
# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str) # parse columns with regex to remove commas and vasiala_
if debug:
print(columns)
# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='measurements')
cursor = cnx.cursor()
# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
table_name = 'vaisala2_' + columns[i]
sql_command = "CREATE TABLE IF NOT EXISTS " + \
table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
"`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
"`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
"`value` varchar(255) DEFAULT NULL, " \
"PRIMARY KEY (`id`), " \
"UNIQUE KEY `milliseconds` (`milliseconds`)" \
"COMMENT 'Eliminates duplicate millisecond values', " \
"KEY `timestamp` (`timestamp`)) " \
"ENGINE=InnoDB DEFAULT CHARSET=utf8;"
if debug:
print("Creating table", table_name, "in database")
cursor.execute(sql_command)
# read in rest of lines in CSV file
for line in file.readlines():
cursor.execute("START TRANSACTION;")
line = line.strip()
values = re.split(',"|",|,', line) # regex split along commas, or commas and quotes
if debug:
print(values)
# iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
for i in range(2, len(columns)):
table_name = "vaisala2_" + columns[i]
timestamp = values[1]
# translate timestamp back to epoch time
try:
pattern = '%Y-%m-%d %H:%M:%S'
epoch = int(time.mktime(time.strptime(timestamp, pattern)))
milliseconds = epoch * 1000 # convert seconds to ms
except ValueError: # errors default to 0
milliseconds = 0
value = values[i]
# generate SQL command to insert data into destination table
sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
milliseconds, value)
if debug:
print(sql_command)
cursor.execute(sql_command)
cnx.commit() # commits changes in destination MySQL server
# print total execution time
if timing:
print("Completed in %s seconds" % (time.time() - start_time))
这不需要进行难以置信的优化;如果机器必须运行几天才能完成,这是完全可以接受的。但是一年太长了。