代码之家 › 专栏 › 技术社区 › sulleh

CSV分析返回“Unquoted fields do not allow\r or\n”,但在源文件中找不到错误?

error-handling parsing csv ruby ruby-on-rails

sulleh · 技术社区 · 7 年前

我正在为我的Rails应用程序使用Ruby中的内置CSV函数。我正在调用一个URL(通过HTTParty)解析它,并试图将结果保存到我的数据库中。

问题是,我得到了错误 Unquoted fields do not allow \r or \n 这通常表示输入数据有问题,但是在检查数据时,我找不到任何问题。

以下是我如何检索数据:

response = HTTParty.get("http://" + "weather.com/ads.txt", limit: 100, follow_redirects: true, timeout: 10)

然后我尝试分析数据,并应用一些regex忽略 # ,忽略空行等。

if response.code == 200 && !response.body.match(/<.*html>/) active_policies = []

CSV.parse(response.body, skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|
    begin
     #print out the individual ads.txt records 
     puts ""
     print row[0].downcase.strip + " " + row[1].strip + " " + 
     row[2].split("#").first.strip
            active_policies.push(
                publisher.policies.find_or_create_by(ad_partner: row[0].downcase.strip, external_seller_id: row[1].strip, seller_relationship: row[2].split("#").first.strip) do |policy|
                    policy.deactivated_at = nil
                end 
                )

                rescue => save
                #Add error event to the new sync status model
                puts "we are in the loop"
                puts save.message, row.inspect, save.backtrace
                    next
                end
                end
            #else
                #puts "Too many policies.  Skipping " + publisher.name
            #end
            #now we are going to run a check to see if we have any policies that are outdated, and if so, flag them as such.
            deactivated_policies = publisher.policies.where.not(id: active_policies.map(&:id)).where(deactivated_at: nil)
            deactivated_policies.update_all(deactivated_at: Time.now)
            deactivated_policies.each do |deactivated_policy|
                puts "Deactivating Policy for " + deactivated_policy.publisher.name
            end

         elsif response.code == 404 
            print 
            print response.code.to_s + " GET, "  + response.body.size.to_s + " body, "
            puts response.headers.size.to_s + " headers for " + publisher.name

         elsif response.code == 302
            print response.code.to_s + " GET, "  + publisher.name
         else 
            puts response.code.to_s +  " GET ads txt not found on " + publisher.name
         end

    publisher.update(last_scan: Time.now)

    rescue => ex
        puts ex.message, ex.backtrace, "error pulling #{publisher.name} ..." 
        #publisher.update_columns(active: "false")
    end
end`

我有几个想法/调查结果:

CSV.parse(response.body.lines[140..400].join("\n"), skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) 但这对我没有帮助,因为即使我把134号线确定为违规线,我也不知道如何检测或处理它。
1. response.body.force_encoding("UTF-8") 仍然抛出错误。
2. next 到rescue块,所以即使它发现一个错误,它也会移动到csv中的下一行,但不会发生这种情况-它只是出错并停止解析-所以我得到了前130个条目,而不是剩余的条目。
3. 与页面类型类似,我不确定页面类型是HTML而不是文本文件是否会在这里产生问题。

我很想知道如何发现和处理这个错误,所以这里的任何想法都是非常受欢迎的!

作为参考, #PBS 显然是134行在源文件中给了我麻烦,但我不知道我是否完全相信这就是问题所在。

#canada

google.com, pub-0942427266003794, DIRECT, f08c47fec0942fa0
indexexchange.com, 184315, DIRECT
indexexchange.com, 184601, DIRECT
indexexchange.com, 182960, DIRECT
openx.com, 539462051, DIRECT, 6a698e2ec38604c6

#spain

#PBS
google.com, pub-8750086020675820, DIRECT, f08c47fec0942fa0
google.com, pub-1072712229542583, DIRECT, f08c47fec0942fa0
appnexus.com, 3872, DIRECT
rubiconproject.com, 9778, DIRECT, 0bfd66d529a55807
openx.com, 539967419, DIRECT, 6a698e2ec38604c6
openx.com, 539726051, DIRECT, 6a698e2ec38604c6
google.com, pub-7442858011436823, DIRECT, f08c47fec0942fa0

2 回复 | 直到 7 年前

John Skiles Skinner 7 年前

该文本中有不一致的行尾,CSV解析器正绊倒在它们上面。一个很快的解决办法就是把所有的 \r 字符:

response.body.gsub!("\r", '')

response = HTTParty.get("http://" + "weather.com/ads.txt", limit: 100, follow_redirects: true, timeout: 10)
characters = response.chars.inspect
output = File.open( "outputfile.txt","w" )
output << characters
output.close

打开 outputfile.txt 寻找 \右 \n 一个人。

Billy Kimble 7 年前

可以通过在CSV到达文件之前预分析该文件并删除以下内容来解决此问题:

更改:

CSV.parse(response.body, skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|

致:

CSV.parse(response.body.tr("\r", ''), skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|