代码之家  ›  专栏  ›  技术社区  ›  OneUser

仅当逗号(,)在Pig中的内部引号(“”)时才替换逗号(,)

  •  0
  • OneUser  · 技术社区  · 7 年前

    我有这样的数据:

    1,234,"john, lee", john@xyz.com
    

    我想删除,里面有“用猪写的空格”。因此,我的数据如下所示:

    1,234,john lee, john@xyz.com
    

    我被这件事困住了。非常感谢您的帮助。谢谢

    3 回复  |  直到 7 年前
        1
  •  1
  •   Taha Naqvi    7 年前

    以下命令将有所帮助:

    csvFile = load '/path/to/file' using PigStorage(',');
    result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);
    

    输出:

    (1234,john lee,john@xyz.com)

        2
  •  0
  •   nobody    7 年前

    将其加载到单个字段中,然后使用STRSPLIT并替换

    A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
    B = FOREACH A GENERATE STRSPLIT(line,'\\"',3); 
    C = FOREACH B GENERATE REPLACE($1,',','');
    D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
    E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
    DUMP E;
    

    A.

    1,234,"john, lee", john@xyz.com
    

    B

    (1,234,)(john, lee)(, john@xyz.com)
    

    C

    (1,234,)(john lee)(, john@xyz.com)
    

    D

    (1,234,john lee, john@xyz.com)
    

    (1),(234),(john lee),(john@xyz.com)
    
        3
  •  0
  •   OneUser    7 年前

    我找到了一个完美的方法。一个非常通用的解决方案如下:

    data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
    
    /*replace comma(,) if it appears in column content*/
    replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
    
    /*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
    replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
    

    详细用例见 my blog