代码之家  ›  专栏  ›  技术社区  ›  Michael

如何使用pprof优化CSV加载程序?

  •  1
  • Michael  · 技术社区  · 6 年前

    我正在尝试优化一个CSV加载过程,基本上是在一个大的CSV文件中执行regex搜索(+4GB-31033993个记录用于我的实验) 我设法构建了一个多处理逻辑来读取CSV,但是当我使用 pprof

    enter image description here

    以下是我目前为止的代码:

    package main
    
    import (
        "bufio"
        "flag"
        "fmt"
        "log"
        "os"
        "regexp"
        "runtime"
        "runtime/pprof"
        "strings"
        "sync"
    )
    
    func processFile(path string) [][]string {
        file, err := os.Open(path)
        if err != nil {
            log.Println("Error:", err)
        }
        var pattern = regexp.MustCompile(`^.*foo.*$`)
        numCPU := runtime.NumCPU()
        jobs := make(chan string, numCPU+1)
    
        fmt.Printf("Strategy: Parallel, %d Workers ...\n", numCPU)
    
        results := make(chan []string)
        wg := new(sync.WaitGroup)
        for w := 1; w <= numCPU; w++ {
            wg.Add(1)
            go parseRecord(jobs, results, wg, pattern)
        }
        go func() {
            scanner := bufio.NewScanner(file)
            for scanner.Scan() {
                jobs <- scanner.Text()
            }
            close(jobs)
        }()
    
        go func() {
            wg.Wait()
            close(results)
        }()
    
        lines := [][]string{}
        for line := range results {
            lines = append(lines, line)
        }
    
        return lines
    }
    
    func parseRecord(jobs <-chan string, results chan<- []string, wg *sync.WaitGroup, pattern *regexp.Regexp) {
        defer wg.Done()
        for j := range jobs {
            if pattern.MatchString(j) {
                x := strings.Split(string(j), "\n")
                results <- x
            }
    
        }
    }
    
    func split(r rune) bool {
        return r == ','
    }
    
    func main() {
        f, err := os.Create("perf.data")
        if err != nil {
            log.Fatal(err)
        }
        pprof.StartCPUProfile(f)
        defer pprof.StopCPUProfile()
    
        pathFlag := flag.String("file", "", `The CSV file to operate on.`)
        flag.Parse()
        lines := processFile(*pathFlag)
        fmt.Println("loaded", len(lines), "records")
    }
    

    当我在没有任何regex约束的情况下处理文件时,我得到了一个合理的计算时间(我只是简单地将解析后的字符串加载到2D数组中,而没有任何限制) pattern.MatchString() )

    Strategy: Parallel, 8 Workers ... loaded 31033993 records 2018/10/09 11:46:38 readLines took 30.611246035s

    Strategy: Parallel, 8 Workers ... loaded 143090 records 2018/10/09 12:04:32 readLines took 1m24.029830907s

    1 回复  |  直到 6 年前
        1
  •  1
  •   Vorsprung    6 年前

    MatchString查找字符串上的任何匹配项 所以你可以去掉锚和通配符 在regexp引擎中,两端的通配符通常很慢

    package reggie
    
    import (
            "regexp"
            "testing"
    )
    
    var pattern = regexp.MustCompile(`^.*foo.*$`)
    var pattern2 = regexp.MustCompile(`foo`)
    
    func BenchmarkRegexp(b *testing.B) {
            for i := 0; i < b.N; i++ {
                    pattern.MatchString("youfathairyfoobar")
            }
    }
    
    func BenchmarkRegexp2(b *testing.B) {
            for i := 0; i < b.N; i++ {
                    pattern2.MatchString("youfathairyfoobar")
            }
    }
    $ go test -bench=.
    goos: darwin
    goarch: amd64
    BenchmarkRegexp-4        3000000           471 ns/op
    BenchmarkRegexp2-4      20000000           101 ns/op
    PASS
    ok      _/Users/jsandrew/wip/src/reg    4.031s