代码之家  ›  专栏  ›  技术社区  ›  Preetam Purbia

从一个句子生成N-gram

  •  28
  • Preetam Purbia  · 技术社区  · 14 年前

    String Input="This is my car."
    

    Input Ngram size = 3
    

    输出应为:

    This
    is
    my
    car
    
    This is
    is my
    my car
    
    This is my
    is my car
    

    给出一些关于Java的想法,如何实现,或者是否有任何库可供使用。

    我想用 this NGramTokenizer 但是它给出了字符序列的n-gram,我想要单词序列的n-gram。

    7 回复  |  直到 12 年前
        1
  •  26
  •   Shashikant Kore    12 年前

    你在找什么 ShingleFilter

        2
  •  43
  •   aioobe    14 年前

    我相信这会满足你的要求:

    import java.util.*;
    
    public class Test {
    
        public static List<String> ngrams(int n, String str) {
            List<String> ngrams = new ArrayList<String>();
            String[] words = str.split(" ");
            for (int i = 0; i < words.length - n + 1; i++)
                ngrams.add(concat(words, i, i+n));
            return ngrams;
        }
    
        public static String concat(String[] words, int start, int end) {
            StringBuilder sb = new StringBuilder();
            for (int i = start; i < end; i++)
                sb.append((i > start ? " " : "") + words[i]);
            return sb.toString();
        }
    
        public static void main(String[] args) {
            for (int n = 1; n <= 3; n++) {
                for (String ngram : ngrams(n, "This is my car."))
                    System.out.println(ngram);
                System.out.println();
            }
        }
    }
    

    输出:

    This
    is
    my
    car.
    
    This is
    is my
    my car.
    
    This is my
    is my car.
    

    作为迭代器实现的“按需”解决方案:

    class NgramIterator implements Iterator<String> {
    
        String[] words;
        int pos = 0, n;
    
        public NgramIterator(int n, String str) {
            this.n = n;
            words = str.split(" ");
        }
    
        public boolean hasNext() {
            return pos < words.length - n + 1;
        }
    
        public String next() {
            StringBuilder sb = new StringBuilder();
            for (int i = pos; i < pos + n; i++)
                sb.append((i > pos ? " " : "") + words[i]);
            pos++;
            return sb.toString();
        }
    
        public void remove() {
            throw new UnsupportedOperationException();
        }
    }
    
        3
  •  6
  •   Landei    14 年前

    public static String[] ngrams(String s, int len) {
        String[] parts = s.split(" ");
        String[] result = new String[parts.length - len + 1];
        for(int i = 0; i < parts.length - len + 1; i++) {
           StringBuilder sb = new StringBuilder();
           for(int k = 0; k < len; k++) {
               if(k > 0) sb.append(' ');
               sb.append(parts[i+k]);
           }
           result[i] = sb.toString();
        }
        return result;
    }
    

    例如。

    System.out.println(Arrays.toString(ngrams("This is my car", 2)));
    //--> [This is, is my, my car]
    System.out.println(Arrays.toString(ngrams("This is my car", 3)));
    //--> [This is my, is my car] 
    
        4
  •  1
  •   tozCSS    12 年前
    /**
     * 
     * @param sentence should has at least one string
     * @param maxGramSize should be 1 at least
     * @return set of continuous word n-grams up to maxGramSize from the sentence
     */
    public static List<String> generateNgramsUpto(String str, int maxGramSize) {
    
        List<String> sentence = Arrays.asList(str.split("[\\W+]"));
    
        List<String> ngrams = new ArrayList<String>();
        int ngramSize = 0;
        StringBuilder sb = null;
    
        //sentence becomes ngrams
        for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
            String word = (String) it.next();
    
            //1- add the word itself
            sb = new StringBuilder(word);
            ngrams.add(word);
            ngramSize=1;
            it.previous();
    
            //2- insert prevs of the word and add those too
            while(it.hasPrevious() && ngramSize<maxGramSize){
                sb.insert(0,' ');
                sb.insert(0,it.previous());
                ngrams.add(sb.toString());
                ngramSize++;
            }
    
            //go back to initial position
            while(ngramSize>0){
                ngramSize--;
                it.next();
            }                   
        }
        return ngrams;
    }
    

    电话:

    long startTime = System.currentTimeMillis();
    ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
    long stopTime = System.currentTimeMillis();
    System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
    System.out.println(ngrams.toString());
    

    是我的车,我的车,是我的车]

        5
  •  1
  •   Dung TQ    11 年前
        public static void CreateNgram(ArrayList<String> list, int cutoff) {
        try
        {
            NGramModel ngramModel = new NGramModel();
            POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
            PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
            POSTaggerME tagger = new POSTaggerME(model);
            perfMon.start();
            for(int i = 0; i<list.size(); i++)
            {
                String inputString = list.get(i);
                ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
                String line;
                while ((line = lineStream.read()) != null) 
                {
                    String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
                    String[] tags = tagger.tag(whitespaceTokenizerLine);
    
                    POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
    
                    perfMon.incrementCounter();
    
                    String words[] = sample.getSentence();
    
                    if(words.length > 0)
                    {
                        for(int k = 2; k< 4; k++)
                        {
                            ngramModel.add(new StringList(words), k, k);
                        }
                    }
                }
            }
            ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
            Iterator<StringList> it = ngramModel.iterator();
            while(it.hasNext())
            {
                StringList strList = it.next();
                System.out.println(strList.toString());
            }
            perfMon.stopAndPrintFinalResult();
        }catch(Exception e)
        {
            System.out.println(e.toString());
        }
    }
    

        6
  •  0
  •   M Sach    7 年前
    public static void main(String[] args) {
    
        String[] words = "This is my car.".split(" ");
        for (int n = 0; n < 3; n++) {
    
            List<String> list = ngrams(n, words);
            for (String ngram : list) {
                System.out.println(ngram);
            }
            System.out.println();
    
        }
    }
    
    public static List<String> ngrams(int stepSize, String[] words) {
        List<String> ngrams = new ArrayList<String>();
        for (int i = 0; i < words.length-stepSize; i++) {
    
            String initialWord = "";
            int internalCount = i;
            int internalStepSize = i + stepSize;
            while (internalCount <= internalStepSize
                    && internalCount < words.length) {
                initialWord = initialWord+" " + words[internalCount];
                ++internalCount;
            }
            ngrams.add(initialWord);
    
        }
        return ngrams;
    }
    
        7
  •  0
  •   Jagesh Maharjan    5 年前

    看看这个:

    public static void main(String[] args) {
        NGram nGram = new NGram();
        String[] tokens = "this is my car".split(" ");
        int i = tokens.length;
        List<String> ngrams = new ArrayList<>();
        while (i >= 1){
            ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
            i--;
        }
        System.out.println(ngrams);
    }
    
    private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
        StringBuilder strbldr = new StringBuilder();
        if (tokens.length < n) {
            return ngrams;
        }else {
            for (int i=0; i<n; i++){
                strbldr.append(tokens[i]).append(" ");
            }
            ngrams.add(strbldr.toString().trim());
            String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
            return getNGram(newTokens, n, ngrams);
        }
    }
    

    简单的递归函数,运行时间更好。