代码之家  ›  专栏  ›  技术社区  ›  s__

按大小写拆分向量

  •  1
  • s__  · 技术社区  · 6 年前

    this this ,但我无法使它们与我的数据一起工作。

    # here my data
        data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                                    ,"OTHER UPPER CASES   And other words"
                                    , "Some lower cases        AND UPPER CASES"
                                    ,"ONLY UPPER CASES"
                                    ,"Only lower cases, maybe"
                                    ,"UPPER lower UPPER!"))
        data
                                             text
        1 SOME UPPERCASES     And some Lower Cases
        2      OTHER UPPER CASES   And other words
        3  Some lower cases        AND UPPER CASES
        4                         ONLY UPPER CASES
        5                  Only lower cases, maybe
        6                        UPPER lower UPPER!
    

    期望的结果应该是这样的:

           V1                  V2
    1      SOME UPPERCASES     And some Lower Cases
    2      OTHER UPPER CASES   And other words
    3      AND UPPER CASES     Some lower cases        
    4      ONLY UPPER CASES    NA
    5      NA                  Only lower cases, maybe
    6      UPPER UPPER!         lower
    

    strsplit(x= data$text[1], split="[[:upper:]]")   # error
    gsub('([[:upper:]])', ' \\1', data$text[1])      # not good results
    
    library(reshape)
    transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b')))                                        # neither good results
    
    3 回复  |  直到 6 年前
        1
  •  1
  •   Andre Elrico    6 年前

    数据:

    data <- data.frame(text = c("SOME UPPERCASES     And some Lower Cases"
                                ,"OTHER UPPER CASES   And other words"
                                , "Some lower cases        AND UPPER CASES"
                                ,"ONLY UPPER CASES"
                                ,"Only lower cases, maybe"
                                ,"UPPER lower UPPER!"))
    

    library(magrittr)
    
    UpperCol    <- regmatches(data$text , gregexpr("\\b[A-Z]+\\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
    notUpperCol <- regmatches(data$text , gregexpr("\\b(?![A-Z]+\\b)[a-zA-Z]+\\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist
    
    result <- data.frame(I(UpperCol), I(notUpperCol))
    result[result == ""] <- NA
    

    结果:

    #           UpperCol            notUpperCol
    #1   SOME UPPERCASES   And some Lower Cases
    #2 OTHER UPPER CASES        And other words
    #3   AND UPPER CASES       Some lower cases
    #4  ONLY UPPER CASES                   <NA>
    #5              <NA> Only lower cases maybe
    #6       UPPER UPPER                  lower
    

    • regex
    • 感谢Wiktor Stribiew的一些优化。
        2
  •  1
  •   Jaap    6 年前

    包裹:

    library(stringi)
    l1 <- stri_extract_all_regex(dat$text, "\\b[A-Z]+\\b")
    l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)
    
    res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
                      not_all_upper = sapply(l2, paste, collapse = " "),
                      stringsAsFactors = FALSE)
    res[res == "NA"] <- NA
    res[res == ""] <- NA
    

    它给出:

    > res
              all_upper          not_all_upper
    1   SOME UPPERCASES   And some Lower Cases
    2 OTHER UPPER CASES        And other words
    3   AND UPPER CASES       Some lower cases
    4  ONLY UPPER CASES                   <NA>
    5              <NA> Only lower cases maybe
    6       UPPER UPPER                  lower
    
        3
  •  1
  •   s_baldur    6 年前
    separate <- function(x) {
      x <- unlist(strsplit(as.character(x), "\\s+"))
      with_lower <- grepl("\\p{Ll}", x, perl = TRUE)
      list(paste(x[!with_lower], collapse = " "),  paste(x[with_lower], collapse = " "))
    }
    
    
    do.call(rbind, lapply(data$text, separate))
    
         [,1]                [,2]                     
    [1,] "SOME UPPERCASES"   "And some Lower Cases"   
    [2,] "OTHER UPPER CASES" "And other words"        
    [3,] "AND UPPER CASES"   "Some lower cases"       
    [4,] "ONLY UPPER CASES"  ""                       
    [5,] ""                  "Only lower cases, maybe"
    [6,] "UPPER UPPER!"      "lower"