代码之家  ›  专栏  ›  技术社区  ›  philfreo

UTF-8的多字节安全wordwrap()函数

  •  12
  • philfreo  · 技术社区  · 14 年前

    PHP的 wordwrap() 对于UTF-8这样的多字节字符串,函数不能正常工作。

    函数的参数应与

    特别是要确保它能:

    • 如果需要,请删掉中间词 $cut = true
    • 不在单词中插入额外空格,如果 $break = ' '
    • $break = "\n"
    8 回复  |  直到 14 年前
        1
  •  21
  •   Fosfor    13 年前

    我还没有找到适合我的工作代码。这是我写的。对我来说,这是工作,但它可能不是最快的。

    function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false) {
        $lines = explode($break, $str);
        foreach ($lines as &$line) {
            $line = rtrim($line);
            if (mb_strlen($line) <= $width)
                continue;
            $words = explode(' ', $line);
            $line = '';
            $actual = '';
            foreach ($words as $word) {
                if (mb_strlen($actual.$word) <= $width)
                    $actual .= $word.' ';
                else {
                    if ($actual != '')
                        $line .= rtrim($actual).$break;
                    $actual = $word;
                    if ($cut) {
                        while (mb_strlen($actual) > $width) {
                            $line .= mb_substr($actual, 0, $width).$break;
                            $actual = mb_substr($actual, $width);
                        }
                    }
                    $actual .= ' ';
                }
            }
            $line .= trim($actual);
        }
        return implode($break, $lines);
    }
    
        2
  •  5
  •   sacrebleu    13 年前
    /**
     * wordwrap for utf8 encoded strings
     *
     * @param string $str
     * @param integer $len
     * @param string $what
     * @return string
     * @author Milian Wolff <mail@milianw.de>
     */
    
    function utf8_wordwrap($str, $width, $break, $cut = false) {
        if (!$cut) {
            $regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.',}\b#U';
        } else {
            $regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.'}#';
        }
        if (function_exists('mb_strlen')) {
            $str_len = mb_strlen($str,'UTF-8');
        } else {
            $str_len = preg_match_all('/[\x00-\x7F\xC0-\xFD]/', $str, $var_empty);
        }
        $while_what = ceil($str_len / $width);
        $i = 1;
        $return = '';
        while ($i < $while_what) {
            preg_match($regexp, $str,$matches);
            $string = $matches[0];
            $return .= $string.$break;
            $str = substr($str, strlen($string));
            $i++;
        }
        return $return.$str;
    }
    

        3
  •  3
  •   Fleshgrinder    11 年前

    因为没有答案能处理每一个用例,这里有一个答案。代码基于 Drupal’s AbstractStringWrapper::wordWrap

    <?php
    
    /**
     * Wraps any string to a given number of characters.
     *
     * This implementation is multi-byte aware and relies on {@link
     * http://www.php.net/manual/en/book.mbstring.php PHP's multibyte
     * string extension}.
     *
     * @see wordwrap()
     * @link https://api.drupal.org/api/drupal/core%21vendor%21zendframework%21zend-stdlib%21Zend%21Stdlib%21StringWrapper%21AbstractStringWrapper.php/function/AbstractStringWrapper%3A%3AwordWrap/8
     * @param string $string
     *   The input string.
     * @param int $width [optional]
     *   The number of characters at which <var>$string</var> will be
     *   wrapped. Defaults to <code>75</code>.
     * @param string $break [optional]
     *   The line is broken using the optional break parameter. Defaults
     *   to <code>"\n"</code>.
     * @param boolean $cut [optional]
     *   If the <var>$cut</var> is set to <code>TRUE</code>, the string is
     *   always wrapped at or before the specified <var>$width</var>. So if
     *   you have a word that is larger than the given <var>$width</var>, it
     *   is broken apart. Defaults to <code>FALSE</code>.
     * @return string
     *   Returns the given <var>$string</var> wrapped at the specified
     *   <var>$width</var>.
     */
    function mb_wordwrap($string, $width = 75, $break = "\n", $cut = false) {
      $string = (string) $string;
      if ($string === '') {
        return '';
      }
    
      $break = (string) $break;
      if ($break === '') {
        trigger_error('Break string cannot be empty', E_USER_ERROR);
      }
    
      $width = (int) $width;
      if ($width === 0 && $cut) {
        trigger_error('Cannot force cut when width is zero', E_USER_ERROR);
      }
    
      if (strlen($string) === mb_strlen($string)) {
        return wordwrap($string, $width, $break, $cut);
      }
    
      $stringWidth = mb_strlen($string);
      $breakWidth = mb_strlen($break);
    
      $result = '';
      $lastStart = $lastSpace = 0;
    
      for ($current = 0; $current < $stringWidth; $current++) {
        $char = mb_substr($string, $current, 1);
    
        $possibleBreak = $char;
        if ($breakWidth !== 1) {
          $possibleBreak = mb_substr($string, $current, $breakWidth);
        }
    
        if ($possibleBreak === $break) {
          $result .= mb_substr($string, $lastStart, $current - $lastStart + $breakWidth);
          $current += $breakWidth - 1;
          $lastStart = $lastSpace = $current + 1;
          continue;
        }
    
        if ($char === ' ') {
          if ($current - $lastStart >= $width) {
            $result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
            $lastStart = $current + 1;
          }
    
          $lastSpace = $current;
          continue;
        }
    
        if ($current - $lastStart >= $width && $cut && $lastStart >= $lastSpace) {
          $result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
          $lastStart = $lastSpace = $current;
          continue;
        }
    
        if ($current - $lastStart >= $width && $lastStart < $lastSpace) {
          $result .= mb_substr($string, $lastStart, $lastSpace - $lastStart) . $break;
          $lastStart = $lastSpace = $lastSpace + 1;
          continue;
        }
      }
    
      if ($lastStart !== $current) {
        $result .= mb_substr($string, $lastStart, $current - $lastStart);
      }
    
      return $result;
    }
    
    ?>
    
        4
  •  2
  •   jchook    5 年前

    自定义单词边界

    Unicode文本比8位编码有更多潜在的单词边界,包括 17 space separators ,和 full width comma . 此解决方案允许您为应用程序自定义单词边界列表。

    更好的性能

    你有没有测试过 mb_* nextCharUtf8()

    <?php
    
    function wordWrapUtf8(
      string $phrase,
      int $width = 75,
      string $break = "\n",
      bool $cut = false,
      array $seps = [' ', "\n", "\t", ',']
    ): string
    {
      $chunks = [];
      $chunk = '';
      $len = 0;
      $pointer = 0;
      while (!is_null($char = nextCharUtf8($phrase, $pointer))) {
        $chunk .= $char;
        $len++;
        if (in_array($char, $seps, true) || ($cut && $len === $width)) {
          $chunks[] = [$len, $chunk];
          $len = 0;
          $chunk = '';
        }
      }
      if ($chunk) {
        $chunks[] = [$len, $chunk];
      }
      $line = '';
      $lines = [];
      $lineLen = 0;
      foreach ($chunks as [$len, $chunk]) {
        if ($lineLen + $len > $width) {
          if ($line) {
            $lines[] = $line;
            $lineLen = 0;
            $line = '';
          }
        }
        $line .= $chunk;
        $lineLen += $len;
      }
      if ($line) {
        $lines[] = $line;
      }
      return implode($break, $lines);
    }
    
    function nextCharUtf8(&$string, &$pointer)
    {
      // EOF
      if (!isset($string[$pointer])) {
        return null;
      }
    
      // Get the byte value at the pointer
      $char = ord($string[$pointer]);
    
      // ASCII
      if ($char < 128) {
        return $string[$pointer++];
      }
    
      // UTF-8
      if ($char < 224) {
        $bytes = 2;
      } elseif ($char < 240) {
        $bytes = 3;
      } elseif ($char < 248) {
        $bytes = 4;
      } elseif ($char == 252) {
        $bytes = 5;
      } else {
        $bytes = 6;
      }
    
      // Get full multibyte char
      $str = substr($string, $pointer, $bytes);
    
      // Increment pointer according to length of char
      $pointer += $bytes;
    
      // Return mb char
      return $str;
    }
    
        5
  •  1
  •   josevoid    6 年前

    只是想和大家分享我在网上找到的一些替代品。

    <?php
    if ( !function_exists('mb_str_split') ) {
        function mb_str_split($string, $split_length = 1)
        {
            mb_internal_encoding('UTF-8'); 
            mb_regex_encoding('UTF-8');  
    
            $split_length = ($split_length <= 0) ? 1 : $split_length;
    
            $mb_strlen = mb_strlen($string, 'utf-8');
    
            $array = array();
    
            for($i = 0; $i < $mb_strlen; $i += $split_length) {
                $array[] = mb_substr($string, $i, $split_length);
            }
    
            return $array;
        }
    }
    

    mb_str_split ,您可以使用 join <br> .

    <?php
        $text = '<utf-8 content>';
    
        echo join('<br>', mb_str_split($text, 20));
    

    最后创建自己的助手 mb_textwrap

    <?php
    
    if( !function_exists('mb_textwrap') ) {
        function mb_textwrap($text, $length = 20, $concat = '<br>') 
        {
            return join($concat, mb_str_split($text, $length));
        }
    }
    
    $text = '<utf-8 content>';
    // so simply call
    echo mb_textwrap($text);
    

    请参见屏幕截图演示: mb_textwrap demo

        6
  •  1
  •   user2788145    6 年前
    function mb_wordwrap($str, $width = 74, $break = "\r\n", $cut = false)
            {
                return preg_replace(
                    '~(?P<str>.{' . $width . ',}?' . ($cut ? '(?(?!.+\s+)\s*|\s+)' : '\s+') . ')(?=\S+)~mus',
                    '$1' . $break,
                    $str
                );
            }
    
        7
  •  0
  •   Guillaume V    11 年前

    这是我从互联网上其他人那里得到的灵感编写的多字节wrap函数。

    function mb_wordwrap($long_str, $width = 75, $break = "\n", $cut = false) {
        $long_str = html_entity_decode($long_str, ENT_COMPAT, 'UTF-8');
        $width -= mb_strlen($break);
        if ($cut) {
            $short_str = mb_substr($long_str, 0, $width);
            $short_str = trim($short_str);
        }
        else {
            $short_str = preg_replace('/^(.{1,'.$width.'})(?:\s.*|$)/', '$1', $long_str);
            if (mb_strlen($short_str) > $width) {
                $short_str = mb_substr($short_str, 0, $width);
            }
        }
        if (mb_strlen($long_str) != mb_strlen($short_str)) {
            $short_str .= $break;
        }
        return $short_str;
    }
    

    ini_set('default_charset', 'UTF-8');
    mb_internal_encoding('UTF-8');
    mb_regex_encoding('UTF-8');
    

    我希望这会有帮助。 纪尧姆

        8
  •  -1
  •   philfreo    14 年前

    /**
     * Multi-byte safe version of wordwrap()
     * Seems to me like wordwrap() is only broken on UTF-8 strings when $cut = true
     * @return string
     */
    function wrap($str, $len = 75, $break = " ", $cut = true) { 
        $len = (int) $len;
    
        if (empty($str))
            return ""; 
    
        $pattern = "";
    
        if ($cut)
            $pattern = '/([^'.preg_quote($break).']{'.$len.'})/u'; 
        else
            return wordwrap($str, $len, $break);
    
        return preg_replace($pattern, "\${1}".$break, $str); 
    }
    
        9
  •  -2
  •   philfreo    14 年前

    这个好像很管用。。。

    function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false, $charset = null) {
        if ($charset === null) $charset = mb_internal_encoding();
    
        $pieces = explode($break, $str);
        $result = array();
        foreach ($pieces as $piece) {
          $current = $piece;
          while ($cut && mb_strlen($current) > $width) {
            $result[] = mb_substr($current, 0, $width, $charset);
            $current = mb_substr($current, $width, 2048, $charset);
          }
          $result[] = $current;
        }
        return implode($break, $result);
    }