代码之家 › 专栏 › 技术社区 › steve_c

将字符串剪裁为长度忽略HTML

tokenize truncate string html

steve_c · 技术社区 · 15 年前

这个问题很有挑战性。我们的应用程序允许用户在主页上发布新闻。该新闻是通过允许HTML的富文本编辑器输入的。在主页上,我们只想显示新闻项目的截断摘要。

例如,这里是我们显示的全文,包括HTML

为了在办公室、厨房腾出一点空间,我把所有的杯子都拿出来放在午餐桌上。 除非你对1992年的夏延信使杯或1997年的BC电话高级通讯杯的所有权有强烈的感觉,否则它们将被放在一个盒子里,捐赠给比我们更需要杯子的办公室。

我们希望将新闻项目缩减到250个字符,但不包括HTML。

我们目前用于修整的方法包括HTML,这会导致一些HTML很重的新闻文章被截断。

例如,如果上面的示例包含大量的HTML,它可能看起来像这样:

为了在办公室、厨房腾出一点空间,我拉了…

这不是我们想要的。

是否有人有标记化HTML标记的方法来保持字符串中的位置,对字符串执行长度检查和/或修剪,并将字符串中的HTML恢复到原来的位置?

7 回复 | 直到 10 年前

Chad Birch 15 年前

从帖子的第一个字符开始,跨过每个字符。每次你跨过一个字符,增加一个计数器。当您找到一个“<”字符时,停止递增计数器,直到您碰到一个“>”字符。当计数器达到250时,你的位置就是你想要切断的地方。

请注意,这将有另一个问题,当一个HTML标记在截止前打开而不是关闭时,您必须处理它。

Matteo Pelucco 12 年前

在2态有限机器建议之后,我在Java中开发了一个简单的HTML解析器。

http://pastebin.com/jCRqiwNH

这里是一个测试案例:

http://pastebin.com/37gCS4tV

这里是Java代码:

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

Brian R. Bondy 15 年前

如果我正确理解了这个问题,那么您希望保留HTML格式,但不希望将其作为保留字符串长度的一部分。

您可以使用实现简单 finite state machine .

2个州:Intag、Outoftag
InTag:
-如果 > 遇到字符
-转到自身遇到任何其他字符
OutOfTag:
-进入Intag if < 遇到字符
-转到自身遇到任何其他字符

您的启动状态将为Outoftag。

通过一次处理1个字符来实现有限状态机。对每个字符的处理将使您进入一种新的状态。

当您在有限状态机中运行文本时,您还希望保留一个输出缓冲区,并且到目前为止遇到的长度是可变的(因此您知道何时停止)。

每次处于outoftag状态并处理另一个字符时,都要增加长度变量。如果您有空白字符,则可以选择不增加此变量。
当您没有更多的字符或具有1中提到的所需长度时,将结束算法。
在输出缓冲区中,包括在1中提到的长度之前遇到的字符。
保留一堆未关闭的标签。当到达长度时,对于堆栈中的每个元素,添加一个结束标记。在运行算法时,通过保持当前的标记变量,可以知道何时遇到标记。当前的_tag变量在您进入in tag状态时启动,在您进入outoftag状态时(或在in tag状态下遇到whitepsace字符时)结束。如果你有一个开始标签,你就把它放在堆栈中。如果你有一个结束标签,你可以从堆栈中弹出它。

steve_c 15 年前

下面是我在C中提出的实现:

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

以及一些我通过TDD使用的单元测试:

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

Highstead 15 年前

我知道这比发布日期晚了很多,但我有一个相似的问题,这就是我最终解决它的方法。我关心的是regex与通过数组进行交互的速度。

另外,如果在HTML标记之前有一个空格,而在这之后并不能解决这个问题

private string HtmlTrimmer(string input, int len)
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;
    if (input.Length <= len)
        return input;

    // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
    string inputCopy;
    string tag;

    string result = "";
    int strLen = 0;
    int strMarker = 0;
    int inputLength = input.Length;     

    Stack stack = new Stack(10);
    Regex text = new Regex("^[^<&]+");                
    Regex singleUseTag = new Regex("^<[^>]*?/>");            
    Regex specChar = new Regex("^&[^;]*?;");
    Regex htmlTag = new Regex("^<.*?>");

    while (strLen < len)
    {
        inputCopy = input.Substring(strMarker);
        //If the marker is at the end of the string OR 
        //the sum of the remaining characters and those analyzed is less then the maxlength
        if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
            break;

        //Match regular text
        result += text.Match(inputCopy,0,len-strLen);
        strLen += result.Length - strMarker;
        strMarker = result.Length;

        inputCopy = input.Substring(strMarker);
        if (singleUseTag.IsMatch(inputCopy))
            result += singleUseTag.Match(inputCopy);
        else if (specChar.IsMatch(inputCopy))
        {
            //think of &nbsp; as 1 character instead of 5
            result += specChar.Match(inputCopy);
            ++strLen;
        }
        else if (htmlTag.IsMatch(inputCopy))
        {
            tag = htmlTag.Match(inputCopy).ToString();
            //This only works if this is valid Markup...
            if(tag[1]=='/')         //Closing tag
                stack.Pop();
            else                    //not a closing tag
                stack.Push(tag);
            result += tag;
        }
        else    //Bad syntax
            result += input[strMarker];

        strMarker = result.Length;
    }

    while (stack.Count > 0)
    {
        tag = stack.Pop().ToString();
        result += tag.Insert(1, "/");
    }
    if (strLen == len)
        result += "...";
    return result;
}

Brankodd 10 年前

您可以尝试以下NPM包

trim-html

它切断HTML标记内的足够文本,保存原始HTML限制,达到限制后删除HTML标记并关闭打开的标记。

-1

Phil.Wheeler 15 年前

使用jquery不是最快的方法吗? text() 方法?

例如:

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

会给一个两个树的值 text 变量。这将允许您获取不包含HTML的文本的实际长度。