如果字符串是未加修饰的(即,没有标记),则其中任何一个都可以正常工作:
data = 'Main Idea, key term, key term, key term'
# example #1
/^(.+?, )(.+)/.match(data).captures.each_slice(2).map { |a,b| a << %Q{<span class="smaller_font">#{ b }</span>}}.first
# => "Main Idea, <span class=\"smaller_font\">key term, key term, key term</span>"
# example #2
data =~ /^(.+?, )(.+)/
$1 << %Q{<span class="smaller_font">#{ $2 }</span>}
# => "Main Idea, <span class=\"smaller_font\">key term, key term, key term</span>"
如果字符串有标记,则不鼓励使用regex处理HTML或XML,因为它很容易中断。对于您所控制的HTML,非常简单的用法是非常安全的,但是如果内容或格式发生更改,regex可能会破坏您的代码。
HTML解析器是通常推荐的解决方案,因为如果内容或其格式发生更改,它们将继续工作。这就是我用Nokogiri做的。我故意详细解释发生了什么事:
require 'nokogiri'
# build a sample document
html = '<a href="stupidreqexquestion">Main Idea, key term, key term, key term</a>'
doc = Nokogiri::HTML(html)
puts doc.to_s, ''
# find the link
a_tag = doc.at_css('a[href=stupidreqexquestion]')
# break down the tag content
a_text = a_tag.content
main_idea, key_terms = a_text.split(/,\s+/, 2) # => ["Main Idea", "key term, key term, key term"]
a_tag.content = main_idea
# create a new node
span = Nokogiri::XML::Node.new('span', doc)
span['class'] = 'smaller_font'
span.content = key_terms
puts span.to_s, ''
# add it to the old node
a_tag.add_child(span)
puts doc.to_s
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><a href="stupidreqexquestion">Main Idea, key term, key term, key term</a></body></html>
# >>
# >> <span class="smaller_font">key term, key term, key term</span>
# >>
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><a href="stupidreqexquestion">Main Idea<span class="smaller_font">key term, key term, key term</span></a></body></html>
在上面的输出中,您可以看到Nokogiri是如何构建示例文档、添加的跨度以及生成的文档的。
它可以简化为:
require 'nokogiri'
doc = Nokogiri::HTML('<a href="stupidreqexquestion">Main Idea, key term, key term, key term</a>')
a_tag = doc.at_css('a[href=stupidreqexquestion]')
main_idea, key_terms = a_tag.content.split(/,\s+/, 2)
a_tag.content = main_idea
a_tag.add_child("<span class='smaller_font'>#{ key_terms }</span>")
puts doc.to_s
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><a href="stupidreqexquestion">Main Idea<span class="smaller_font">key term, key term, key term</span></a></body></html>