代码之家 › 专栏 › 技术社区 › Imran

如何使django sligify与unicode字符串一起正常工作?

slug django-templates unicode django python

Imran · 技术社区 · 16 年前

我能做些什么来防止 slugify 过滤掉非ASCII字母数字字符?(我用的是django 1.0.2)

cnprog.com 有疑问的URL中有中文字符,所以我查看了它们的代码。他们没有使用 使贫瘠化 在模板中,而是在 Question 获取永久链接的模型

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

他们是否在攻击URL?

8 回复 | 直到 16 年前

Evgeny 14 年前

有一个叫做python的包 unidecode 我在Askbot Q&A论坛上采用了这种方法,它对拉丁字母的效果很好,甚至对希腊字母也很合理:

>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'Î´Î¹Î±ÎºÏÎ¹ÏÎ¹ÎºÏÏ')
'diakritikos'

它对亚洲语言有点奇怪:

>>> unidecode(u'å½±å¸«å')
'Ying Shi Ma '
>>>

这有道理吗?

在askbot中,我们这样计算slug:

from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))

Open SEO 13 年前

Mozilla网站团队一直致力于实现: https://github.com/mozilla/unicode-slugify 样本代码 http://davedash.com/2011/03/24/how-we-slug-at-mozilla/

Arthur Hebert-Ryan 14 年前

另外,slugify的django版本没有使用re.unicode标志,因此它甚至不会试图理解 \w\s 因为它属于非ASCII字符。

这个自定义版本对我来说很好:

def u_slugify(txt):
        """A custom version of slugify that retains non-ascii characters. The purpose of this
        function in the application is to make URLs more readable in a browser, so there are 
        some added heuristics to retain as much of the title meaning as possible while 
        excluding characters that are troublesome to read in URLs. For example, question marks 
        will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
        characters will also be hex-encoded in the raw URL, most browsers will display them
        as human-readable glyphs in the address bar -- those should be kept in the slug."""
        txt = txt.strip() # remove trailing whitespace
        txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
        txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
        txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
        txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
        txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
        return txt

注意最后一个regex替换。这是解决具有更健壮表达式的问题的方法。 r'\W' 如下面的python解释器会话所示,它可能会删除一些非ASCII字符,或者错误地对它们重新编码:

Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = 'æ¨èªèå°å¨çç¤¾åæèè¶£çä¸åæå½±å¸«å'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
æ¨èªèå°å¨çç¤¾åæèè¶£çä¸åæå½±å¸«å
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?çä¸åæå½±å¸«å
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>> 
>>> 
>>> # Now do the same with only the last three glyphs of the string
>>> str = 'å½±å¸«å'
>>> print str
å½±å¸«å
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters

我不确定上面的问题是什么,但我猜想它是由 whatever is classified as alphanumeric in the Unicode character properties database 以及如何实现。我听说python 3.x在更好的unicode处理上具有很高的优先级,所以这可能已经被修复了。或者,也许这是正确的python行为,我误用了unicode和/或中文。

目前,解决方法是避免字符类,并基于显式定义的字符集进行替换。

Jarret Hardie 16 年前

恐怕Django对slug的定义意味着ASCII,尽管Django文档没有明确说明这一点。这是slugify的默认过滤器的来源…您可以看到值正在转换为ASCII,如果出现错误,则使用“忽略”选项:

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

基于此,我想cnprog.com没有使用官方网站 slugify 功能。如果您想要一个不同的行为,您可能希望适应上面的django片段。

尽管如此,URL的RFC确实声明了非US ASCII字符(或者更具体地说,除了字母数字和$-.+!*'())应使用%hex表示法进行编码。如果您查看浏览器发送的实际原始GET请求(例如,使用Firebug),您将看到中文字符实际上是在发送前编码的…浏览器只是让它在显示屏上看起来很漂亮。我怀疑这就是为什么Slugify坚持只使用ASCII,fwiw。

Antoine Pinsard 6 年前

用 Django>=1.9 , django.utils.text.slugify 有一个 allow_unicode 参数:

>>> slugify("ä½ å¥½ World", allow_unicode=True)
"ä½ å¥½-world"

如果使用django<=1.8(自2018年4月起不应使用),则可以 pick up the code from Django 1.9 .

Ondrej Slinták 12 年前

你可能想看看: https://github.com/un33k/django-uuslug

它将为您处理两个“U”。 U 独特的 U 在Unicode中。

它会为你做这项工作,不费吹灰之力。

mhl666 14 年前

这是我使用的:

http://trac.django-fr.org/browser/site/trunk/djangofr/links/slughifi.py

slughifi是一个普通slugify的包装器,它的不同之处在于它用英语字母表中的对应字符替换了国家字符。

因此,你得到的不是“_”或“A”,而不是“_”或“L”,依此类推。

raratiru 6 年前

我感兴趣的是在slug中只允许使用ASCII字符。这就是为什么我尝试将一些可用的工具作为同一字符串的基准:

Unicode Slugify :

In [5]: %timeit slugify('Î Î±Î¯Î¶Ï ÏÏÎÏÏ %^&*@# ÎºÎ±Î¹ Î³%^(Î»Ï la fd/o', only_ascii=True)
37.8 Âµs Â± 86.7 ns per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-glo-la-fdo'

Django Uuslug :

In [3]: %timeit slugify('Î Î±Î¯Î¶Ï ÏÏÎÏÏ %^&*@# ÎºÎ±Î¹ Î³%^(Î»Ï la fd/o')
35.3 Âµs Â± 303 ns per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-g-lo-la-fd-o'

Awesome Slugify :

In [3]: %timeit slugify('Î Î±Î¯Î¶Ï ÏÏÎÏÏ %^&*@# ÎºÎ±Î¹ Î³%^(Î»Ï la fd/o')
47.1 Âµs Â± 1.94 Âµs per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

'Paizo-trekho-kai-g-lo-la-fd-o'

Python Slugify :

In [3]: %timeit slugify('Î Î±Î¯Î¶Ï ÏÏÎÏÏ %^&*@# ÎºÎ±Î¹ Î³%^(Î»Ï la fd/o')
24.6 Âµs Â± 122 ns per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-g-lo-la-fd-o'

django.utils.text.slugify 具有 Unidecode :

In [15]: %timeit slugify(unidecode('Î Î±Î¯Î¶Ï ÏÏÎÏÏ %^&*@# ÎºÎ±Î¹ Î³%^(Î»Ï la fd/o'))
36.5 Âµs Â± 89.7 ns per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-glo-la-fdo'