python

超轻量级php框架startmvc

python 文本单词提取和词频统计的实例

更新时间:2020-06-17 07:06:01 作者:startmvc
这些对文本的操作经常用到,那我就总结一下。陆续补充。。。操作:strip_html(cls,text)去除h

这些对文本的操作经常用到, 那我就总结一下。 陆续补充。。。

操作:

strip_html(cls, text) 去除html标签

separate_words(cls, text, min_lenth=3) 文本提取

get_words_frequency(cls, words_list) 获取词频

源码:


class DocProcess(object):

 @classmethod
 def strip_html(cls, text):
 """
 Delete html tags in text.
 text is String
 """
 new_text = " "
 is_html = False
 for character in text:
 if character == "<":
 is_html = True
 elif character == ">":
 is_html = False
 new_text += " "
 elif is_html is False:
 new_text += character
 return new_text

 @classmethod
 def separate_words(cls, text, min_lenth=3):
 """
 Separate text into words in list.
 """
 splitter = re.compile("\\W+")
 return [s.lower() for s in splitter.split(text) if len(s) > min_lenth]

 @classmethod
 def get_words_frequency(cls, words_list):
 """
 Get frequency of words in words_list.
 return a dict.
 """
 num_words = {}
 for word in words_list:
 num_words[word] = num_words.get(word, 0) + 1
 return num_words

以上这篇python 文本单词提取和词频统计的实例就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持脚本之家。

python 文本 单词 词频