jieba.NET

jieba中文分词的.NET版本(支持.NET Framework与.NET Core)

MIT License

Stars
1.1K
Committers
4

jieba.NETjieba.NETC#

0.42.2jieba 0.42jieba****paddlejiebawiki

KeywordProcessor FlashText KeywordProcessor ****

IssueI see u:)


    • , ****HMM


  • MIT

  • (DAG)
  • ,
  • HMMViterbi

net40net45netstandard2.0NuGet

PM> Install-Package jieba.NET

packages\jieba.NETResourcesjieba.NETResourcesjieba.NETapp.configweb.config

<appSettings>
    <add key="JiebaConfigFileDir" value="C:\jiebanet\config" />
</appSettings>

jieba.NETBaseDirectory

  • C:\jiebanet\configC:\jiebanet\config\dict.txt
  • Resources..\config..BaseDirectoryC:\myapp\bin\C:\myapp\config\dict.txt

config

JiebaNet.Segmenter.ConfigManager.ConfigFileBaseDir = @"C:\jiebanet\config";

1.

  • JiebaSegmenter.CuttextcutAllhmmhmmIEnumerable<string>
  • JiebaSegmenter.CutForSearchtexthmmhmmIEnumerable<string>
var segmenter = new JiebaSegmenter();
var segments = segmenter.Cut("", cutAll: true);
Console.WriteLine("{0}", string.Join("/ ", segments));

segments = segmenter.Cut("");  // 
Console.WriteLine("{0}", string.Join("/ ", segments));

segments = segmenter.Cut("");  // HMM
Console.WriteLine("{0}", string.Join("/ ", segments));

segments = segmenter.CutForSearch(""); // 
Console.WriteLine("{0}", string.Join("/ ", segments));

segments = segmenter.Cut("");
Console.WriteLine("{0}", string.Join("/ ", segments));
/ / / / / / 
/ / / 
/ / / / / 
/ / / / / / / / / / / / / / / / / / 
/ / / / / 

2.

  • jiebajieba
  • JiebaSegmenter.LoadUserDict("user_dict_file_path")
 3 i
 5
 nz

 3

  • JiebaSegmenter.AddWord(word, freq=0, tag=null)``freq
  • JiebaSegmenter.DeleteWord(word)

3.

TF-IDF

  • JiebaNet.Analyser.TfidfExtractor.ExtractTags(string text, int count = 20, IEnumerable<string> allowPos = null)
  • JiebaNet.Analyser.TfidfExtractor.ExtractTagsWithWeight(string text, int count = 20, IEnumerable<string> allowPos = null)****
  • IDFIDF
  • Stop WordsNLTK

TextRank

  • JiebaNet.Analyser.TextRankExtractor``TfidfExtractor``TextRankExtractor
  • 5Span

4.

  • JiebaNet.Segmenter.PosSeg.PosSegmenter
  • ictclasictclasjieba
var posSeg = new PosSegmenter();
var s = "";

var tokens = posSeg.Cut(s);
Console.WriteLine(string.Join(" ", tokens.Select(token => string.Format("{0}/{1}", token.Word, token.Flag))));
/m /i /uj /n /n /ns /x /p /a /c /a /uj /n /f /z /uv /v

5. Tokenize

var segmenter = new JiebaSegmenter();
var s = "";
var tokens = segmenter.Tokenize(s);
foreach (var token in tokens)
{
    Console.WriteLine("word {0,-12} start: {1,-3} end: {2,-3}", token.Word, token.StartIndex, token.EndIndex);
}
word            start: 0   end: 2
word            start: 2   end: 4
word            start: 4   end: 6
word          start: 6   end: 10
var segmenter = new JiebaSegmenter();
var s = "";
var tokens = segmenter.Tokenize(s, TokenizerMode.Search);
foreach (var token in tokens)
{
    Console.WriteLine("word {0,-12} start: {1,-3} end: {2,-3}", token.Word, token.StartIndex, token.EndIndex);
}
word            start: 0   end: 2
word            start: 2   end: 4
word            start: 4   end: 6
word            start: 6   end: 8
word            start: 8   end: 10
word          start: 6   end: 10

6.

  • JiebaSegmenter.CutInParallel()``JiebaSegmenter.CutForSearchInParallel()
  • PosSegmenter.CutInParallel()

7. Lucene.NET

jiebaForLuceneNetLucene.NETjiebaForLuceneNet

8.

jieba

9.

  • 2.5 MB/s
  • 1.1 MB/s
  • Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz.txt734KB

10.

Segmenter.Clibuildjiebanet.ext

-f       --file          the file name, ().
-d       --delimiter     the delimiter between tokens, default: / .
-a       --cut-all       use cut_all mode.
-n       --no-hmm        don't use HMM.
-p       --pos           enable POS tagging.
-v       --version       show version info.
-h       --help          show help details.

sample usages:
$ jiebanet -f input.txt > output.txt
$ jiebanet -d | -f input.txt > output.txt
$ jiebanet -p -f input.txt > output.txt

11.

CounterPythonCounter

var s = "algorithm";
var seg = new JiebaSegmenter();
var freqs = new Counter<string>(seg.Cut(s));
foreach (var pair in freqs.MostCommon(5))
{
    Console.WriteLine($"{pair.Key}: {pair.Value}");
}
: 4
: 3
: 3
: 3
: 3

Counter``Add``Subtract``Union``MostCommon

12. KeywordProcessor

KeywordProcessor KeywordExtractor``KeywordProcessor

jieba**** KeywordProcessor

var kp = new KeywordProcessor();
kp.AddKeywords(new []{".NET Core", "Java", "C", " tree", "CET-4", " "});

var keywords = kp.ExtractKeywords("cet-4c.NET core JavaScript tree");

// keywords 
// new List<string> { "CET-4", "C", ".NET Core", " ", " tree"}

//  `raw` 

var keywords = kp.ExtractKeywords("cet-4c.NET core JavaScript tree", raw: true);

// keywords 
// new List<string> { "cet-4", "c", ".NET core", " ", " tree"}