Compare HTML similarity using structural and style metrics
MIT License
This package provides a set of functions to measure the similarity between HTMLs.
Note: This is a fork of html-similarity.
pip install niteru
Uses sequence comparison of the html tags to compute the similarity.
We do not implement the similarity based on tree edit distance because it is slower than sequence comparison.
Extracts CSS classes of each html document and calculates the jaccard similarity of the sets of classes.
The joint similarity metric is calculated as::
k * structural_similarity(html1, html2) + (1 - k) * style_similarity(html1, html2)
All the similarity metrics take values between 0.0 and 1.0.
Using k=0.3
gives better results. The style similarity gives more information about the similarity rather than the structural similarity.
Here is an example:
html1 = '''
<h1 class="title">First Document</h1>
<ul class="menu">
<li class="active">Documents</li>
<li>Extra</li>
</ul>
'''
html2 = '''
<h1 class="title">Second document Document</h1>
<ul class="menu">
<li class="active">Extra Documents</li>
</ul>
'''
from niteru import style_similarity, structural_similarity, similarity
style_similarity(html1, html2) # => 1.0
structural_similarity(html1, html2) # => 0.8571428571428571
similarity(html1, html2) # => 0.9285714285714286