djparsing

microframework for parsing sites, a simple interface and flexibility will help you quickly start parsing sites.Convenient parsing, can be used as an application Django or independently

MIT License

Downloads
382
Stars
1
Committers
1

djparsing

Convenient parsing, can be used as an application Django or independently

parser of the first data block (by date this is new) and saving in the specified table

Requirements

  • python (3.4, 3.5, 3.6б 3.7)
  • django (1.8, 1.9, 1.10, 1.11)
  • lxml (4.1.1)
  • cssselect (1.0.1)
  • Pillow

Quick start

Install:
    pip install djparsing
Using:
class MyModel(models.Model):
    title = models.CharField(max_length=256)
    text = HTMLField(blank=True)
    source = models.URLField(max_length=255, blank=True)
    create_date = models.DateTimeField(auto_now_add=True)
    img = models.ImageField(blank=True, null=True)
    flag = models.BooleanField(default=False)
    
from djparsing.core import Parser, init
from djparsing import data

@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
    body = data.BodyCSSSelect()
    text = data.TextContentCSSSelect()
    source = data.AttrCSSSelect(attr_data='href') #Or set the first argument AttrCSSSelect('href')
    title = data.TextCSSSelect()
    img = data.ImgCSSSelect('src') #The default is src, so the argument is optional. can ImgCSSSelect()
    
    class Meta:
        coincidence = ['Python', 'Django', 'Питон', 'ML'] # a list of words for the condition that the data fit
        field_coincidence = 'title' # field to which a list of words is used
    
pars_obj = MyParserClass(
        body='.content-list__item',
        text='.post__body_crop > .post__text',
        source='a.post__title_link',
        title='a.post__title_link',
        img='.post__body_crop > .post__text img',
        url='http://site/'
        )
pars_obj.run()

Note: a model for saving data can be specified in Meta class

class Meta:
    model = MyModel # decorator @init is not needed

Inheritance:

class MyChildParserClass(MyParserClass):
    my_field = data.TextCSSSelect()

Note: fields from the base class, and also the Meta class is inherited. You can override

If you need to install an additional field in the database:

pars_obj.add_field['flag'] = True
pars_obj.run()  #if you do not need to save to the database and print the data to the log, 
                # add the argument log -> run(log=True) and redefine the method log_output(self, result):
Example:
@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
    body = data.BodyCSSSelect()
    text = data.TextContentCSSSelect()
    
    def log_output(self, result): # if you do not override the method, the result will be output to the terminal
        pass # and work further with the result

If you do not want to write data to the database or output to the log, use:

data = pars_obj.run(create=False)

Note: Also a must create=False, when you are not working with django and base

Attributs

start_url
# initialize the path to the URL with the data block.
# This is needed when the list of objects is on the page, and the data is on another page 
BodyCSSSelect(start_url='div.description.float-right > a')

Note: in the attribute with the URL should be href

add_domain
# if the URL in the attribute does not have a domain
# set add_domain=True, by default False

BodyCSSSelect(start_url='div.description.float-right > a', add_domain=True)
save_start_url

when you need to save additional data in the field, such as the start URLs of objects, add the ExtraDataField field (save_start_url = True)

body_count

how many objects are parsing

Example:
class MyParserClass(Parser):
    start = BodyCssSelect(start_url='ul.quest-tiles > li.quest-tile-1 > div.item-box > div.item-box-desc h4  a',
                          add_domain=True,
                          body_count=4)
    source = ExtraDataField(save_start_url=True)

It works on this site, all this on the channel