Parsing algorithm basics

For better understanding of parser settings you should learn parsing algorithm basics.

1. User setting up URL(s) of website(s) where he is going to parse content from. After user clicked on Parse button all the URLs are immediately put to parser queue for processing.

2. Parser is starting the first pass. It is taking first user-provided URL and harvesting all its internal links (by default). Newly collected URLs are placed in the end of the queue for processing.

3. Then parser is searching for content block on the page according to boundaries which are specified in «Input regular expression to find the content block beginning» and «Input strings to search the bottom boundary of the content block (one per line)» fields.

IMPORTANT: Content block is to search just when page is fitting user defined restrictions. Otherwise page is immediately added to the processed pages database and parser is jumping to the 8-th step.

4. Parser is searching page body for the future post header.

5. Parser is processing the content block.

6. Processed content block (with header) is sent to the Google Translate service for translation from language selected in «Select language to translate from» to language selected in «Select language to translate to».

IMPORTANT: If you select the same language in both listboxes, step 6 is skipped and content block is to keep its original language.

7. Content block with previously specified header is added as new post to the blog and currently parsed page is added to the processed pages database.

8. If pages to process queue is not empty, steps from 2 till 8 are repeating till queue become empty or parser will process as much pages as mentioned in «Maximal number of pages to parse per pass» field.

9. When the maximal number of pages per pass are processed, but pages to process queue is not empty, user is getting «Continue processing of unprocessed pages» question.

IMPORTANT: If you want to avoid answering processing continuation question between passes, you should uncheck «Ask about parsing process continuation» checkbox. In this case new pass will start automatically.

10. If user is clicking OK, the next pass is starting (from step 2). This will be working like this till pages to process queue is empty or user will check «Stop parser as soon as possible» checkbox.

Go to Top