Tutorial



The following is a detailed tutorial to help you use CrawlForMe.

Getting started

Before you can start to use our powerful tool, you must be registered.

Follow this link and you will see a form with the URL of the website to crawl and your email address where results will be sent. CrawlForMe will check your website and send you a report as soon as the job is completed. The email contains a quick overview of the results, along with a direct link to the report. You must log in to consult the report and you can do so by using the login and password provided in the email you received after you registered. You may also access the report by going through the plateform via the ‘Dashboard’ → ‘Results’.

Trial mode offers you two crawl of a maximum of 5000 unique resources. If your website contains more than this amount of resource you will still be able to access to a partial report to give you a preview of what the full report looks like.


Just by filling in a few fields, you can use our tool

Setup your crawl

Add websites, seeds and scheduled crawls

From your dashboard, you can add as many websites as you want. Each of them is associated with a configuration and can have as many different seeds as you want. Finally, schedule a task and choose its frequency: once, daily, weekly, monthly, or yearly.


Your dashboard, the heart of CrawlForMe.


Creating a new website is as simple as this.

Configure your crawler

CrawlForMe provides a wide variety of options for full customization of your crawler. Although we provide a suitable configuration for most cases, each options can be tweaked depending on your taste or your needs.

Website options

Depth
Limits how far the crawler goes. Each resource has a depth, a unit of measure between that particular resource and the seed. The depth increases for each new level of resource crawled. From a user point of view, the depth of a resource is the minimum number of clicks separating the main page from the resource.

For instance, if you intend to crawl only the seed and its immediate children then the maximum depth will be 1 since the seed’s depth is 0. This is an effective way to control the amount of resource crawled as well as the time required to crawl your website.

Follow redirections links
Selected by default, this option allows redirection to be followed until a resource is found thus ensuring that the redirection is neither broken nor unreachable. Moreover, redirection is a basic concept of the HTTP protocol. Disallowing them will substantially reduce the number of resources found. The analysed links will be shown in the Redirected or Error tabs of your report and will be charged.
Check external links
External links are resources outside of the context defined by the seed. If this option is selected the complete informations about the link (weight, MIME type, charset, …) will be retrieved and shown in the Successful or Error tab of your report. Otherwise, external resources will be shown in the Ignored tab of your report and will not be charged.
Handle “nofollow” directive
Allow to skip link marked as nofollow. More explanation about this directive can be found on this page.
Protocol consistency
Ensures the consistency of the protocol used in all unique links from the seed to the last resource found. If it is not the case, CrawlForMe spots the weakness and stores them in the Error tab of your report. This option is especially useful when you want to verify the integrity of a website using the secure HTTPS protocol, but other use cases are also possible.
Max resource age
Exclude all resources older than the provided date. This mechanism is handled by your HTTP server and might reduce the amount of transaction overhead inherent to a crawl.

Request options

Parallel tasks
Maximum number of tasks working together. There is a direct correlation between the speed of a crawl and the induced transaction overhead. Reducing the number of tasks slows down the crawl and thus decreases the overhead.
Niceness
HTTP requests works in two phases: first, a HEAD request determines if the resource is reachable then a GET request gathers the resource’s content for further analysis. The niceness value represents the time in milliseconds between these two requests. Like the number of parallel tasks, this option is useful to control the transaction overhead induced by a crawl.
Max attempt
Number of times to retry a resource in error.
Ignore robots.txt
Although this is not recommended, you may ignore the robots.txt. This file allows the Webmaster to define which pages a robot may crawl and which pages are forbidden. The rules defined in this file are closely related to the user-agent.
User agent and masquerade
A user-agent is the computer system application remotely accessing another computer or server, via the network. In HTTP, the requester identifies itself by specifying a user-agent with each request. Since this header field is widely acknowledged as not being reliable, the main use of the user-agent is statistics and robots.txt. We may also not send user-agent at all.
Allow compression
Allow or disable page content compression during transfers. Since the compression rate can lower the size by up to 30% (for HTML content) we strongly suggest you to leave this option activated. If your server doesn’t support such compression, it’s no big deal: we will automatically fall back to a plain text mode.
Accept-language
The Accept-Language is not necessarily a language choice available on your website, it is a parameter of the HTTP request which gives the server a hint about which language the user prefers. It is up to you to determine whether or not this parameter is relevant.

Valid format: A language abbreviation (en, fr, nl, etc) that may be followed by a country code (US, FR, BE etc.). Therefore, a valid accept-language value looks like en-US, fr-BE, fr-FR, …

Resource options

This section allows you to filter the type of resource that will be inspected by our crawler. The term inspect has a subtle connotation. Indeed, you may not only want to check for the presence of the resource, but you may also want us to analyse its content in order to find more links. That’s why we differentiate between checking a link and analysing its content.

JavaScript & CSS

Checking the link
These are by nature external files. If you decide to check the link, its content will remain un-analysed but we will be able to tell you whether or not the file exists. All related information will be available, such as the size of the file, the encoding, …

The resource will be displayed in the Successful or Error tab and will be charged.

Embedded content
Embedded content is JavaScript or a CSS defined within a HTML page. We can analyse them to find additional resources, such as images. Due to the programmatic nature of JavaScript, partial link may be found which don’t reflect the links produced when the code is actually executed.
External content
Analyse the content of an external file found when the related link has been checked. This option is, of course, only available if you allow the crawler to check the link prior to trying to analyse it.
Image, Flash & other
These resources are, by nature, files. We can check for their presence and gather additional information, but no additional links will result from the analysis.
IFrame
IFrames are a means to display external content within your website. Although this tag is currently less used, we can check the external link as well as gathering additional information such as size, MIME type, encoding, …


Customisation of your crawler

Configure your report

Show custom logo
You may add a custom logo in the report. It will be displayed on the top right of the report. Check this option then upload your logo. Currently, the expected ratio width/height of the image to upload is 2.5 (200×80px).
Show brand logo
You may override the CrawlForMe logo of the report if you want to brand it under your own name. This option is not available for all plan.
Show ignored links
Decide whether or not display the tab Ignored in your report. Just a reminder: ignored resources are not charged.
Show redirected links
Similar to previous option but with redirected links. Be careful, it’s just a visual option and the crawler, depending on the options in the crawler configuration, analyzes the redirection links which are charged.
Show configuration
Decide whether or not display a tab containing the whole configuration of your report. It is pretty handy to have it since the content of the report is directly related to its configuration.
Show password
Display password added via the Protected pages as plain text or replace the letters with stars.
Send report by mail
If you do not want to be warned when a report is released, just uncheck this option.
Addresses
List of addresses that will receive a notification whenever a new report is available.


Even the report has its options.

Add ignored links

You can exclude specific resources or a set of resources having a name matching a specific pattern (simple pattern matching using common * and ? wildcards or regular expressions). There is a huge advantage to using this useful and powerful feature, including getting quicker results by avoiding irrelevant content.

The pattern-matching mechanism is a common feature in our platform. You will find it in the Ignored links and the Protected pages. That’s why there is a specific section containing examples at the end of this tutorial. Due to the complex and powerful nature of pattern-matching, you may not grasp the entire potential right away but if you’re having a hard time getting what you want, remember that our staff is always there to provide you with further help.

Name
Choose a name, simply a word of your choice to quickly retrieve this configuration in the list of all ignored links.
Pattern and Type
See the specific section


Form to add an ignored pattern.

Add protected pages

The HTTP protocol provides an authentication mechanism that we support to find and continue to check additional resources. If you specify multiple global passwords, each of them will be tested when an authentication is encountered. The result of these attempts will be displayed in the report.

Global
Checked by default, this flag means that the login/password may be used anywhere on your website when an AUTH request is required. If you uncheck this flag, you will have to tell us where to use it.
Use pattern
See the specific section
Username and Password
The login/password combination required for an authentication. We encourage you to create a specific account for our crawler if possible so that you can deactivate it once the crawl is done, to make sure you’re not opening a security hole in your website.


Get past any authentication to find more resources.

Add cookies

Cookies are a simple way to store data pushed by the server to the client side. They are widely used for a variety of purposes. For instance, cookies can be used to automatically authenticate a user returning to your website. We support the push mechanism, but you may want to pre-store some data.

As you can see you cannot specify a domain for the cookie. The domain used is the one extracted from the seed. If multiple seed are defined with diffrent domain, the cookie will be duplicated as well for each domain found.

Name
Name of the cookie.
Value
Value of the cookie.


A key/value pair is all you need to define a cookie.

Add a form

Another specific feature of CrawlForMe is the ability to handle forms and check the response for success or failure. In other words, you can validate that each form accepts certain values and redirects the user to a specific page.

Create a form

First step, create the form.

Form
Name your form so you can spot it easily in the report. A dedicated tab will be present in the report if you use this feature.
Address
It’s the complete URL of the page containing the form. If the form is behind a Protected pages or if Cookies are needed, don’t forget to add them prior to adding your form.


Specify where the form should be found, the URL must be exact.

Define the Input/Values

All the forms found on the page mentioned in the previous step will be available through a dropdown menu. Choose the one you want, and the form will be extracted from your page and displayed. You can now use it like a regular form by typing the desired values in the available fields.


View the form and fill it in with values like a regular form.
As you can see, a password was used to reach this form. It was defined in the protected pages tab before configuring the form.

The reports

Dashboard

This is without a doubt the most interesting part of CrawlForMe: the reports. They are all listed chronologically and have a status so that you can see which are new and which ones you’ve already viewed. If your report is complete, you can consult, share, delete or export it.

View
Access your report through a simple yet intuitive interface that can handle an unlimited number of resources with ease. See the complete overview of a report below.
Share
Grant access to external users to consult your report by generating an access key, which can be revoked at any time.
Export
Generate a CSV file containing all the errors and the resources pointing toward them. This is the kind of file that will help your IT departement solve the issues.
Compare
Still under heavy developpement, this feature allows you to compare two different reports to see the resources specific to each of them, as well as those they share.


The results tab provides an overview of all your crawls.

Report overview

A Report contains six tabs, each with its own purpose. Those displaying resources have features in common like sorting, searching, pagination, browsing relations … Getting used to it is just a matter of minutes.


Lots of graphs are available to provide a graphic overview of the report.


You can easily retrieve references to resources with errors.


All resources that were checked with no problems.


All resources that return a redirection.


Resources that were ignored due to specific ignored patterns, robots.txt entries, …

Your profile

Overview

Your profile contains all your personal information as well as user management and your financial history. As a regular user you have access to all this information. If you create other users you may restrict access to the user management section to prevent further users to be created.


Your account overview contains you current plan and it’s expiration date aswell as well as your financial history and invoices.


Most of the fields in the legal entity tab are required in order to buy a plan.


Create and manage additional users under the same legal entity.

Subscribe to a plan

In order to start a transaction, you must have fullfiled all the information of your user and legal entity. If any mandatory field are missing, you will be automatically redirected to the tab that requires your attention.

The basket and the selected plan description will update automatically has you choose your desired plan and periode span. Once you find the options that suit you the most, just read the general condition and proceed to the secured payment using Paypal.

Each transaction will generate a downloadable invoice available in your account overview.


Use this dynamic form to find the options that suit you the most

Advanced notions

Taking advantage of patterns

Patterns are a generic component of CrawlForMe. They allow you to use a single resource to define a range of resources and act on them. Don’t be afraid of their apparent complexity. Here are some simple examples that will allow you to master them in no time but first, a quick explanation of the three fields that make up a pattern.

Pattern
The pattern itself, is a string that reflects a resource.
Type
Two types are available:

Parsed value
The most common and simple case. By selecting this mode, CrawlForMe will convert a string containing the classic wildcards (*, ?, …) into a regular expression.
Regular expression
CrawlForMe brings you the full power of regular expressions. See below for more information.
Action
Only available for regular expressions, this allows you to tell us if your expression should match the entire resource path or just a substring of it. Parsed value will always try to find the expression within the resource path.

Now let’s take at look a both Parsed values and Regular expressions

Parsed values

Parsed values are the easiest regular expressions to use. In fact you may already been using them without noticing. See the table below for some self-explanatory examples.

Pattern Resource Found Matching
blog http://www.crawlforme.com/blog/ yes http://www.crawlforme.com/blog/
*blog* http://www.crawlforme.com/blog/ yes http://www.crawlforme.com/blog/
bl?g http://www.crawlforme.com/blog/ yes http://www.crawlforme.com/blog/

That’s it! It is as simple as that! Ofcourse your can build more advanced matches with this system but there is an inherent limit to complexity as you can see.

Regular expressions

Examples
Pattern Action Resource Found Matching
[a-z]+ find http://www.crawlforme.com/blog/ yes http://www.crawlforme.com/blog/
[a-z./]+ find http://www.crawlforme.com/blog/ yes http://www.crawlforme.com/blog/
http://www\.crawlforme\.com/blog/ match http://www.crawlforme.com/blog/ yes http://www.crawlforme.com/blog/
List of modifier
Modifier Translation Additional example
. The dot matches any character (dot included) or any new line. The dot matches any character (dot included) or any new line. To match a “real” dot, it must be escaped with a backslash to use the literal value, like www.\.crawlforme\.com.
[a-z] Defines a range of characters. This is case sensitive and multiple ranges can be merged. [a-mN-Z] will match either a to m in lower case or N to Z in upper case
[0-9] Defines a range of numbers to match. Numbers with multiple digits are forbidden. Use a quantifier for this purpose.
List of quantifiers
Quantifier Translation Additional example
? Matches the preceding pattern element zero or one times [a-z]?, [0-9]? or even \n?
* Matches the preceding pattern element zero or more times. \s*, …
+ Matches the preceding pattern element one or more times. [a-zA-Z]+, …

The previous examples are just an overview of the most common and useful modifiers and quantifiers you may encounter. Since regular expressions are a vast and complexe subject, we strongly suggest taking a look at this wikipedia page if you want to know more about them.

Escaping, what is it and when do I need it?

Escaping is placing a backslash \ before a special character like . to prevent the parsing engine from interpreting the character. You need to escape a character when you want to keep its literal meaning.