Crawler

The crawler travels through your site, refreshing pages that have expired in the cache. This makes it less likely that your visitors will encounter uncached pages.

The crawler must be enabled at the server-level or the virtual host level by a site admin. Please see: Enabling the Crawler at the Server or Virtual Host Level

Learn more about crawling on our blog.

Summary Tab

!LSCWP Crawler Section Summary Tab

Crawler Cron

See the progress of the various crawlers enabled for your site. You can monitor the progress of each crawler via the color-coded rectangles in the Status column.

Note

Crawlers cannot run concurrently. If you have multiple crawlers defined, they will run one at a time.

Use the Reset Position button to start the crawler at the beginning again.

Use the Manually run button to start the crawler without waiting for the cron job.

Watch Crawler Status

If you've opted to watch the crawler status, your screen will look something like the image below. The messages in the status window will vary from site to site.

!

Here is an explanation of some of the terms:

  • Size: The number of URLs in the sitemap. This example has 181.
  • Crawler: Indicates which crawler number you are watching. It's number 1 in this example. There could be multiple crawlers working, depending on your settings.
  • Position: The URL number currently being fetched from the sitemap list.
  • Threads: Indicates the number of threads currently being used to fetch URLs. There may be multiple threads fetching. It is smart and will adjust based on your load settings.
  • Status: Indicates the current crawler status. In this example, Stopped due to reset meta position means that the site purged or the sitemap changed while it was crawling, and as such, the crawler will restart from the top.

Map Tab

!LSCWP Crawler Section Map Tab

Sitemap List

This page displays the URIs currently in the crawler map. If you don't see any listed, try pressing the Refresh Crawler Map button.

!LSCWP Crawler Section Map Tab

From here you can manually add URIs to the blacklist via the button next to each entry.

The Crawler Status column uses colored dots to give you the status of each URI. See below the table for the key.

To start from scratch with the crawler map, press the Clean Crawler Map button.

Blacklist Tab

!LSCWP Crawler Section Blacklist Tab

Blacklist

This page displays the URIs currently in the blacklist.

From here you can manually remove URIs from the blacklist via the button next to each entry.

The Status column uses colored dots to give you the status of each URI. See below the table for the key.

To start from scratch and clear out the blacklist, press the Empty Blacklist button.

General Settings Tab

!LSCWP Crawler Section Settings Tab

Crawler

OFF

Set the to ON to enable crawling for this site.

Delay

500

Set the delay in microseconds to let LSCache know how often to send a new request to the server. You can increase this amount to lessen the load on the server, just be aware that will make the entire crawling process take longer.

This setting may be limited at the server level. Learn more about limiting the crawler's impact on the server.

Run Duration

400

This is how long the crawler runs before taking a break. The default of 400 has the crawler run for 400 seconds, then it temporarily stops. After the break is over, the crawler will start back up exactly where it left off and run for another 400 seconds. This will continue until the entire site has been crawled.

Interval Between Runs

600

This setting determines the length of the break mentioned above. By default, the crawler rests for 600 seconds in between every 400-second run.

Crawl Interval

302400

This value determines how long to wait before re-initiating the entire crawling process. To keep your site regularly-crawled, determine how long the crawler usually takes to run, and set this value to slightly longer than that.

Threads

3

This is the number of separate crawling processes happening concurrently. The higher the number, the faster your site is crawled, but also the more load that is put on your server.

Timeout

30

The crawler has this many seconds to crawl a page before moving on to the next page. Value can range from 10 to 300 seconds.

Server IP

Empty string

As of v1.1.1, you can enter your Site’s IP address to simplify the crawling process and eliminate the overhead involved in DNS and Content Delivery Network (CDN) lookups. To understand why, let’s look at a few scenarios.

This is how it works if you’re using a CDN:

  • The crawler gets http://yourserver.com/path from the sitemap
  • The crawler checks with the DNS to find yourserver.com’s IP address
  • The DNS returns the CDNs IP address to the crawler
  • The crawler goes to the CDN to ask for the page
  • The CDN grabs the page from yourserver.com
  • The CDN returns the page to the crawler

This is how it works if you’re not using a CDN:

  • The crawler gets http://yourserver.com/path from the sitemap
  • The crawler checks with the DNS to find yourserver.com’s IP address
  • The crawler grabs the page from yourserver.com

In both scenarios, there are lookups that occur, expending time and resources. These lookups can be eliminated by entering your site’s IP in this field.

When the crawler knows your IP, this is how it works:

  • The crawler gets http://yourserver.com/path from the sitemap
  • The crawler grabs the page directly from yourserver.combecause it already knows the IP address

The middlemen are eliminated, along with all of their overhead.

Server Load Limit

1

This setting is a way to keep the crawler from monopolizing system resources. Once it reaches this limit, the crawler will be terminated rather than allowing it to compromise server performance. This setting is based on Linux server load. (A completely idle computer has a load average of 0. Each running process either using or waiting for CPU resources adds 1 to the load average.)

This setting may be limited at the server level. Learn more about limiting the crawler's impact on the server.

Simulation Settings Tab

!LSCWP Crawler Section Simulation Tab

Role Simulation

Empty list

By default, the crawler runs as a non-logged-in "guest" on your site. As such, the pages that are cached by the crawler are all for non-logged-in users. If you would like to also pre-cache logged-in views, you may do so here.

The crawler simulates a user account when it runs, so you need to specify user id numbers that correspond to the roles you'd like to cache.

Example

To cache pages for users with the "Subscriber" role, choose one user that has the "Subscriber" role, and enter that user's ID in the box.

You may crawl multiple points-of-view by entering multiple user ids in the box, one per line.

Note

Only one crawler may run at a time, so if you have specified one or more user ids in the Role Simulation box, first the "Guest" crawler will run, and then the role-based crawlers will run, one after the other.

To crawl for a particular cookie, enter the cookie name, and the values you wish to crawl for. Values should be one per line, and can include a blank line. There will be one crawler created per cookie value, per simulated role. Press the + button to add additional cookies, but be aware the number of crawlers grows quickly with each new cookie, and can be a drain on system resources.

Example

If you crawl for Guest and Administrator roles, and you add testcookie1 with the values A and B, you have 4 crawlers:

  1. Guest, testcookie1=A
  2. Guest, testcookie1=B
  3. Administrator, testcookie1=A
  4. Administrator, testcookie1=B

Add testcookie2 with the values C, D, and and you suddenly have 12 crawlers.

  1. Guest, testcookie1=A, testcookie2=C
  2. Guest, testcookie1=B, testcookie2=C
  3. Administrator, testcookie1=A, testcookie2=C
  4. Administrator, testcookie1=B, testcookie2=C
  5. Guest, testcookie1=A, testcookie2=D
  6. Guest, testcookie1=B, testcookie2=D
  7. Administrator, testcookie1=A, testcookie2=
  8. Administrator, testcookie1=B, testcookie2=
  9. Guest, testcookie1=A, testcookie2=
  10. Guest, testcookie1=B, testcookie2=
  11. Administrator, testcookie1=A, testcookie2=
  12. Administrator, testcookie1=B, testcookie2=

There aren't many situations where you would need to simulate a cookie crawler, but it can be useful for sites that use a cookie to control multiple languages or currencies.

Example

WPML uses the ​_icl_current_language=​ cookie to differentiate between visitor languages. An English speaker's cookie would look like ​_icl_current_language=EN, while a Thai speaker's cookie would look like ​_icl_current_language=TH. To crawl your site for a particular language, use a Guest user, and the appropriate cookie value.

Sitemap Settings Tab

!LSCWP Crawler Section Sitemap Tab

Custom SiteMap

Empty string

A sitemap tells the crawler which pages on your site should be crawled. By default, LSCache for WordPress generates its own sitemap. If, however, you already have a sitemap that you’d like to use, that is an option as of v1.1.1.

Enter the full URL to the sitemap in this field.

Note

The sitemap must be in Google XML Sitemap format.

Drop Domain from Sitemap

ON

The crawler will parse the sitemap and save it into the database before crawling. When parsing the sitemap, dropping the domain can save DB storage.

Warning

If you are using multiple domains for one site, and you have multiple domains in the sitemap, please keep this option OFF. Otherwise, the crawler will only crawl one of the domains.

Sitemap Generation

Use these fields, if you don't already have a custom sitemap to use.

Include Posts / Pages / Categories / Tags

ON

These four settings determine which taxonomies will be crawled. By default, all of them are.

Exclude Custom Post Types

Empty string

By default all custom taxonomies are crawled. If you have some that should not be crawled, list them in this field, one per line.

Date, descending

This field determines the order that the crawler will parse the sitemap. By default, priority is given to the newest content on your site. Set this value so that your most important content is crawled first, in the event the crawler is terminated before it completes the entire sitemap.


Last update: June 9, 2020