Troubleshooting Crawler Issues

Too Many Pages Added to Blacklist

Particular pages are being added to the blacklist after the first crawling, but when you check manually (through the browser or through curl) you see the x-litespeed-cache header and 200 OK status code. So, why are the URIs ending up in the blacklist?

By default , LSCWP's built-in crawler will add a URI to the blacklist if the following conditions are met:

  1. The page is not cacheable by design or by default. In other words, any pages that send the response header x-litespeed-cache-control: no-cache
  2. The page doesn't respond with the following headers:

HTTP/1.1 200 OK HTTP/1.1 201 Created HTTP/2 200 HTTP/2 201

Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted.

Diagnosing the Cause

Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class.

Open the following file: litespeed-cache/lib/litespeed/litespeed-crawler.class.php

Add the following code at line 273, to allow us to log more information:

LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers) ;

Now, when the crawler processes a URI, the $headers will be written to the debug log.

Run the crawler manually, and run the following command:

grep headers /path/to/wordpress/wp-content/debug.log

You should see something like this:

!

So here is the problem: most of the logs show the header is HTTP/1.1 200 OK but a few of them are empty. It's the empty ones that are being added to the blacklist.

But why, if you manually run a curl on one of those pages, does it look fine?

    [root@test ~]# curl -I -XGET https://example.com/product-name-1
    HTTP/1.1 200 OK
    Date: Thu, 11 Jul 2019 20:57:54 GMT
    Content-Type: text/html; charset=UTF-8
    Transfer-Encoding: chunked
    Connection: keep-alive
    Set-Cookie: __cfduid=some-string-here; expires=Fri, 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com; HttpOnly
    Cf-Railgun: direct (starting new WAN connection)
    Link: <https://example.com/wp-json/>; rel="https://api.w.org/"
    Link: </min/186a9.css>; rel=preload; as=style,</min/f7e97.css>; rel=preload; as=style,</wp-content/plugins/plugin/jquery.min.js>; rel=preload; as=script,</min/7f44e.js>; rel=preload; as=script,</min/a8512.js>; rel=preload; as=script,</wp-content/plugins/litespeed-cache/js/webfontloader.min.js>; rel=preload; as=script
    Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
    Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
    Vary: Accept-Encoding
    X-Litespeed-Cache: miss
    X-Litespeed-Cache-Control: public,max-age=604800
    X-Litespeed-Tag: 98f_WC_T.156,98f_WC_T.494,98f_WC_T.48,98f_product_cat,98f_URL.e3a528ab8c54fd1cf6bf060091288580,98f_T.156,98f_
    X-Powered-By: PHP/7.3.6
    Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    Server: cloudflare
    CF-RAY: 5f5db4fd1c234c56-AMS

This URI returns 200, and x-litespeed-cache-control: public, so why is the header empty in the previous debugging process?

To figure it out, we can mimic the exact options the PHP curl used, and see what's going on.

To add another debug log code to grab the curl options the crawler used, add following code into litespeed-cache/lib/litespeed/litespeed-crawler.class.php at line 627, directly before return $options ;, like so:

    $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;

    LiteSpeed_Cache_Log::debug( 'crawler logs headers2', json_encode( $options ) ) ;
    return $options ;

Now, manually crawl it again to get the all the options:

    07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{
    "19913":true,
    "42":true,
    "10036":"GET",
    "52":false,
    "10102":"gzip",
    "78":10,
    "13":10,
    "81":0,
    "64":false,
    "44":false,
    "10023":["Cache-Control: max-age=0","Host: example.com"],
    "84":2,
    "10018":"lscache_runner ",
    "10016":"http://example.com/wp-cron.php?doing_wp_cron=1234567890.12345678910111213141516","10022":"litespeed_hash=qwert"
    }'

The numbers you see are PHP curlset reference code. An internet search reveals that the 78 and 13 are particularly interesting. They represent curl connection timeout and curl timeout.

Let's apply these options to our curl command.

    [root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1
    curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received

So this confirms a timeout is the root cause of the problem. Without cache, the page takes more than ten seconds to load.

Let's do one more test to confirm it:

    [root@test ~]# curl -w 'Establish Connection: %{time_connect}snTTFB: %{time_starttransfer}snTotal: %{time_total}sn' -XGET -A "lscache_runner https://example.com/product-name-1/
    Establish Connection: 0.006s
    TTFB: 16.455s
    Total: 16.462s

So yes. The page without cache takes more than 16 seconds to load, which results in a curl timeout. That is the reason why the debug log shows an empty header, the 200 status is never received by the crawler, and the URL is blacklisted.

Solution

We need to increase the timeout. As of LSCWP v3.0 there is a setting for this. Navigate to LiteSpeed Cache > Crawler > General Settings and set the timeout to something greater than 10 seconds (the v3.0 default is 30).

Sitemap File Not Generated

If you generate a sitemap but notice that the crawler Size is still 1 or 2, you can try the following steps to debug the issue.

!

  1. Verify that the sitemap URL works.
  2. Verify that the URLs within the sitemap are available for public visit. If the URLs are only private, that can cause a sitemap generation issue.
  3. Verify /wp-content/plugins/litespeed-cache/var/crawlermap*.data exists. If it doesn't exist, there may be a permission issue with one of the directories in the path.
  4. Verify that the home_url listed in the Environment Report is identical to the home URL in Step 2. Keep in mind that http://example.com/test.html and https://example.com/test.html are different! You can fix it by changing WP Dashboard > Settings > General > Site Address (URL) to match the value in the sitemap.

Last update: April 17, 2020