Need help: 60k "pages" in Google Console that are not indexed but getting crawled!

that’s exactly what I’ve always suspected … but I just don’t have the technical understanding for it.

One possible culprit…

Try publishing a page containing Poster or Alloy with pretty urls turned on but no suitable entry in the htaccess file.

You get a page that looks very similar to our garbage pages. Perhaps such pages also dynamically produce the garbage urls that Google picks up on. Then as you say, from here it’s just a nasty every growing loop, which no longer needs that initial page.

I think the solution lies not in discovering the cause but in the htaccess file: to somehow return these garbage pages to a 404 or 410 error.

Yes. That also came to my mind. With some clever regex this should be possible.

There is another nasty side-effect of this issue that I think it’s worth adding here.

When an Alloy page is served it will generate some PHP Warnings. I believe Adam was aware of this and had a solution planned, but it never got released.

If your Alloy page is getting a small amount of traffic this isn’t an issue. But if an alloy page is part of the loop talked about above, then your logs can grow: 500MB a day on one of my other sites.

Poster 2 did the same, but as soon as I told Jannis about it he released a fix. But no such fix will be released for Alloy.

I’m not putting this here for a Solution (with Adam out the loop now there is no solution per se), but for people reading this with an error log issue with Alloy, to help them understand why it’s happening.

1 Like

Google crawlers also count to this traffic, which then will create error logs in addition to the normal visitors.

1 Like

That’s my point. If your only getting real traffic the logs are small. But if your getting crawler traffic repeatedly hitting garbage Alloy pages, you have a problem.

This is what ultimately bought me to the problem we’re talking about; one small site with an Slloy blog was producing 500MB of error logs a day, even though the stats were showing a tiny amount of “real” traffic (Matomo was set to not record crawler traffic).

Wow. Quite the thread!

@Steveb post your entire htaccess file

1 Like

Interesting to see this. I was seeing bizarre URLs dynamically generated and google attempting to index them. The problem was very small scale, a few dozen per google report and it has disappeared recently. I have no idea why

@joeworkman

That’s the htaccess I have in my docs:

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule blogpage/author/(.*) /blogpage/index.php?author=$1 [L,NC,QSA]
RewriteRule blogpage/category/(.*) /blogpage/index.php?category=$1 [L,NC,QSA]
RewriteRule blogpage/date/(.*) /blogpage/index.php?date=$1 [L,NC,QSA]
RewriteRule blogpage/search/(.*) /blogpage/index.php?search=$1 [L,NC,QSA]
RewriteRule blogpage/tag/(.*) /blogpage/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule blogpage/(.*) /blogpage/index.php?post=$1 [L,NC,QSA]

Where the last line will lead any request of the page to the index.php

Here is the htaccess doc I had in place when using pretty urls on the La Novia site…

# redirects
RewriteOptions inherit
RewriteEngine on
RewriteCond %{HTTP_HOST} ^lanovia.co.uk [NC]
RewriteRule ^(.*)$ https://www.lanovia.co.uk/$1 [L,R=301,NC]

RewriteCond %{SERVER_PORT} 80 
RewriteRule ^(.*)$ https://www.lanovia.co.uk/$1 [R,L]

RewriteEngine on
ErrorDocument 404 https://www.lanovia.co.uk/

# sitemap rewrite
RewriteEngine On
RewriteRule ^sitemap\.xml$ /sitemap/ [R=301,L]

# To tell bots the folder archive no longer exists with a 410 code
RewriteEngine On
RewriteRule ^repo/archive/ - [G,L]



# Pretty Urls journal

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule journal/author/(.*) /journal/index.php?author=$1 [L,NC,QSA]
RewriteRule journal/category/(.*) /journal/index.php?category=$1 [L,NC,QSA]
RewriteRule journal/date/(.*) /journal/index.php?date=$1 [L,NC,QSA]
RewriteRule journal/search/(.*) /journal/index.php?search=$1 [L,NC,QSA]
RewriteRule journal/tag/(.*) /journal/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule journal/(.*) /journal/index.php?post=$1 [L,NC,QSA]


# Pretty Urls real-brides

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule real-brides/author/(.*) /real-brides/index.php?author=$1 [L,NC,QSA]
RewriteRule real-brides/category/(.*) /real-brides/index.php?category=$1 [L,NC,QSA]
RewriteRule real-brides/date/(.*) /real-brides/index.php?date=$1 [L,NC,QSA]
RewriteRule real-brides/search/(.*) /real-brides/index.php?search=$1 [L,NC,QSA]
RewriteRule real-brides/tag/(.*) /real-brides/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule real-brides/(.*) /real-brides/index.php?post=$1 [L,NC,QSA]


# Pretty Urls lilianadabic

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule lilianadabic/author/(.*) /lilianadabic/index.php?author=$1 [L,NC,QSA]
RewriteRule lilianadabic/category/(.*) /lilianadabic/index.php?category=$1 [L,NC,QSA]
RewriteRule lilianadabic/date/(.*) /lilianadabic/index.php?date=$1 [L,NC,QSA]
RewriteRule lilianadabic/search/(.*) /lilianadabic/index.php?search=$1 [L,NC,QSA]
RewriteRule lilianadabic/tag/(.*) /lilianadabic/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule lilianadabic/(.*) /lilianadabic/index.php?post=$1 [L,NC,QSA]

# Pretty Urls Oscar Lili

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule oscarlili/author/(.*) /oscarlili/index.php?author=$1 [L,NC,QSA]
RewriteRule oscarlili/category/(.*) /oscarlili/index.php?category=$1 [L,NC,QSA]
RewriteRule oscarlili/date/(.*) /oscarlili/index.php?date=$1 [L,NC,QSA]
RewriteRule oscarlili/search/(.*) /oscarlili/index.php?search=$1 [L,NC,QSA]
RewriteRule oscarlili/tag/(.*) /oscarlili/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule oscarlili/(.*) /oscarlili/index.php?post=$1 [L,NC,QSA]


# Pretty Urls casablanca

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule casablanca/author/(.*) /casablanca/index.php?author=$1 [L,NC,QSA]
RewriteRule casablanca/category/(.*) /casablanca/index.php?category=$1 [L,NC,QSA]
RewriteRule casablanca/date/(.*) /casablanca/index.php?date=$1 [L,NC,QSA]
RewriteRule casablanca/search/(.*) /casablanca/index.php?search=$1 [L,NC,QSA]
RewriteRule casablanca/tag/(.*) /casablanca/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule casablanca/(.*) /casablanca/index.php?post=$1 [L,NC,QSA]

# Pretty Urls Beloved

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule beloved/author/(.*) /beloved/index.php?author=$1 [L,NC,QSA]
RewriteRule beloved/category/(.*) /beloved/index.php?category=$1 [L,NC,QSA]
RewriteRule beloved/date/(.*) /beloved/index.php?date=$1 [L,NC,QSA]
RewriteRule beloved/search/(.*) /beloved/index.php?search=$1 [L,NC,QSA]
RewriteRule beloved/tag/(.*) /beloved/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule beloved/(.*) /beloved/index.php?post=$1 [L,NC,QSA]

# Pretty Urls elysee

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule elysee/author/(.*) /elysee/index.php?author=$1 [L,NC,QSA]
RewriteRule elysee/category/(.*) /elysee/index.php?category=$1 [L,NC,QSA]
RewriteRule elysee/date/(.*) /elysee/index.php?date=$1 [L,NC,QSA]
RewriteRule elysee/search/(.*) /elysee/index.php?search=$1 [L,NC,QSA]
RewriteRule elysee/tag/(.*) /elysee/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule elysee/(.*) /elysee/index.php?post=$1 [L,NC,QSA]

# Pretty Urls eliis bridals

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ellis-bridals/author/(.*) /ellis-bridals/index.php?author=$1 [L,NC,QSA]
RewriteRule ellis-bridals/category/(.*) /ellis-bridals/index.php?category=$1 [L,NC,QSA]
RewriteRule ellis-bridals/date/(.*) /ellis-bridals/index.php?date=$1 [L,NC,QSA]
RewriteRule ellis-bridals/search/(.*) /ellis-bridals/index.php?search=$1 [L,NC,QSA]
RewriteRule ellis-bridals/tag/(.*) /ellis-bridals/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ellis-bridals/(.*) /ellis-bridals/index.php?post=$1 [L,NC,QSA]

# Pretty Urls samples

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule samplesale/author/(.*) /samplesale/index.php?author=$1 [L,NC,QSA]
RewriteRule samplesale/category/(.*) /samplesale/index.php?category=$1 [L,NC,QSA]
RewriteRule samplesale/date/(.*) /samplesale/index.php?date=$1 [L,NC,QSA]
RewriteRule samplesale/search/(.*) /samplesale/index.php?search=$1 [L,NC,QSA]
RewriteRule samplesale/tag/(.*) /samplesale/index.php?tag=$1 [L,NC,QSA]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule samplesale/(.*) /samplesale/index.php?post=$1 [L,NC,QSA]

There your problem… Your htaccess rules allow for WAY too much to be caught! The .* captures anything, forever. You need to limit the regular expressions so that it only captures what you need.

If you need help with that, I can help when I get back to the office this afternoon.

4 Likes

Thanks!

I think I’m on it 💻 🤓

@Steveb @Jannis or @joeworkman

I would be pleased if you could let me know how the final solution looks like.
specifically what to do to avoid the problem of messy URLs in the future.

thank you all for your time and the intensive troubleshooting

2 Likes

I hope @Steveb will test my proposed solution. I will keep you updated.

1 Like

I’m pretty good with .htaccess rules. But, I gotta admit…this got way over my head! I’m really glad Jannis jumped in and has a solution that will hopefully get Steve fixed-up. Also, glad to see Joe offered his unwavering support!

2 Likes

We have already kind of a solution, but I need to test it on different web servers. Stay tuned.

3 Likes

This works for me:

RewriteEngine On
# Rewrite blogpage URLs
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule blogpage/author/([^/]+)/?$ /blogpage/index.php?author=$1 [L,NC,QSA]
RewriteRule blogpage/category/([^/]+)/?$ /blogpage/index.php?category=$1 [L,NC,QSA]
RewriteRule blogpage/date/([^/]+)/?$ /blogpage/index.php?date=$1 [L,NC,QSA]
RewriteRule blogpage/search/([^/]+)/?$ /blogpage/index.php?search=$1 [L,NC,QSA]
RewriteRule blogpage/tag/([^/]+)/?$ /blogpage/index.php?tag=$1 [L,NC,QSA]

# Catch-all rule for individual blog posts
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule blogpage/([^/]+)/?$ /blogpage/index.php?post=$1 [L,NC,QSA]
1 Like

OK, ChatGPT helped me to understand that the new htaccess rule is more specific to only filter out structurally allowed ‘pretty URLs’. The old rule allowed any kind of URL starting with the blogpage, which would then be converted into a slug. More specific is certainly good.

However, in order to understand how that could have created the runaway process of pseudo-existing webpages, I think we would need to understand how e.g. Poster 2 handles the slug request. What happens when a post is requested that doesn’t exist ?

With the new htaccess rule one can still create something like

https://example.com/blogpage/index.php?post=non-exisiting-slug

With the old htaccess rule one could end up with somthing like

https://example.com/blogpage/index.php?post=a/b/c/some-slug

Is Poster 2 handling the slugs differently now? Does the call to a non-existing post create a 404 error ? Did the previously possible slug (with subfolders) create a false blogpage ?

It’s more a question why such a post is requested.

With the latest 2.8.4 update: yes. But only on application (PHP) level. Not on web server (Apache/nginx) level. Therefore 404 redirects will not be caught by the web server.

It’s not Poster, or Alloy, creating the pages. It’s the fact that, what we’ve called, “garbage” pages are being created and can be landed on, that caused the issue. I think these pages, then produce more garbage pages, and so the loop begins, and in time crawler bots land on them, and attempt to index them. But they figure out they are garbage, so don’t index them. But, they keep coming back to them to check if they are no longer garbage, so generating more garbage pages!

What created the first garbage page? My best guess is Poster or Alloy (or others) being published with pretty URLs turned on, but no associated entry in the htaccess file. This only has to happen for a split second; if a crawler bot is at that moment crawling your Poster/Alloy page… That’s it. The loop has begun.

The new htaccess entry (thanks to @joeworkman for the heads up on that one) fixes part of the issue, then it’s just down to a 404 error entry in the htaccess file to tell bots that page doesn’t exist.

Of course, a 410 error code is better, but then you have to differentiate between pages that are really gone forever, and those which are simply not working at the moment the bot lands. So personally, I’ve opted for a 404.

For me, it’s all working. When I first picked up on this, Google had 61k crawled but not indexed pages in the search console. Now it’s down to 56k and falling. And the load on the server has gone from about 67MB served per minute for one single domain (almost all garbage pages) to about 500KB.

Plus, while working on this, Jannis has managed to tighten up error handling within Poster 2, which is great. I wish Adam would do the same for Alloy, as it’s still generating a lot of PHP warnings, but of course, he’s left, so that’s not going to happen. It’s a shame he can’t find a little bit of time to fire out an update of Alloy, if for no other reason than to support his old customers who supported his business. But that’s life!

3 Likes