Need help: 60k "pages" in Google Console that are not indexed but getting crawled!

This one has had me and the IT guys who maintain my server going around in circles, and we’re stumped.

One of my client sites is added to my Google Console. In the pages tab there are 400 pages indexed (about right), but 60k that are not. The screenshot below gives the reasons…

These pages do not exist, per se: If I click one of the links it does land on a page, but it’s only the header and footer sections of a page, without any CSS styling.

Now this all sounds kinda OK: 60k “not real” pages that are not getting indexed… Fine. But they are all getting crawled by Google bots! To the tune of about 67MB per minute! Currently sucking up about 30% of my servers resources.

I’m kinda lost as to where to go with this! If there is someone who can fix this I’m happy to pay them. Or, if there is even anyone who can begin to tell me how to fix it, I’m happy to do that too.

But I need help!

hey,

was the website created with foundry3 or alloy?

can you show which URLs are under “Crawled - currently not indexed”? (the path of the url should be fine!)

I also had the problem with over 100,000 URLs, although my project only has just over 1,300 pages.
the problem was created by “alloy” + pretty urls. I reported the problem to Adam at the time, but unfortunately he didn’t really take look into it …

last year i had almost 100,000 pages in may and june and in december i still had a good 30,000 and now i’m almost at the normal number of pages

1 Like

Typically a page is left unindexed as a result of the content not have a clearly defined value to a reader. Is the content available to website visitors? Are there links leading in and out of the pages? Do the links create a relevant structure that brings more meaning to the overall value of the website?

1 Like

Thanks for the replies. Answes, as best I can, below…

was the website created with foundry3 or alloy?

No. UiKit.

the problem was created by “alloy” + pretty urls.

However, the site in question does use a lot of pretty URL’s: Many pages use Poster 2 from @Jannis and each instance of Poster 2 uses pretty URL’s, with the appropriate entry into the htaccess file. So perhaps this has something to do with pretty urls?

last year i had almost 100,000 pages in may and june and in december i still had a good 30,000 and now i’m almost at the normal number of pages

How did you reduce the numbers?

Is the content available to website visitors?

Yes. If I was to copy and paste the URL into the browser, a page is rendered. Example: https://www.lanovia.co.uk/lilianadabic/wedding-dresses/accessories/ellis-bridals/faq/category/adele

The URL appears to be made up of page names all put together. For instance, from that URL above, the following are actual pages, which appear in the site map etc.

https://www.lanovia.co.uk/lilianadabic/
https://www.lanovia.co.uk//wedding-dresses/
https://www.lanovia.co.uk/accessories/
https://www.lanovia.co.uk/faq/

And so on.

I am somewhat confused as to why that URL actually returns anything, expect a 404 error.

Do the links create a relevant structure that brings more meaning to the overall value of the website?

No, see the example URL above.

Are there links leading in and out of the pages?

As you can see with the example URL, there are working links from it to other pages on the site.

As for referring links… If I inspect the URL above in Google Console, it say the following page is the referrer: https://www.lanovia.co.uk/lilianadabic/wedding-dresses/accessories/ellis-bridals/faq/

Hopefully that helps. Thanks for the comments, if anyone can help further, I’d appreciate it.

Well, if that’s the case, there must be still somewhere the origin of the wrong URLs Google thinks it has to crawl.

Morning Jannis.

So, the pretty URL comment above got me thinking. Amd I can say with a reasonably high level of confidence, it in indeed pretty urls that are the issue. I’ll do my best to explain (I might make a video on it later, as that will better explain things, perhaps)

At the moment the lanovia.co.uk site has the following pages with Poster2, and pretty urls…

https://www.lanovia.co.uk/lilianadabic/
https://www.lanovia.co.uk/casablanca/
https://www.lanovia.co.uk/beloved/
https://www.lanovia.co.uk/elysee/

The following pages do not have Poster 2, and so do not have pretty urls set in the htaccess file.

https://www.lanovia.co.uk/wedding-dresses/
https://www.lanovia.co.uk/the-boutique/
https://www.lanovia.co.uk/faq/
https://www.lanovia.co.uk/contact/

There are more, but these will do as an example.

If you make up a URL using any page folders that don’t have Poster2 (and so no pretty urls), like this…

https://www.lanovia.co.uk/contact/faq/about/contact/

The 404 rewrite in the htacess for unknown URL’s work, and you are sent to the home page. Incidentally, the htaccess entry for this looks like this…

RewriteEngine on
ErrorDocument 404 https://www.lanovia.co.uk/

But…

If you make up a url with non-poster2 pages (no pretty url) but include at least one page with poster2 (and so with a pretty url), you will not get returned to the homepage, but instead you’ll land on a page, albeit one with garbage content. For example…

https://www.lanovia.co.uk/contact/about/lilianadabic/faq

Note, in that url the page /lillianadabic/ uses P2.

To prove that this issue is the pretty url, not Poster2 itself. I’ve removed pretty urls from one of the pages with Poster 2, and removed the pretty url entry from the htaccess. The page is…

https://www.lanovia.co.uk/oscarlili/

So, if we take the above url with a p2 page with pretty urls, and switch out the page in question (lilianadabic) for the P2 page without pretty urls (oscarlili), we don’t get this odd behavour. Instead, the url is returned to the homepage, as it should.

https://www.lanovia.co.uk/contact/about/oscarlili/faq

So, from this, I think we can summise that the problem here is not Poster2 itself, but the pretty url entry.

Make sense?

The big question is… How to fix it!

Anyone any ideas?

(Is anyone still reading this?).

AFAIK, I don’t return a 404 for non-existing posts (or empty page results). I can add an option to return a 404 http header in this case.

Still Google thinks it must index these non existing pages. Why?

Worth adding to this…

The site: oscarlili.co.uk is the sister site to la Novia, it’s built in much the same way, although it has a lot less Poster 2 pages. In fact, it has only one, the blog page…

https://www.oscarlili.co.uk/journal/

It does the same thing, ie. made up urls that include a Poster 2 page return a page instead of sending to the homepage.

I can’t answer that with any level of knowledge. Perhaps there was once an incorrectly written url added to a page, that included a poster 2 page, so Google then discovered a “garbage” page, and as these garbage pages contain links to lots of other garbage pages ( as seen by Google), it’s spun out from there, to a total of 60k pages (so far!). I suspect that number is slowly growing too.

As far as I can see there are two solutions…

  1. Remove all pretty urls.
  2. Do something either in Poster 2 or more likely the pretty url htaccess entry to stop the above happening.

I’ve grown to dislike pretty urls a lot in recent times. Not due to P2, but just to how they mess things up when it comes to search results in general. So I’m tempted to go for option 1. But, in the case of La Novia, that is going to impact search results, as a link like this…

https://www.lanovia.co.uk/beloved/beloved-araya

Which is a link to one of the wedding dresses they sell, it going to stop working. Unless I put an aweful lot of redirects in place!

Edit: I don’t think this is a unique issue to P2. I’m pretty sure Alloy suffers the same. As per the comment from @Pegasus above, and based on some weird stuff I’ve seen with error log files for sites with Alloy (a problem for another day!).

Good discussion, also if it’s happening in Alloy also.

I send you a new version where you can turn on a 404 http header for non existing posts. I hope this helps.

This would be on your side. The htaccess example code I’m providing would have to be edited in that case.

1 Like

Got it and updated some pages. But, what exactly will this do?

Poster 2 on this page now has the new option ticked.

https://www.lanovia.co.uk/oscarlili/

This is a URL to one of the items in P2 on that page: https://www.lanovia.co.uk/oscarlili/?post=oscar-lili-adele (pretty urls now removed).

But, if I understand the new setting correctly, if someone lands on a url for a P2 post on that page that doesn’t exist, a 404 error is returned. Is that correct?

If, for instance the url is Oscar Lili bridal seperates at La Novia Edinburgh it should return a 404 error?

But, I don’t think it is.

Exactly. But on application/PHP level. Which is most probably too late for the server side htaccess 404 redirect.

But it might help for the Google crawler.

1 Like

Ah OK, I understand.

This is good, but I don’t think it’s going to solve my issue, which is pretty urls causing “fake” pages to exist to Google.

think the only solution for me is to turn off pretty urls, and put redirects in place for the the old prettified links to the poster items.

Unless, you can think of another solution?

Google doesn’t invent wrong pages / URLs / links. These must have somewhere an origin. I don’t know where these should come from.

1 Like

I’ve turned off pretty urls and removed all entries from the htaccess. Now, all those “fake” urls are returning a 404 error.

I will look into whether it’s possible to redo the pretty urls including the htaccess entries that stops this issue. If I find one I’ll let you know. If not, I’ll just stop using pretty urls. As said above, in the last year or so I’ve come to the conclusion they are not worth the hassles they cause anyway, so I’m not too bothered to lose them. Other than the fact that a lot of old urls will no longer work!

Would this post on Google Search support help?

https://support.google.com/webmasters/community-guide/288535911/tired-of-seeing-irrelevant-old-non-canonical-and-bogus-urls-in-the-page-indexing-report?hl=en

1 Like

Thanks, but no. Tht really just covers filtering the data to only include pages in the sitemap. My issue is not that Google is showing non-indexed pages, but that it’s still crawling them.

The above solution worked though, non-indexed pages are already down from 60k to 58k.

ok. it’s not the a “alloy” problem. i never had these problems with poster2 or tcms …

then the problem is probably due to an internal “broken link”, at least that’s how it was for me.

In my early days with rapidweaver, I always used the “rw-page tool” for internal linking. Later, I only used the direct urls or paths. because I realized quite late that every time I deleted a “page” in Rapidweaver and it was internally linked, this caused problems …

this also leads to the “messy” url problem …

i solved this problem with the following htaccess entry

## remove slash if not directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} /$
RewriteRule (.*)/ $1 [R=301]

## add .php to access file, but don't redirect
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteCond %{REQUEST_URI} !/$
RewriteRule (.*) $1\.php [L]


RewriteCond %{THE_REQUEST} ^.*/index\.php 
RewriteRule ^(.*)index.php$ /$1 [R=301,L]


### remove double/more slashes in url
1 Like

Disabling pretty-urls is probably the best thing you can do. In terms of Google ranking, there is absolutely no downside without pretty-urls.

The only issue that always frustrates me is the fact that you can’t 301 redirect non-pretty-urls to another page.

1 Like

And I’ll investigate if I find wrongly generated links when the checkbox “pretty URLs” is set.

Please let me know if the new option of sending a 404 when a non existent post is called has any negative side effects. I would like to make this a standard enabled option.

All I could do is adding another option to not send a 404, but a 301 redirect with a given URL.