Need help: 60k "pages" in Google Console that are not indexed but getting crawled!

I’ve now started look at another site with the same issues, albeit on a much smaller scale. This site uses Alloy, not Poster, but the issue is indentical.

When pretty urls are used, random urls which include the page uses pretty urls will return an actual page, although it’s an unformated page, with ramdon content. Nonetheless, crawler bots try to index it, and contstantly crawl it.

So, this is a much sider issue than Poster2. It, for sure, happens with Alloy, and it might be happening with other uses of pretty urls too. Perhaps Poster 2 and Alloy do the whole pretty url thing in the same way?

I don’t know what the solution is. If anyone knows an expert on htaccess then perhaps we can start talking with them to see if the solution is a different way to do pretty urls. But for now, my solution is to turn off prety urls.

I’m gonna tag @dave in this, as he seems to know his way around an htaccess file. Maybe he has some ideas.

Can you show the generated source code and wrong generated example URLs (for content generated by Poster)?

I doubt.

One such URL is as follows…

https://www.lanovia.co.uk/oscarlili/lilianadabic/wedding-dresses/elysee/files/files/files/real-brides/

I’ve now turned off pretty urls for all pages in the la novia site, so if you click that link you will be sent to the homerpage, with a 404 error being recorded. if you’d like to see what the page would look like with pretty urls on, tell me, and I’ll add them back for some page.

I’ll need to see the source where this code is included inside the generated HTML.

I think what you are saying, is you need to see the code for the page where this url is generated. Correct?

If so, I don’t know where that link is generated. All I know is that Google has it as a link, and so was trying to crawl it.

I have now added pretty urls back for the page: Elysee collection wedding dresses at La Novia Edinburgh and sure enough, that link in the last post, which shouldn’t land on a page, now does. Albeit a page that is garbage.

https://www.lanovia.co.uk/oscarlili/lilianadabic/wedding-dresses/elysee/files/files/files/real-brides/

This is the problem: Pretty urls are making random urls work, as far as a search bot is concerned.

I do understand your point: Where did that url come from in the first place? I don’t have an answer for that.

Perhaps the “garbage” pages somehow are creating urls that search bots can read?

Yes. Garbage pages might create wrong URLs, which I will prevent in future.

Still we need to find the origin of this first garbage link.

1 Like

Btw, which doesn’t need to be wrongly generated, but set up by yourself by accident (no criticism).

In the garbage page above: https://www.lanovia.co.uk/oscarlili/lilianadabic/wedding-dresses/elysee/files/files/files/real-brides/ The following stands out to me…

<script type="text/javascript">
document.addEventListener("DOMContentLoaded", function(event) { 
    stacks.com_instacks_poster2_main.replaceHtml(".poster-archive-categories", ' <a href="./category/all"> All</a> <a href="./category/avignon">Avignon</a> <a href="./category/bernadotte">Bernadotte</a> <a href="./category/trianon">Trianon</a>');
    stacks.com_instacks_poster2_main.replaceHtml(".poster-archive-tags", '');
    stacks.com_instacks_poster2_main.replaceHtml(".poster-archive-date-year", ' <a href="./date/2020">2020</a>');
    stacks.com_instacks_poster2_main.replaceHtml(".poster-archive-date-month", ' <a href="./date/2020-01">2020-01</a>');
    stacks.com_instacks_poster2_main.replaceHtml(".poster-archive-authors", '');

	const url = window.location.href.match(/[^\/]+(?=\/$|$)/);
	const categorylinks = document.querySelectorAll(".poster-archive-categories a");
	[...categorylinks].forEach(function(link){
		if(link.href.match(/[^\/]+(?=\/$|$)/)[0] == url[0]){link.classList.add('active')}
	});
	const taglinks = document.querySelectorAll(".poster-archive-tags a");
	[...taglinks].forEach(function(link){
		if(link.href.match(/[^\/]+(?=\/$|$)/)[0] == url[0]){link.classList.add('active')}
	});
	const yearlinks = document.querySelectorAll(".poster-archive-date-year a");
	[...yearlinks].forEach(function(link){
		if(link.href.match(/[^\/]+(?=\/$|$)/)[0] == url[0]){link.classList.add('active')}
	});
	const monthlinks = document.querySelectorAll(".poster-archive-date-month a");
	[...monthlinks].forEach(function(link){
		if(link.href.match(/[^\/]+(?=\/$|$)/)[0] == url[0]){link.classList.add('active')}
	});
	const authorlinks = document.querySelectorAll(".poster-archive-authors a");
	[...authorlinks].forEach(function(link){
		if(link.href.match(/[^\/]+(?=\/$|$)/)[0] == url[0]){link.classList.add('active')}
	});
});
</script>

There seems to be a lot going on there when it comes to urls.

Plus…

When you look at the garbage url, that Google and other search bots are trying to crawl…

https://www.lanovia.co.uk/oscarlili/lilianadabic/wedding-dresses/elysee/files/files/files/real-brides/

It’s made up of lots of page folder names, all mashed together in one url.

For instance /oscarlili/ is a page, /lilianadabic/ is a page, /wedding-dresses/ is a page. And so on.

If “feels” something is telling search bots to append page folder names to the end of working page urls. Perhaps this is how Google is getting the garbabe urls?

For sure, there is no page anywhere on the site that actually has the url https://www.lanovia.co.uk/oscarlili/lilianadabic/wedding-dresses/elysee/files/files/files/real-brides/ on it. So these garbage urls haven’t come about from the page content. Some how (I think) they are being created dynamically, and search bots are reading them.

2 Likes

Nah, there is just too many of them. There were 60k of pages with such urls. It’s humanly impossible to make that many url errors. Even for me ;-)

They are somehow getting produced dynamically. Then, thanks to pretty urls, they were not being returned as a 404 error. This is the only explanation I can think of.

Something of further interest…

Since putting pretty urls back on for the page https://www.lanovia.co.uk/elysee/ the following php warnings have started to appear in the error log…

[16-Mar-2025 14:07:54 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:54 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:54 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:54 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:56 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:56 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:57 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:58 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:58 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:58 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:07:59 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:08:01 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:08:02 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272
[16-Mar-2025 14:08:02 UTC] PHP Warning:  Cannot modify header information - headers already sent by (output started at /home/caffeine/site-lanovia/elysee/index.php:1335) in /home/caffeine/site-lanovia/elysee/index.php on line 4272

I think they are being generated when the page is hit.

The lines around line 4247 in the index.php file on that page looks like this…

$mustache_data = [];
if ($display_detail) {
    foreach ($poster_items as $key => $poster_item) {
        $last_page = floor( $item / $items_per_page ) + 1;
        $item++;
        if ($post == $poster_item->slug) {
            if ($key > 0) {
                $prev_slug = $poster_items[$key - 1]->slug;
                $prev_title = $poster_items[$key - 1]->title;
            }
            if ($key < count($poster_items) && isset($poster_items[$key + 1])) {
                $next_slug = $poster_items[$key + 1]->slug;
                $next_title = $poster_items[$key + 1]->title;
            }
            $poster_detail_item = $poster_item;

            if ($last_page == 1) {
                $poster_detail_item->back['link'] = './' . checkQueryParameters($query);
            } else {
                $poster_detail_item->back['link'] = './?page=' . $last_page . $query;
            }
            $mustache_data['post'] = $poster_detail_item;
            break;
        }
    }

Line 4247 looks like this…

foreach ($poster_items as $key => $poster_item) {

I can confirm that these URLs are created dynamically …

all these urls in the screenshot have never been created and make no sense in terms of structure.

1 Like

So, same for you then!

What are you using on some of those pages: Poster2, Alloy, something else?

I’m am going to have to remove pretty urls for this page: https://www.lanovia.co.uk/elysee/as having them on is causing a spike on the server again. If anyone needs them added back let me know.

His blog uses Alloy as far as I can see.

If you would you be so kind and generate the whole website content with pretty URLs activated, and send me the content included htaccess files as zip file. Than I can test everything locally on my MAMP server.

So, do you want me to set it all up on my server, pretty urls and all, then zip up the server contents and share it with you?

Or, do you want it some other way?

EDIT: Actually, I have a backup of the entire site as it was with pretty urls and all, so if that’s what you want a zip of, I can do that now.

If possible yes.

I’ll send you a download link of the backup via a message in a moment.

1 Like

so far we know that the problem is with alloy and Poster2.

it would be interesting to know if there is anyone who has the problem with tcms.

I use both alloy and poster2.
but I have only noticed the messy-URL-problem in my “alloy” projects. my little project with poster2 + pretty url looks good so far

1 Like

I’m not convinced that it’s a framework problem and therefore it is an issue that needs wider addressing.

So I used Google Gemini to explain why this may happen. Interesting results with links to deeper info.

Rather than spew out the whole generated response here, try for yourselves:

Prompt: Please explain why pretty urls or friendly urls result in inaccurate SEO results.

You can tweak the prompt ad nauseum to taste.

Not necessarily. It’s more the way how the htaccess rules are set up, and that relative links are included in the HTML (generated from RW, Alloy/Poster, or manual links), combined with the PHP renderer.

These wrong links in Google search console can have their origin in one wrong (relative) link, either generated from RW, Alloy/Poster, or by a wrong manual link. This one wrong link will lead to a “garbage URL”, which then itself will generate a lot more “garbage URLs” in a round robin loop style way. This explains the enormous amount of incorrect URLs.

We have to find out which wrong links are the first ones, leading to the other incorrect ones.

2 Likes