Your understanding is mostly correct—a 410 status tells Google the pages are permanently gone, and over time they should drop the URLs from the index. However, to ensure the process works smoothly, consider these points:
• Ensure All URLs Return 410: Make sure every URL in the /archive/ folder consistently returns a 410 status code.
• Avoid Blocking via Robots.txt: Don’t disallow the folder in robots.txt; if Google can’t crawl the pages, it won’t see the 410 status and may keep the URLs in its index.
• Use Google Search Console: If you need faster removal, use the URL removal tool in Google Search Console to expedite the de-indexing process.
Point 3 is a good one. There is a temp removal tool, but from what I understand it’s only good for about 6 months. But, adding the URL to that tool and doing the rewrite, should see a faster result. So great call, thanks.
Point 1 is a toughy. The folder /archive/ contains hundreds of markdown files that each have loads of various links, all to a file system that no longer exists. Setting up unique instructions for each URL is not feasible.
Is there a way to setup a “catch-all” instruction in the Htaccess?
Yes, you can use a catch-all rule in your .htaccess file. For example, using mod_alias you can add this line to return a 410 for any URL under the /archive/ folder:
RedirectMatch 410 ^/archive/.*$
Alternatively, if you’re using mod_rewrite, you can add:
RewriteEngine On
RewriteRule ^archive/.*$ - [G,L]
Remember, ensure that your robots.txt isn’t blocking these URLs so search engines can see the 410 responses.
Since you want to block access to only the /repo/archive/ folder and its subdirectories, and you’re placing your .htaccessfile at the document root, you can use a single rewrite rule that targets that specific path. For example:
RewriteEngine On
RewriteRule ^repo/archive/.*$ - [G,L]
• HTTP Response:
The [G] flag tells Apache to return a 410 Gone status. This effectively informs browsers and search engines that the content is permanently unavailable, rather than simply preventing crawling.
.htaccess Location:
Since the .htaccess file is at the root, the rule must include repo/ in the pattern to correctly reference the full URL path.
Regarding Robots.txt:
It’s important not to rely on a robots.txt file for this purpose. While robots.txt can prevent search engines from crawling certain directories, it doesn’t stop them from indexing the URLs if they’re linked elsewhere. With the .htaccess rule returning a 410 status, search engines will recognize that the content is permanently removed and are more likely to drop those URLs from their index.
This should do the trick; give it a couple of weeks and let me know if you’re still seeing issues. BTW; Google just began a new core update yesterday…so, it may be a few days before they get things sorted (ha; not that I’m counting on them ever getting search sorted again).