Community Information
-
Critiques on my method to find all live URLs on a large site
Hello, can anyone please critique my method for locating every live page on a large website? I am working on a site with more than 3,000 pages, including product pages. There has been little to no governance prior. There is no way to export a list of pages from our headless CMS. Therefore, I am pulling every page that GA found over the past year; every page GSC logged; every page that shows up in a comprehensive Screaming Frog crawl that uses GA4 and GSC API's . I am going section by section, pulling the list of pages in a given subfolder from each source into a spreadsheet. The group of pages from each subfolder is in a different column. I am manually removing URLs with parameters. Then, I'm using a script to identify the pages in all three columns and which ones are not in all three columns (exceptions). I am then running the list of exceptions through Screaming Frog to reveal 404 and 500 errors. The ones with 200 statuses get added to the list of pages in all three columns. At this stage, I'm also spot-checking the Exception URLs for ones that may be correct if I manually fix typos. That, in part is why I'm going section by section and not doing the entire site at once: So the list is still relatively small enough to spot-check and also keep an eye out for any trends. Once I'm done, I can either spot-check against what's in our headless CMS, or manually go folder-by-folder in the CMS to look for anything I missed. I imagine I will miss a few non-indexed or orphan pages, but this should get me close. Any thoughts, suggestions, constructive criticism appreciated! Thanks in advance.4
© 2025 Indiareply.com. All rights reserved.