日本語

Technical SEO Audit: Crawl, Indexing, Architecture, Core Web Vitals

Most "technical SEO audits" end up as 200-page PDFs filed in a Google Drive folder nobody opens. Engineering ignores them because they read like academic papers: issue inventories with no triage, no ticket templates, and no traffic estimates attached to the fixes.

This is the IC version. Five focused areas, named diagnoses, and a dev ticket queue your engineers will actually pick up in their next sprint. The audit's value isn't the deck. It's the merged PR.

If you walk away from a quarterly audit with a 47-tab spreadsheet but no commits, you ran a research project, not an audit.

The 5-Area Audit Framework

Every technical audit I run covers the same five areas, in this order, because the dependencies run downhill: there's no point optimizing CWV on pages Google can't crawl, and no point fixing schema on pages that aren't indexed. Work the funnel.

  1. Crawlability
  2. Index coverage
  3. Architecture and internal linking
  4. Core Web Vitals
  5. Structured data

You can finish the first pass in two focused days for a site under 50K URLs. Anything bigger, plan a week and lean harder on log sampling.

1. Crawlability

Tools: Screaming Frog or Sitebulb for the full crawl, server logs for what Googlebot actually fetches, robots.txt tester in GSC.

Start with robots.txt. Read it line by line. Then check what's blocked at the directory level versus pattern level. The classic own-goals I see at least once a quarter:

  • Staging rules leaked to production. Someone copies Disallow: / from staging into the production robots.txt during a deploy. Whole site goes dark in 48 hours. The fix is two characters. The damage is six weeks.
  • /api/ routes accidentally disallowed when they serve hydrated content. Some Next.js and Nuxt apps fetch from /api/... for ISR or SSR data. If those endpoints are blocked, Googlebot sees half-rendered pages on the second pass.
  • CSS and JS blocked. Still happens. Googlebot needs to render the page. If it can't load your stylesheet, mobile-friendliness checks fail and rankings drop.
  • Disallow on faceted nav that's also where 60% of long-tail traffic lives. Common in e-commerce. Block parameter URLs in robots.txt and you've also blocked the canonical signals that would have consolidated them.

After robots.txt, run a full Screaming Frog crawl with JavaScript rendering enabled. Compare the rendered HTML against the raw HTML. If the rendered version has 3,000 more words than the raw, you have a JS rendering problem. (More on that below.)

Then sample your server logs. Two weeks of access logs filtered to Googlebot user agents. Group by status code and URL pattern. You're looking for:

  • 5xx errors hitting Googlebot (server can't keep up with crawl rate)
  • Soft 404s (200 status with "page not found" content)
  • Crawl traps (infinite parameter combinations on faceted nav, calendar widgets that go to year 3024, session IDs in URLs)

Crawl budget only matters on sites over 100K URLs. If you're a SaaS site with 800 pages, stop worrying about crawl budget and worry about why 12 of those 800 pages drive 90% of organic traffic.

2. Index Coverage

Tools: GSC Index Coverage (now Pages report), Screaming Frog comparing crawled vs indexed.

Open GSC Pages report. Filter to "Not indexed." Read every single status reason. The two that ship the most regression bugs:

  • "Discovered, currently not indexed." Google found the URL but hasn't fetched it. Usually means crawl priority is low (thin content, weak internal links) or the site is on a slow server.
  • "Crawled, currently not indexed." Google fetched it and decided not to index. This is a quality signal. Thin content, near-duplicates, and pages Google considers low value all land here. Fixing this is a content problem, not a technical one. Don't ticket the dev team for "Crawled, not indexed."

Now the classics:

  • Noindex on important pages. The canonical own-goal. Someone adds <meta name="robots" content="noindex"> to a layout file used by the whole /blog/ directory and traffic craters in 30 days. I've seen this ship from a dev who was testing a single article and forgot to revert the layout change. Always grep the codebase for noindex during an audit. Always.
  • Canonical mismatches. Page A has <link rel="canonical" href="page-b">, page B has <link rel="canonical" href="page-a">. Google picks one and ignores the other, but it's a coin flip which one. Fix: every page canonicalizes to itself unless you have a deliberate consolidation reason.
  • Parameter handling. UTM parameters, session IDs, sort orders, filter combinations. Each variant is a separate URL to Google unless you canonicalize. Default rule: every parameterized URL canonicals to the clean base URL. Override only for parameters that change content meaningfully (like product variants).
  • Hreflang errors. If you run multi-language, hreflang return-tag errors cascade. GSC's International Targeting report still shows these.

Sanity check: site:yourdomain.com in Google. The number it returns should roughly match your GSC indexed count (within 20%). If site: shows 12,000 and GSC shows 4,000, something's off and the gap is usually parameter URLs Google indexed before you canonicalized them.

3. Architecture and Internal Linking

Tools: Sitebulb (best in class for visualizing site depth), Screaming Frog crawl depth report.

Three diagnostics, ranked by frequency I see them in audits:

Depth from homepage. Anything more than four clicks from the homepage is a red flag. Run a Screaming Frog crawl, sort by crawl depth descending. Pages at depth 6+ are usually orphans or stranded by bad navigation. If your highest-revenue product page is at depth 5, you have an architecture bug, not a content bug.

Orphan pages. Pages with zero internal inbound links. Sitebulb has a dedicated report. Common culprits: old landing pages from a campaign, blog posts the editorial team forgot to link from anywhere, product pages launched without nav updates. Either link to them from a relevant hub or noindex them. Don't let them rot.

Internal linking distribution. Run Screaming Frog's internal link count per URL. Plot the distribution. You're looking for the long tail of pages with 0-2 internal links. Those pages will struggle to rank no matter how good the content is. Hub-and-spoke structure (pillar page → cluster of related articles) is the cleanest fix: every cluster article links to the pillar, the pillar links back to every cluster article.

Faceted nav traps. If you're e-commerce or marketplace, faceted navigation can generate millions of URL combinations. The decision tree:

  • Filters that change content meaningfully (category, brand) → indexable, canonical to themselves
  • Filters that just sort or paginate → noindex or canonical to base category
  • Combination filters (category + brand + price + size) → block in robots.txt or use AJAX so they don't generate crawlable URLs

Breadcrumbs. Every content page should have a visible breadcrumb component, marked up with BreadcrumbList schema. Breadcrumbs help users orient, give Google extra structural signal, and earn the breadcrumb display in SERPs.

4. Core Web Vitals

Tools: PageSpeed Insights for both lab and field data, Lighthouse for repeatable lab testing, Chrome DevTools Performance panel for diagnosis, GSC Core Web Vitals report for monitoring.

The thresholds matter. Memorize them:

Metric Good Needs Improvement Poor
LCP (Largest Contentful Paint) < 2.5s 2.5s – 4.0s > 4.0s
INP (Interaction to Next Paint) < 200ms 200ms – 500ms > 500ms
CLS (Cumulative Layout Shift) < 0.1 0.1 – 0.25 > 0.25

Always measure on field data (CrUX, the real-user numbers in PageSpeed Insights and GSC), not lab. Lab numbers tell you a story; field numbers tell you the truth. Lab can be passing while CrUX is failing because real users are on slower networks and older devices.

Tier the fixes. Each metric has a clear ladder:

LCP fixes (in order of effort vs impact):

Tier Fix Effort Typical Impact
1 Compress and convert hero image to WebP/AVIF Low 0.3-0.8s
2 Serve via CDN with edge caching Low-Med 0.5-1.5s
3 Add fetchpriority="high" and preload for the LCP element Low 0.2-0.5s
4 Inline critical CSS, defer non-critical Medium 0.3-0.7s
5 Reduce server TTFB (caching, faster origin) High 0.5-2.0s+

INP fixes:

Tier Fix Effort Typical Impact
1 Code-split JS bundles, lazy-load below-fold scripts Medium 50-150ms
2 Replace heavy event handlers with debounced versions Low 30-80ms
3 Move expensive work off main thread (Web Workers) High 100-300ms
4 Audit and remove third-party scripts you don't need Low-Med 80-200ms
5 Replace blocking sync APIs with async equivalents Medium varies

CLS fixes:

Tier Fix Effort Typical Impact
1 Add explicit width and height on every image and video Low 0.05-0.15
2 Reserve space for ads and embeds with min-height Low 0.1-0.2
3 Use font-display: swap and preload key fonts Low 0.05-0.1
4 Stop injecting content above existing content Medium varies

Run PageSpeed Insights on your top 10 templates (homepage, blog post, category, product, pricing, etc.), not on individual URLs. Templates are where you ship fixes once and they propagate.

5. Structured Data

Tools: GSC Rich Results report, schema.org validator, Screaming Frog's structured data extraction.

Coverage check. Every content type should have its appropriate schema:

  • Articles → Article or BlogPosting
  • Product pages → Product with Offer
  • FAQ pages → FAQPage
  • Organization (homepage and footer) → Organization
  • Breadcrumbs (every page) → BreadcrumbList
  • How-to content → HowTo

Run Screaming Frog with structured data extraction enabled. Filter to URLs with errors or missing types. Cross-reference against GSC's Rich Results report. That's the source of truth for what Google validates.

The AEO angle (Answer Engine Optimization) matters more every quarter. LLM-powered search systems lean heavily on structured data when deciding which content to cite. A FAQ page with valid FAQPage schema and clean Q&A pairs gets cited by ChatGPT, Perplexity, and Google's AI Overviews far more reliably than the same content without schema. Schema is no longer just for rich snippets. It's how machines decide what your page is about.

The JS Rendering Reality

If your site is a SPA built on React, Vue, Svelte, or Angular without SSR or SSG, you're playing on hard mode.

Googlebot does a two-pass render. First pass: it sees your initial HTML response. For an unrendered SPA, that's a near-blank <div id="root"></div> with a script tag. Second pass: hours to weeks later, when Google's render queue has capacity, it re-fetches and runs the JS. The content gets indexed, eventually.

The "blank-page-in-Screaming-Frog" diagnosis: run Screaming Frog with JS rendering OFF. If your pages return 200 OK with empty <body>, that's what Googlebot sees on first pass. Now run with JS rendering ON. If the content appears, Google will get there eventually. If it doesn't, you have a deeper bug (auth wall, broken hydration, JS error blocking render).

When this matters:

  • News and trending content. If you depend on Googlebot indexing within 24 hours, the second-pass delay kills you. SSR or SSG, no exceptions.
  • Large catalogs. A 50K-product e-commerce site with client-side rendering will see crawl budget burn on the render queue. Google decides which pages are worth the second pass.
  • Pages with auth gating. If your hydration logic redirects unauthenticated users (which Googlebot is), Google sees a redirect, not your content.

The decision tree:

  • If you're a content site → SSG (Next.js, Astro, Eleventy)
  • If you're a product app → SSR for marketing pages, CSR for the app itself
  • If migrating is too expensive → prerendering (Prerender.io, Rendertron) as a stopgap

Ask the engineering team to log the User-Agent header on render failures. If you see Googlebot in there, you have proof to bring to a sprint planning session.

Packaging Findings as Dev Tickets

Here's where most audits die. The findings are real. The tickets never get written. Engineering picks the easiest tasks from the backlog because nobody attached business impact to your audit.

The deliverable is a prioritized ticket queue. My standard split for a quarterly audit:

  • 3 P0s (indexing-blocking issues). Site can't rank if these aren't fixed. Examples: noindex on important pages, robots.txt disallow on indexable directories, canonical tags pointing to 404s.
  • 8 P1s (Core Web Vitals failures and architecture problems). Examples: LCP > 4s on top template, no internal links to revenue pages, faceted nav generating crawl traps.
  • 20 P2s: schema gaps, hreflang cleanup, meta description rewrites, image alt text, minor architecture polish.

Every ticket needs the same fields:

Title: [SEO P0] Noindex meta tag on /blog/* layout

URL pattern: All URLs matching /blog/*
Repro:
  1. Visit https://example.com/blog/any-post
  2. View source
  3. See <meta name="robots" content="noindex"> in <head>
Expected: noindex tag should not be present on indexable blog posts
Evidence: Screenshot of view-source, GSC Pages report showing 247 URLs
  with "Excluded by 'noindex' tag" status
Estimated traffic impact: 247 URLs currently excluded. Top 20 of those
  ranked positions 4-15 before the noindex was added (Ahrefs data).
  Estimated recovery: 8K-12K monthly organic visits within 60 days
  of fix + reindex.
Acceptance: Layout file no longer renders noindex tag for /blog/*.
  GSC Pages report drops "Excluded by noindex" count by ≥240.

Engineering will pick up tickets that read like that. They will not pick up tickets that read like "Fix indexing issues on blog."

The Audit Cadence

Quarterly deep audits cover all five areas. Two days of focused work, plus a week to write tickets and ride them through sprint planning. Every quarter.

Monthly index-coverage check. Open GSC Pages report on the first Monday of every month. Compare against the previous month's snapshot. Investigate any "Not indexed" category that grew by more than 10%.

Weekly CWV regression scan. GSC Core Web Vitals report. Any template that flips from "Good" to "Needs Improvement" gets investigated within the week. Catching a CWV regression at 1 week is a 1-hour fix. Catching it at 8 weeks after a release is a multi-sprint excavation.

Server log sampling: quarterly is fine for most sites, monthly for sites over 500K URLs.

The audit isn't a one-time deliverable. It's an operating cadence. The first one is a heavier lift because you're establishing baselines. By the third quarter, most of your time goes into the regression scans and the new-issue triage, not full re-audits.

Ship fixes, not findings. The audit is only as good as the merged PRs.

Learn More