When we started CODERCOPS, our blog had 12 posts. They lived in the same repository as the website code. Life was simple.

Today we have 130+ blog posts, 11 project case studies, and 5 team member profiles. The single-repo approach broke down around post 40. Content commits polluted the Git history, merge conflicts became frequent, and non-technical team members needed access to content files without risking changes to production code.

This post documents how we solved the problem by splitting content into a separate repository, connected via Git submodules. It is a pattern we now recommend to any content-heavy site.

Content repository architecture Separating content from code is one of the best architectural decisions we have made

The Problem with a Single Repository

What Broke at 40 Posts

Issue Impact
Git history noise Content commits outnumbered code commits 5:1. Finding code changes in git log required filtering.
Merge conflicts Two people editing different blog posts on different branches still conflicted on index files.
CI/CD waste Every typo fix in a blog post triggered a full site rebuild with all linting, testing, and deployment steps.
Access control We wanted content editors to push changes without access to API keys, deployment configs, or source code.
Repository size 130+ MDX files with image references made the repo clone time noticeable.

The tipping point was when a content update accidentally triggered a deployment that failed because of an unrelated linting issue in the codebase. A blog typo fix should never be blocked by a code problem.

The Two-Repository Architecture

Repository Structure

codercops-agency-website/          (Code repo)
├── src/
│   ├── pages/
│   ├── components/
│   ├── layouts/
│   └── styles/
├── public/
├── codercops-agency-content/      (Git submodule → Content repo)
│   ├── blog/
│   │   ├── post-one.mdx
│   │   ├── post-two.mdx
│   │   └── ... (130+ files)
│   ├── projects/
│   │   ├── the-venting-spot.md
│   │   ├── colleatz.md
│   │   └── ... (11 files)
│   └── team/
│       ├── anurag-verma.md
│       └── ... (5 files)
├── astro.config.mjs
└── package.json

codercops-agency-content/          (Content repo)
├── blog/
├── projects/
├── team/
├── .github/
│   └── workflows/
│       └── validate.yml
└── README.md

The content repo is included in the website repo as a Git submodule. This means:

  1. The website repo points to a specific commit of the content repo
  2. Content editors work in the content repo independently
  3. The website repo updates its submodule reference when content is ready to deploy
  4. Both repos have their own CI/CD pipelines

Why Git Submodules (Not a CMS)

We evaluated several alternatives before settling on submodules:

Approach Pros Cons Why We Rejected
Headless CMS (Sanity, Strapi) Visual editor, media management Runtime dependency, API costs, vendor lock-in We write in MDX with code blocks and tables. No CMS handles this well.
Git monorepo with CODEOWNERS Simple setup Still one repo, still polluted history Does not solve the core problem.
npm package for content Clean separation, versioning Overhead of publishing, slower iteration Too heavy for content that changes daily.
Git submodule Clean separation, familiar Git workflow, no runtime dependency Submodule learning curve Best fit for our needs.
Copy files at build time (rsync) Simple Fragile, no version tracking Too brittle for production.

The key insight: our content is MDX files, not database records. MDX with frontmatter, code blocks, tables, and custom components does not fit neatly into any CMS. Git is the natural version control system for text files.

The Git Submodule Setup

Initial Setup

# In the website repo
git submodule add https://github.com/codercops/codercops-agency-content.git codercops-agency-content
git commit -m "Add content as submodule"

Updating Content

# Pull latest content changes
cd codercops-agency-content
git pull origin main
cd ..
git add codercops-agency-content
git commit -m "Update content submodule"

Astro Configuration

In astro.config.mjs, we point the content collections to the submodule directory:

// astro.config.mjs
import { defineConfig } from 'astro/config';

export default defineConfig({
  // Content collections read from the submodule
  // Astro's content directory is configured to include
  // the submodule path
});

The content collections in Astro 5 handle the MDX parsing, frontmatter validation, and type safety. The submodule is transparent to Astro — it just sees a directory of MDX files.

Content Validation with GitHub Actions

The content repo has its own validation pipeline that runs on every push:

# .github/workflows/validate.yml
name: Validate Content
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Check frontmatter
        run: |
          # Validate required frontmatter fields
          for file in blog/*.mdx; do
            if ! grep -q "^title:" "$file"; then
              echo "Missing title in $file"
              exit 1
            fi
            if ! grep -q "^pubDate:" "$file"; then
              echo "Missing pubDate in $file"
              exit 1
            fi
            if ! grep -q "^author:" "$file"; then
              echo "Missing author in $file"
              exit 1
            fi
          done

      - name: Check for broken image references
        run: |
          # Ensure all image URLs are valid
          grep -rh "!\[" blog/ | grep -oP 'https?://[^\s)]+' | while read url; do
            status=$(curl -o /dev/null -s -w "%{http_code}" "$url")
            if [ "$status" -ne 200 ]; then
              echo "Broken image: $url (status: $status)"
            fi
          done

      - name: Lint markdown
        uses: DavidAnson/markdownlint-cli2-action@v14
        with:
          globs: '**/*.mdx'

This catches problems before they reach the website build:

  • Missing required frontmatter fields (title, pubDate, author)
  • Broken image URLs
  • Markdown formatting issues
  • Invalid date formats

The Content Workflow

For Blog Posts

Writer creates post
    ↓
Push to content repo (branch)
    ↓
GitHub Actions validates frontmatter and formatting
    ↓
Pull request reviewed
    ↓
Merge to main
    ↓
Website repo updates submodule reference
    ↓
Vercel deploys updated site

For Project Case Studies

Project files follow the same flow but with additional frontmatter fields (client, tech stack, timeline, live URL).

Deployment Trigger

We have two deployment paths:

  1. Code changes trigger a full build from the website repo
  2. Content changes trigger a submodule update in the website repo, which triggers a Vercel deployment

This means a blog post typo fix deploys in under 2 minutes without touching any code.

What We Learned After 130+ Posts

File Naming Convention Matters

We settled on this pattern: {slug}-{year}.mdx

ai-chatbot-build-vs-buy-cost-breakdown-2026.mdx
vibe-coding-revolution-2026.mdx
web-development-costs-pricing-guide-2026.mdx

The year suffix prevents slug collisions when we update topics annually. "web-development-trends" could exist for 2025 and 2026 without conflict.

Frontmatter Schema Enforcement

Every blog post requires this frontmatter:

---
title: "String"           # Required
description: "String"     # Required, used for meta description and OG
pubDate: YYYY-MM-DD       # Required, ISO date
author: "String"          # Required
image: "URL"              # Required, Unsplash or custom
tags: ["Array"]           # Required, 3-6 tags
category: "String"        # Required (Web Development, AI Integration, etc.)
subcategory: "String"     # Required (Guide, Tutorial, News, etc.)
featured: boolean         # Required (true shows on homepage)
draft: boolean            # Optional (true hides from production)
---

Astro 5's content collections enforce this schema at build time. If a field is missing or the wrong type, the build fails with a clear error message.

Content Organization by Topic, Not Date

We organize by topic in the file system, not by date:

blog/
├── ai-chatbot-build-vs-buy-2026.mdx
├── ai-powered-web-development-2026.mdx
├── astro-5-agency-websites-2026.mdx
├── ... (alphabetical by slug)

Date-based folders (2026/03/post.mdx) create unnecessary nesting and make it harder to find posts. The pubDate frontmatter field handles chronological sorting at the application level.

Image Strategy

We use Unsplash URLs with dimension parameters instead of storing images in the repo:

![Alt text](https://images.unsplash.com/photo-123?w=800&h=400&fit=crop)

Benefits:

  • Zero image storage in the repo
  • Unsplash CDN handles optimization and delivery
  • Consistent dimensions via URL parameters
  • No build-time image processing needed

For project screenshots and custom graphics, we use a separate asset hosting approach.

Performance Numbers

Metric Single Repo (Before) Two Repos (After)
Clone time 45 seconds 12 seconds (code) + 8 seconds (content)
Average build time 3.2 minutes 2.1 minutes
Content-only deploy 3.2 minutes (full build) 1.8 minutes
Git log noise 80% content commits Clean separation
Merge conflicts Weekly Rare

The biggest win is not speed — it is cognitive clarity. When we look at the website repo's Git history, we see code changes. When we look at the content repo, we see content changes. Each history tells a coherent story.

When This Pattern Makes Sense

Use a Separate Content Repo When:

  • You have 20+ content files and growing
  • Multiple people contribute content
  • Content updates are more frequent than code changes
  • You want content validation independent of code builds
  • Your content is text-based (Markdown, MDX, YAML) not database records

Stick with a Single Repo When:

  • You have fewer than 20 content files
  • Only developers touch content
  • Content and code change at the same rate
  • You are using a headless CMS anyway

Use a Headless CMS When:

  • Non-technical users need a visual editor
  • Your content includes complex media management
  • You need role-based content workflows (draft, review, publish)
  • Content does not include code blocks or technical formatting

Common Pitfalls

  1. Forgetting to update the submodule. New content sits in the content repo but the website still points to the old commit. Solution: automate submodule updates via webhook or scheduled CI job.

  2. Submodule confusion for new developers. git clone does not automatically initialize submodules. New team members need to run git submodule update --init. We document this in the README.

  3. Branch divergence. If the content repo has branches, the website repo needs to know which branch to track. We keep it simple: always track main.

  4. Build failures from content changes. Even with validation in the content repo, a valid MDX file might break the Astro build (e.g., using an undefined component). We run a test build in the website repo's CI before deploying.

The Numbers

Our content repo today:

Metric Count
Blog posts 130+
Project case studies 11
Team profiles 5
Total MDX/MD files 146+
Average post length 1,800 words
Total content ~260,000 words
Repo size 2.1 MB (text only, no images)

At 2.1 MB for 260,000 words of content, Git handles this effortlessly. We could 10x the content volume before needing to reconsider the architecture.


Running a content-heavy site and struggling with repository management? We have built this pattern for ourselves and for clients. Talk to us about your content architecture.

Comments