Building a Markdown Knowledge Base from Web Data

Converting web content into a well-organized knowledge base can be challenging, especially when dealing with multiple sources and formats. In this tutorial, I’ll show you how to use DataFuel to transform website data into a clean, maintainable Markdown knowledge base. For a deeper dive into using this knowledge base with AI, check out our guide on RAG for Websites.

Why Markdown for Your Knowledge Base?

Feature Benefit
Human-readable Easy to read and write without special tools
Version Control Git-friendly format for tracking changes
Convertible Easily transforms to HTML, PDF, and other formats
LLM-friendly Clean text structure ideal for AI processing

Getting Started with DataFuel

  1. Get Your API Key: Sign up at datafuel.dev
  2. Install Dependencies: You’ll need Python with the requests library

Basic Implementation

Here’s an example of how to scrape a website or knowledge base into clean markdown format using DataFuel:

import requests
import os
from datetime import datetime
import time
import json

API_KEY = 'sk-012d1e6e0e092a5be1f988990d41c7e1257371f68caf30a645'
KNOWLEDGE_BASE_DIR = './kb'

def fetch_content(url, depth=5, limit=100):
    headers = {
        'Authorization': f'Bearer {API_KEY}',
        'Content-Type': 'application/json'
    }
    
    # Initial crawl request
    payload = {
        "depth": depth,
        "limit": limit,
        "url": url
    }
    
    response = requests.post('https://api.datafuel.dev/crawl', 
        json=payload,
        headers=headers
    )
    
    job_id = response.json().get('job_id')
    if not job_id:
        raise Exception("No job_id received")
    
    # Poll for results
    while True:
        results = get_crawl_results(job_id, headers)
        if results and all(r['job_status'] == 'finished' for r in results):
            return results
        time.sleep(5)  # Wait 5 seconds before polling again

def get_crawl_results(job_id, headers):
    response = requests.get(
        'https://api.datafuel.dev/list_scrapes',
        headers=headers,
        params={'job_id': job_id}
    )
    return response.json()

def extract_markdown(signed_url):
    try:
        response = requests.get(signed_url)
        response.raise_for_status()
        data = response.json()
        
        if markdown := data.get("markdown"):
            return markdown
        return "No markdown field found in the data."
    except requests.exceptions.RequestException as e:
        return f"An error occurred while fetching the data: {e}"
    except json.JSONDecodeError:
        return "Error: The response is not valid JSON."

def save_to_markdown(content, url, tags=None):
    # Create clean filename from URL
    filename = url.replace('https://', '').replace('/', '_') + '.md'
    filepath = os.path.join(KNOWLEDGE_BASE_DIR, filename)
    
    # Get markdown content from signed URL
    markdown_content = extract_markdown(content['signed_url'])
    
    metadata = {
        'source': url,
        'date_added': datetime.now().strftime('%Y-%m-%d'),
        'tags': tags or [],
        'category': determine_category(url),
        'scrape_id': content.get('scrape_id'),
        'job_id': content.get('job_id')
    }
    
    # Convert metadata to YAML format
    yaml_metadata = '\n'.join([f'{k}: {v}' for k, v in metadata.items()])
    
    final_content = f"""---
{yaml_metadata}
---

{markdown_content}
"""
    
    with open(filepath, 'w') as f:
        f.write(final_content)

# Example usage
def process_website(url, depth=5, limit=100):
    try:
        crawl_results = fetch_content(url, depth, limit)
        for result in crawl_results:
            if result['scrape_status'] == 'success':
                save_to_markdown(result, result['scrape_url'])
    except Exception as e:
        print(f"Error processing website: {e}")

Organizing Your Knowledge Base

kb/
├── 📁 technical/
│   ├── guides/
│   ├── reference/
│   └── tutorials/
├── 📁 product/
│   ├── features/
│   └── use-cases/
└── 📁 research/
    ├── market/
    └── competitors/

Making Your Knowledge Base LLM-Ready

Best Practices Checklist

Rich Metadata

  • 📅 Date added
  • 🏷️ Tags
  • 📁 Categories
  • 🔄 Last updated

📚 Clear Structure

  • 📌 Consistent headings
  • 🔍 Logical hierarchy

💻 Code Examples

  • ✨ Syntax highlighting
  • 💭 Clear comments

🔗 Cross-References

  • 📎 Internal links
  • 🤝 Related content

Enhanced Metadata Example

def save_to_markdown(content, url, tags=None):
    metadata = {
        'source': url,
        'date_added': datetime.now().strftime('%Y-%m-%d'),
        'tags': tags or [],
        'category': determine_category(url),
        'summary': content.get('summary', '')
    }
    
    # Convert metadata to YAML format
    yaml_metadata = '\n'.join([f'{k}: {v}' for k, v in metadata.items()])
    
    markdown_content = f"""---
{yaml_metadata}
---

# {content['title']}

{content['summary']}

{content['main_content']}
"""

Maintaining Your Knowledge Base

Regular Maintenance Tasks

Task Frequency Purpose
Content Updates Monthly Keep information current
Link Verification Weekly Ensure all links work
Duplicate Check Monthly Remove redundant content
Tag Review Quarterly Maintain consistent taxonomy

Automated Health Check

def audit_knowledge_base():
    for root, _, files in os.walk(KNOWLEDGE_BASE_DIR):
        for file in files:
            if file.endswith('.md'):
                filepath = os.path.join(root, file)
                check_file_health(filepath)

def check_file_health(filepath):
    # Read file content
    with open(filepath, 'r') as f:
        content = f.read()
    
    # Check for common issues
    issues = []
    if '](broken-link)' in content:
        issues.append('Contains broken links')
    if len(content.split('\n\n')) < 3:
        issues.append('Content might be too condensed')
    
    return issues

Final Thoughts

Building a robust Markdown knowledge base requires:

  1. Consistent Structure

    • Clear organization
    • Regular formatting
    • Predictable patterns
  2. Quality Content

    • Up-to-date information
    • Well-documented code
    • Comprehensive metadata
  3. Regular Maintenance

    • Scheduled reviews
    • Automated checks
    • Content updates

Pro Tip: Start small and iterate. Your knowledge base should evolve with your needs and grow organically over time.


Need help getting started? Check out our documentation or join our community.

Try it yourself!

If you want all that in a simple and reliable scraping Tool