Building a Markdown Knowledge Base from Web Data
Converting web content into a well-organized knowledge base can be challenging, especially when dealing with multiple sources and formats. In this tutorial, I’ll show you how to use DataFuel to transform website data into a clean, maintainable Markdown knowledge base. For a deeper dive into using this knowledge base with AI, check out our guide on RAG for Websites.
Why Markdown for Your Knowledge Base?
Feature | Benefit |
---|---|
Human-readable | Easy to read and write without special tools |
Version Control | Git-friendly format for tracking changes |
Convertible | Easily transforms to HTML, PDF, and other formats |
LLM-friendly | Clean text structure ideal for AI processing |
Getting Started with DataFuel
- Get Your API Key: Sign up at datafuel.dev
- Install Dependencies: You’ll need Python with the
requests
library
Basic Implementation
Here’s an example of how to scrape a website or knowledge base into clean markdown format using DataFuel:
import requests
import os
from datetime import datetime
import time
import json
API_KEY = 'sk-012d1e6e0e092a5be1f988990d41c7e1257371f68caf30a645'
KNOWLEDGE_BASE_DIR = './kb'
def fetch_content(url, depth=5, limit=100):
headers = {
'Authorization': f'Bearer {API_KEY}',
'Content-Type': 'application/json'
}
# Initial crawl request
payload = {
"depth": depth,
"limit": limit,
"url": url
}
response = requests.post('https://api.datafuel.dev/crawl',
json=payload,
headers=headers
)
job_id = response.json().get('job_id')
if not job_id:
raise Exception("No job_id received")
# Poll for results
while True:
results = get_crawl_results(job_id, headers)
if results and all(r['job_status'] == 'finished' for r in results):
return results
time.sleep(5) # Wait 5 seconds before polling again
def get_crawl_results(job_id, headers):
response = requests.get(
'https://api.datafuel.dev/list_scrapes',
headers=headers,
params={'job_id': job_id}
)
return response.json()
def extract_markdown(signed_url):
try:
response = requests.get(signed_url)
response.raise_for_status()
data = response.json()
if markdown := data.get("markdown"):
return markdown
return "No markdown field found in the data."
except requests.exceptions.RequestException as e:
return f"An error occurred while fetching the data: {e}"
except json.JSONDecodeError:
return "Error: The response is not valid JSON."
def save_to_markdown(content, url, tags=None):
# Create clean filename from URL
filename = url.replace('https://', '').replace('/', '_') + '.md'
filepath = os.path.join(KNOWLEDGE_BASE_DIR, filename)
# Get markdown content from signed URL
markdown_content = extract_markdown(content['signed_url'])
metadata = {
'source': url,
'date_added': datetime.now().strftime('%Y-%m-%d'),
'tags': tags or [],
'category': determine_category(url),
'scrape_id': content.get('scrape_id'),
'job_id': content.get('job_id')
}
# Convert metadata to YAML format
yaml_metadata = '\n'.join([f'{k}: {v}' for k, v in metadata.items()])
final_content = f"""---
{yaml_metadata}
---
{markdown_content}
"""
with open(filepath, 'w') as f:
f.write(final_content)
# Example usage
def process_website(url, depth=5, limit=100):
try:
crawl_results = fetch_content(url, depth, limit)
for result in crawl_results:
if result['scrape_status'] == 'success':
save_to_markdown(result, result['scrape_url'])
except Exception as e:
print(f"Error processing website: {e}")
Organizing Your Knowledge Base
Recommended Structure
kb/
├── 📁 technical/
│ ├── guides/
│ ├── reference/
│ └── tutorials/
├── 📁 product/
│ ├── features/
│ └── use-cases/
└── 📁 research/
├── market/
└── competitors/
Making Your Knowledge Base LLM-Ready
Best Practices Checklist
✨ Rich Metadata
- 📅 Date added
- 🏷️ Tags
- 📁 Categories
- 🔄 Last updated
📚 Clear Structure
- 📌 Consistent headings
- 🔍 Logical hierarchy
💻 Code Examples
- ✨ Syntax highlighting
- 💭 Clear comments
🔗 Cross-References
- 📎 Internal links
- 🤝 Related content
Enhanced Metadata Example
def save_to_markdown(content, url, tags=None):
metadata = {
'source': url,
'date_added': datetime.now().strftime('%Y-%m-%d'),
'tags': tags or [],
'category': determine_category(url),
'summary': content.get('summary', '')
}
# Convert metadata to YAML format
yaml_metadata = '\n'.join([f'{k}: {v}' for k, v in metadata.items()])
markdown_content = f"""---
{yaml_metadata}
---
# {content['title']}
{content['summary']}
{content['main_content']}
"""
Maintaining Your Knowledge Base
Regular Maintenance Tasks
Task | Frequency | Purpose |
---|---|---|
Content Updates | Monthly | Keep information current |
Link Verification | Weekly | Ensure all links work |
Duplicate Check | Monthly | Remove redundant content |
Tag Review | Quarterly | Maintain consistent taxonomy |
Automated Health Check
def audit_knowledge_base():
for root, _, files in os.walk(KNOWLEDGE_BASE_DIR):
for file in files:
if file.endswith('.md'):
filepath = os.path.join(root, file)
check_file_health(filepath)
def check_file_health(filepath):
# Read file content
with open(filepath, 'r') as f:
content = f.read()
# Check for common issues
issues = []
if '](broken-link)' in content:
issues.append('Contains broken links')
if len(content.split('\n\n')) < 3:
issues.append('Content might be too condensed')
return issues
Final Thoughts
Building a robust Markdown knowledge base requires:
Consistent Structure
- Clear organization
- Regular formatting
- Predictable patterns
Quality Content
- Up-to-date information
- Well-documented code
- Comprehensive metadata
Regular Maintenance
- Scheduled reviews
- Automated checks
- Content updates
Pro Tip: Start small and iterate. Your knowledge base should evolve with your needs and grow organically over time.
Need help getting started? Check out our documentation or join our community.