Rag Construction
VerifiedBuild RAG systems for construction knowledge bases. Create searchable AI-powered construction document systems
$ Add to .claude/skills/ About This Skill
- # RAG Construction
- ## Overview
- Based on DDC methodology (Chapter 2.3), this skill builds Retrieval-Augmented Generation (RAG) systems for construction knowledge bases, enabling semantic search and AI-powered question answering over construction documents.
- Book Reference: "Pandas DataFrame и LLM ChatGPT" / "Pandas DataFrame and LLM ChatGPT"
- ## Quick Start
- ```python
- from dataclasses import dataclass, field
- from enum import Enum
- from typing import List, Dict, Optional, Any, Callable
- from datetime import datetime
- import json
- import hashlib
- import re
- class DocumentType(Enum):
- """Types of construction documents"""
- SPECIFICATION = "specification"
- DRAWING = "drawing"
- CONTRACT = "contract"
- RFI = "rfi"
- SUBMITTAL = "submittal"
- CHANGE_ORDER = "change_order"
- MEETING_MINUTES = "meeting_minutes"
- DAILY_REPORT = "daily_report"
- SAFETY_REPORT = "safety_report"
- INSPECTION = "inspection"
- MANUAL = "manual"
- STANDARD = "standard"
- class ChunkingStrategy(Enum):
- """Text chunking strategies"""
- FIXED_SIZE = "fixed_size"
- PARAGRAPH = "paragraph"
- SECTION = "section"
- SEMANTIC = "semantic"
- SENTENCE = "sentence"
- @dataclass
- class DocumentChunk:
- """A chunk of document text"""
- id: str
- document_id: str
- content: str
- metadata: Dict[str, Any]
- embedding: Optional[List[float]] = None
- token_count: int = 0
- position: int = 0
- @dataclass
- class Document:
- """Construction document"""
- id: str
- title: str
- doc_type: DocumentType
- content: str
- source: str
- metadata: Dict[str, Any] = field(default_factory=dict)
- chunks: List[DocumentChunk] = field(default_factory=list)
- created_at: datetime = field(default_factory=datetime.now)
- @dataclass
- class SearchResult:
- """Search result from vector store"""
- chunk: DocumentChunk
- score: float
- document_title: str
- doc_type: DocumentType
- @dataclass
- class RAGResponse:
- """Response from RAG system"""
- query: str
- answer: str
- sources: List[SearchResult]
- confidence: float
- tokens_used: int
- class TextChunker:
- """Split documents into chunks for embedding"""
- def __init__(
- self,
- strategy: ChunkingStrategy = ChunkingStrategy.PARAGRAPH,
- chunk_size: int = 500,
- chunk_overlap: int = 50
- ):
- self.strategy = strategy
- self.chunk_size = chunk_size
- self.chunk_overlap = chunk_overlap
- def chunk_document(self, document: Document) -> List[DocumentChunk]:
- """Split document into chunks"""
- if self.strategy == ChunkingStrategy.FIXED_SIZE:
- return self._chunk_fixed_size(document)
- elif self.strategy == ChunkingStrategy.PARAGRAPH:
- return self._chunk_by_paragraph(document)
- elif self.strategy == ChunkingStrategy.SECTION:
- return self._chunk_by_section(document)
- elif self.strategy == ChunkingStrategy.SENTENCE:
- return self._chunk_by_sentence(document)
- else:
- return self._chunk_fixed_size(document)
- def _chunk_fixed_size(self, document: Document) -> List[DocumentChunk]:
- """Chunk by fixed character size with overlap"""
- chunks = []
- text = document.content
- start = 0
- position = 0
- while start < len(text):
- end = start + self.chunk_size
- # Find word boundary
- if end < len(text):
- while end > start and text[end] not in ' \n\t':
- end -= 1
- chunk_text = text[start:end].strip()
- if chunk_text:
- chunk_id = self._generate_chunk_id(document.id, position)
- chunks.append(DocumentChunk(
- id=chunk_id,
- document_id=document.id,
- content=chunk_text,
- metadata={
- "doc_type": document.doc_type.value,
- "title": document.title,
- **document.metadata
- },
- token_count=len(chunk_text.split()),
- position=position
- ))
- position += 1
- start = end - self.chunk_overlap
- if start >= len(text):
- break
- return chunks
- def _chunk_by_paragraph(self, document: Document) -> List[DocumentChunk]:
- """Chunk by paragraphs"""
- chunks = []
- paragraphs = document.content.split('\n\n')
- current_chunk = ""
- position = 0
- for para in paragraphs:
- para = para.strip()
- if not para:
- continue
- if len(current_chunk) + len(para) < self.chunk_size:
- current_chunk += "\n\n" + para if current_chunk else para
- else:
- if current_chunk:
- chunk_id = self._generate_chunk_id(document.id, position)
- chunks.append(DocumentChunk(
- id=chunk_id,
- document_id=document.id,
- content=current_chunk,
- metadata={
- "doc_type": document.doc_type.value,
- "title": document.title,
- **document.metadata
- },
- token_count=len(current_chunk.split()),
- position=position
- ))
- position += 1
- current_chunk = para
- # Add remaining content
- if current_chunk:
- chunk_id = self._generate_chunk_id(document.id, position)
- chunks.append(DocumentChunk(
- id=chunk_id,
- document_id=document.id,
- content=current_chunk,
- metadata={
- "doc_type": document.doc_type.value,
- "title": document.title,
- **document.metadata
- },
- token_count=len(current_chunk.split()),
- position=position
- ))
- return chunks
- def _chunk_by_section(self, document: Document) -> List[DocumentChunk]:
- """Chunk by document sections (headers)"""
- # Split by common section patterns
- section_pattern = r'\n(?=(?:\d+\.|\d+\s|SECTION|ARTICLE|PART)\s+[A-Z])'
- sections = re.split(section_pattern, document.content)
- chunks = []
- for position, section in enumerate(sections):
- section = section.strip()
- if section:
- # If section is too large, further split it
- if len(section) > self.chunk_size * 2:
- sub_chunker = TextChunker(ChunkingStrategy.PARAGRAPH, self.chunk_size)
- sub_doc = Document(
- id=f"{document.id}_sec{position}",
- title=document.title,
- doc_type=document.doc_type,
- content=section,
- source=document.source,
- metadata=document.metadata
- )
- sub_chunks = sub_chunker.chunk_document(sub_doc)
- for i, chunk in enumerate(sub_chunks):
- chunk.id = self._generate_chunk_id(document.id, position * 100 + i)
- chunk.position = position * 100 + i
- chunks.extend(sub_chunks)
- else:
- chunk_id = self._generate_chunk_id(document.id, position)
- chunks.append(DocumentChunk(
- id=chunk_id,
- document_id=document.id,
- content=section,
- metadata={
- "doc_type": document.doc_type.value,
- "title": document.title,
- **document.metadata
- },
- token_count=len(section.split()),
- position=position
- ))
- return chunks
- def _chunk_by_sentence(self, document: Document) -> List[DocumentChunk]:
- """Chunk by sentences, grouping to meet size requirements"""
- # Simple sentence splitting
- sentences = re.split(r'(?<=[.!?])\s+', document.content)
- chunks = []
- current_chunk = ""
- position = 0
- for sentence in sentences:
- if len(current_chunk) + len(sentence) < self.chunk_size:
- current_chunk += " " + sentence if current_chunk else sentence
- else:
- if current_chunk:
- chunk_id = self._generate_chunk_id(document.id, position)
- chunks.append(DocumentChunk(
- id=chunk_id,
- document_id=document.id,
- content=current_chunk.strip(),
- metadata={
- "doc_type": document.doc_type.value,
- "title": document.title,
- **document.metadata
- },
- token_count=len(current_chunk.split()),
- position=position
- ))
- position += 1
- current_chunk = sentence
- if current_chunk:
- chunk_id = self._generate_chunk_id(document.id, position)
- chunks.append(DocumentChunk(
- id=chunk_id,
- document_id=document.id,
- content=current_chunk.strip(),
- metadata={
- "doc_type": document.doc_type.value,
- "title": document.title,
- **document.metadata
- },
- token_count=len(current_chunk.split()),
- position=position
- ))
- return chunks
- def _generate_chunk_id(self, doc_id: str, position: int) -> str:
- """Generate unique chunk ID"""
- return hashlib.md5(f"{doc_id}_{position}".encode()).hexdigest()[:12]
- class VectorStore:
- """Simple in-memory vector store for RAG"""
- def __init__(self):
- self.chunks: Dict[str, DocumentChunk] = {}
- self.embeddings: Dict[str, List[float]] = {}
- def add_chunks(self, chunks: List[DocumentChunk]):
- """Add chunks to the store"""
- for chunk in chunks:
- self.chunks[chunk.id] = chunk
- if chunk.embedding:
- self.embeddings[chunk.id] = chunk.embedding
- def search(
- self,
- query_embedding: List[float],
- top_k: int = 5,
- filter_metadata: Optional[Dict] = None
- ) -> List[Tuple[DocumentChunk, float]]:
- """Search for similar chunks"""
- results = []
- for chunk_id, chunk in self.chunks.items():
- # Apply metadata filter
- if filter_metadata:
- match = all(
- chunk.metadata.get(k) == v
- for k, v in filter_metadata.items()
- )
- if not match:
- continue
- # Calculate similarity (cosine similarity simulation)
- if chunk_id in self.embeddings:
- score = self._cosine_similarity(query_embedding, self.embeddings[chunk_id])
- results.append((chunk, score))
- # Sort by score descending
- results.sort(key=lambda x: x[1], reverse=True)
- return results[:top_k]
- def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
- """Calculate cosine similarity between two vectors"""
- if len(a) != len(b):
- return 0.0
- dot_product = sum(x * y for x, y in zip(a, b))
- norm_a = sum(x * x for x in a) ** 0.5
- norm_b = sum(x * x for x in b) ** 0.5
- if norm_a == 0 or norm_b == 0:
- return 0.0
- return dot_product / (norm_a * norm_b)
- def get_stats(self) -> Dict:
- """Get store statistics"""
- doc_types = {}
- for chunk in self.chunks.values():
- doc_type = chunk.metadata.get("doc_type", "unknown")
- doc_types[doc_type] = doc_types.get(doc_type, 0) + 1
- return {
- "total_chunks": len(self.chunks),
- "chunks_with_embeddings": len(self.embeddings),
- "chunks_by_type": doc_types
- }
- class EmbeddingModel:
- """Simulated embedding model (replace with actual model in production)"""
- def __init__(self, model_name: str = "text-embedding-ada-002"):
- self.model_name = model_name
- self.dimension = 1536
- def embed(self, text: str) -> List[float]:
- """Generate embedding for text"""
- # Simulation: generate deterministic embedding based on text hash
- text_hash = hashlib.sha256(text.encode()).digest()
- embedding = []
- for i in range(self.dimension):
- byte_idx = i % len(text_hash)
- embedding.append((text_hash[byte_idx] - 128) / 128.0)
- return embedding
- def embed_batch(self, texts: List[str]) -> List[List[float]]:
- """Generate embeddings for multiple texts"""
- return [self.embed(text) for text in texts]
- class ConstructionRAG:
- """
- RAG system for construction knowledge bases.
- Based on DDC methodology Chapter 2.3.
- """
- def __init__(
- self,
- embedding_model: Optional[EmbeddingModel] = None,
- chunking_strategy: ChunkingStrategy = ChunkingStrategy.PARAGRAPH,
- chunk_size: int = 500
- ):
- self.embedding_model = embedding_model or EmbeddingModel()
- self.chunker = TextChunker(chunking_strategy, chunk_size)
- self.vector_store = VectorStore()
- self.documents: Dict[str, Document] = {}
- def add_document(self, document: Document) -> int:
- """
- Add a document to the knowledge base.
- Args:
- document: Document to add
- Returns:
- Number of chunks created
- """
- # Store document
- self.documents[document.id] = document
- # Chunk document
- chunks = self.chunker.chunk_document(document)
- # Generate embeddings
- for chunk in chunks:
- chunk.embedding = self.embedding_model.embed(chunk.content)
- # Add to vector store
- self.vector_store.add_chunks(chunks)
- # Update document with chunks
- document.chunks = chunks
- return len(chunks)
- def add_documents(self, documents: List[Document]) -> Dict[str, int]:
- """Add multiple documents"""
- results = {}
- for doc in documents:
- results[doc.id] = self.add_document(doc)
- return results
- def search(
- self,
- query: str,
- top_k: int = 5,
- doc_type: Optional[DocumentType] = None
- ) -> List[SearchResult]:
- """
- Search the knowledge base.
- Args:
- query: Search query
- top_k: Number of results to return
- doc_type: Filter by document type
- Returns:
- List of search results
- """
- # Generate query embedding
- query_embedding = self.embedding_model.embed(query)
- # Build filter
- filter_metadata = None
- if doc_type:
- filter_metadata = {"doc_type": doc_type.value}
- # Search vector store
- results = self.vector_store.search(
- query_embedding,
- top_k=top_k,
- filter_metadata=filter_metadata
- )
- # Build search results
- search_results = []
- for chunk, score in results:
- doc = self.documents.get(chunk.document_id)
- search_results.append(SearchResult(
- chunk=chunk,
- score=score,
- document_title=doc.title if doc else "Unknown",
- doc_type=doc.doc_type if doc else DocumentType.MANUAL
- ))
- return search_results
- def query(
- self,
- question: str,
- top_k: int = 5,
- doc_type: Optional[DocumentType] = None
- ) -> RAGResponse:
- """
- Answer a question using RAG.
- Args:
- question: Question to answer
- top_k: Number of context chunks to use
- doc_type: Filter by document type
- Returns:
- RAG response with answer and sources
- """
- # Search for relevant context
- search_results = self.search(question, top_k=top_k, doc_type=doc_type)
- if not search_results:
- return RAGResponse(
- query=question,
- answer="I couldn't find relevant information to answer this question.",
- sources=[],
- confidence=0.0,
- tokens_used=0
- )
- # Build context from search results
- context_parts = []
- for i, result in enumerate(search_results):
- context_parts.append(
- f"[Source {i+1}: {result.document_title}]\n{result.chunk.content}"
- )
- context = "\n\n".join(context_parts)
- # Generate answer (simulated - in production, call LLM)
- answer = self._generate_answer(question, context, search_results)
- # Calculate confidence
- avg_score = sum(r.score for r in search_results) / len(search_results)
- return RAGResponse(
- query=question,
- answer=answer,
- sources=search_results,
- confidence=avg_score,
- tokens_used=len(context.split()) + len(question.split())
- )
- def _generate_answer(
- self,
- question: str,
- context: str,
- sources: List[SearchResult]
- ) -> str:
- """
- Generate answer from context.
- In production, this would call an LLM API.
- """
- # Simulated answer generation
- answer_parts = [
- f"Based on the available construction documentation:\n"
- ]
- # Extract key information from sources
- for source in sources[:3]:
- # Take first sentence of each relevant chunk
- first_sentence = source.chunk.content.split('.')[0] + '.'
- answer_parts.append(f"- {first_sentence}")
- answer_parts.append(
- f"\n\nThis information comes from {len(sources)} source documents "
- f"including: {', '.join(set(s.document_title for s in sources[:3]))}."
- )
- return "\n".join(answer_parts)
- def get_document_summary(self, document_id: str) -> Optional[Dict]:
- """Get summary of a document"""
- doc = self.documents.get(document_id)
- if not doc:
- return None
- return {
- "id": doc.id,
- "title": doc.title,
- "type": doc.doc_type.value,
- "chunks": len(doc.chunks),
- "total_tokens": sum(c.token_count for c in doc.chunks),
- "source": doc.source,
- "created_at": doc.created_at.isoformat()
- }
- def get_stats(self) -> Dict:
- """Get system statistics"""
- return {
- "total_documents": len(self.documents),
- "vector_store": self.vector_store.get_stats(),
- "embedding_model": self.embedding_model.model_name,
- "chunking_strategy": self.chunker.strategy.value
- }
- def export_knowledge_base(self) -> Dict:
- """Export knowledge base for backup/transfer"""
- return {
- "documents": [
- {
- "id": doc.id,
- "title": doc.title,
- "type": doc.doc_type.value,
- "content": doc.content,
- "source": doc.source,
- "metadata": doc.metadata
- }
- for doc in self.documents.values()
- ],
- "stats": self.get_stats(),
- "exported_at": datetime.now().isoformat()
- }
- ```
- ## Common Use Cases
- ### Build Construction Knowledge Base
- ```python
- rag = ConstructionRAG(
- chunking_strategy=ChunkingStrategy.SECTION,
- chunk_size=500
- )
- # Add specifications
- spec_doc = Document(
- id="spec-03300",
- title="Cast-in-Place Concrete Specification",
- doc_type=DocumentType.SPECIFICATION,
- content="""
- SECTION 03 30 00 - CAST-IN-PLACE CONCRETE
- PART 1 - GENERAL
- 1.1 SUMMARY
- A. Section includes cast-in-place concrete for foundations,
- slabs, walls, and other structural elements.
- 1.2 RELATED SECTIONS
- A. Section 03 10 00 - Concrete Forming
- B. Section 03 20 00 - Concrete Reinforcing
- PART 2 - PRODUCTS
- 2.1 CONCRETE MATERIALS
- A. Portland Cement: ASTM C150, Type I or II
- B. Aggregates: ASTM C33, graded
- C. Water: Clean, potable
- """,
- source="project_specs.pdf",
- metadata={"division": "03", "project": "Building A"}
- )
- chunks_created = rag.add_document(spec_doc)
- print(f"Created {chunks_created} chunks")
- ```
- ### Search Knowledge Base
- ```python
- # Search for concrete requirements
- results = rag.search(
- query="concrete strength requirements",
- top_k=5,
- doc_type=DocumentType.SPECIFICATION
- )
- for result in results:
- print(f"Score: {result.score:.3f}")
- print(f"Document: {result.document_title}")
- print(f"Content: {result.chunk.content[:200]}...")
- print()
- ```
- ### Answer Questions with RAG
- ```python
- response = rag.query(
- question="What type of cement should be used for foundations?",
- top_k=3
- )
- print(f"Answer: {response.answer}")
- print(f"Confidence: {response.confidence:.0%}")
- print(f"Sources: {len(response.sources)}")
- ```
- ## Quick Reference
- | Component | Purpose |
- |-----------|---------|
- | `ConstructionRAG` | Main RAG system |
- | `TextChunker` | Document chunking |
- | `VectorStore` | Embedding storage |
- | `EmbeddingModel` | Text embeddings |
- | `DocumentChunk` | Chunk with metadata |
- | `RAGResponse` | Query response |
- ## Resources
- Book: "Data-Driven Construction" by Artem Boiko, Chapter 2.3
- Website: https://datadrivenconstruction.io
- ## Next Steps
- Use llm-data-automation for automation
- Use vector-search for advanced search
- Use document-classification-nlp for classification
Use Cases
- Build retrieval-augmented generation (RAG) systems for knowledge-grounded AI
- Index and search document collections for relevant context retrieval
- Construct vector databases and embedding pipelines for semantic search
- Configure chunking, embedding, and retrieval strategies for RAG applications
- Integrate RAG capabilities into existing AI agent workflows
Pros & Cons
Pros
- +Well-adopted with 1,027+ downloads showing reliable real-world usage
- +Leverages AI models for intelligent automation beyond simple rule-based tools
- +Configurable parameters allow tuning for different quality and cost tradeoffs
Cons
- -Depends on external AI model APIs which may incur usage costs
- -Output quality varies based on input specificity and model capabilities
FAQ
What does Rag Construction do?
Build RAG systems for construction knowledge bases. Create searchable AI-powered construction document systems
What platforms support Rag Construction?
Rag Construction is available on Claude Code, OpenClaw.
What are the use cases for Rag Construction?
Build retrieval-augmented generation (RAG) systems for knowledge-grounded AI. Index and search document collections for relevant context retrieval. Construct vector databases and embedding pipelines for semantic search.
100+ free AI tools
Writing, PDF, image, and developer tools — all in your browser.
Next Step
Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.