Building a GraphRAG Agent From The Ground Up

As I’ve been diving more and more into Claude Code and AI the more I realize how much I have to learn around the systems and infrastructure that power them.

One piece of infrastructure that has always piqued my interest are graph databases and using them for RAG applications (AKA GraphRAG). Reason being is that not only do they allow for more advanced querying but, as an SEO, it’s a concept that has had quite the buzz since 2012 thanks to Google.

Claude Code Graph (CCGraph)

So what did I build?

Well, it all started thanks to a conversation with Noah Learner on discovering new and useful Claude Code repos. Something that can be both overwhelming (thanks to the +5,000 repos on GitHub that use the topic Claude Code) and is highly personalized based on your workflow.

After staring helplessly into the void for about 5 minutes on how to do this, I started poking around GitHub’s public API for repos with Claude Code as the topic and realized this was the perfect pet project to learn to build an agentic GraphRAG app end to end.

And 6 days of intermittent work later Claude Code Graph (CCGraph) was born!

The Journey

Now that we know what, let’s talk about how.

Prior to this my only experience working with Claude Code were 2 tiny test projects (a personal finance landing page quiz and a proof of concept transcription pipeline). Thankfully because of those two projects I already had a bit of a feel for how to approach development and previously created v1.0 of Claude Code Starter which I used as the starting point for CCGraph.

Milestone 0: Understanding Graph Databases and GraphRAG

Before even starting development I did a deep dive into really understanding what graph databases and GraphRAG is and how they work.

This consisted of having long in-depth multi-day conversations with both Gemini and Claude, watching countless YouTube videos, and going through articles and documentation.

Why do this rather than jumping in blindly?

To prevent brain rot by guiding strategy while Claude lead the execution
I actually want to understand the underlying technology (not enough people do this!)
Claude does some really stupid shit and will gaslight you into allowing it to do dumb things

If you’re looking to learn more about graph databases and graphRAG, these resources came in handy:

Neo4j’s graph academy, blog, resources, and YouTube channel
AI Engineer’s YouTube channel
Cole Medin’s YouTube channel

Deciding on Graph Structure

Based on the learnings from my deep dive and brainstorming with Claude on how users will interact with the app, I decided on the following structure:

4 types of nodes would be used which consisted of:

Author - The person who owns the repository
Repository - The repository itself
Topic - Any tagged topic that a repository may have
Section - A chunked readable and vector embedded version of the README.md and Claude.md

The edges (the way the nodes are connected to each other) would be:

References - Relationship between a repo and another repo when it is mentioned by name
Owned By - Relationship between the author and repo
Has Topic - Relationship between a topic and repo
Has Section - Relationship between the repo and the README.md/Claude.md
Similar To - Relationship between two README.md/Claude.md chunks
Related Topic - Relationship between two topics

With both the edges and nodes now mapped out, the LLM can now easily traverse through the graph allowing it to find repos and topics that are connected or similar with deep context for the user.

Milestone 1: Initial App & Data Ingestions

The first step was collaborating with Claude on the initial PRD.

I personally like taking a collaborative approach with the PRD development by:

Explaining in detail what I want to build, the technology that should be used for it, and what I want the users to learn/get value from
Go back and forth with Claude asking and answering questions about the backend logic, user experience, and any other items
Having Claude write an initial PRD and then creating a 2nd session within the same folder and asking it to “unapologetically and ruthlessly to find edge cases, potential bugs, and items that lack clarity”
Ruthless Claude typically finds 15-20 items that need to be improved which I’ll then go item by item with it to collaborate on finding a solution to
You can take this a step further by having multiple session of Ruthless Claude rereview the updated PRD or even have a separate LLM (ChatGPT or Gemini) review it
Once I’m happy with the solutions Ruthless Claude will then update the PRD and I’ll have the original Claude start the build out process turning the refined PRD into an epic and individual issues
After everything’s been pushed to GitHub I then let Claude Code start development in parallel and launching new agents whenever a new issue becomes unblocked

The initial PRD for CCGraph was focused on creating the foundational elements of the app which included:

Backend server settings and UI
Data ingestion pipeline
Loading the data into Neo4J
Outlining the chat proxy
Developing error handling, security, and testing

The initial infrastructure used:

Firebase for hosting
Neo4J as the graph database
Vue for the framework
Tailwind CSS to make it look pretty

Now that the easy part was done, it was time to start refining the chat responses.

Milestone 2: GraphRAG - Prompt Workflows

The first iteration of CCGraph used a prompt workflow which looked like:

User sends a query
Neo4j is queried based on user’s query
Response is then put into a system prompt
LLM responds back to the user’s query

Though it wasn’t perfect it kinda got the job done… not really…

What I found is that the responses lacked relevancy and context to what the user was actually trying to learn based on their query. The response was just spewing out repos willy nilly.

Three big things had to be changed:

The user’s query had to be categorized to the app could figure out how to use Neo4j
Initially Claude did this by regex but, we quickly migrated to using an LLM (re: Claude doing dumb things)
Structure for the Neo4j requests had to be implemented (e.g. sorting, filtering, aggregating)
A relevancy score was developed for repos based on what the user was asking blending section level and repo level cosine similarity

After a lot of back and forth and testing prompts the workflow then moved to:

User sends a query
LLM categorizes the query
Based on the category Neo4J is queried
Neo4j response is put into a system prompt
LLM responds back to the user’s query

Now we’re starting to get somewhere!

Responses started getting a heck of a lot closer to what you’d expect as a user but, I really wanted to understand what was happening behind the scenes so that I could continue testing and tweaking prompts.

And so came along…

Milestone 3: Tracing & Observability

Not only was adding observability and tracing important from an ops and refinement perspective but, knowing that down the road I plan on making production level AI apps this was something that I knew would be key to understand.

Eventually, I came across Langfuse and not only has it been heavily tested by others, has other key features (prompt management, A/B testing, evals, and a lot more) but, it’s open source and can easily be deployed on Elestio.

Implementation wise it was a super easy one prompt ask with Claude Code simply asking:

*“I want to implement Langfuse’s observability and tracing (https://langfuse.com/docs/observability/overview) into the app for all prompts using the JS SDK (https://github.com/langfuse/langfuse-js)*”

And off it went quickly implementing Langfuse in a couple of short minutes only asking me for the API key along the way.

As I continued to test the prompts more and more the responses still felt a bit off and seemed to lack deep context and understanding of what was being asked.

Digging more and more into the issue and going through resources I then naively thought “Let’s move this over to an agent, I’m sure it’ll be super easy…”

…It was not

Prompt Workflows vs Agents

In case you aren’t aware there’s a major difference between prompt workflows and agents.

On one hand, prompt workflows are like assembly lines and quite linear where it takes an input (user query) and then goes through a pre-determined sequence of steps until there is a response back to the user.

Agents on the other hand take a non-linear problem solving approach where you feed it the user’s query and provide it with tools (e.g. pre-built database queries or 3rd party API end points) to respond back to the user in a dynamic way.

If you want to learn more about the pros and cons of both, Confluent has a great quick guide on them.

Milestone 4: Agent Overhaul

Though, ultimately, it was the right decision (not just from a response quality standpoint but, for learning) it required an entire architectural overhaul and a long time refining the prompt and tools used by the agent.

Eventually, I landed on using LangGraph for the agent’s framework which (not so) coincidentally has a very tidy integration with LangFuse.

After a lot of going back and forth with Claude Code and some hand holding the migration over to an agent was complete but…

…The responses from it were hot flaming trash

Digging into the traces from Langfuse I realized two major things:

I wasn’t providing enough tools for the agent to properly traverse the graph
The agent prompt was garbage

Building Out a Toolset

Initially the agent had two tools available to it;

Vector search which used the previously implemented blended score system to help answer conceptual and broader questions.

Structured query that allowed the agent to search the graph for rankings, stats, filtering, and relationship traversal.

After going back and forth with Claude Code, “Rutheless Claude”, and Gemini we came up with a library of queries that a user may ask. Based on that we identified a whole slew of gaps within the current tools.

The structured query tool was expanded to allow for 4 additional intents for pulling from the graph database which included references, similar repos, and related topics.

README.md and Claude.md tools were added to allow users to do deep dives into how to use a specific repo and the workflows around it.

Improving The Agent Prompt

As embarrassingly shown above, the initial prompt had a LOT of issues…

Diving head first into the problem again with 3 of my favourite friends (Claude Code, “Rutheless Claude”, and Gemini) we slowly refined and tested the prompt to get to a place where the responses were much more useful.

A few key changes that were made included:

Providing a strategy (chain of thought) on how to approach a user’s query
Describing each tool in greater detail along with providing examples of questions that were labeled
Details on how to choose which tool to use and when
Guidelines on how to provide the responded to the user in a helpful way
Forcing no hallucinations
Important rules that should always be followed

After rolling out both the new and improved tools along with a shiny new prompt the agent was finally starting to be useful for users (at least for me)

Adding Response Feedback

Though I’m happy with how the agent is performing it’s still a long way from perfect.

To help with this and to get more experience with production ready AI apps, scoring and feedback was added. This allows users to either give the agent’s response a thumbs up or down along with a reason why the response was bad.

The results from the feedback are collected in LangFuse which can be used to both debug the agent and be used as evals for future tests.

Milestone 5: UX Overhaul & Graph View

The next step was polishing the UX of the app and making it look half decent.

Severely lacking any sense of design and creativity, I fed Pencil my personal website (both URL and screenshots) as a baseline and had it build out a design system that was practically copy and pasted into CCGraph.

Building the Graph Visualization

Now that I had a half decent look the next step was to build out a graph visualization to allow users to visualize repos, authors, and topics and explore them further.

Using D3 as the library, the first response from the LLM was integrated into the main chat interface as a side bar.

The first initial graphs contained a lot of orphan nodes regardless of type (repo, author, and tag) which required going back and forth with Claude on identifying the reason why and creating a set of conditions on how each node should be displayed and interacted with.

Ultimately, I ended up with a graph visualization where users can click into each node and find repos that fit within a certain topic or author, pre-populate a query about a specific repo, or even visit a repo’s GitHub page.

Milestone 6: Deployment

With the development done it was now time to deploy on Firebase and it could not have been easier.

Thankfully Claude Code makes it super simple to do it and takes you by the hand walking you step by step on what it’s doing and what it needs from you along the way.

Learnings and What’s Next

Though it’s not the sexiest and most useful of apps, unlike other ones that I’ve seen rolled out, it was the perfect excuse to learn new technology and get more comfortable with Claude Code.

With that being said, a few key learnings that I had from my journey include:

If you have garbage responses, it’s cause you have garbage prompts
Create a ./tmp directory that doesn’t get included in git (include it in .gitignore) to have Claude save mini-PRDs, prompts, and other items that you can then edit or feed into other LLMs/sessions for feedback on
Workflows and agents are not the same and the way that you prompt them and feed them context need to be approached differently
If you want to integrate a tool or library include links directly to the documentation and repos to make Claude’s life easier
Take time to research, learn, and plan what you want to do rather than blindly trusting in Claude
Observability is KEY when building AI powered applications
Graph databases and GraphRAG comes off as more intimidating than it actually is

As far as what’s next, CCGraph will stay live and occasional updates will be rolled out based on insights from Langfuse and any crazy scientist ideas I may get.

On my side it’s a few things:

Using the learnings on graph databases and graphRAG to build out production grade applications that drive enterprise value
Diving deeper into Claude Code by exploring tools like Beads, Claude Flow, Agentic Flywheel, and Gas Town to name a few

Have thoughts? Leave a comment below