ask.dalanmendonca.com is now live! Please message me if you want to play with it.
What is it? It’s an AI chatbot designed to mimic my voice. It uses a llama 3.1 8b model fine-tuned on my notes. However, since booting the model takes time, the first five responses are provided by a base llama 3.1 model.
Motivation
My two main motivations were curiosity and professional development.
AI is a new kind of material and the goal was to get my hands dirty. I wanted to experience the nuts and bolts of how AI works. As someone well versed with cloud infrastructure, I wanted to train and host this model from scratch to understand the technical side of things, rather than use some SaaS tool. That said; I was in maker mode and optimizing for speed of shipping. I wanted a demo-able v1 not a Mona Lisa, rough edges were ok.
The second motivation was professional development FOMO, and not getting left behind; I wanted to be up to date with what’s happening in the AI ecosystem.
I’ve been following Luke Wrobelski who has been writing about his AI explorations. His blog posts on AI are worth reading but note they are not generic tutorials, they are written from the PoV of a designer exploring AI. His Ask Luke bot inspired me to build my own.
Lessons
- It’s actually easy?! It’s very tedious and long, but it’s easy. Maybe not for everybody but for someone who’s been working on cloud infrastructure for a while, the concepts fit into my existing mental models and hoops were tolerable. Maybe I’m underestimating the problem given that I used Claude and I don’t have customers/users wanting accuracy?
- The main thing to understand is embeddings; the embedding is the word (“token”) to number mapping. The model does some fancy math and number juggling to predict the next token. The model only deals with numbers. It does not know what entity (word, pixel, sound) that number represents.
- Fine-tuning a model is feeding it desired Q&A pairs You don’t actually use the sentences or content from your corpus raw; you create Q&A pairs from them for fine-tuning.
- It’s actually a search and data pipeline problem First you realize you are running a masked search engine where results are processed and an answer is generated. Then you realize that a system doing fancy math on data is impacted by … the data! It’s structure, its quality, it’s topic diversity, its completeness, etc. Soon my thoughts went to how I can generate better data for my bot, cover topics not in my blogs, etc.
- A personal chatbot is about guardrails and fallback There are topics I wouldn’t want to talk about or have strong opinions on, it is important to setup guardrails via prompts or judges to make sure responses are in-line with expectations.
- Privacy fails by default While I was excited to throw my corpus of notes to build a good chatbot, I was (half) surprised to see the bot leaking personal conversations because they were present in the vector dataset!
- There are lots of knobs you can tweak to impact the results From vector search result count, to chunk size, to the fabled temperature, etc. there are so many parameters that can be tweaked to optimize results.
- GPUs are in short supply Even if you’re willing to pay for them, you might run into availability issues. Of course, this was also because I was using “on-demand” GPUs. I’m sure if I did some pod/VM reservations I might have a better chance.
- You can get far with a system prompt for personality This is my indirect way of saying that I’m not a unique snowflake and my whole life and personality can be distilled into a modest system prompt for Llama :D This is also my direct way of saying that the base Llama 3.1 provides good enough responses to most questions.
- This whole exercise was quite cheap I think I must’ve spent about $15-20 max. Of course, this discounts that I have a mini PC with an integrated GPU for basic local testing and had a $20/month Claude Pro subscription throughout. But still. While doing inference on Open source models like Llama and DeepSeek, I was shocked after doing my tests how my credits largely remained untouched. All companies are providing free resources and credits as well. I don’t think this is purely bubble math; even SaaS vendors who’ve had decades to track costs continue to provide some free resources/credits to hobbyists.
- It was fun :) It was really enjoyable doing this even if I was in “script kiddie” mode most of the time. It worked in the end and I’m so happy!
How it was built
I used Gemini to create a 7 phase plan that I largely stuck to. I prompted Gemini to give me a plan that would use my skills as a Cloud infrastructure PM, and be technical. Also prompted it to list key concepts I am learning/utilizing in every phase.
Below are the phase-wise notes, for those who want the gory details. This is not a tutorial you can follow step-by-step. Just my observations.
| Phase + Name | Goal | Notes & Learning |
|---|---|---|
| Phase 0: Exploratory data analysis | Make clusters out of notes to see what’s in there and get a sense of frequently mentioned and see their distribution in vector space | My original dataset of blogs and notes was 559 documents. We used K-means clustering to explore the notes. It produced some okayish visualizations but no major insights really. That’s likely more to do with my lack of interpretive knowledge. We used a simple scoring mechanism to rank notes and filter down to a dataset of 283 “bright” notes. I don’t know why Claude chose that word. The remaining notes were too old or too short. |
| Phase 1: Data Engineering | Transform personal archives into a structured, high-signal instruction dataset | So I imagined training meant like “boiling” my notes in a big pot so it gets absorbed in a larger model. It was nothing of that sort. What I actually did was use my notes to create Question + Answer pairs. These pairs were generated by an LLM but based on the notes. This technique is called “Self-instruct”. After deduplication, these generated Q&A pairs were put into the “ShareGPT” format that fine-tuning tools like Unsloth accept. |
| Phase 2: Fine-Tuning | Train a 7B-8B model to “speak Dalan” using Parameter-Efficient Fine-Tuning (PEFT) | Fine-tuned Llama 3.1 8B using QLoRA on a rented RunPod GPU. The output is a LoRA adapter (the “personality layer”) and a GGUF file for local testing. This phase is mostly done in a cloud Jupyter notebook attached to your RunPod Pod/GPU. This just cost $2 or so. Ran into storage issues and floating point precision issues a lot. |
| Phase 3: RAG & Evaluation | Give the model “perfect memory” of your notes and a way to measure its accuracy | Setup a local RAG pipeline using ChromaDB. Overall flow was: Query -> Embeddings -> Fetch related 3 notes -> Add to prompt -> Send to Fine-tuned model -> Get response. This concept of searching via embeddings vs the typical search via keyword was new to me; but seems sensible once you think about it. We setup evals using promptfoo to compare Base Llama vs Fine-tuned vs Fine-Tuned + RAG. Funnily, Base LLama was beating the others initially but this was because of bad rubric design that rewarded ignorance (just saying “I don’t know”). That and some other tuning eventually made RAG the best setup. Here too, retrieving 5 vs 3 notes can make a big difference. Adding a “Bio” page with a lot of quick facts about me, actually helped improve the responses as well; turns out I haven’t put such facts in my blog posts! |
| Phase 4: Web deployment | Launch the bot as a live “island” on dalanmendonca.com | Initially I thought this would be an “island” on my existing Astro-based website but for code-purity and not wanting to mix the innards of the two projects; I decided to separate it out. ask.dalanmendonca.com was chosen as the domain. It is a Next.JS app on Vercel. Pinecone is the vector DB. Inference was planned as a custom fine-tuned model running on FireworksAI but we ran into GPU shortages (amongst several other headaches). A big headache was that Fireworks only accepts the SafeTensors format while my scripts were producing a GGUF which is more suitable for local inference. For simplicity and moving on, I swallowed the bitter pill of using just base Llama on OpenRouter (with a system prompt for personality) as the base model to continue the work and figure out deploying the fine-tuned model later. I did make backend swappable so we can change backends easily later. Added Authentication using simple access tokens stored in Upstach . This is to prevent random bots from finding this endpoint and milking it for tokens. For looks I went with a Terminal-esque aesthetic that seems to be trending and fits text interaction well. Went with a green terminal inspired by The Matrix. |
| Phase 5: Observability | Use professional-grade tracing to monitor costs, latency, and quality | Used Langfuse for tracing application interactions. Claude actually tripped up a lot here unfortunately, though this was one of the most straightforward bits of the project. Once the Langfuse SDK was installed and wired up, you get a log line, a trace, and some metrics (latency, cost, etc.) for every interaction with the app. I also setup feedback, so the user can indicate whether the response was good or bad. Fun fact, these logs can then be used for further training, testing, and evals! |
| Phase 6: App Improvements | Polish and harden the live app - fine-tuned model, swappable inference backends, auth, | Here I discovered this beautiful product called Modal, which amongst other things allows you deploy fine-tuned/custom models. Despite the ease of setting the end-point on Modal, the cold start boot time for a on-demand model was 10-30 seconds which was way too high for an end user (and I didn’t want to permanently rent a GPU). So I decided to start with base Llama and pull up the fine-tuned model only if there are more than 5 messages. Also added a debug mode to see more of the innards of the app. |
| Phase 7: Voice Persona | Use ElevenLabs to give “Dalan” your actual voice and enable verbal interaction | I was satisfied enough with the project that I stopped. Left as a future todo. |
For reference the final tech stack is:
- Next.js app / frontend
- Pinecone for vector search
- Modal and OpenRouter for AI inference (Modal serves a fine-tuned llama-3.1-8b-instruct while OpenRouter serves the base model)
- Langfuse for tracing and feedback
- Upstach on Vercel for access token management
These were all suggestions by Claude code. I mostly said ok.