Ears to the ground…
Intrigued by the current AI trends across the industry, I was keen on trying some hands-on, to: get a hold on the concepts, terminologies, etc, and also to see if I can leverage the “open source” to improve on something I have previously done.
..a simple RAG application
Let me talk about a very simple RAG1 application that I ended-up building. Few years ago, I had played-around with Strava2 APIs to fetch my activity data (from prehistoric era — when I was active) — the whole experience was fun as I’d experimented with a simple service (using snips.ai , Raspberry Pi, and an Xmini). The application would:
accept voice input like “how much did I run (/bike) last week (/month/year)”
summarise my run and biking activities’ totals for a given duration, and then,
respond to be via audio.
(Here’s a demo video from back then)
In this new Strava-RAG project, the aim was to cash-in on that past experience with Strava APIs, and see how it can be made better. I got a simple project outline, and capabilities generated using ChatGPT. Admittedly, I have a soft corner for PostgreSQL, and I was also keen on utilising its vector database3 extension.
The overall system goal was very simple, and can be outlined as:
A system that can answer question in plain English about activities like “Which year was I the most active in”, or something more complex like “Show me the activities where I performed like my best activity in 2019”, and so on…
The following main system operations were required:
Authenticate and fetch activities from Strava
Parse and store activities in the PostgreSQL
Enable natural language querying on the stored activities using LLM
Authenticate, Authorize, and Fetch data from Strava
Strava uses OAuth protocol for authentication, which turned out to be a bit more involved4 than what it sounds in theory, so let me lay it down simply.
Here’s the general flow:
You register your app with Strava and get a Client ID and Client Secret.
You direct the user (yourself, in this case) to a special authorization URL.
On approval, Strava sends back an authorization code.
You exchange this code for an access token, which is used to fetch actual activity data.
This boiled down to two basic functions:
# Get the token
get_token(auth_code) -> access_token
# Use token to get activities
get_activities(access_token) -> activities[]
Note that the tokens expire - so this also involves an additional quirk of adding a refresh_token
logic to the backend code.
The output of get_activities
is an array of activities, which are dumped to a JSON file.
Parse and store activities in the PostgreSQL
Once the list of activities have been successfully fetched, the next step is to push them to the database. While doing that we also convert relevant information to embeddings to enable intelligent querying.
Following was the format of (would-be) embedded sentences.
{activity['name']} {activity['type']} {activity['distance']} meters in {activity['elapsed_time']} seconds"
All the activities were converted to the above format, and then I utilised the Python-provided sentence-transformer
model to do the string → embedding
transformation. For example, a sentence that’s formed by replacing the placeholders in the above sentence would be:
morning cycling ride 10034 meters in 1934 seconds
which would then get converted via sentence-transformer
to a 384-dimension vector like so:
[0.034342 0.0534622 .0324592 0.0441267 …]
each of which gets stored for each activity record.
It’s easy to write a simple iteration to do the same for the entire set of activities, and store it into a DB table. This table would then become our source of truth, and would be kept updated with new activities over time.
With this, I was done with the heart of the application. What remains is enabling querying in natural language, which could then leverage these embeddings. So here we go.
Enable natural language querying on the stored activities using LLM
I must admit that initially I took this step for granted, naïvely assuming that the LLM will take care of everything. It wasn’t to be.
The platform of choice was Ollama, since I wanted something local, and portable. Based on ChatGPT’s recommendation, I installed Mistral LLM on Ollama. This followed many many iterations of refining the prompt that was to be sent to Mistral. The main challenge was in perfecting the query syntax to be sent to PGSQL. Mistral, being a comparatively lightweight, and generic model, did often deviate from the PGSQL syntax.
The deviation from what PGSQL expected was also due to the fact that similarity searches (cosine distance
(<=>
)), brought about by the recently introduced PGVector extension, have a unique syntax. For example, for queries like:
Find activities similar to my last Run
The expected syntax by PGSQL would be something like:
...
SELECT activity_id, activity_type, distance, duration, timestamp,
1 - (embedding <=> (SELECT embedding FROM last_run)) AS similarity_score
FROM activities
...
Over several iterations of refining the prompt, ChatGPT suggested that I add the following to the query rules, in the prompt provided to Mistral:
Query rules:
- Use the `<=>` operator for cosine similarity between embeddings. It returns a FLOAT, not a BOOLEAN.
...
However, despite the rules, Mistral kept misfiring as far as PGSQL syntax was concerned. Another pain point for Mistral was the way it dealt with Date functions in SQL. To further make it PGSQL compatible, I had to add another example like so:
### Example: Find the year with the least number of activities SELECT EXTRACT(YEAR FROM timestamp)
AS activity_year, COUNT(*) AS total_activities
FROM activities
GROUP BY activity_year
ORDER BY total_activities ASC
LIMIT 1;
Despite the above, Mistral was often using incorrect Date & Time function to extract the year out of the activity record’s timestamp
field. To handle this, I had to be explicit the query rules provided in the prompt.
- PostgreSQL does **not support** YEAR(timestamp). Instead, use: EXTRACT(YEAR FROM "timestamp")
...
This mostly took care of the issue.
Several other iterations were required that ended up in additional query rules and examples in the prompt. After that, Mistral finally started generating consistently good syntax, with very few errors.
Then I went ahead to add a simple, browser-based UI to the application, because until then it was all CLI-driven.
With this I was pretty satisfied with the overall arrangement.
Next steps..
In subsequent posts, I want to build-up on this basic project and to mimic a production system as I think that’s a topic which requires adequate attention. In other words I want to focus on: how does this all come together, and sustain over time, in a real-world systems!
And, oh yes, you can find the above project here.
Retrieval Augmented Generation (RAG): A system that enhances (augments) a language model’s responses by retrieving relevant information from an external source — database, documents, etc.
Strava is a very popular activity-tracking app. It’s also a social network of sorts for the users, in that users can subscribe to each other’s “activity feed”, and react on it
Vector database: A kind of database that stores text as numbers (called embeddings), making it possible to run mathematical operations — like measuring how similar two pieces of text are, by comparing their number arrays.