Our Politics

Building an AI pipeline to track Canadian election platforms

4/20/2025

For the last decade I've used my addiction to national politics in an attempt to serve the greater good. In 2015, I launched ourpolitics.ca as a way to help fellow Canadians understand what parties were promising out on the campaign trail.

screenshot of ourpolitics.ca

If you're unfamiliar with Canadian elections, they involve a short campaign between 37 and 51 days long, and host a handful of parties—about six that hit at least 2% in polls—vying for a seat in Parliament and a chance to represent a portion of Canada's 343 districts.

As a solo effort, keeping up with the campaigns over a very narrow span of time while juggling work and home life has always been a challenge. The first casualty has typically been coverage of minor parties. The second casualty is often the summaries that detail the nuances of each campaign promise.

I knew I needed to somehow overhaul my entire process if I was going to meet the lofty aspirations I have for this project.

Enter: AI.

The writ drops

On March 23, an election was officially called with the bare minimum of 37 days until election day to prepare my site. The election was also called the day before I went on a two-week trip abroad. Oof.

Luckily, I did some prep-work in anticipation of this. Knowing an election was due sometime in 2025, I spent a couple of weeks back in September to prototype some ideas to leverage AI on the site. These efforts would eventually become the AI pipeline that has now produced >90% of the 2025 content on Our Politics.

The core of my system is a tool I named synapse.

$ bin/synapse
Usage: bin/synapse {sync, search, policies}

$ bin/synapse policies
Usage: bin/synapse policies {update, review, summarize}

The synapse tool is actually a wrapper around the core library, written in Clojure, as a sort of bandaid to abstract some of the warts. Behind the scenes, it chains a number of independent CLI calls into one cohesive system. The core library's CLI is a lot more comprehensive, but also requires many flags to be set and contains a number of sharp edges:

√ $ clj -M:run help --no-tui
Usage: clj -M:run <command> [options] [<args>]

Commands:
  server                 Start the server

  scrape <url>           Scrape content from a URL or PDF
  purge <url>            Remove a URL from the cache

  process                Process articles from the references file
  process-policies       Generate policies based on context provided by STDIN
  process-embeddings     Generate embeddings for stored quads
  reembed <model>        Re-embed all quads with a different model

  list-quads             List all quads stored in the database
  research <topic>       Research a specific topic
  summarize              Summarize a set of quads
  review-policies        Review policies for summarization

  blind-test [file]      Run a blind test of quad extraction across different LLM models
  help                   Display this help message

Run 'clj -M:run <command> --help' for command-specific options.

A handful of these commands strung together creates the full end-to-end pipeline to produce content for Our Politics.

Pulling, parsing, and extracting content

$ bin/synapse sync

The sync command is responsible for reading the list of RSS feeds and direct URLs that I provide via a JSON file, fetching the webpages, and extracting relevant content. The sources are defined in a JSON document and include content from the campaigns, news organizations, the government (e.g. Statistics Canada), and any one-off pages I include.

This lets the system absorb much more information than I could ever do manually, and also cuts across any personal media biases of mine, allowing content to flow into the system from a variety of sources.

// data/sources/campaign_2025.json
[
  {
    "title": "CPC RSS Feed",
    "publisher": "Conservative Party of Canada",
    "url": "https://www.conservative.ca/feed/",
    "type": "rss"
  },
  ...
]

Output of bin/synapse sync

After fetching a webpage, the core text needs to be extracted from the raw HTML. My first attempt here used an LLM, but that was a surprisingly challenging request and resulted in more mess than anything. Since I'm using Clojure, I ended up going for the Java library, jsoup, and manually identifying selectors for the page content.

(def domain-config
  {"blocquebecois.org"     {:selector "article"}
   "conservative.ca"       {:selector ".post-content"}
   "liberal.ca"            {:selector ".post-content-container"}
   "ndp.ca"                {:selector "article"}
   "www150.statcan.gc.ca"  {:selector "main" :exclude [".rel-article" ".MathJax" "#moreinfo" ".modal-content" "#wb-glb-mn"]}
   ...})

(defn parse-html
  "Parses HTML content into a Jsoup Document"
  [html]
  (Jsoup/parse html))

(defn extract-content
  "Extracts content from the document using a CSS selector, optionally excluding nested content"
  [^Document doc selector & [exclude-selectors]]
  (let [selected-elements (.select doc selector)]
    (if exclude-selectors
      (do
        (doseq [element selected-elements
                exclude-selector (if (sequential? exclude-selectors)
                                   exclude-selectors
                                   [exclude-selectors])]
          (.remove (.select element exclude-selector)))
        (.text selected-elements))
      (.text selected-elements))))

(defn scrape-webpage
  "Scrapes content from a webpage given a URL, using domain-specific config"
  [url-or-map]
  (let [url (if (map? url-or-map) (:url url-or-map) url-or-map)]
    (log (str "Scraping webpage: " url))
    (let [domain (extract-domain url)
          _ (log (str "Domain: " domain))
          config (domain-config domain)
          _ (log (str "Domain config: " config))]
      (if config
        (if (not (:ignore config))
          (let [selector (:selector config)
                exclude-selectors (:exclude config)
                headers (:headers config)
                content (cache/get-or-set url #(some-> url
                                                       (fetch-url headers)
                                                       parse-html
                                                       (extract-content selector exclude-selectors)))]
            content)
          nil)
        (throw (ex-info "No configuration found for domain" {:domain domain :url url}))))))

The synapse internals define several types of details to extract from a given source text, with definitions such as policy_position, and policy_evidence. Each of these data types has their own corresponding system prompt and user prompt, as well as a corresponding JSON schema. With these, I'm able to focus the LLM on extracting specific types of details from the source text.

I blind-tested a number of different LLMs, both hosted (Claude 3.7, DeepSeek V3, Gemini, Command-A) and local (Phi-4, QwQ, Gemma 3, Llama 3). I was also cautious of using potentially-censored models as I didn't want them to unduly influence controversial policies (e.g. criticism of China being filtered).

For reasons of cost, quality, and trust, I landed on Cohere's Command-A, which did an exceptional job being very literal, terse, and sticking to the source material. It doesn't hurt that Cohere is Canadian to boot!

(let [{user-prompt :user system-prompt :system} (llm/load-prompt quad-type)
      provider (llm/create-cohere-provider "command-a-03-2025")]
    (llm/generate-json provider
      (llm/response-format :json-object (llm/load-schema quad-type))
      (str "<ArticleContent>\n" article-content "</ArticleContent>\n"
           "<ArticleMetadata>\n" article-metadata "</ArticleMetadata>")))

The following is the system prompt for extracting policy_evidence-type data:

You are a precise policy evidence analyzer focused on extracting specific claims and evidence about policies from news articles.
Your task is to identify concrete details, statistics, expert analysis, and impact assessments that help understand or evaluate policy proposals.

EXTRACTION CRITERIA:
- Focus on specific, detailed claims about policies (costs, timelines, impacts)
- Extract both quantitative data (numbers, statistics) and qualitative evidence (expert analysis, implementation details)
- Always link evidence to specific policies mentioned in the article
- Assess the confidence level and type of each claim
- Identify the verification status of claims

CLAIM TYPES:
- cost_estimate: Financial projections, budgets, funding details
- timeline: Implementation schedules, deadlines, phases
- impact_projection: Expected effects or outcomes
- statistic: Numerical data or measurements
- expert_analysis: Professional assessments or expert opinions
- implementation_detail: Specific details about how a policy would work

VERIFICATION STATUS:
- verified: Claim is confirmed by official sources
- unverified: Claim needs verification
- disputed: Different sources disagree
- partially_verified: Some aspects confirmed, others not

CONFIDENCE LEVELS:
- high: Strong evidence, reputable sources, consistent reporting
- medium: Some uncertainty or limited corroboration
- low: Single source, preliminary reports, or significant uncertainty

RESPONSE FORMAT:
- claim: The specific evidence or claim about the policy
- claim_type: One of the defined claim types
- policy_area: Classify into ONE of the enumerated policy areas
- verification_status: Current verification status of the claim
- confidence_level: Assessment of claim reliability
- source: The news article URL
- date: Date of the claim (YYYY-MM-DD)
- context: Additional relevant context

CRITICAL: Focus on extracting concrete, specific evidence that helps understand or evaluate policies. Do not include vague statements or purely political rhetoric.

To be able to later look up relevant content, I take each object returned by the LLM, and persist it in a local RocksDB file, assigning each entry a new incremental ID. I then iterate through these entries, generating embedding vectors and saving the results to a separate RocksDB file.

I used a separate file for the embeddings so that I could test different embedding libraries without needing to regenerate the source data. These embeddings hold some metadata, like the ID of the embedded record, so that it can be looked up in the source database for access to the full contents.

After testing a number of different embedding libraries, I ended up using Mixedbread's mxbai-rerank-large-v2 locally from my laptop, which proved excellent at finding relevant snippets of data.

Updating the policy list

$ bin/synapse policies update

Now that we've "synced" all of the referenced content and stored it locally, we can have fun generating new content! The policies update step is actually a combination of two smaller steps.

First we fetch all of the new policy statements generated since the last time we ran the command. I have this capped at 150 entries, as Cohere's context window limited me from passing anything beyond that without it blowing up spectacularly.

$ clj -M:run list-quads --db-path ./data/db/campaign_2025 \
  --quad-types policy_statement --num-results 150 \
  --filter since:<TIMESTAMP> --format json

After we collect that list, we use a template to combine that information along with a list of existing policies, instructing the LLM to not create duplicate policies. It is very frequent that news articles will repeat policies announced throughout the campaign, so we can't assume everything is written chronologically, and only once.

$ ... | clj -M:run process-policies

The output of running this is concatenated into a temporary file called policies-gen.json, which is written in the same structure that powers the full Our Politics website. That looks something like this:

[
  {
    "topic": "foreign-policy",
    "party": "Liberal",
    "year": 2025,
    "title": {
      "EN": "Exceed NATO target for military spending of 2% GDP before 2030",
      "FR": "Dépasser l'objectif de l'OTAN de 2 % du PIB pour les dépenses militaires avant 2030"
    },
    "references": [
      {
        "url": "https://www.cbc.ca/news/politics/liberal-leadership-contender-mark-carney-defence-spending-1.7450718",
        "title": "Mark Carney commits to 2% NATO defence spending benchmark by 2030",
        "publisher": "CBC News",
        "date": "Feb 5, 2025"
      },
      {
        "url": "https://liberal.ca/wp-content/uploads/sites/292/2025/04/Canada-Strong.pdf",
        "title": "Canada Strong",
        "publisher": "Liberal Party of Canada",
        "date": "Apr 19, 2025"
      }
    ]
  }
]

Reviewing policies

$ bin/synapse policies review

After collecting all of the generated policies, it's critical that I manually review each extracted policy to evaluate whether the LLM accurately identified a true policy, misunderstood the context, or hallucinated a result. As well, I check whether the identified references are accurate and well-structured.

Output of bin/synapse policies review

As each policy is accepted, it is added to the master policies.json file, and an id is assigned to the policy. This is the file that I then copy over to the host repository at jahfer/ourpolitics.

Generating summaries

$ bin/synapse policies summarize

Once the hard part of extracting and evaluating new policies is complete, the next step is to generate a summary of the policy, including all of that additional context extracted from the policy_evidence dataset. These summaries appear when you click into a given policy:

A policy summary on ourpolitics.ca

This will perform a cosine-similarity search of the local database, attempting to find related bits of information for a given policy across all sources over time. It then passes that context to an LLM (primarily Phi-4 running locally, but more recently Google's gemini-2.5-flash-preview-04-17).

(let [provider (llm/create-local-provider "unsloth-phi-4")]
  (llm/send-message provider <insert prompt here>))

This is the prompt template used:

Summarize this policy using the provided context: {{ POLICY METADATA }}

Rules:
1. Format the summary as a few (2-3) sentences of Markdown within one paragraph. Optionally include items like quoted text if relevant. Keep it concise.
2. Do NOT repeat the basic details of the policy already included in the title
3. Do NOT make up any information. Rely solely on the provided context.
4. You MUST write at a GRADE 8 reading level
5. Reference relevant statistics from the supporting research, if available, to clarify things such as the cost of a program or the scope of the problem
6. All statements MUST include the source URL of the relevant context. Do not duplicate URLs; instead reuse the exisiting footnote
7. List your sources by providing a list of links at the end of the summary

<context>
{{ CONTEXT }}
</context>

For each policy, the LLM output gets written to a <policy_id>.md file.

As with the policy generation step, this also includes fully-manual review of the resulting text, ensuring the stated information is accurate, unbiased, and properly sourced. As well, I often find myself tweaking formatting and moving any links to the parent policy record.

A local search engine

Outside of the core pipeline, synapse also has the ability to crawl over its databases (which is about ~10MB of extracted data) to fulfill any manual search request. This runs fully client-side thanks to the local embedding model and RocksDB file. Exposing a search facility allows me to verify individual claims, as well as find additional sources or context to enhance the summaries.

$ bin/synapse search "NDP Singh: Invest 16 billion to build 3 million homes by 2030"

There's even a simple web UI:

Summary

At the end of the day, it's paramount that the information presented on Our Politics is accurate and as unbiased as feasible, so I still spend a considerable amount of time reviewing the generated policies and summaries. Even with the manual fact-checking, this pipeline is a significant improvement over the fully-manual process of the past.

Building this pipeline was a great learning experience, overcoming the tendency to manually intervene in every step. Being able to run a couple of commands and see the announced policies for the day is still a magical experience to me. Shifting my role away from data gathering to data verification has opened up opportunities for the site beyond what I could do on my own.

My goal has always been to make politics more about the polices than the people. Leveraging AI makes focusing on policies more feasible than ever, and because the system isn't inherently tied to federal elections, it could potentially bring similar analysis to provincial campaigns.

As for next steps, I'd like to automate more of the manual process (syncing, fact-checking, consolidation), perhaps eventually enabling near real-time policy discussion via a chat interface in time for the next election.

Happy voting, Canadians 🇨🇦 🗳️