Simon Willison’s Weblog

On talks 33 python 863 claude 7 ethics 50 annotatedtalks 9 ...

 

Recent entries

Weeknotes: Embeddings, more embeddings and Datasette Cloud four days ago

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

Embeddings in LLM

LLM gained embedding support in version 0.9, and then got binary embedding support (for CLIP) in version 0.10. I wrote about those releases in detail in:

Embeddings are a fascinating tool. If you haven’t got your head around them yet the first of my blog entries tries to explain why they are so interesting.

There’s a lot more I want to built on top of embeddings—most notably, LLM (or Datasette, or likely a combination of the two) will be growing support for Retrieval Augmented Generation on top of the LLM embedding mechanism.

Annotated releases

I always include a list of new releases in my weeknotes. This time I’m going to use those to illustrate the themes I’ve been working on.

The first group of release relates to LLM and its embedding support. LLM 0.10 extended that support:

  • llm 0.10—2023-09-12
    Access large language models from the command-line

Embedding models can now be built as LLM plugins. I’ve released two of those so far:

The CLIP one is particularly fun, because it genuinely allows you to build a sophisticated image search engine that runs entirely on your own computer!

  • symbex 1.4—2023-09-05
    Find the Python code for specified symbols

Symbex is my tool for extracting symbols—functions, methods and classes—from Python code. I introduced that in Symbex: search Python code for functions and classes, then pipe them into a LLM.

Symbex 1.4 adds a tiny but impactful feature: it can now output a list of symbols as JSON, CSV or TSV. These output formats are designed to be compatible with the new llm embed-multi command, which means you can easily create embeddings for all of your functions:

symbex '*' '*:*' --nl | \
  llm embed-multi symbols - \
  --format nl --database embeddings.db --store

I haven’t fully explored what this enables yet, but it should mean that both related functions and semantic function search (“Find my a function that downloads a CSV”) are now easy to build.

Yet another thing you can do with embeddings is use them to find clusters of related items.

The neatest feature of llm-cluster is that you can ask it to generate names for these clusters by sending the names of the items in each cluster through another language model, something like this:

llm cluster issues 10 \
  -d issues.db \
  --summary \
  --prompt 'Short, concise title for this cluster of related documents'

One last embedding related project: datasette-llm-embed is a tiny plugin that adds a select llm_embed('sentence-transformers/all-mpnet-base-v2', 'This is some text') SQL function. I built it to support quickly prototyping embedding-related ideas in Datasette.

Spending time with embedding models has lead me to spend more time with Hugging Face. I realized last week that the Hugging Face all models sorted by downloads page doubles as a list of the models that are most likely to be easy to use.

One of the models I tried out was Salesforce BLIP, an astonishing model that can genuinely produce usable captions for images.

It’s really easy to work with. I ended up building this tiny little CLI tool that wraps the model:

  • blip-caption 0.1—2023-09-10
    Generate captions for images with Salesforce BLIP

Releases driven by Datasette Cloud

Datasette Cloud continues to drive improvements to the wider Datasette ecosystem as a whole.

It runs on the latest Datasette 1.0 alpha series, taking advantage of the JSON write API.

This also means that it’s been highlighting breaking changes in 1.0 that have caused old plugins to break, either subtly or completely.

This has driven a bunch of new plugin releases. Some of these are compatible with both 0.x and 1.x—the ones that only work with the 1.x alphas are themselves marked as alpha releases.

Datasette Cloud’s API works using database-backed access tokens, to ensure users can revoke tokens if they need to (something that’s not easily done with purely signed tokens) and that each token can record when it was most recently used.

I’ve been building that into the existing datasette-auth-tokens plugin:

Alex Garcia has been working with me building out features for Datasette Cloud, generously sponsored by Fly.io.

We’re beginning to build out social features for Datasette Cloud—feature that will help teams privately collaborate on data investigations together.

Alex has been building datasette-short-links as an experimental link shortener. In building that, we realized that we needed a mechanism for resolving actor IDs displayed in a list (e.g. this link created by X) to their actual names.

Datasette doesn’t dictate the shape of actor representations, and there’s no guarantee that actors would be represented in a predictable table.

So... we needed a new plugin hook. I released Datasette 1.06a with a new hook, actors_from_ids(actor_ids), which can be used to answer the question “who are the actors represented by these IDs”.

Alex is using this in datasette-short-links, and I built two plugins to work with the new hook as well:

Datasette Cloud lets users insert, edit and delete rows from their tables, using the plugin Alex built called datasette-write-ui which he introduced on the Datasette Cloud blog.

This inspired me to finally put out a fresh release of datasette-edit-schema—the plugin which provides the ability to edit table schemas—adding and removing columns, changing column types, even altering the order columns are stored in the table.

datasette-edit-schema 0.6 is a major release, with three significant new features:

  • You can now create a brand new table from scratch!
  • You can edit the table’s primary key
  • You can modify the foreign key constraints on the table

Those last two became important when I realized that Datasette’s API is much more interesting if there are foreign key relationships to follow.

Combine that with datasette-write-ui and Datasette Cloud now has a full set of features for building, populating and editing tables—backed by a comprehensive JSON API.

  • sqlite-migrate 0.1a2—2023-09-03
    A simple database migration system for SQLite, based on sqlite-utils

sqlite-migrate is still marked as an alpha, but won’t be for much longer: it’s my attempt at a migration system for SQLite, inspired by Django migrations but with a less sophisticated set of features.

I’m using it in LLM now to manage the schema used to store embeddings, and it’s beginning to show up in some Datasette plugins as well. I’ll be promoting this to non-alpha status pretty soon.

  • sqlite-utils 3.35.1—2023-09-09
    Python CLI utility and library for manipulating SQLite databases

A tiny fix in this, which with hindsight was less impactful than I thought.

I spotted a bug on Datasette Cloud when I configured full-text search on a column, then edited the schema and found that searches no longer returned the correct results.

It turned out the rowid column in SQLite was being rewritten by calls to the sqlite-utils table.transform() method. FTS records are related to their underlying row by rowid, so this was breaking search!

I pushed out a fix for this in 3.35.1. But then... I learned that rowid in SQLite has always been unstable—they are rewritten any time someone VACUUMs a table!

I’ve been designing future features for Datasette that assume that rowid is a useful stable identifier for a row. This clearly isn’t going to work! I’m still thinking through the consequences of it, but I think there may be Datasette features (like the ability to comment on a row) that will only work for tables with a proper foreign key.

sqlite-chronicle

  • sqlite-chronicle 0.1—2023-09-11
    Use triggers to track when rows in a SQLite table were updated or deleted

This is very early, but I’m excited about the direction it’s going in.

I keep on finding problems where I want to be able to synchronize various processes with the data in a table.

I built sqlite-history a few months ago, which uses SQLite triggers to create a full copy of the updated data every time a row in a table is edited.

That’s a pretty heavy-weight solution. What if there was something lighter that could achieve a lot of the same goals?

sqlite-chronicle uses triggers to instead create what I’m calling a “chronicle table”. This is a shadow table that records, for every row in the main table, four integer values:

  • added_ms—the timestamp in milliseconds when the row was added
  • updated_ms—the timestamp in milliseconds when the row was last updated
  • version—a constantly incrementing version number, global across the entire table
  • deleted—set to 1 if the row has been deleted

Just storing four integers (plus copies of the primary key) makes this a pretty tiny table, and hopefully one that’s cheap to update via triggers.

But... having this table enables some pretty interesting things—because external processes can track the last version number that they saw and use it to see just which rows have been inserted and updated since that point.

I gave a talk at DjangoCon a few years ago called the denormalized query engine pattern, describing the challenge of syncing an external search index like Elasticsearch with data held in a relational database.

These chronicle tables can solve that problem, and can be applied to a whole host of other problems too. So far I’m thinking about the following:

  • Publishing SQLite databases up to Datasette, sending only the rows that have changed since the last sync. I wrote a prototype that does this and it seems to work very well.
  • Copying a table from Datasette Cloud to other places—a desktop copy, or another instance, or even into an alternative database such as PostgreSQL or MySQL, in a way that only copies and deletes rows that have changed.
  • Saved search alerts: run a SQL query against just rows that were modified since the last time that query ran, then send alerts if any rows are matched.
  • Showing users a note that “34 rows in this table have changed since your last visit”, then displaying those rows.

I’m sure there are many more applications for this. I’m looking forward to finding out what they are!

I needed to fix a bug in Datasette Cloud by moving a table from one database to another... so I built a little plugin for sqlite-utils that adds a sqlite-utils move-tables origin.db destination.db tablename command. I love being able to build single-use features as plugins like this.

And some TILs

This was a fun TIL exercising the new embeddings feature in LLM. I used Django SQL Dashboardto break up my blog entries into paragraphs and exported those as CSV which could then be piped into llm embed-multi, then used that to build a CLI-driven semantic search engine for my blog.

llama-cpp has grammars now, which enable you to control the exact output format of the LLM. I’m optimistic that these could be used to implement an equivalent to OpenAI Functions on top of Llama 2 and similar models. So far I’ve just got them to output arrays of JSON objects.

I’m using this trick a lot at the moment. I have API access to Claude now, which has a 100,000 token context limit (GPT-4 is just 8,000 by default). That’s enough to summarize 100+ comment threads from Hacker News, for which I’m now using this prompt:

Summarize the themes of the opinions expressed here, including quotes (with author attribution) where appropriate.

The quotes part has been working really well—it turns out summaries of themes with illustrative quotes are much more interesting, and so far my spot checks haven’t found any that were hallucinated.

cr-sqlite adds full CRDTs to SQLite, which should enable multiple databases to accept writes independently and then seamlessly merge them together. It’s a very exciting capability!

It turns out Hugging Faces offer free scale-to-zero hosting for demos that run in Docker containers on machines with a full 16GB of RAM! I’m used to optimizing Datasette for tiny 256MB containers, so having this much memory available is a real treat.

And the rest:

Build an image search engine with llm-clip, chat with models with llm chat eight days ago

LLM is my combination CLI tool and Python library for working with Large Language Models. I just released LLM 0.10 with two significant new features: embedding support for binary files and the llm chat command.

Image search by embedding images with CLIP

I wrote about LLM’s support for embeddings (including what those are and why they’re interesting) when I released 0.9 last week.

That initial release could only handle embeddings of text—great for things like building semantic search and finding related content, but not capable of handling other types of data.

It turns out there are some really interesting embedding models for working with binary data. Top of the list for me is CLIP, released by OpenAI in January 2021.

CLIP has a really impressive trick up its sleeve: it can embed both text and images into the same vector space.

This means you can create an index for a collection of photos, each placed somewhere in 512-dimensional space. Then you can take a text string—like “happy dog”—and embed that into the same space. The images that are closest to that location will be the ones that contain happy dogs!

My llm-clip plugin provides the CLIP model, loaded via SentenceTransformers. You can install and run it like this:

llm install llm-clip
llm embed-multi photos --files photos/ '*.jpg' --binary -m clip

This will install the llm-clip plugin, then use embed-multi to embed all of the JPEG files in the photos/ directory using the clip model.

The resulting embedding vectors are stored in an embedding collection called photos. This defaults to going in the embeddings.db SQLite database managed by LLM, or you can add -d photos.db to store it in a separate database instead.

Then you can run text similarity searches against that collection using llm similar:

llm similar photos -c 'raccoon'

I get back:

{"id": "IMG_4801.jpeg", "score": 0.28125139257127457, "content": null, "metadata": null}
{"id": "IMG_4656.jpeg", "score": 0.26626441704164294, "content": null, "metadata": null}
{"id": "IMG_2944.jpeg", "score": 0.2647445926996852, "content": null, "metadata": null}

And sure enough, IMG_4801.jpeg is this:

A night time blurry photo of a Raccoon in a trash can. A stencilled label on the bin below the raccoon's face says TRASH

(I was pretty excited to snap a photo of a trash panda in an actual trash can.)

CLIP is a pretty old model at this point, and there are plenty of interesting alternatives that are just waiting for someone to wrap them in a plugin. I’m particularly excited about Facebook’s ImageBind, which can embed images, text, audio, depth, thermal, and IMU data all in the same vector space!

Chat with models using llm chat

The other big feature in LLM 0.10 is the new llm chat command.

Prior to this release, the way to have an ongoing conversation with a model was through the -c/--continue flag. You could start a conversation like this:

llm -m gpt-4 'Help me understand generators in Python'

Then ask a follow-up question using -c like so:

llm -c 'Show me an example involving the cast of Riverdale'

This works well, and everything gets logged to SQLite so you can run llm logs -c to see a full log of your most recent conversation.

You could continue a past conversation as well, using llm --cid ID to specify the conversation ID, recorded in those logs.

There was one big problem though: LLM plugins allow you to install models on your own machine, but a conversation with them that spans multiple command invocations resulted in that model being loaded into memory and then un-loaded again for every prompt.

The new llm chat solves that. You can now run llm chat -m model_id to start an interactive chat in your terminal with your model of choice.

Here’s an example using Llama 2 13B, installed via the llm-mlc plugin.

llm chat -m mlc-chat-Llama-2-13b-chat-hf-q4f16_1

You can set an alias for the model to make that easier to remember.

Here’s an example chat session with Llama 2:

Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> Who are you?
Hello! I'm just an AI, here to assist you with any questions you may have.
My name is LLaMA, and I'm a large language model trained to provide helpful
and informative responses to a wide range of questions and topics. I'm here
to help you find the information you need, while ensuring a safe and
respectful conversation. Please feel free to ask me anything, and I'll do my
best to provide a helpful and accurate response.
> Tell me a joke about otters
Sure, here's a joke about otters:

Why did the otter go to the party?

Because he heard it was a "whale" of a time!

(Get it? Whale, like a big sea mammal, but also a "wild" or "fun" time.
Otters are known for their playful and social nature, so it's a lighthearted
and silly joke.)

I hope that brought a smile to your face! Do you have any other questions or
topics you'd like to discuss?
> exit

I like how Llama 2 enthusiastically explains its own jokes! This is a great demonstration of why techniques like RLHF are so important—you need to go way beyond a raw language model if you’re going to teach one not to be this corny.

Each line of your chat will be executed as soon as you hit <enter>. Sometimes you might need to enter a multi-line prompt, for example if you need to paste in an error message. You can do that using the !multi token, like this:

llm chat -m gpt-4
Chatting with gpt-4
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> !multi custom-end
 Explain this error:

   File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

 !end custom-end

llm chat also supports system prompts and templates. If you want to chat with a sentient cheesecake, try this:

llm chat -m gpt-3.5-turbo --system '
You are a stereotypical sentient cheesecake with strong opinions
who always talks about cheesecake'

You can save those as templates too:

llm --system 'You are a stereotypical sentient cheesecake with
strong opinions who always talks about cheesecake' --save cheesecake -m gpt-4
llm chat -t cheesecake

For more options, see the llm chat documentation.

Get involved

My ambition for LLM is for it to provide the easiest way to try out new models, both full-sized Large Language Models and now embedding models such as CLIP.

I’m not going to write all of these plugins myself!

If you want to help out, please come and say hi in the #llm Discord channel.

LLM now provides tools for working with embeddings 16 days ago

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

This is a long post with a lot of theory and background. If you already know what embeddings are, here’s a TLDR you can try out straight away:

# Install LLM
pip install llm

# If you already installed via Homebrew/pipx you can upgrade like this:
llm install -U llm

# Install the llm-sentence-transformers plugin
llm install llm-sentence-transformers

# Install the all-MiniLM-L6-v2 embedding model
llm sentence-transformers register all-MiniLM-L6-v2

# Generate and store embeddings for every README.md in your home directory, recursively
llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'
  # Add --store to store the text content as well

# Run a similarity search for "sqlite" against those embeddings
llm similar readmes -c sqlite

For everyone else, read on and the above example should hopefully all make sense.

Embeddings

Embeddings are a fascinating concept within the larger world of language models.

I explained embeddings in my recent talk, Making Large Language Models work for you. The relevant section of the slides and transcript is here, or you can jump to that section on YouTube.

An embedding model lets you take a string of text—a word, sentence, paragraph or even a whole document—and turn that into an array of floating point numbers called an embedding vector.

On the left is a text post from one of my sites: Storing and serving related documents with openai-to-sqlite and embeddings. An arrow points to a huge JSON array on the right, with the label 1536 floating point numbers.

A model will always produce the same length of array—1,536 numbers for the OpenAI embedding model, 384 for all-MiniLM-L6-v2—but the array itself is inscrutable. What are you meant to do with it?

The answer is that you can compare them. I like to think of an embedding vector as a location in 1,536-dimensional space. The distance between two vectors is a measure of how semantically similar they are in meaning, at least according to the model that produced them.

A location in 1,536 dimension space  There's a 3D plot with 400 red dots arranged randomly across 3 axis.

“One happy dog” and “A playful hound” will end up close together, even though they don’t share any keywords. The embedding vector represents the language model’s interpretation of the meaning of the text.

Things you can do with embeddings include:

  1. Find related items. I use this on my TIL site to display related articles, as described in Storing and serving related documents with openai-to-sqlite and embeddings.
  2. Build semantic search. As shown above, an embeddings-based search engine can find content relevant to the user’s search term even if none of the keywords match.
  3. Implement retrieval augmented generation—the trick where you take a user’s question, find relevant documentation in your own corpus and use that to get an LLM to spit out an answer. More on that here.
  4. Clustering: you can find clusters of nearby items and identify patterns in a corpus of documents.
  5. Classification: calculate the embedding of a piece of text and compare it to pre-calculated “average” embeddings for different categories.

LLM’s new embedding features

My goal with LLM is to provide a plugin-driven abstraction around a growing collection of language models. I want to make installing, using and comparing these models as easy as possible.

The new release adds several command-line tools for working with embeddings, plus a new Python API for working with embeddings in your own code.

It also adds support for installing additional embedding models via plugins. I’ve released one plugin for this so far: llm-sentence-transformers, which adds support for new models based on the sentence-transformers library.

The example above shows how to use sentence-transformers. LLM also supports API-driven access to the OpenAI ada-002 model.

Here’s how to embed some text using ada-002, assuming you have installed LLM already:

# Set your OpenAI API key
llm keys set openai
# <paste key here>

# Embed some text
llm embed -m ada-002 -c "Hello world"

This will output a huge JSON list of floating point numbers to your terminal. You can add -f base64 (or -f hex) to get that back in a different format, though none of these outputs are instantly useful.

Embeddings are much more interesting when you store them.

LLM already uses SQLite to store prompts and responses. It was a natural fit to use SQLite to store embeddings as well.

Embedding collections

LLM 0.9 introduces the concept of a collection of embeddings. A collection has a name—like readmes—and contains a set of embeddings, each of which has an ID and an embedding vector.

All of the embeddings in a collection are generated by the same model, to ensure they can be compared with each others.

The llm embed command can store the vector in the database instead of returning it to the console. Pass it the name of an existing (or to-be-created) collection and the ID to use to store the embedding.

Here we’ll store the embedding for the phrase “Hello world” in a collection called phrases with the ID hello, using that ada-002 embedding model:

llm embed phrases hello -m ada-002 -c "Hello world"

Future phrases can be added without needing to specify the model again, since it is remembered by the collection:

llm embed phrases goodbye -c "Goodbye world"

The llm embed-db collections shows a list of collections:

phrases: ada-002
  2 embeddings
readmes: sentence-transformers/all-MiniLM-L6-v2
  16796 embeddings

The data is stored in a SQLite embeddings table with the following schema:

CREATE TABLE [collections] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [model] TEXT
);
CREATE TABLE "embeddings" (
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([collection_id], [id])
);

CREATE UNIQUE INDEX [idx_collections_name]
    ON [collections] ([name]);
CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

By default this is the SQLite database at the location revealed by llm embed-db path, but you can pass --database my-embeddings.db to various LLM commands to use a different database.

Each embedding vector is stored as a binary BLOB in the embedding column, consisting of those floating point numbers packed together as 32 bit floats.

The content_hash column contains a MD5 hash of the content. This helps avoid re-calculating the embedding (which can cost actual money for API-based embedding models like ada-002) unless the content has changed.

The content column is usually null, but can contain a copy of the original text content if you pass the --store option to the llm embed command.

metadata can contain a JSON object with metadata, if you pass --metadata '{"json": "goes here"}.

You don’t have to pass content using -c—you can instead pass a file path using the -i/--input option:

llm embed docs llm-setup -m ada-002 -i llm/docs/setup.md

Or pipe things to standard input like this:

cat llm/docs/setup.md | llm embed docs llm-setup -m ada-002 -i -

Embedding similarity search

Once you’ve built a collection, you can search for similar embeddings using the llm similar command.

The -c "term" option will embed the text you pass in using the embedding model for the collection and use that as the comparison vector:

llm similar readmes -c sqlite

You can also pass the ID of an object in that collection to use that embedding instead. This gets you related documents, for example:

llm similar readmes sqlite-utils/README.md

The output from this command is currently newline-delimited JSON.

Embedding in bulk

The llm embed command embeds a single string at a time. llm embed-multi is much more powerful: you can feed a CSV or JSON file, a SQLite database or even have it read from a directory of files in order to embed multiple items at once.

Many embeddings models are optimized for batch operations, so embedding multiple items at a time can provide a significant speed boost.

The embed-multi command is described in detail in the documentation. Here are a couple of fun things you can do with it.

First, I’m going to create embeddings for every single one of my Apple Notes.

My apple-notes-to-sqlite tool can export Apple Notes to a SQLite database. I’ll run that first:

apple-notes-to-sqlite notes.db

This took quite a while to run on my machine and generated a 828M SQLite database containing 6,462 records!

Next, I’m going to embed the content of all of those notes using the sentence-transformers/all-MiniLM-L6-v2 model:

llm embed-multi notes \
  -d notes.db \
  --sql 'select id, title, body from notes' \
  -m sentence-transformers/all-MiniLM-L6-v2

This took around 15 minutes to run, and increased the size of my database by 13MB.

The --sql option here specifies a SQL query. The first column must be an id, then any subsequent columns will be concatenated together to form the content to embed.

In this case the embeddings are written back to the same notes.db database that the content came from.

And now I can run embedding similarity operations against all of my Apple notes!

llm similar notes -d notes.db -c 'ideas for blog posts'

Embedding files in a directory

Let’s revisit the example from the top of this post. In this case, I’m using the --files option to search for files on disk and embed each of them:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'

The --files option takes two arguments: a path to a directory and a pattern to match against filenames. In this case I’m searching my home directory recursively for any files named README.md.

Running this command gives me embeddings for all of my README.md files, which I can then search against like this:

llm similar readmes -c sqlite

Embeddings in Python

So far I’ve only covered the command-line tools. LLM 0.9 also introduces a new Python API for working with embeddings.

There are two aspects to this. If you just want to embed content and handle the resulting vectors yourself, you can use llm.get_embedding_model():

import llm

# This takes model IDs and aliases defined by plugins:
model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
vector = model.embed("This is text to embed")

vector will then be a Python list of floating point numbers.

You can serialize that to the same binary format that LLM uses like this:

binary_vector = llm.encode(vector)
# And to deserialize:
vector = llm.decode(binary_vector)

The second aspect of the Python API is the llm.Collection class, for working with collections of embeddings. This example code is quoted from the documentation:

import sqlite_utils
import llm

# This collection will use an in-memory database that will be
# discarded when the Python process exits
collection = llm.Collection("entries", model_id="ada-002")

# Or you can persist the database to disk like this:
db = sqlite_utils.Database("my-embeddings.db")
collection = llm.Collection("entries", db, model_id="ada-002")

# You can pass a model directly using model= instead of model_id=
embedding_model = llm.get_embedding_model("ada-002")
collection = llm.Collection("entries", db, model=embedding_model)

# Store a string in the collection with an ID:
collection.embed("hound", "my happy hound")

# Or to store content and extra metadata:
collection.embed(
    "hound",
    "my happy hound",
    metadata={"name": "Hound"},
    store=True
)

# Or embed things in bulk:
collection.embed_multi(
    [
        ("hound", "my happy hound"),
        ("cat", "my dissatisfied cat"),
    ],
    # Add this to store the strings in the content column:
    store=True,
)

As with everything else in LLM, the goal is that anything you can do with the CLI can be done with the Python API, and vice-versa.

Clustering with llm-cluster

Another interesting application of embeddings is that you can use them to cluster content—identifying patterns in a corpus of documents.

I’ve started exploring this area with a new plugin, called llm-cluster.

You can install it like this:

llm install llm-cluster

Let’s create a new collection using data pulled from GitHub. I’m going to import all of the LLM issues from the GitHub API, using my paginate-json tool:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --store

Running this gives me a issues.db SQLite database with 218 embeddings contained in a collection called llm-issues.

Now let’s try out the llm-cluster command, requesting ten clusters from that collection:

llm cluster llm-issues --database issues.db 10

The output from this command, truncated, looks like this:

[
  {
    "id": "0",
    "items": [
      {
        "id": "1784149135",
        "content": "Tests fail with pydantic 2"
      },
      {
        "id": "1837084995",
        "content": "Allow for use of Pydantic v1 as well as v2."
      },
      {
        "id": "1857942721",
        "content": "Get tests passing against Pydantic 1"
      }
    ]
  },
  {
    "id": "1",
    "items": [
      {
        "id": "1724577618",
        "content": "Better ways of storing and accessing API keys"
      },
      {
        "id": "1772024726",
        "content": "Support for `-o key value` options such as `temperature`"
      },
      {
        "id": "1784111239",
        "content": "`--key` should be used in place of the environment variable"
      }
    ]
  },
  {
    "id": "8",
    "items": [
      {
        "id": "1835739724",
        "content": "Bump the python-packages group with 1 update"
      },
      {
        "id": "1848143453",
        "content": "Python library support for adding aliases"
      },
      {
        "id": "1857268563",
        "content": "Bump the python-packages group with 1 update"
      }
    ]
  }
]

These look pretty good! But wouldn’t it be neat if we had a snappy title for each one?

The --summary option can provide exactly that, by piping the members of each cluster through a call to another LLM in order to generate a useful summary.

llm cluster llm-issues --database issues.db 10 --summary

This uses gpt-3.5-turbo to generate a summary for each cluster, with this default prompt:

Short, concise title for this cluster of related documents.

The results I got back are pretty good, including:

  • Template Storage and Management Improvements
  • Package and Dependency Updates and Improvements
  • Adding Conversation Mechanism and Tools

I tried the same thing using a Llama 2 model running on my own laptop, with a custom prompt:

llm cluster llm-issues --database issues.db 10 \
  --summary --model mlc-chat-Llama-2-13b-chat-hf-q4f16_1 \
  --prompt 'Concise title for this cluster of related documents, just return the title'

I didn’t quite get what I wanted! Llama 2 is proving a lot harder to prompt, so each cluster came back with something that looked like this:

Sure! Here’s a concise title for this cluster of related documents:

“Design Improvements for the Neat Prompt System”

This title captures the main theme of the documents, which is to improve the design of the Neat prompt system. It also highlights the focus on improving the system’s functionality and usability

llm-cluster only took a few hours to throw together, which I’m seeing as a positive indicator that the LLM library is developing in the right direction.

Future plans

The two future features I’m most excited about are indexing and chunking.

Indexing

The llm similar command and collection.similar() Python method currently use effectively the slowest brute force approach possible: calculate a cosine difference between input vector and every other embedding in the collection, then sort the results.

This works fine for collections with a few hundred items, but will start to suffer for collections of 100,000 or more.

There are plenty of potential ways of speeding this up: you can run a vector index like FAISS or hnswlib, use a database extension like sqlite-vss or pgvector, or turn to a hosted vector database like Pinecone or Milvus.

With this many potential solutions, the obvious answer for LLM is to address this with plugins.

I’m still thinking through the details, but the core idea is that users should be able to define an index against one or more collections, and LLM will then coordinate updates to that index. These may not happen in real-time—some indexes can be expensive to rebuild, so there are benefits to applying updates in batches.

I experimented with FAISS earlier this year in datasette-faiss. That’s likely to be the base for my first implementation.

The embeddings table has an updated timestamp column to support this use-case—so indexers can run against just the items that have changed since the last indexing run.

Follow issue #216 for updates on this feature.

Chunking

When building an embeddings-based search engine, the hardest challenge is deciding how best to “chunk” the documents.

Users will type in short phrases or questions. The embedding for a four word question might not necessarily map closely to the embedding of a thousand word article, even if the article itself should be a good match for that query.

To maximize the chance of returning the most relevant content, we need to be smarter about what we embed.

I’m still trying to get a good feeling for the strategies that make sense here. Some that I’ve seen include:

  • Split a document up into fixed length shorter segments.
  • Split into segments but including a ~10% overlap with the previous and next segments, to reduce problems caused by words and sentences being split in a way that disrupts their semantic meaning.
  • Splitting by sentence, using NLP techniques.
  • Splitting into higher level sections, based on things like document headings.

Then there are more exciting, LLM-driven approaches:

  • Generate an LLM summary of a document and embed that.
  • Ask an LLM “What questions are answered by the following text?” and then embed each of the resulting questions!

It’s possible to try out these different techniques using LLM already: write code that does the splitting, then feed the results to Collection.embed_multi() or llm embed-multi.

But... it would be really cool if LLM could split documents for you—with the splitting techniques themselves defined by plugins, to make it easy to try out new approaches.

Get involved

It should be clear by now that the potential scope of the LLM project is enormous. I’m trying to use plugins to tie together an enormous and rapidly growing ecosystem of models and techniques into something that’s as easy for people to work with and build on as possible.

There are plenty of ways you can help!

  • Join the #llm Discord to talk about the project.
  • Try out plugins and run different models with them. There are 12 plugins already, and several of those can be used to run dozens if not hundreds of models (llm-mlc, llm-gpt4all and llm-llama-cpp in particular). I’ve hardly scratched the surface of these myself, and I’m testing exclusively on Apple Silicon. I’m really keen to learn more about which models work well, which models don’t and which perform the best on different hardware.
  • Try building a plugin for a new model. My dream here is that every significant Large Language Model will have an LLM plugin that makes it easy to install and use.
  • Build stuff using LLM and let me know what you’ve built. Nothing fuels an open source project more than stories of cool things people have built with it.

Datasette 1.0a4 and 1.0a5, plus weeknotes 21 days ago

Two new alpha releases of Datasette, plus a keynote at WordCamp, a new LLM release, two new LLM plugins and a flurry of TILs.

Datasette 1.0a5

Released this morning, Datasette 1.0a5 has some exciting new changes driven by Datasette Cloud and the ongoing march towards Datasette 1.0.

Alex Garcia is working with me on Datasette Cloud and Datasette generally, generously sponsored by Fly.

Two of the changes in 1.0a5 were driven by Alex:

New datasette.yaml (or .json) configuration file, which can be specified using datasette -c path-to-file. The goal here to consolidate settings, plugin configuration, permissions, canned queries, and other Datasette configuration into a single single file, separate from metadata.yaml. The legacy settings.json config file used for Configuration directory mode has been removed, and datasette.yaml has a "settings" section where the same settings key/value pairs can be included. In the next future alpha release, more configuration such as plugins/permissions/canned queries will be moved to the datasette.yaml file. See #2093 for more details.

Right from the very start of the project, Datasette has supported specifying metadata about databases—sources, licenses, etc, as a metadata.json file that can be passed to Datasette like this:

datasette data.db -m metadata.json

Over time, the purpose and uses of that file has expanded in all kinds of different directions. It can be used for plugin settings, and to set preferences for a table default page size, default facets etc), and even to configure access permissions for who can view what.

The name metadata.json is entirely inappropriate for what the file actually does. It’s a mess.

I’ve always had a desire to fix this before Datasette 1.0, but it never quite got high up enough the priority list for me to spend time on it.

Alex expressed interest in fixing it, and has started to put a plan into motion for cleaning it up.

More details in the issue.

The Datasette _internal database has had some changes. It no longer shows up in the datasette.databases list by default, and is now instead available to plugins using the datasette.get_internal_database(). Plugins are invited to use this as a private database to store configuration and settings and secrets that should not be made visible through the default Datasette interface. Users can pass the new --internal internal.db option to persist that internal database to disk. (#2157).

This was the other initiative driven by Alex. In working on Datasette Cloud we realized that it’s actually quite common for plugins to need somewhere to store data that shouldn’t necessarily be visible to regular users of a Datasette instance—things like tokens created by datasette-auth-tokens, or the progress bar mechanism used by datasette-upload-csvs.

Alex pointed out that the existing _internal database for Datasette could be expanded to cover these use-cases as well. #2157 has more details on how we agreed this should work.

The other changes in 1.0a5 were driven by me:

When restrictions are applied to API tokens, those restrictions now behave slightly differently: applying the view-table restriction will imply the ability to view-database for the database containing that table, and both view-table and view-database will imply view-instance. Previously you needed to create a token with restrictions that explicitly listed view-instance and view-database and view-table in order to view a table without getting a permission denied error. (#2102)

I described finely-grained permissions for access tokens in my annotated release notes for 1.0a2.

They provide a mechanism for creating an API token that’s only allowed to perform a subset of actions on behalf of the user.

In trying these out for Datasette Cloud I came across a nasty usability flaw. You could create a token that was restricted to view-table access for a specific table... and it wouldn’t work. Because the access code for that view would check for view-instance and view-database permission first.

1.0a5 fixes that, by adding logic that says that if a token can view-table that implies it can view-database for the database containing that table, and view-instance for the overall instance.

This change took quite some time to develop, because any time I write code involving permissions I like to also include extremely comprehensive automated tests.

The -s/--setting option can now take dotted paths to nested settings. These will then be used to set or over-ride the same options as are present in the new configuration file. (#2156)

This is a fun little detail inspired by Alex’s configuration work.

I run a lot of different Datasette instances, often on an ad-hoc basis.

I sometimes find it frustrating that to use certain features I need to create a metadata.json (soon to be datasette.yml) configuration file, just to get something to work.

Wouldn’t it be neat if every possible setting for Datasette could be provided both in a configuration file or as command-line options?

That’s what the new --setting option aims to solve. Anything that can be represented as a JSON or YAML configuration can now also be represented as key/value pairs on the command-line.

Here’s an example from my initial issue comment:

datasette \
  -s settings.sql_time_limit_ms 1000 \
  -s plugins.datasette-auth-tokens.manage_tokens true \
  -s plugins.datasette-auth-tokens.manage_tokens_database tokens \
  -s plugins.datasette-ripgrep.path "/home/simon/code-to-search" \
  -s databases.mydatabase.tables.example_table.sort created \
  mydatabase.db tokens.db

Once this feature is complete, the above will behave the same as a datasette.yml file containing this:

plugins:
  datasette-auth-tokens:
    manage_tokens: true
    manage_tokens_database: tokens
  datasette-ripgrep:
    path: /home/simon/code-to-search
databases:
  mydatabase:
    tables:
      example_table:
        sort: created
settings:
  sql_time_limit_ms: 1000

I’ve experimented with ways of turning key/value pairs into nested JSON objects before, with my json-flatten library.

This time I took a slightly different approach. In particular, if you need to pass a nested JSON object (such as an array) which isn’t easily represented using key.nested notation, you can pass it like this instead:

datasette data.db \
  -s plugins.datasette-complex-plugin.configs \
  '{"foo": [1,2,3], "bar": "baz"}'

Which would convert to the following equivalent YAML:

plugins:
  datasette-complex-plugin:
    configs:
      foo:
        - 1
        - 2
        - 3
      bar: baz

These examples don’t quite work yet, because the plugin configuration hasn’t migrated to datasette.yml—but it should work for the next alpha.

New --actor '{"id": "json-goes-here"}' option for use with datasette --get to treat the simulated request as being made by a specific actor, see datasette --get. (#2153)

This is a fun little debug helper I built while working on restricted tokens.

The datasette --get /... option is a neat trick that can be used to simulate an HTTP request through the Datasette instance, without even starting a server running on a port.

I use it for things like generating social media card images for my TILs website.

The new --actor option lets you add a simulated actor to the request, which is useful for testing out things like configured authentication and permissions.

A security fix in Datasette 1.0a4

Datasette 1.0a4 has a security fix: I realized that the API explorer I added in the 1.0 alpha series was exposing the names of databases and tables (though not their actual content) to unauthenticated users, even for Datasette instances that were protected by authentication.

I issued a GitHub security advisory for this: Datasette 1.0 alpha series leaks names of databases and tables to unauthenticated users, which has since been issued a CVE, CVE-2023-40570—GitHub is a CVE Numbering Authority which means their security team are trusted to review such advisories and issue CVEs where necessary.

I expect the impact of this vulnerability to be very small: outside of Datasette Cloud very few people are running the Datasette 1.0 alphas on the public internet, and it’s possible that the set of those users who are also authenticating their instances to provide authenticated access to private data—especially where just the database and table names of that data is considered sensitive—is an empty set.

Datasette Cloud itself has detailed access logs primarily to help evaluate this kind of threat. I’m pleased to report that those logs showed no instances of an unauthenticated user accessing the pages in question prior to the bug being fixed.

A keynote at WordCamp US

Last Friday I gave a keynote at WordCamp US on the subject of Large Language Models.

I used MacWhisper and my annotated presentation tool to turn that into a detailed transcript, complete with additional links and context: Making Large Language Models work for you.

llm-openrouter and llm-anyscale-endpoints

I released two new plugins for LLM, which lets you run large language models either locally or via APIs, as both a CLI tool and a Python library.

Both plugins provide access to API-hosted models:

  • llm-openrouter provides access to models hosted by OpenRouter. Of particular interest here is Claude—I’m still on the waiting list for the official Claude API, but in the meantime I can pay for access to it via OpenRouter and it works just fine. Claude has a 100,000 token context, making it a really great option for working with larger documents.
  • llm-anyscale-endpoints is a similar plugin that instead works with Anyscale Endpoints. Anyscale provide Llama 2 and Code Llama at extremely low prices—between $0.25 and $1 per million tokens, depending on the model.

These plugins were very quick to develop.

Both OpenRouter and Anyscale Endpoints provide API endpoints that emulate the official OpenAI APIs, including the way the handle streaming tokens.

LLM already has code for talking to those endpoints via the openai Python library, which can be re-pointed to another backend using the officially supported api_base parameter.

So the core code for the plugins ended up being less than 30 lines each: llm_openrouter.py and llm_anyscale_endpoints.py.

llm 0.8

I shipped LLM 0.8 a week and a half ago, with a bunch of small changes.

The most significant of these was a change to the default llm logs output, which shows the logs (recorded in SQLite) of the previous prompts and responses you have sent through the tool.

This output used to be JSON. It’s now Markdown, which is both easier to read and can be pasted into GitHub Issue comments or Gists or similar to share the results with other people.

The release notes for 0.8 describe all of the other improvements.

sqlite-utils 3.35

The 3.35 release of sqlite-utils was driven by LLM.

sqlite-utils has a mechanism for adding foreign keys to an existing table—something that’s not supported by SQLite out of the box.

That implementation used to work using a deeply gnarly hack: it would switch the sqlite_master table over to being writable (using PRAGMA writable_schema = 1), update that schema in place to reflect the new foreign keys and then toggle writable_schema = 0 back again.

It turns out there are Python installations out there—most notably the system Python on macOS—which completely disable the ability to write to that table, no matter what the status of the various pragmas.

I was getting bug reports from LLM users who were running into this. I realized that I had a solution for this mostly implemented already: the sqlite-utils transform() method, which can apply all sorts of complex schema changes by creating a brand new table, copying across the old data and then renaming it to replace the old one.

So I dropped the old writable_schema mechanism entirely in favour of .transform()—it’s slower, because it requires copying the entire table, but it doesn’t have weird edge-cases where it doesn’t work.

Since sqlite-utils supports plugins now, I realized I could set a healthy precedent by making the removed feature available in a new plugin: sqlite-utils-fast-fks, which provides the following command for adding foreign keys the fast, old way (provided your installation supports it):

sqlite-utils install sqlite-utils-fast-fks
sqlite-utils fast-fks my_database.db places country_id country id

I’ve always admired how jQuery uses plugins to keep old features working on an opt-in basis after major version upgrades. I’m excited to be able to apply the same pattern for sqlite-utils.

paginate-json 1.0

paginate-json is a tiny tool I first released a few years ago to solve a very specific problem.

There’s a neat pattern in some JSON APIs where the HTTP link header is used to indicate subsequent pages of results.

The best example I know of this is the GitHub API. Run this to see what it looks like here I’m using the events API):

curl -i \
  https://api.github.com/users/simonw/events

Here’s a truncated example of the output:

HTTP/2 200 
server: GitHub.com
content-type: application/json; charset=utf-8
link: <https://api.github.com/user/9599/events?page=2>; rel="next", <https://api.github.com/user/9599/events?page=9>; rel="last"

[
  {
    "id": "31467177730",
    "type": "PushEvent",

The link header there specifies a next and last URL that can be used for pagination.

To fetch all available items, you can follow the next link repeatedly until it runs out.

My paginate-json tool can follow these links for you. If you run it like this:

paginate-json \
  https://api.github.com/users/simonw/events

It will output a single JSON array consisting of the results from every available page.

The 1.0 release adds a bunch of small features, but also marks my confidence in the stability of the design of the tool.

The Datasette JSON API has supported link pagination for a while—you can use paginate-json with Datasette like this, taking advantage of the new --key option to paginate over the array of objects returned in the "rows" key:

paginate-json \
  'https://datasette.io/content/pypi_releases.json?_labels=on' \
  --key rows \
  --nl

The --nl option here causes paginate-json to output the results as newline-delimited JSON, instead of bundling them together into a JSON array.

Here’s how to use sqlite-utils insert to insert that data directly into a fresh SQLite database:

paginate-json \
  'https://datasette.io/content/pypi_releases.json?_labels=on' \
  --key rows \
  --nl | \
    sqlite-utils insert data.db releases - \
      --nl --flatten

Releases this week

TIL this week

Making Large Language Models work for you 24 days ago

I gave an invited keynote at WordCamp 2023 in National Harbor, Maryland on Friday.

I was invited to provide a practical take on Large Language Models: what they are, how they work, what you can do with them and what kind of things you can build with them that could not be built before.

As a long-time fan of WordPress and the WordPress community, which I think represents the very best of open source values, I was delighted to participate.

You can watch my talk on YouTube here. Here are the slides and an annotated transcript, prepared using the custom tool I described in this post.

Datasette Cloud, Datasette 1.0a3, llm-mlc and more one month ago

Datasette Cloud is now a significant step closer to general availability. The Datasette 1.03 alpha release is out, with a mostly finalized JSON format for 1.0. Plus new plugins for LLM and sqlite-utils and a flurry of things I’ve learned.

Datasette Cloud

Yesterday morning we unveiled the new Datasette Cloud blog, and kicked things off there with two posts:

Here’s a screenshot of the interface for creating a new private space in Datasette Cloud:

Create a space A space is a private area where you can import, explore and analyze data and share it with invited collaborators. Space name Subdomain Region  .datasette.cloud  Your data will be hosted in a region. Pick somewhere geographically close to you for optimal performance.

datasette-write-ui is particularly notable because it was written by Alex Garcia, who is now working with me to help get Datasette Cloud ready for general availability.

Alex’s work on the project is being supported by Fly.io, in a particularly exciting form of open source sponsorship. Datasette Cloud is already being built on Fly, but as part of Alex’s work we’ll be extensively documenting what we learn along the way about using Fly to build a multi-tenant SaaS platform.

Alex has some very cool work with Fly’s Litestream in the pipeline which we hope to talk more about shortly.

Since this is my first time building a blog from scratch in quite a while, I also put together a new TIL on Building a blog in Django.

The Datasette Cloud work has been driving a lot of improvements to other parts of the Datasette ecosystem, including improvements to datasette-upload-dbs and the other big news this week: Datasette 1.0a3.

Datasette 1.0a3

Datasette 1.0 is the first version of Datasette that will be marked as “stable”: if you build software on top of Datasette I want to guarantee as much as possible that it won’t break until Datasette 2.0, which I hope to avoid ever needing to release.

The three big aspects of this are:

The 1.0 alpha 3 release primarily focuses on the JSON support. There’s a new, much more intuitive default shape for both the table and the arbitrary query pages, which looks like this:

{
  "ok": true,
  "rows": [
    {
      "id": 3,
      "name": "Detroit"
    },
    {
      "id": 2,
      "name": "Los Angeles"
    },
    {
      "id": 4,
      "name": "Memnonia"
    },
    {
      "id": 1,
      "name": "San Francisco"
    }
  ],
  "truncated": false
}

This is a huge improvement on the old format, which featured a vibrant mess of top-level keys and served the rows up as an array-of-arrays, leaving the user to figure out which column was which by matching against "columns".

The new format is documented here. I wanted to get this in place as soon as possible for Datasette Cloud (which is running this alpha), since I don’t want to risk paying customers building integrations that would later break due to 1.0 API changes.

llm-mlc

My LLM tool provides a CLI utility and Python library for running prompts through Large Language Models. I added plugin support to it a few weeks ago, so now it can support additional models through plugins—including a variety of models that can run directly on your own device.

For a while now I’ve been trying to work out the easiest recipe to get a Llama 2 model running on my M2 Mac with GPU acceleration.

I finally figured that out the other week, using the excellent MLC Python library.

I built a new plugin for LLM called llm-mlc. I think this may now be one of the easiest ways to run Llama 2 on an Apple Silicon Mac with GPU acceleration.

Here are the steps to try it out. First, install LLM—which is easiest with Homebrew:

brew install llm

If you have a Python 3 environment you can run pip install llm or pipx install llm instead.

Next, install the new plugin:

llm install llm-mlc

There’s an additional installation step which I’ve not yet been able to automate fully—on an M1/M2 Mac run the following:

llm mlc pip install --pre --force-reinstall \
  mlc-ai-nightly \
  mlc-chat-nightly \
  -f https://mlc.ai/wheels

Instructions for other platforms can be found here.

Now run this command to finish the setup (which configures git-lfs ready to download the models):

llm mlc setup

And finally, you can download the Llama 2 model using this command:

llm mlc download-model Llama-2-7b-chat --alias llama2

And run a prompt like this:

llm -m llama2 'five names for a cute pet ferret'

It’s still more steps than I’d like, but it seems to be working for people!

As always, my goal for LLM is to grow a community of enthusiasts who write plugins like this to help support new models as they are released. That’s why I put a lot of effort into building this tutorial about Writing a plugin to support a new model.

Also out now: llm 0.7, which mainly adds a new mechanism for adding custom aliases to existing models:

llm aliases set turbo gpt-3.5-turbo-16k
llm -m turbo 'An epic Greek-style saga about a cheesecake that builds a SQL database from scratch'

openai-to-sqlite and embeddings for related content

A smaller release this week: openai-to-sqlite 0.4, an update to my CLI tool for loading data from various OpenAI APIs into a SQLite database.

My inspiration for this release was a desire to add better related content to my TIL website.

Short version: I did exactly that! Each post on that site now includes a list of related posts that are generated using OpenAI embeddings, which help me plot posts that are semantically similar to each other.

I wrote up a full TIL about how that all works: Storing and serving related documents with openai-to-sqlite and embeddings—scroll to the bottom of that post to see the new related content in action.

I’m fascinated by embeddings. They’re not difficult to run using locally hosted models either—I hope to add a feature to LLM to help with that soon.

Getting creative with embeddings by Amelia Wattenberger is a great example of some of the more interesting applications they can be put to.

sqlite-utils-jq

A tiny new plugin for sqlite-utils, inspired by this Hacker News comment and written mainly as an excuse for me to exercise that new plugins framework a little more.

sqlite-utils-jq adds a new jq() function which can be used to execute jq programs as part of a SQL query.

Install it like this:

sqlite-utils install sqlite-utils-jq

Now you can do things like this:

sqlite-utils memory "select jq(:doc, :expr) as result" \
  -p doc '{"foo": "bar"}' \
  -p expr '.foo'

You can also use it in combination with sqlite-utils-litecli to run that new function as part of an interactive shell:

sqlite-utils install sqlite-utils-litecli
sqlite-utils litecli data.db
# ...
Version: 1.9.0
Mail: https://groups.google.com/forum/#!forum/litecli-users
GitHub: https://github.com/dbcli/litecli
data.db> select jq('{"foo": "bar"}', '.foo')
+------------------------------+
| jq('{"foo": "bar"}', '.foo') |
+------------------------------+
| "bar"                        |
+------------------------------+
1 row in set
Time: 0.031s

Other entries this week

How I make annotated presentations describes the process I now use to create annotated presentations like this one for Catching up on the weird world of LLMs (now up to over 17,000 views on YouTube!) using a new custom annotation tool I put together with the help of GPT-4.

A couple of highlights from my TILs:

Releases this week

TIL this week

Elsewhere

19th September 2023

  • The WebAssembly Go Playground (via) Jeff Lindsay has a full Go 1.21.1 compiler running entirely in the browser. #19th September 2023, 7:53 pm
  • LLM 0.11. I released LLM 0.11 with support for the new gpt-3.5-turbo-instruct completion model from OpenAI.

    The most interesting feature of completion models is the option to request “log probabilities” from them, where each token returned is accompanied by up to 5 alternatives that were considered, along with their scores. #19th September 2023, 3:28 pm

18th September 2023

  • Note that there have been no breaking changes since the [SQLite] file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the “reserved for future extensions” bits with descriptions of those extensions as they occurred.

    D. Richard Hipp # 18th September 2023, 6:02 pm

16th September 2023

  • Notes on using a single-person Mastodon server. Julia Evans experiences running a single-person Mastodon server (on masto.host—the same host I use for my own) pretty much exactly match what I’ve learned so far as well. The biggest disadvantage is the missing replies issue, where your server only shows replies to posts that come from people who you follow—so it’s easy to reply to something in a way that duplicates other replies that are invisible to you. #16th September 2023, 10:35 pm
  • How CPython Implements and Uses Bloom Filters for String Processing. Fascinating dive into Python string internals by Abhinav Upadhyay. It turns out CPython uses very simple bloom filters in several parts of the core string methods, to solve problems like splitting on newlines where there are actually eight codepoints that could represent a newline, and a tiny bloom filter can help filter a character in a single operation before performing all eight comparisons only if that first check failed. #16th September 2023, 10:32 pm

14th September 2023

  • CAISO Grid Status (via) CAISO is the California Independent System Operator, a non-profit managing 80% of California’s electricity flow. This grid status page shows live data about the state of the grid and it’s fascinating: right now (2pm local time) California is running 71.4% on renewables, having peaked at 80% three hours ago. The current fuel mix is 52% solar, 31% natural gas, 7% each large hydro and nuclear and 2% wind. The charts on this page show how solar turns off overnight and then picks up and peaks during daylight hours. #14th September 2023, 9:08 pm

13th September 2023

  • Introducing datasette-litestream: easy replication for SQLite databases in Datasette. We use Litestream on Datasette Cloud for streaming backups of user data to S3. Alex Garcia extracted out our implementation into a standalone Datasette plugin, which bundles the Litestream Go binary (for the relevant platform) in the package you get when you run “datasette install datasette-litestream”—so now Datasette has a very robust answer to questions about SQLite disaster recovery beyond just the Datasette Cloud platform. #13th September 2023, 7:28 pm
  • Some notes on Local-First Development (via) Local-First is the name that has been coined by the community of people who are interested in building apps where data is manipulated in a client application first (mobile, desktop or web) and then continually synchronized with a server, rather than the other way round. This is a really useful review by Kyle Mathews of how the space is shaping up so far—lots of interesting threads to follow here. #13th September 2023, 3:48 am
  • In the long term, I suspect that LLMs will have a significant positive impact on higher education. Specifically, I believe they will elevate the importance of the humanities. [...] LLMs are deeply, inherently textual. And they are reliant on text in a way that is directly linked to the skills and methods that we emphasize in university humanities classes.

    Benjamin Breen # 13th September 2023, 3:40 am

  • Simulating History with ChatGPT (via) Absolutely fascinating new entry in the using-ChatGPT-to-teach genre. Benjamin Breen teaches history at UC Santa Cruz, and has been developing a sophisticated approach to using ChatGPT to play out role-playing scenarios involving different periods of history. His students are challenged to participate in them, then pick them apart—fact-checking details from the scenario and building critiques of the perspectives demonstrated by the language model. There are so many quotable snippets in here, I recommend reading the whole thing. #13th September 2023, 3:36 am

10th September 2023

  • All models on Hugging Face, sorted by downloads (via) I realized this morning that “sort by downloads” against the list of all of the models on Hugging Face can work as a reasonably good proxy for “which of these models are easiest to get running on your own computer”. #10th September 2023, 5:24 pm
  • The AI-assistant wars heat up with Claude Pro, a new ChatGPT Plus rival. I’m quoted in this piece about the new Claude Pro $20/month subscription from Anthropic:

    > Willison has also run into problems with Claude’s morality filter, which has caused him trouble by accident: “I tried to use it against a transcription of a podcast episode, and it processed most of the text before—right in front of my eyes—it deleted everything it had done! I eventually figured out that they had started talking about bomb threats against data centers towards the end of the episode, and Claude effectively got triggered by that and deleted the entire transcript.” #10th September 2023, 5:07 pm
  • promptfoo: How to benchmark Llama2 Uncensored vs. GPT-3.5 on your own inputs. promptfoo is a CLI and library for “evaluating LLM output quality”. This tutorial in their documentation about using it to compare Llama 2 to gpt-3.5-turbo is a good illustration of how it works: it uses YAML files to configure the prompts, and more YAML to define assertions such as “not-icontains: AI language model”. #10th September 2023, 4:19 pm

9th September 2023

8th September 2023

6th September 2023

  • hubcap.php (via) This PHP script by Dave Hulbert delights me. It’s 24 lines of code that takes a specified goal, then calls my LLM utility on a loop to request the next shell command to execute in order to reach that goal... and pipes the output straight into exec() after a 3s wait so the user can panic and hit Ctrl+C if it’s about to do something dangerous! #6th September 2023, 3:45 pm
  • Using ChatGPT Code Intepreter (aka "Advanced Data Analysis") to analyze your ChatGPT history. I posted a short thread showing how to upload your ChatGPT history to ChatGPT itself, then prompt it with “Build a dataframe of the id, title, create_time properties from the conversations.json JSON array of objects. Convert create_time to a date and plot it daily”. #6th September 2023, 3:42 pm
  • Perplexity: interactive LLM visualization (via) I linked to a video of Linus Lee’s GPT visualization tool the other day. Today he’s released a new version of it that people can actually play with: it runs entirely in a browser, powered by a 120MB version of the GPT-2 ONNX model loaded using the brilliant Transformers.js JavaScript library. #6th September 2023, 3:33 am

5th September 2023

  • Symbex 1.4. New release of my Symbex tool for finding symbols (functions, methods and classes) in a Python codebase. Symbex can now output matching symbols in JSON, CSV or TSV in addition to plain text.

    I designed this feature for compatibility with the new “llm embed-multi” command—so you can now use Symbex to find every Python function in a nested directory and then pipe them to LLM to calculate embeddings for every one of them.

    I tried it on my projects directory and embedded over 13,000 functions in just a few minutes! Next step is to figure out what kind of interesting things I can do with all of those embeddings. #5th September 2023, 5:29 pm
  • A token-wise likelihood visualizer for GPT-2. Linus Lee built a superb visualization to help demonstrate how Large Language Models work, in the form of a video essay where each word is coloured to show how “surprising” it is to the model. It’s worth carefully reading the text in the video as each term is highlighted to get the full effect. #5th September 2023, 3:39 am