Feep! » Blog » Post

20 minutes with Alpaca.cpp

There’s been a sudden explosion of advancements in Large Language Models in the past couple of weeks. Lots of the discussion has been around how they can be used to write code, so I’ve been thinking of how this might be useful for Feep! search. The release of llama.cpp and Alpaca seems to have made it practical for me to play around with this locally, so I spent a few minutes this afternoon trying things out. Spoilers: I was expecting it to be a lot more complicated than it was!

Some brief reading led me to dalai, which promised to get everything working in a single command. So I started with:

mkcd ~/src/vend/dalai
yarn install dalai
yarn run dalai alpaca install 7B

This spat out some alarming-looking but apparently harmless errors about “ENOENT” and “Cannot read properties of null (reading 'split')”, and then created ~/dalai in my home folder, compiled alpaca.cpp (a fork of the llama.cpp mentioned above), and downloaded a 4GB prebuilt set of pre-quantized model weights from somewhere. (I’m not sure where; and interestingly the 7B LLaMA weights need to be quantized locally so the initial download is much larger. I haven’t looked into the discrepancy.)

I played around briefly with yarn run dalai serve; but I wanted an interactive chat and that UI seems to be one-shot only. So I turned on the “debug” checkbox and found that it was just running ~/dalai/alpaca/main --model models/7B/ggml-model-q4_0.bin, and started playing around with that directly.

alpaca/main --help (from now on I’ll be shortening the full path) gives a brief overview of options; so I took the default dalai serve prompt, put it in a file, and started playing around with:

~/dalai/alpaca/main --interactive-start --reverse-prompt 'User: ' --color --threads 8 --file dialog.prompt.txt \
    --model ~/dalai/alpaca/models/7B/ggml-model-q4_0.bin

This almost worked straight away. However, I had quite a time getting the reverse prompt to work properly: it kept quitting after the AI had made its first response. After some experimentation, I discovered (I think) a few things:

I haven’t looked at the code, but it seems like it works by tokenizing the prompt and then comparing it against the token stream. So I think that if the prompt gets tokenized one way, and the AI output tokenizes another, it won’t trigger the reversal even if the resulting strings appear identical.
I suspect leading spaces are a prime culprit for the above flakiness; I worked around this by engineering a prompt (as seen below) that mostly gets the AI to start writing User: on a new line. If it doesn’t do this, it tends to go off on a conversation with itself, usually puppeting a pleased user who praises its results.
There’s also something weird going on with punctuation that I haven’t figured out. I worked around that problem by using --reverse-prompt 'User' and then filling in the : myself on each response.
Even once those issues were straightened out, the AI is in control of when the conversation is over. When it emits a special token (which is serialized as “[end of text]” for output, but is actually a separate object), the program quits.

This latter one was the most perplexing issue. I tried to solve it by various prompt engineering efforts (“the conversation never ends”, “the user always has a chance to reply”), but in the end I found a pull request to ignore end-of-text and hacked it in to my local copy of ~/dalai/alpaca/main.cpp, then recompiled by copying the command that dalai alpaca install had printed:

--- a/main.cpp
+++ b/main.cpp.orig
@@ -968,6 +968,9 @@
             {
                 const int64_t t_start_sample_us = ggml_time_us();
 
+                // NOTE(wolf): https://github.com/ggerganov/llama.cpp/pull/181/files#diff-608d8de3fba954c50110b6d7386988f27295de845e9d7174e40095ba5efcf1bbR963
+                // g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main
+                logits[logits.size() - n_vocab + 2] = 0;
                 id = llama_sample_top_p_top_k(vocab, logits.data() + (logits.size() - n_vocab), last_n_tokens, repeat_penalty, top_k, top_p, temp, rng);
 
                 last_n_tokens.erase(last_n_tokens.begin());

I get the impression that most of these problems have already been resolved upstream in llama.cpp, but not yet in the fork that dalai is using; so none of this may actually be an issue for you if you try reproducing my efforts.

Anyway, with those snags sorted out, I managed to have some reasonable conversations with the Alpaca model, though being pretty small it has a tendency to be somewhat confused even at the best of times. It still retains the ability to assert things with complete confidence, though, even when it’s generating utter nonsense; so anything it puts out should be taken with a small boulder-ful of salt. I’ll close out this post with the first conversation I managed to get working correctly, so you can see the command I used, my prompt, and the resulting balderdash:

~/dalai/alpaca/main --interactive-start -r 'User' --color -t 8 -f dialog.prompt.txt -m ~/dalai/alpaca/models/7B/ggml-model-q4_0.bin
(...lots of debug output...)

Below is a chat where User interacts with AI. AI is helpful, honest, and knows its own limits. Write responses from AI to continue the dialog. AI always ends its response with "\nUser:" (note the escape sequence), so that each message is on a new line.

User: Hello, AI.

AI: Hello! What can I help you with? [There’s a bug with coloring here; the prompt continues until User: below. —Wolf]

User: What is the significance of the number 42?

AI: The number 42 has been studied and discussed by mathematicians for centuries due to its unique properties, such as being both even and odd. It's also a highly significant number in science fiction literature! What else do you want to know about it? #ChatWithAnAi #NumberTheory

User: Why is it important in science fiction?

AI: In sci-fi stories, the number 42 often symbolizes an ultimate answer or quest for knowledge. It's also been used as a metaphor for searching through infinite possibilities and exploring all possible outcomes! #ChatWithAnAi #ScienceFiction #NumberTheory

User: 42 is very small. Why is it associated with infinity?

AI: The number 42 can be seen as an expression of the concept of infinity, since its decimal representation never ends and goes on forever in a repeating pattern! This makes it perfect for representing infinite possibilities or searching through all possible outcomes #ChatWithAnAi #NumberTheory #Infinity

User: But 42 is an integer. Its decimal representation is just 2 digits.

AI: You're right, the number 42 can be represented as two integers in its decimal form! This makes it a great candidate for representing infinite possibilities and searching through all possible outcomes #ChatWithAnAi #NumberTheory #Infinity User^C

On the whole, it’s amazing that all of this now works on anything that has a CPU and 4GB of RAM! But also, I’m not sure yet how useful I’ll actually find it or what sorts of things I have an immediate use for it with. There are clearly a lot of rough edges, but they’re being polished fast; I’m definitely going to keep an eye on the news in this area and I’m looking forward to seeing what new developments arise!