ai-on-a-10yo-phone - Sayak Majumder

So, recently a new model launched, GLM 5.1. In my last work they used GLM 4.5 Air, and that model was actually really good in its job. So, naturally when the newer model launched I wanted to try it. But, but, but, who is going to let me try out the model? Like quite frankly that model required 220 gigs of (drum roll please...) VRAM. I have 16 gigs of RAM. So yeah, running it locally was completely out of question (I didn't want to melt my laptop, duh).

Then I tried a few sites where they claim to use GLM 5.1, but were they really GLM 5.1?

That made me question everything, starting with what if I had enough money to buy a H200?
What models can I run in my laptop locally without burning it down?
What are the minimum specs you need to run a model?
Are specs even required?
What about a phone... Can a phone run a model?
What about a 10 year old phone?

[It was really hard to remember my entire line of thought, but it was something around this]

So, here starts the story of how I ran a model (very old and basic model) on my Motorola e3 Power.

So, for context, the specs of the phone - Mediatek MT6735P (28 nm) chipset, Quad-core 1.0 GHz Cortex-A53 CPU and Mali-T720MP2 GPU. 2GB RAM & 16GB ROM. Check all the spece here!

For normal people, the latest chipsets for phones are 4nm (the lower the faster), and 3nm for the flagships. Octa-core 4.74 GHz processor and Adreno 840 (1.3GHz) GPU and 12/16GB RAM. This is just the hardware difference between a midrange phone from '16 and a Flagship from '26. [These are the specs of Samsung S26 Ultra (Yes, I mentioned flagship!)]

I started with GPT, and kept on asking about what I can run, it basically boiled down to a 500M parameter model (for context GLM 5.1 is 1T parameter model, 2000x more). And if I really wanted to push the limits of my 2GB hardware I could run a 1B param model.

The 2 models were Qwen 2.5 0.5B Instruct, and Llama 3.2 1B model.

Now, lets understand my limitations first,
I had 2 gigs of RAM, so usable RAM was about ~1900 mb
Android 6 was roughly taking up ~400mb, and the launcher (the actual UI we use) took up ~300mb, and some other apps that were absolutely necessary (Phone app, Settings, Gboard takes 60mb for some reason? etc etc) took up ~250mb. So, idle RAM usage ~900mb. In a 10 y/o phone I couldn't use the ROM as RAM, cause the secondary memory was even slower.
Which meant I only got 1GB RAM for the actual model to run.

I decided to change the launcher to a lighter one (KISS Launcher) , which takes 30mb of memory to run. I also decided to use android adb, and just force shut down a lot of the services which aren't necessary anymore (Phone, Messages, Photos etc) which saved another ~150mb.

Now my idle RAM usage went down from ~900mb ro just about ~500mb . Which gave me a huge 1350-1400mbs for the model to run!

Now to run the model I need to first set up llama.cpp. The present versions wont work with that ancient embedded system. So I spent nearly a day with ai-studio (cause its free, and the usage limits are massive). The build kept failing because of dependencies which couldn't run on the old software, and gemini kept on making changes to accustom them.
A whole lot of fixes here and there and about 18 hours of trial and error later, it... failed at 100%, 3 separate times. Imagine waiting over an hour for a progress bar to hit the end (3 freaking times), only for it to scream a nonsensical C++ error at the very last second.

Just so were on the same page, I understand the error (atleast some of them tbh), I was just fedup of the tiny issues all over.

GPT finally pointed out that I was missing the atomic linking flags (basically the glue that holds the code together on old 32-bit chips). GPT also edited the final build command from:

make llama-server

to:

make llama-server -j4

The "j4" tag would use all 4 cores of CPU. That cut the build time from about 1.5 hours to around 25 mins.

Now that the stage was set, I finally ran the model with the prompt:What is the capital of France?

The model ran for about 15 seconds, and then it spit out:
Paris

I can not explain my excitement. At that moment, it felt like I had conquered the entire fucking world. Battling dependencies for an entire day (fine... 18 hours, not a complete day), the final sigh was immensely peaceful!

From there on, I tried it out a little more, and yes, a 500M param model isn't that good, but the fact that it run was a massive ego boost. I went on to run Llama 3.2 1B param model as well, and it was better, but naturally slower. This model ran only because I force stopped some services and changed the launcher which gave a 1B param model the breathing room to run! Now obviously a 1B param model wont run in its full glory on 1GB RAM so I had to run the GGUF Q4 version, and to be in a safe side I used the Q4 version for even the 500M model which could have run without the quantization. (In muggle terms - Quantization means that the model I ran was compressed from FP16 - 16 bits to 4 bits).

Here's a full video of how the models worked out! I used ssh to run the model in my phone but record it from my PC.

If the embed doesnt work, watch it in YouTube.

Once llama.cpp was set up, running another model was just downloading the model and running the command to run it. Ofcouse within the physical hardware bounds. That's all!

That was it for this rant!
If you read till here, thanks!

AI on a 10y/o phone?