Impression of Software 2.0
In Andrej Karpathy's concept of Software 2.0 . He suggested trained neural works for specific tasks are new type of sotware. I've heard the idea a few times, but only had very rough undestanding of it. Recently, I got chance to get some hands on experience with it in one of my hobby projects. Feeling I have learnt a few cents, that I would like to note it down and share.
The Background
It started with the idea of "Can I use AI to play video game", and got stuck in the first step, reading screen information. The game uses a special font that confuses most OCR libraries. After trying most popular 3 libraries (suggested by AI), I realized general OCR tool/model is not a optimium anser to the question. General OCR models were created to handle odd/blurry handwritings and variouse fonts, and sometimes even multiple languages, while my case is much simpler, specific font (alphanumeric only), well printed (screenshot), unified size with a bit of changing background and renderring light. General models not only complicated the question, but also sacrice the speed(as bigger networks required). And in my use case, analyzing video game frames, speed matters. So I ended up training a dedicated OCR model for it. I spent about a week's spare time on it. And it works supprising well (98%+ accuracy in 100+ fps).
The Solution
The whole process (the OCR model) could be summarised as:
1 Generate synthetic training data
2 Design and Train the model (CCN+Transformer hybrid)
3 Put the model in use, and capture live data
4 Review and correct captured data,
5 Retrain with reviewed data,
repeat 3-5 until satisfied.
I feel this could be standard high-level process for any similar tasks. I spend about a week for the first time, as there were a lot of new things (thus a lot of back and forth) to me. I actually created another model (for another use case) in 2 days with the same process.
Apart from the high level process, there were a few things/questions I feel interesting.
Preprocessing or Not
By instinct, I feel I only need structure for letters, especially well printed ones. So I started with heavy preprocessing of scaling, skeletonizer which turns letters into thin strokes in black/white. the model was easy to train, but performed badly in game. What I noticed is the coded preprocessing can't always provide the ideal result, as it involves some threshhold values/parameters that I need tweak, and ended up busy finding impossible ideal value of them.
Then I ended up train with colored images, and it worked pretty well. It is a simple choice at this moement, but it was actually a battle between software 1.0 and 2.0.
Preprocessing image with opencv libraray is bascially software 1.0, it is based on rational analyze of the question, and come out with rules applied in code.
Feeding colored image without preprocessing, is bascially rely on the data and neural network to comeout with the answer, which is essentially Karpathy's Software 2.0.
Although in my case, preprocessing seems not needed, but I'm sure there are cases where preprocessings are still needed. But what is the criteria?
I think it all come to pros vs cons, preprocessing could simplify the task for the model, while neural works are heavy computing task, and it does not garantee consistent output.
Pros
1 Less computing power required ( in general)
2 Could simplify the question by order of matitudes
3 Precise and consistent
Cons
1 it might remove useful infromation
2 it requires coding effort
My feeling is if the preprocessing rules are solid and simple, it's always good to apply it. Although with improving hardware capability the gain in speed could not worth the effort and risk of applying it.
But still, in my case, croping Reason of Interest of 72x32 pixels chips out of 4k screenshot is definitely still needed and is also a form of preprocessing.
And another point I noticed is the neural network (software 2.0) is handling the task that bascially not possible (imagine rule based OCR code...) for software 1.0. In real life, there are more things that can't be cleared described by rules comparing to things can. And after more than half century of software 1.0 booming, things that can be covered by rules most likely have already been covered, while there are great potentials for things can't.
Self Improving Process
My model trained based on synthetic data was only achieving 50% accuracy and 70% confidence in real game test. it's basically useless. I spent most of the time tweaking the models and synthetic data generating logic, it was still struggling. until I started capturing the data when playing the game, then I feed the reviewed and corrected real life data into the training, the accuracy and confidency improved dramatically to 98%+.
I've worked on a few machine learning POC projects before, and a lot of times they failed because not achiving an acceptable accuracy. And if I look at them again, if they can be put in "real life test" even with low accuracy, and start capturing data, the result could be different.
And in my use case, by the time I hook up the model with game, the work required for captureing data are largely done, all I need was add data saving logic before/after the model classification logic. I feel this (not much extra work needed to capture data, once model hooked up with real data feeds) could same for other cases.
To summarize, the trained model is not the solution, the self improving process is.
If I look over the process again, it's basically harvest human knowledge . It distilled raw data and my labeling into a model that capable of doing a specific task. This is probably what the big AI companies are doing now, althought they are trying to havest something much bigger, like coding in overall for example.
Learning is the joy
I started the project due to some sort of addiction to a game, and ended up learning deep learning.
It was a quite busy week, it was my first thing when I get home from work and crafting till late night, didn't even want to spend time on dinner that I ordered pizza for a few days. I feel it was even more addictive than the game. Maybe trying/learning new things itself is one of the highest form of joy.
As I mentioned earlier, AI companies are busy distilling human knowledge, maybe in the future, our existing knowledges will be distilled into models, and we will be enjoy(or forced to) learning new things all the time.