A Steemit Curation Bot based on Natural Language Processing and Machine Learning
Steemit can be a tough place for minnows, as new users are often called. I had to learn this myself. Due to the incredible number of new posts published every minute, it is exceptionally difficult to stand apart from the crowd. Nice, well-researched, and well-crafted posts from minnows are often overlooked. Minnows do not benefit from influential followers to upvote their high-quality posts. Their contributions are lost long before any whale may notice them and turn these posts into trending topics.
User-based curation does have merrit and it is possible that posts receive the traction and recognition they deserve. I believe there is a way to support the Steemit content curators. A way in which high-quality content no longer goes unnoticed. I have developed a curation bot called
TrufflePig to do exactly this using Natural Language Processing and Machine Learning. The deployed bot can be found here: https://steemit.com/@trufflepig
The idea is to use well-received posts as training examples to teach a Machine Learning Regressor (MLR) what high-quality Steemit content looks like. Once trained, the Machine Learning Regressor is used to identify high-quality posts which were missed by the curation community. These posts which receive less payment than they deserved are dubbed truffles.
The general idea of the system is as follows:
I train a Machine Learning regressor (MLR) using Steemit posts as inputs and the corresponding Steem Dollar (SBD) reward and the number of votes as outputs.
The MLR learns to predict potential payouts for new Steemit posts.
I compare the predicted payout with the actual payout of these recent Steemit posts (between 2 and 26 hours old). When the Machine Learning model predicts a high reward, where such a reward was not actually assigned to the post, I classify this post as an overlooked truffle.
The Machine Learning Regression model is trained on posts older than 7 days which have already been paid. Features include spelling errors, post length, and readability scores. A post's content is modelled as a Latent Semantic Indexing projection. The final regressor is a multi-output Random Forest.
The bot uses the official Steem Python library to scrape data from the steemit blockchain and to post a toplist of the daily found truffles using its trained model.
The bot works as follows:
Older data is scraped from the blockchain (see
bchain.getdata.py) or loaded from disk if possible.
The scraped posts are filtered and preprocessed (see
A model is trained on the processed data if one does not yet exist (see
model.py) or is otherwise loaded from disk.
More recent data is scraped and checked for truffles using this trained model.
The bot publishes a toplist of truffles on which it both upvotes and comments (see
Installation and Execution
Clone the project directory:
$ git clone https://github.com/SmokinCaterpillar/TrufflePig.git
Add the project directory to your
$ export PYTHONPATH=$PYTHONPATH:<path_to_project>
Start the bot using the provided
You can manually set the time the bot considers as
now via the --now configuration flag.
By default, the bot will not post to the blockchain. To enable posting use the
The bot's account information requires you to populate environment variables
STEEM_PASSWORD is optional and used only to encrypt the wallet file. The password should not be your Steemit masterpassword.
Open Source Usage
The bot is open source and can be freely used for non-commercial (!) purposes. Please check the LICENSE file.
(The bot's avatar has been created using https://robohash.org/)