Mangaki data challenge 1st place solution
Mangaki data challenge is an otaku-flavor oriented data science competition. It’s goal is to predict user’s preference of an unwatched/unread anime/manga from two choices: wish to watch/read and don’t want to watch/read. This competition provides training data from https://mangaki.fr/ which allows users to favorite their anime/manga works. Three major training tables are provided as described as follows:
- Wish table: about 10k rows
- Record table: for already watched/read anime/manga. There are four rates here: love, like, neutral and dislike.
- Work table: detailed information of available anime/manga. There are three categories: anime, manga and album. There is only one album in this table, all the others are anime (about 7k) and manga (about 2k)
For the testing data, one should predict 100k user/work pair on whether the user wish or not wish to watch/read an anime/manga. As you can see, the testing data is much larger than training data. Besides, during my analysis of this dataset, it is also not ensured that all users or works appeared in test set are contained in training set.
Recommendation system building has long been studied and there are various methods in solving this particular problem. For me, I also tried to build a recommender for https://bgm.tv several years ago (you can read technical details here). The simplest solution is SVD (actually, a more simple and intuitive solution is by using KNN), then one can move on to RBM, FM, FFM and so on. One assumption that holds firm in all these methods is that users should have an embedding vector capturing their preferences, and works should also have their embedding vector capturing their characteristics. It is reasonable that we should be constrained in this embedding-dotproduct model?
Recently, the common practice on Kaggle competition is by using GBDT to solve (almost all except computer vision related) questions. As long as a model can handle classification, regression and ranking problem very well, it can be applied in all supervised machine learning problems! And by using model ensembing under stacknet framework, one can join different characteristics of models altogether to achieve the best result.
In this competition, my solution is quite fair and straightforward: feature engineering to generate some embeddings, and use GBDT/Random Forest/Factorization Machine to build models from different combinations of features. After all, I used a two-level stack net to ensemble them, in which level two is a logistic regression model.
- Distribution of user’s preference on anime/manga (2d+2d)
- Distribution of item’s preference (2d)
- Word2vec embedding of user on wish-to-watch items (20d)
- Word2vec embedding of user on not-wish-to-watch items (10d)
- Word2vec embedding of item on wish-to-watch users (20d)
- Word2vec embedding of item on not-wish-to-watch users (10d)
- Lsi embedding of user (20d)
- Lsi embedding of item (20d)
- Distribution of user’s preference on anime/manga (4d+4d)
- Distribution of item’s preference (4d)
- Mean/StdErr of user’s rating (2d)
- Mean/StdErr of item’s rating (2d)
- Word2vec embedding of user on loved and liked items (32d)
- Word2vec embedding of user on disliked items (10d)
- Word2vec embedding of item on loved and liked users (32d)
- Word2vec embedding of item on disliked users (10d)
- Lsi embedding of user (20d)
- Lsi embedding of item (20d)
- Lda topic distribution of user on love, like and neutral items (20d)
- Lda topic distribution of item on love, like and neutral ratings (20d)
- Item categorial (1d, categorial feature)
- User Id (1d, only used in FM)
- Item Id (1d, only used in FM)
The first layer of stack net is a set of models that should have good capability of prediction but with different inductive bias. Here I just tried three models: GBDT, RF (all backended by lightGBM) and FM (backended by FastFM). I trained models from record table feature and training table feature separately, and one can further train different models using different combinations of features. For example, one can use all features (except user id and item id) in record table feature. But since GBDT would keep eye on most informative feature if all feature were given, it would be helpful to split features into several groups to train model separately. In this competition, I did not split too much (just because I don’t have too much time). I just removed the first four features (because I see from the prediction result that they have having a major effect on precision) and trained some other models.
The stack net requires one to feed all prediction result from the first layer as feature to second feature. The stacking technique requires one to do KFold cross-validation at the beginning, and then to predict each fold’s result based on all other folds as training data on the second level. Here is the most intuitive (as far as I think) description of model stacking technique: http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/
In this competition, by using a single GBDT and all the features from record table one can reach 0.85567 on LB. By leveraging model stacking technique, one can reach to 0.86155, which is my final score.
Definitely not. One can push the boundary much further:
- I did not tune the embedding generation parameters very well. In fact, I generated those features using default parameters gensim provided. The dimension of embeddings are just get by my abrupt decision, no science involved. Maybe one can enlarge the sliding window of word2vec or use more embedding dimensions to achieve better results.
- I only used lightGBM to build GBDT. One can also use xgboost. Even though they all provides GBDT, lightGBM is a leaf-wise tree growth algorithm based model, while xgboost is depth-wise tree growth. Even though two models are all CART based GBDT, they behaves differently.
- I did not introduced any deep model generated features. GBDT is such a kind of model that relies on heavy feature engineering while deep model would learn features automatically. By combining them altogether in stacking model one can obtain much higher AUC definitely.
- I did not use more complex features. Sometimes, population raking would also effect user’s behavior. A user would select those animes ranked high as “wish to watch”. I did not tried this idea out.
I must say this competition is very interesting because I see no other competition targets on anime/manga prediction. Another good point of this competition is that the training data is very small, so that I could do CV efficiently on my single workstation. And before this competition, I have never tried stack net before. This competition granted me some experience in how to do model stacking in an engineering experience friendly way.
One thing to regret is that too few competitors were involved in this competition. Though I tried to call for participants to join on Bangumi, it seems still not many people joined. The competition holder should make their website more popular next time before holding next data challenge!
One more thing: one may be interested in the code. I write all my code here but they are not arranged in an organized way. But I think the most important files are: “FeatureExtraction.ipynb” and “aggregation.py”. They are files about how to do feature engineering and how to partition features. “CV.ipynb” gives some intuition on how to train models.