You tell me I'm wrong. Then you'd better prove you're right.

### 2017-10-02 Mangaki data challenge 1st place solution

Mangaki data challenge is an otaku-flavor oriented data science competition. It’s goal is to predict user’s preference of an unwatched/unread anime/manga from two choices: wish to watch/read and don’t want to watch/read. This competition provides training data from https://mangaki.fr/ which allows users to favorite their anime/manga works. Three major training tables are provided as described as follows:

1. Wish table: about 10k rows
User_id Work_id Wish
0 233 1
1. Record table: for already watched/read anime/manga. There are four rates here: love, like, neutral and dislike.
User_id Work_id Rate
0 22 like
2 33 dislike
1. Work table: detailed information of available anime/manga. There are three categories: anime, manga and album. There is only one album in this table, all the others are anime (about 7k) and manga (about 2k)
Work_id Title Category
0 Some_anime anime
1 Some_manga manga

For the testing data, one should predict 100k user/work pair on whether the user wish or not wish to watch/read an anime/manga. As you can see, the testing data is much larger than training data. Besides, during my analysis of this dataset, it is also not ensured that all users or works appeared in test set are contained in training set.

## Traditional recommendation system methods (that I know)

Recommendation system building has long been studied and there are various methods in solving this particular problem. For me, I also tried to build a recommender for https://bgm.tv several years ago (you can read technical details here). The simplest solution is SVD (actually, a more simple and intuitive solution is by using KNN), then one can move on to RBM, FM, FFM and so on. One assumption that holds firm in all these methods is that users should have an embedding vector capturing their preferences, and works should also have their embedding vector capturing their characteristics. It is reasonable that we should be constrained in this embedding-dotproduct model?

Recently, the common practice on Kaggle competition is by using GBDT to solve (almost all except computer vision related) questions. As long as a model can handle classification, regression and ranking problem very well, it can be applied in all supervised machine learning problems! And by using model ensembing under stacknet framework, one can join different characteristics of models altogether to achieve the best result.

In this competition, my solution is quite fair and straightforward: feature engineering to generate some embeddings, and use GBDT/Random Forest/Factorization Machine to build models from different combinations of features. After all, I used a two-level stack net to ensemble them, in which level two is a logistic regression model.

## Feature Engineering

### From wish table:

• Distribution of user’s preference on anime/manga (2d+2d)
• Distribution of item’s preference (2d)
• Word2vec embedding of user on wish-to-watch items (20d)
• Word2vec embedding of user on not-wish-to-watch items (10d)
• Word2vec embedding of item on wish-to-watch users (20d)
• Word2vec embedding of item on not-wish-to-watch users (10d)
• Lsi embedding of user (20d)
• Lsi embedding of item (20d)

### From record table:

• Distribution of user’s preference on anime/manga (4d+4d)
• Distribution of item’s preference (4d)
• Mean/StdErr of user’s rating (2d)
• Mean/StdErr of item’s rating (2d)
• Word2vec embedding of user on loved and liked items (32d)
• Word2vec embedding of user on disliked items (10d)
• Word2vec embedding of item on loved and liked users (32d)
• Word2vec embedding of item on disliked users (10d)
• Lsi embedding of user (20d)
• Lsi embedding of item (20d)
• Lda topic distribution of user on love, like and neutral items (20d)
• Lda topic distribution of item on love, like and neutral ratings (20d)
• Item categorial (1d, categorial feature)
• User Id (1d, only used in FM)
• Item Id (1d, only used in FM)

## Model ensembing

The first layer of stack net is a set of models that should have good capability of prediction but with different inductive bias. Here I just tried three models: GBDT, RF (all backended by lightGBM) and FM (backended by FastFM). I trained models from record table feature and training table feature separately, and one can further train different models using different combinations of features. For example, one can use all features (except user id and item id) in record table feature. But since GBDT would keep eye on most informative feature if all feature were given, it would be helpful to split features into several groups to train model separately. In this competition, I did not split too much (just because I don’t have too much time). I just removed the first four features (because I see from the prediction result that they have having a major effect on precision) and trained some other models.

## Model stacking

The stack net requires one to feed all prediction result from the first layer as feature to second feature. The stacking technique requires one to do KFold cross-validation at the beginning, and then to predict each fold’s result based on all other folds as training data on the second level. Here is the most intuitive (as far as I think) description of model stacking technique: http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/

In this competition, by using a single GBDT and all the features from record table one can reach 0.85567 on LB. By leveraging model stacking technique, one can reach to 0.86155, which is my final score.

## Is this the ultimate ceiling?

Definitely not. One can push the boundary much further:

1. I did not tune the embedding generation parameters very well. In fact, I generated those features using default parameters gensim provided. The dimension of embeddings are just get by my abrupt decision, no science involved. Maybe one can enlarge the sliding window of word2vec or use more embedding dimensions to achieve better results.
2. I only used lightGBM to build GBDT. One can also use xgboost. Even though they all provides GBDT, lightGBM is a leaf-wise tree growth algorithm based model, while xgboost is depth-wise tree growth. Even though two models are all CART based GBDT, they behaves differently.
3. I did not introduced any deep model generated features. GBDT is such a kind of model that relies on heavy feature engineering while deep model would learn features automatically. By combining them altogether in stacking model one can obtain much higher AUC definitely.
4. I did not use more complex features. Sometimes, population raking would also effect user’s behavior. A user would select those animes ranked high as “wish to watch”. I did not tried this idea out.

## Conclusion

I must say this competition is very interesting because I see no other competition targets on anime/manga prediction. Another good point of this competition is that the training data is very small, so that I could do CV efficiently on my single workstation. And before this competition, I have never tried stack net before. This competition granted me some experience in how to do model stacking in an engineering experience friendly way.

One thing to regret is that too few competitors were involved in this competition. Though I tried to call for participants to join on Bangumi, it seems still not many people joined. The competition holder should make their website more popular next time before holding next data challenge!

One more thing: one may be interested in the code. I write all my code here but they are not arranged in an organized way. But I think the most important files are: “FeatureExtraction.ipynb” and “aggregation.py”. They are files about how to do feature engineering and how to partition features. “CV.ipynb” gives some intuition on how to train models.

### 2017-04-14 Console as a SQL interface for quick text file processing

uid    name    nickname    joindate    activedate
7    7    lorien.    2008-07-14    2010-06-05
2    2    陈永仁    2008-07-14    2017-02-17
8    8    堂堂    2008-07-14    2008-07-14
9    9    lxl711    2008-07-14    2008-07-14
name    iid    typ    state    adddate    rate    tags
2    189708    real    dropped    2016-10-06
2    76371    real    dropped    2015-11-07
2    119224    real    dropped    2015-03-04
2    100734    real    dropped    2014-10-09
subjectid    authenticid    subjectname    subjecttype    rank    date    votenum    favnum    tags
1    1    第一次的親密接觸    book    1069    1999-11-01    57    [7, 84, 0, 3, 2]    小説:1;NN:1;1999:1;国:1;台湾:4;网络:2;三次元:5;轻舞飞扬:9;国产:2;爱情:9;经典:5;少女系:1;蔡智恒:8;小说:5;痞子蔡:20;书籍:1
2    2    坟场    music    272        421    [108, 538, 50, 18, 20]    陈老师:1;银魂:1;冷泉夜月:1;中配:1;银魂中配:1;治愈系:1;银他妈:1;神还原:1;恶搞:1;陈绮贞:9
4    4    合金弹头7    game    2396    2008-07-17    120    [14, 164, 6, 3, 2]    STG:1;结束:1;暴力:1;动作:1;SNK:10;汉化:1;2008:1;六星:1;合金弹头:26;ACT:10;NDS:38;Metal_Slug_7:6;诚意不足:2;移植:2
6    6    军团要塞2    game    895    2007-10-10    107    [15, 108, 23, 9, 7]    抓好社会主义精神文明建设:3;团队要塞:3;帽子:5;出门杀:1;半条命2:5;Valve:31;PC:13;军团要塞:7;军团要塞2:24;FPS:26;经典:6;tf:1;枪枪枪:4;2007:2;STEAM:25;TF2:15


1. 非实时。我所说的“实时”并不是今天是 4 月 16 日而数据只是 2 月的，而是我无法保证数据是在某一个时间点上的快照。对于用户数据，由于爬取一次需要两天的时间，在这两天的时间里，可能用户修改了他们的昵称和用户名而在爬取的数据上未反映出来。更为严重的问题是，对于收藏数据，可能会出现在爬取数据的时候用户进行了收藏的操作，导致爬取的数据出现重复或缺失。而且由于用户数据和收藏数据是分开爬取的，我无法保证通过用户名能把两个 table 一一对应地 join 起来。
2. 非顺序。可以从预览的数据中看到。
3. 爬虫本身缺陷。由于我对于 Bangumi 出现 500 错误没有在处理上体现出来，所以会导致某些数据有所缺失。

## 1. SELECT … WHERE … ORDER BY …

### 筛选 2017 冬季番组

90


85 anime_selection.tsv
122772    122772    六心公主    anime        2016-12-30    26    [19, 41, 1, 1, 4]    17冬:1;原创:1;PONCOTAN:4;2016年:2;广桥凉:1;TVSP:1;池赖宏:1;原优子:1;mebae:1;TV:4;日本动画:1;片山慎三:1;Studio:1;STUDIOPONCOTAN:4;2016:5;TVA:1;短片:2;上田繁:1;搞笑:4;中川大地:2;岛津裕之:2;种崎敦美:1;2017年1月:1;テレビアニメ:1;オリジナル:1;SP:1;6HP:2;村上隆:10;未确定:1
125900    125900    锁链战记～赫克瑟塔斯之光～    anime    3065    2017-01-07    88    [66, 24, 216, 20, 60]    山下大辉:3;17冬:1;原创:1;游戏改:47;CC:1;花泽香菜:7;TV:22;未确定:2;グラフィニカ:2;佐仓绫音:4;2017年1月:61;锁链战记:1;2017:10;锁链战记～Haecceitas的闪光～:15;热血:2;チェインクロ:1;石田彰:22;声优:2;2017年:4;Telecom_Animation_Film:1;十文字:1;柳田淳一:1;战斗:2;内田真礼:2;剧场版:1;奇幻:17;2017·01·07:1;工藤昌史:3;2015年10月:1;TelecomAnimationFilm:9
126185    126185    POPIN Q    anime        2016-12-23    10    [134, 11, 3, 3, 0]    荒井修子:1;黒星紅白:4;原创:3;黑星红白:1;2016年:5;_Q:1;日本动画:1;2016年12月:2;未确定:1;小泽亚李:1;2017:2;2016:5;动画电影:1;2017年:5;Q:3;东映动画:1;种崎敦美:1;2017年1月:1;宫原直树:1;POPIN:6;東映アニメーション:12;剧场版:24;东映:4;萌系画风:1;濑户麻沙美:5
131901    131901    神怒之日    anime        2017-10-01    0    [79, 1, 0, 3, 1]    GENCO:3;2017年10月:2;TV:4;未确定:2;2017年:2;GAL改:4;游戏改:4;LIGHT:2;2017:3;エロゲ改:3;2017年1月:1


### 抽取标签列表

122772    六心公主    村上隆    10
122772    六心公主    2016    5
122772    六心公主    PONCOTAN    4
122772    六心公主    STUDIOPONCOTAN    4
122772    六心公主    TV    4
122772    六心公主    搞笑    4
122772    六心公主    2016年    2
122772    六心公主    6HP    2
122772    六心公主    中川大地    2
122772    六心公主    岛津裕之    2
122772    六心公主    短片    2
122772    六心公主    テレビアニメ    1
122772    六心公主    オリジナル    1
122772    六心公主    17冬    1
122772    六心公主    2017年1月    1
122772    六心公主    mebae    1
122772    六心公主    SP    1
122772    六心公主    Studio    1
122772    六心公主    TVA    1
122772    六心公主    TVSP    1
sort: write failed: standard output: Broken pipe
sort: write error


## 余弦距离

在预估出所有已收藏未评分的作品评分之后，我仍然采用余弦距离计算用户之间的同步率。正如上文所说，这样用户未收藏的作品不会影响同步率。目前阶段，我不想让用户看到某位和他/她同步率很高的人看的是完全不同的动画——尽管它们有相同的概念。

在同步率过去的反馈里，我总是听到有人抱怨有收藏了个位数作品的人进入了他们的前十榜单。但是，只要同步率还是采取这样的保守定义：即在作品空间的余弦距离，这种现象永远不会得到根治。

设想一下，如果用户 A 看过的作品是你的子集，而用户 B 看过更多的作品，他不仅看过 A 的作品，而且还看过更多的你没看过的其他作品，那么可能谁会更高？这个答案毫无疑问是 A 与你的同步率更高——尽管你可能更想让 B 在你的同步率榜单中靠前一些。也就是说，相同条件下，余弦距离的定义使得系统必然偏向收藏数目少的人。

鉴于这样的事实，我想同步率这个玩具，v0.3 就是它最后的一个版本了。如果要以“寻求同好”为目的，就必须抛弃“同步率”这样陈旧的概念。

## 下一步？

同步率改的设计，我想已经到达了一个里程碑了，而且我想这也完成了我对 matrix factorization 的实践。但是，寻找同好这一终极目标并没有到达。在接下来，对同好的描述可能将不是余弦距离这样直观的模型就能理解的。我已经想好了会怎么做。所以请期待下一个更棒的玩具吧。