Digging deep into Netease Yunyin song recommendation algorithm: How to make birds of a feather flock together?
There are various detailed explanations about the personalized recommendation algorithm of Netease Cloud Music on the Internet, but the official has never appeared! In order to solve users’ curiosity about the algorithm behind the daily recommended song list, we broke into the product and technology department of Netease Cloud Music Headquarters, held technical experts hostage, and spit out all our doubts.
This article is transferred from: The Power of Machines | Utoda
Old irons, do you know what day it is today?
Knowing that we should say happy holidays, we can’t forget to wait for the singles assigned by the state in the cold kennel.(as if I wasn’t talking about myself).
This morning, when I found that the dog food scattered by my circle of friends had been enough for one year, I still turned on Netease Cloud Music, trying to find the same kind in the lively comment area: another group of single dog people.
Unexpectedly, in addition to a "special interview for the first year of singles", the first song recommended to me every day turned out to be:
The Best Day of My LifeBest day of my life.
……
Well, the lyrics "I won’t give up on myself, don’t wake me up, this is the best day of my life", which makes me seriously doubt that the personality recommendation of Netease Cloud Music has penetrated the normal life of all single dog users:
"Don’t always stay in the comment area, please find your own happiness before coming next year. Otherwise, as soon as the Year of the Dog is over, your title will become’ single pig’. "
Happy Valentine’s Day and Happy Year of the Dog.
On the Zhihu, two questions, "What is the song list recommendation algorithm of Netease Cloud Music" and "What’s good about Netease Cloud Music", occupy the third and eighth place in the hot topic of Netease Cloud Music respectively.
To a large extent, the first question has made the second question.
Perhaps Netease Cloud Music has a variety of reasons for its one-sided praise on Zhihu.Some people say that they have hired a large number of water forces. If so, it should be a heavy investment. I won’t tell you that the communities of the two platforms overlap greatly)However, the quality of the song list is hard, and the personality recommendation is relatively accurate compared with the domestic competing products, which is one of the key reasons for some users to develop into Netease Cloud Music diehard fans.
And, the first question can also explain why you will see a lot of comments such as "Japanese push first", "Japanese push second" and "Japanese push +FM simultaneous recommendation" in the comment area of many songs.
However, it is not very appropriate for some people to compare Netease Cloud Music to "a paradise for independent and minority music lovers". Re-exposing those songs that have been ignored by the public to your eyes is often the role of technology behind the scenes.
Just like you downloaded a song from Jay Chou today, will the system push you another popular song with a similar style in Jay Chou the next day, or push an unpopular song with a similar style, which will make you feel more novel?
There is some truth in this answer.
However, it is a bit surprising. Netease Cloud Music has never officially disclosed its recommendation algorithm and product application details. But this does not prevent the public from being interested in the process of integrating their technology and products.
Therefore, the algorithm model and AI application in Netease Cloud Music have basically been ripped apart by Zhihu users.
You can find a great answer and speculation in the Zhihu topic "What is the song list recommendation algorithm of Netease Cloud Music?"(The high praise answer in it is not too clear than the media reports, and it is easy to understand.).
The reason why we want to visit Xu Jia, a data mining engineer and product manager of Netease Cloud Music, is not so much to uncover the secret of the algorithm as to verify it before.(including on the Internet)All kinds of guesses, as well as help users answer the doubts arising from the use of Netease cloud music.
Basic algorithm: people are divided into groups
In fact, the personalized recommendation algorithm of Netease Cloud Music is similar to the basic recommendation algorithm applied in today’s headlines, station B and many O2O e-commerce platforms. This has been certified by the Xu family, which is the kind of basic algorithm we are familiar with:
This algorithm is attributed to the invention of Amazon engineers-if a customer buys this thing, he may also buy another thing.
Simply put,The prediction standard of this algorithm depends on the similar consumption patterns between people.For example, I like these two songs, and they are also in your song list, so there may be other songs I like in your song list.
The above statement is only easy to understand.In fact, collaborative filtering algorithms should be divided into two categories: user-based and project-based (single).
1. Based on users:I have a high similarity with Xiao Ming’s collection of songs, so on the basis of judging that our tastes are similar, I can recommend Xiao Ming the singles in my collection that she hasn’t collected.
Image source: data mining workers
2, based on the project (single):That is, the user’s preference for a song is used as a vector to calculate the similarity between singles. After comparing the similarity, we recommend a single to another user according to this user’s historical preference.
For example, Xiao Xin downloaded two singles, Courage and Little Love Song, while Xiao Yi downloaded Courage, Darkness and Little Love Song, while Xiao Yi downloaded Courage …
According to the historical preferences of these users, Netease Cloud Music can judge that Courage is similar to Little Love Song, and those who like Courage may also like Little Love Song, so they can recommend Little Love Song to Xiaoyi.
Image source: data mining workers
In short, if you still find it difficult to understand the algorithm of "collaborative filtering", you can just remember one word: people are divided into groups.
It’s a crooked building here: it is this recommendation model, which is essentially based on the similarity of users’ preferences, that invisibly makes users form a community of "chatting with each other" while listening to music.
Therefore, Shen Bowen did not regard "it may develop into the largest dating website in China in the future" as a nonsense joke. Instead, I think that this social trend based on music preferences is more reliable than the current dating platform:
Curiosity Daily has done a survey before. What are the main reference standards for human beings to find their soul mates? -it’s music taste.
"Birds of a feather flock together" under the neural network model
It can be seen that this recommendation algorithm is absolutely indispensable to the support of user historical data. Collaborative filtering algorithm is very powerful when the amount of data is huge and clean enough.
On the other hand, if I am a new user, or I use Netease Cloud Music very infrequently. In other words, in the case of scarce data, how can Netease Cloud Music know my taste?
This cold start problem means the inevitability of cross-use of different algorithm models. Perhaps the following second category of algorithms can eliminate this obstacle to some extent.
-
Content-based recommendation algorithm.
This is a recommendation method that focuses on distinguishing the essence of single content, and it is regarded as "birds of a feather flock together".
Sander Dieleman, one of the founders of Spotify content recommendation model, a world-famous music streaming platform.(now a DeepMind research scientist)In a blog post titled "The Application of Convolutional Neural Network in Music Recommendation", I specifically explained the errors when using a single collaborative filtering algorithm:
1. Because this algorithm not only includes the information of users and consumption patterns,There is no information about the recommended single itself. Therefore, popular music is easier to be recommended than unpopular music because the former has more data. And this kind of recommendation is often hard to surprise.
2. Based on collaborative filtering of items (single songs),There is also a problem, that is, the content heterogeneity under similar usage patterns.
For example, if you listen to all the songs in a new album, but except the title song, other episodes, covers and remixes may not be typical works of singers, then collaborative filtering will be biased because of these "noises" at this time.
of course,Its biggest problem is that "without data, everything is invalid".
Therefore, the content-based recommendation algorithm is more like a supplement to the above shortcomings of collaborative filtering algorithm-if there is not a lot of user data, or if we want to listen to unpopular songs, we can only find the answer from the music itself.
Two experts, Xu Jia and Shen Bowen, made it clear that Netease Cloud Music adopted a complex content-based algorithm to solve these problems. But unfortunately, the two did not explain the specific details too much.
Therefore,According to our guess, they should use the same method as streaming media platforms such as Spotify and Youtube-using deep learning to build audio-based recommendation models..
First of all, if you want to compare the content differences between singles, there are too many dimensions, such as artist and album information, lyrics, melody and rhythm of music itself, boasting in the comment area, VIP downloading songs, paying or not, and so on.
As you can imagine, this is a huge amount of calculation. But calculating the total amount of violence is a method …
Therefore, so many features should be mapped into the low-dimensional hidden variable space by feature embedding and dimensionality reduction methods.(Like the picture below).
It is conceivable that in this space, each song can have a coordinate, and the coordinate value is multiple coded information including audio characteristics and user preferences.
Then, if we directly predict the exact position of a song in this low-dimensional space, it will also clarify the representation of this song.(including user preference information).
In this way, it can be recommended to the right audience without historical usage data.
Therefore, it is a way that many streaming media are taking to master the hidden characteristics mapped from a large number of song data sources and user behavior data, and then establish a neural network prediction model based on audio characteristics and train the network with short audio clips. (The specific method can be to look through Sander Dieleman’s paper. If you know this, tell us a lesson! )
Of course, in the process of training the network, engineers will still pass the "discard"(Dropout)And other methods to reduce the standard deviation between the hidden representation of filtering model and audio prediction.(Don’t let the data set be too discrete)And this is mainly to reduce the influence of song popularity on the recommendation system.
That’s right, that’s why you can receive more minority song recommendations.
Of course, any of the above algorithms will actually be based on a certain "similarity".
For example, Netease Cloud Music also applies the machine learning ranking model, which is still based on user behavior data and similarity.(It is also a very common model).
Reflected in the application, in popular terms, the first song in your "Daily Recommended" song list is usually the one that the system thinks matches your preferences the most. "Many people often shout’ Day Push First’ in the comment area, and its significance is still quite great."
In Zhihu, there is a recommendation model "Potential Factor Matrix" mentioned by Big V, but Xu Jiaze thinks it is out of date, and few people use it now.
Calculation method-how do we calculate the similarity of our song list?
According to the Xu family, Netease Cloud Music mainly uses two measurement methods:
Euclidean distance and cosine similarity.
The blog of a technical expert in CDSN has explained the difference between them very clearly.(below):
From the CDSN technology blog named Ying.
The former is regarded as two points in the coordinate system to calculate the distance between the two points.
For example, when data A and B in the above figure are regarded as points in the coordinate diagram, their similarity is the absolute distance dist between two points.(A,B).
The latter is regarded as two vectors in the coordinate system to calculate the included angle between the two vectors.
For example, cosθ in the figure, the smaller the included angle, the higher the similarity.
You will find that, still on this picture, if the position of B remains unchanged and the position of point A continues to extend in the opposite direction of the line from A to the origin, the cosine angle between A and B will remain unchanged forever, but the absolute distance between the two points has changed.
This difference makes them need to be used in different data analysis models.
Xu explained that Euclidean distance, which can highlight the absolute difference of numerical values, is often used in calculating the similarity of songs themselves.
For example, 10,000 people like A songs and 20,000 people like B songs. Because the sample size is large enough, all users’ preferences for songs can be regarded as the same intensity, then the European distance can be directly used to calculate.
Under the European distance, users’ preferences for songs can all be considered as the same score, which can simplify the calculation of song similarity.
The cosine similarity is more about distinguishing differences from the user’s preference direction.
For example, Netease Cloud Music can use this method to rate content through users.(There are different scoring weights such as download, collection, search and uninterested)To distinguish the similarity of users’ interests.
In short, combining the above algorithms and calculation methods, Netease Cloud Music’s personality recommendation has a good reputation among the public.
But this kind of "not bad", if it can be achieved only by technology, I am afraid that Amazon’s business in China will not be so bad(I’m actually voicing its interface).
Frankly speaking, even the best algorithm has its shortcomings.
For all music products, the score of user experience is composed of editing and project collaboration, interface design preferences, music copyright richness, music preference prediction, technical ability and problem feedback speed.
This is why some people are crazy about calling for the recommended song list of Netease Cloud Music, and many people say that "people who listen to miscellaneous songs may be really tired of using Netease Cloud Music".
For example, you are a loyal fan of European and American music circles, but you have downloaded a Chinese song occasionally recently.
Then I can be sure that there will be a Chinese song in your recommended song list the next day. Next, I can only be "uninterested" in the Chinese songs that appear in the song list.
The recommendation algorithm sets the weight based on different users’ behaviors, with "download" being the highest, followed by collection, search and sharing. In addition, you can also click "not interested", which may avoid such songs.
In addition to algorithm recommendation, to a great extent, a streaming media platform will also assume the responsibility of manual filtering, establish manual rules from the perspective of products and operations, and filter out unqualified options.
Shen Bowen told us that they don’t just rely on the algorithm, but hope to compensate some shortcomings of the algorithm through some artificial power.
Therefore, in addition to a separate algorithm team, Netease Cloud Music also has a strong editorial team.
On the one hand,They help to do a layer of screening on the recommended content at the beginning, find out those high-quality content, and ensure the health of the whole recommendation library.
On the other hand,They also need to solve some convergence problems of the algorithm.
"Because if we rely solely on algorithm recommendation, we may be slow to respond to some new content, and we will also use some manual editing methods. To find out some content that we may think is very high quality, and then recommend it to everyone, "Shen Bowen said.
In addition, even if the customer service system relies on AI technology to a certain extent, the "artificial feedback group" composed of Netease Cloud Music customer service department and technical department is an important reason for users to have a good impression on Netease Cloud Music.
Many "second-back" technical solutions have also been dubbed by users as "the original Xiaobian of Netease Cloud Music is really alive".
In the early stage of platform development, when the data volume can’t meet the requirement of establishing recommendation algorithm model, as an Internet author named Shaq in Zhihu described:
The reason why you can get the high-force recommendation is probably that it first came from a song list recommended by a professional editor named "High-force Small Fresh".
They effectively guided users with similar interests to discover these music. Most people who have similar tastes with you have heard them and felt good. After fancy’s algorithm, they "precipitated" and "fermented", and then produced a good similarity, thus generating such an excellent recommendation and pushing it to you.
Finally, everyone was "amazing" and more new users joined, Perfect.