Volume 5 Issue 3
Winter 2009
ISSN 1937-7266

Combining Text and Audio for Mood Classification in Music Digital Libraries

Xiao Hu

Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
501 E. Daniel St. Champaign, IL, 61820


Music mood is a newly emerged metadata type and access point to music information. However, most existing music digital libraries and online repositories do not support categorizing and retrieving music by the mood it expresses. In fact, music mood, due to its subjectivity, has been far from well studied in information science. This dissertation research aims to 1) find out mood categories that are frequently used by real-world music listeners; 2) advance the technology in automatic music mood classification by combining text analysis and audio processing. The results of this research will introduce a new way of organizing and accessing music in digital libraries.

Categories and Subject Descriptors

H.3.7 [Information Storage and Retrieval]: Digital Libraries – system issues, user issues. H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – linguistic processing. H.3.4 [Information Storage and Retrieval]: Systems and software – performance evaluation. J.5 [Computer Applications]: Arts and Humanities – music

General Terms:

Design, Experimentation, Human Factors


music mood classification, text affective analysis, multi-modal classification, social tags, lyrics.


“Some sort of emotional experience is probably the main reason behind most people’s engagement with music.” --Juslin and Sloboda [1]

Nowadays music digital libraries (MDL) face the challenge of providing users a natural and diversified access to music. The affective aspect of music (popularly known as music mood) has been identified by recent music information behavior studies as an important criterion in music seeking and organization (e.g., [2][3]). However, most existing music repositories do not support access to music by mood. In fact, music mood has been far from well studied in information science. There has yet to be developed a suitable set of mood categories that can reflect the reality of music listening, and can be well adopted in the MDL and music information retrieval (MIR) community.

As the Internet and computer technologies enable people to access and share information on an unprecedentedly large scale, people call for automatic tools for music classification and recommendation. However, there only exist a few music mood classification systems with sub-superb performances, and most of them are solely based on the audio content of the music. Music related text, as a resource independent of and complementary to audio content [4], will be exploited in this dissertation to improve the state-of-the-art in automatic music mood classification.


2.1 Mood as a novel metadata type of music

Traditionally music information is organized and accessed by bibliographic metadata such as title, composer, performer and release date. In recent years, as MDL and online music repositories are getting popular, new non-traditional metadata have emerged in organizing music information organization. [5] explored the 179 mood labels on AllMusicGuide (AMG)1 , and analyzed the relationships that mood has with genre and artist. The results show that music mood, as a new type of music metadata, appears to be independent of genre and artist. Therefore, mood, as a new access point to music information, is a necessary complement to established ones.

2.2 Music Mood Classification

This section reviews important issues and representative studies on music mood classification.

2.2.1 Music mood categories

Unlike music genre classification where there is a list of genres that are generally accepted by the MIR/MDL community and the general public, the terms used to describe music mood are not standardized and their application can be highly idiosyncratic. Experiments on automatic music mood classification have been using different mood categories [6][7][8][9], making it difficult to do meaningful comparisons between the approaches.

In these classification experiments, there were mainly two approaches to set up a music mood taxonomy. The first is to adapt a mood model in music psychology. For example, [6] and [8] adopted Russell's two dimensional arousal-valence mood model [10]. This approach has good music psychological roots, but as [11] pointed out, psychological models took a musician’s stand and ignored the social context of music listening.

The second approach is to derive mood categories from professional assigned mood labels published on popular music websites or software (e.g. AMG, MoodLogic ). For instance, by clustering mood labels on AMG, [5] derived a set of five mood categories which were used for evaluating music mood classification systems in the Audio Mood Classification (AMC) task in the Music Information Retrieval Evaluation eXchange (MIREX) 2007, a community-based framework for the formal evaluation of algorithms and techniques related to MIR [12]. This approach can take advantage of real-life music listening context reflected by real-life music information services. However, mood labels provided by professionals may not represent users’ perspectives.

With the birth of Web 2.0, the general public can now post text tags on music pieces and the large quantity of social tags become a unique resource of discovering users’ perspectives. [13] derived a set of three mood categories by clustering social tags posted on last.fm, the most popular tagging site for Western music. Unfortunately, this set was not adopted in experimental studies, as three categories were indeed an oversimplication of the reality. This thesis research will propose a new method that combines the strength of social tags, linguistic resources and human expertise to derive more realistic and user-centric mood categories from social tags.

2.2.2 Ground truth datasets

Besides mood categories, a dataset with ground truth labels is necessary for a scientific evaluation on the task. Datasets used in previous experiments usually consisted of several hundreds of music pieces (e.g., [14][6][7]). The limitation on dataset sizes is mainly due to the high cost of human assessment on music.

Just as in most tasks in information retrieval, humans are the ultimate judges on ground truth labels. Hence, in all existing music mood classification experiments surveyed in this paper, ground truth datasets were built by recruiting human assessors to manually label music pieces, and then selecting pieces with (near) consensus on human judgments. However, judgments on music are very subjective and it is hard to achieve agreements across assessors. An analysis on the ground truth dataset used in the MIREX 2007 AMC task showed that human assessors had discrepant judgments on about 30% of the audio excerpts [15]. A study focusing on building a ground truth set for music mood classification [16] also reported that only 1/3 of the studied mood labels had the minimum “acceptable consistency” across assessors. To obtain reliable ground truth labels, [16] called for more assessors and more candidate music pieces, which unfortunately would require too much human labor to be realistic for most projects.

This situation makes it clear that to enhance the development and evaluation in music mood classification, and in MIR/MDL research in general, a sound method is much in need to build ground truth sets of reliable quality in an efficient manner.

2.2.3 Audio-based music mood classification

Most existing work on automatic music mood classification is exclusively based on audio features extracted from .mp3 and/or .wav music files using audio processing technology. Among all audio features, timbral and rhythmic features are the most popular. Based on audio signal spectrum, timbral features model the over-tune structure of a music piece. Rhythmic features instead capture the beat and tempo information of a piece [17].

In the years of 2007 and 2008, 22 audio-based systems were evaluated in the AMC tasks in MIREX. The best systems achieved accuracies of around 60% for a five class categorization task. As pointed out in the seminal work of Aucouturier and Pachet [18], there appeared to be a “glass ceiling” in audio-based approaches, due to the fact that some high-level music features with semantic meanings might be too difficult to be derived from audio using current technology.

2.2.4 Multi-modal music mood classification

To break the “glass-ceiling” of audio-based approaches, researchers started exploring multi-modal approaches combining audio and text, and have achieved improvements in genre classification and artist identification (e.g., [19][20]). However, to date there were very few multi-modal systems on mood classification [21][22].

Yang and Lee [21] performed early work on supplementing audio mood classification with text analysis on lyrics. This study used 7 mood categories adapted from the Watson’s model [23]. The lyrics features explored included bag-of-words (BOW), and 182 psychological features provided by the General Inquirer [24]. These text features were used to disambiguate categories that audio-based classifiers found confusing, and the overall classification accuracy was thus improved by 2.1%. However, the dataset used in this study was too small (145 songs) to draw any reliable conclusion.

Laurie et al. [22] also combined audio and lyrics. Using four mood categories adapted from the Russell’s model [10], the authors conducted classification experiments on a dataset of 1,000 songs. The lyrics features were BOW, BOW with dimensionality reduction by Latent Semantic Analysis, and discriminative terms corresponding to each category. The third feature set performed much better than the others, suggesting the usefulness of a special vocabulary customized to the mood domain. This corresponds to the lexicon-based approach in text affect detection [25].

There are several limitations in both studies. First, the set of categories may be oversimplified. In the music listening reality, there may be 15 or more distinguishable mood types (see Section 5.2). Second, the experiment datasets were too small to have good generalizability. Third, very few text feature types were considered and their performances were not carefully compared.


The overarching research question is: to what extent can text information help in categorizing music in regard to mood? It can be split to the following specific questions.

3.1 Identifying mood categories and building a ground truth dataset

How can social tags be used as the basis for building a ground truth data set with identifiable mood categories? This thesis research will identify mood categories and build a ground truth dataset at the same time using a new method exploiting the rich resource of social tags. This method should be efficient and easily generalizable to any audio set accessible to a researcher, so as to help overcome or alleviate the aforementioned difficulty in building ground truth sets.

The desired characteristics of a ground truth set for mood classification include:

  1. there should be several thousand pieces of music in the ground truth set;
  2. the mood categories cover most of the moods expressed by the collection of music being studied;
  3. each of the mood categories should represent a distinctive meaning;
  4. one music piece can be labeled with multiple mood categories;
  5. each assignment of a mood label to a music piece is validated by multiple human judges. The more judges, the better it is.

To further understand the resultant mood categories and the distribution of ground truth labels, the results will be compared to mood models proposed in music psychology (See Section 5.2). The social tagging environment of Web 2.0 is very different from the laboratory settings where the music psychology studies were conducted. Thus, such comparison will disclose whether the models can support findings from empirical data and how much differences are between them. Particularly the following questions will be addressed:

  1. Is there any correspondence between the resultant categories and the categories in psychological models?
  2. How do music pieces distribute across the mood categories?
  3. If the distances between identified mood categories can be calculated (e.g., by track co-occurrences), how do such distances differ from the relative distances of categories in the psychological models?

3.2 Most helpful features extracted from lyrics

Which types of features of lyrics are more useful than others in classifying music by mood?

Lyrics contain very rich information from which many kinds of features can be extracted, but no prior work has systematically compared the usefulness of these features for music mood classification. This research will compare the following 7 feature types that have been shown useful in related text analysis tasks:

  1. Bag-of-words (BOW) features. They refer to collections of unordered unigrams (single words), bigrams (pairs of words) or trigrams (triples of words). BOW features are widely used in topic-based text classification (e.g., [26]), and often served as baseline in other text classification tasks.
  2. Part-of-speech (POS) features. Similar to BOW, POS features can be unigram, bigram and trigrams of POS tags. They are found to be useful in text genre classification and text style classification (e.g., [25]).
  3. Function words. They have little lexical meaning and are often referred to as stop words in information retrieval. In text stylistic analysis function words have been proven effective (e.g., [27]).
  4. Lexical features. In contrast to function words, lexical features refer to words with specific meanings in a certain domain. Studies in text semantic orientation classification (e.g., [28]) and affect analysis (e.g., [29]) have used lexical features.
  5. Simple syntactic and orthographic features. They include special punctuations, unigrams and bigrams of interjection words. Similar features have been used in a study on emotions in children’s stories [29].
  6. Text statistics. Common text statistics include average line length, average unique words in each line, etc. They have been used in text stylometric analysis as well as music genre classification [19] and artist style detection [30].
  7. ANEW scores. ANEW, Affective Norms for English Words [31] is an influential general purpose affect lexicon. It provides scores in 3 emotion dimensions for each affect related word. The ANEW scores have been used in text affect analysis [29] [32]. It is interesting to see how well a general purpose resource can work in the music domain.

As there is not enough evidence to hypothesize which feature type(s) and their combination would be most useful, all feature types and combinations will be investigated in the task of mood classification and their classification performances will be compared using Friedman’s ANOVA and the Tukey-Kramer “Honestly Significant Difference” (HSD) technique (see Section 5.5).

3.3 Lyrics-based vs. audio-based systems

Are there significant differences between lyrics-based and audio-based systems in music mood classification, given both systems using the Support Vector Machines (SVMs) classification model?

To find out the answer, the following two classification systems will be built and their performances will be compared: 1) the classification system built on the best performing lyrics features; 2) the best performing audio-based system evaluated in the AMC tasks in MIREX 2007 and MIREX 20082 .

Friedman’s ANOVA will be applied to find out whether there are significant differences between performances of the two. As performances of the same system might vary across mood categories, performances in individual mood categories will also be compared.

This thesis research will use SVMs as the classification model for two reasons: 1) the selected audio-based system uses SVMs; 2) SVMs achieved best results in MIR/MDL experiments in general [9] [14] [15] [17].

3.4 Combining lyrics and audio

Are systems combining audio and lyrics significantly better than audio-based or lyrics-based systems?

In machine learning, it is acknowledged that multiple independent sources of features will likely compensate for one another, resulting in better performances than approaches using any one of the sources [33]. Thus, the hypothesis is that hybrid systems combining audio and text will perform better than systems using either source. To test this hypothesis, system performances will be compared by Friedman’s ANOVA and Tukey-Kramer HSD.

There are two major approaches in assembling hybrid systems [33]: features combination (e.g., by weighting different sets of features) and classification result combination (e.g., by voting). Both approaches will be implemented and compared.

The effect of the amount of training data, (i.e. learning curve) will also be investigated. Experiments will be done with incrementing amount of training data. The results will show empirically how much training data are needed for systems based on different sources to achieve comparable performances and whether adding lyrics can help reduce the amount of training data needed for audio-based system.

3.5 30 second clips vs. complete tracks

Can lyrics help classify 30 second clips truncated from original tracks and achieve comparable results to systems using complete tracks?

Due to the time complexity of audio processing, MIR/MDL systems often process the middle 30 second audio clips truncated from original tracks instead of the complete tracks. As text processing is much faster than audio processing, it is of practical value to find out whether combining complete lyrics with short audio excerpts can help compensate the (possibly significant) information loss due to the approximation of complete tracks with short clips.

To answer this question, performances of the hybrid systems on two datasets will be compared: 1) lyrics + complete audio tracks; 2) lyrics + 30 second audio clips. Their performances will also be compared to those of the audio-based system built in Section 3.3.

These research questions are closely related and each is built upon the previous one. Figure 1 illustrates the connections among the questions which together answer the overarching question.

Figure 1. Research questions and experiment flow


4.1 Contributions to methodology

Mood categories have been a much debated topic in MIR/MDL. This thesis research will, for the first time, identify mood categories that have been used by real users in a real life music listening environment. The statistics of the empirical ground truth dataset will also serve as a concrete case of aggregated real-life music listening behaviors, which can be used as a reference for studies in music psychology. In general, the comparison of the mood categories derived from social tags to those in psychological models will open up new directions, helping researchers refine or adapt theories and models and make them better fit the reality of users’ information behaviors.

Text sentiment analysis has been a hot topic in text mining in recent years [25], but has just started being applied to the music domain. This research will exploit a variety of linguistic features and push forward the state-of-the-art on sentiment analysis in the music domain.

4.2 Contributions on evaluation

An effective and scalable evaluation approach is much needed in the music domain. The proposed method for deriving ground truth from social tags will help reduce the cost of human assessments and clear the way to large scale experiments. The ground truth dataset built in this thesis research will be the largest one to date with mood labels given by a number of real life users, and it will be available to new rounds of MIREX for comparing systems developed in the MIR/MDL community.

4.3 Contributions on applications

Music mood classification and recommendation systems are direct applications of this thesis research. Based on the findings, one can plug in existing tools on text categorization, audio feature extraction and ensemble models to build a system that combines audio and text in an optimized way. Moody is an online prototype of such applications [34]. Moody classifies songs into the five mood categories proposed in [5], recommends music in similar mood, and classifies users’ songs on-the-fly. Upon the completion of the thesis, Moody will be updated according to the conclusions.


5.1 Data Collection

The dataset used in this research will contain 3-fold information: music audio, corresponding lyrics and social tags. Starting from accessible audio tracks, the author will collect lyrics and social tags from the Internet using focused crawling techniques.

Specifically, the audio is hosted in the International Music Information Retrieval System Evaluation Laboratory (IMIRSEL) where MIREX is hosted each year. The IMIRSEL has collected more than 31,000 audio tracks in a variety of genres among which about 15,000 tracks are vocal music in English.

Lyrics will be obtained from two online databases with high quality lyrics, Lyricwiki and The Music Lyrics Database (MLDB) .

Social tags will be collected from last.fm, arguably the largest music tagging site with 2.5 millions tags directly related to mood [35]. Only one tagging site will be used in this research because combining tags from multiple sites would distort tag distributions which are important in building a ground truth dataset.

5.2 Mood categories and music psychological models

A new method is proposed to identify mood categories and build a ground truth dataset from social tags. It uses WordNet-Affect, an affective extension of WordNet [36] to solve the vocabulary and synonym problems that are often seen in social tags. Expert opinions are solicited to disambiguate tags. Detailed description on this method and an experiment on a dataset of about 8,800 songs are summarized in [37]. Table 1 presents the major categories identified as well as song distributions among them.

Table 1: Major mood categories derived from social tags

Here we present an example of how a song was labeled with these identified mood categories. Figure 2 shows the most popular social tags on the Beatles’ song, Think for Yourself, as published on last.fm on June 14th, 2009. Among these tags, affect-related terms were first identified, and then judgmental terms (e.g., “awesome”, “favorite”) were removed as they represented users’ preferences on this song instead of the mood it conveyed. After removing other ambiguous terms such as “chill”, there were six terms matching the category terms listed in Table 1, as shown in circles in Figure 2. These terms belong to five categories in Table 1, and thus this particular song was labeled with these five categories.

Figure 2. Example of labeling a song using social tags

The identified mood categories and song distributions will be compared to two influential mood models in music psychology. The earliest and still best-known music mood model is the Hevner adjective circle [38]. As shown in Figure 3, the adjectives within each cluster are close in meanings, and the differences in meanings of clusters are reflected by the distances between clusters.

Besides categorical models like Hevner’s, there are many dimensional models in which emotions are positioned in a continuous multidimensional space. Shown in Figure 4, Russell’s two dimensional model [10] has been adopted in a few experiments in MIR/MDL, but it has not been verified whether this model fits the social context of music listening.

Figure 3. Hevner’s adjective circle

Figure 43 . Russell’s model with dimensions: arousal and valence

5.3 Evaluation Tasks

Most of the research questions will be answered by comparing performances of music mood classification systems based on different information sources (audio, lyrics and hybrid) and features.

The classification systems will be evaluated in two tasks: binary classification and track retrieval. In a binary classification task (Figure 5), a classification model will be built for each mood category, and the model, after being trained, will give a binary output for each track: either it belongs to this category or not.

Figure 5. Binary classification task

In a track retrieval task (Figure 6), a category label would be a query, the document pool contains all tracks in the ground truth dataset, and the tracks classified to this category are outputted in a ranked list.

Figure 6. Track retrieval task

5.4 Performance measures

Commonly used performance measures for classification problems include accuracy, precision, recall, and F-measures. Accuracy has been extensively used in binary classification evaluations in text classification and in MIR, especially MIREX, and thus will be used as the main classification performance measure in this research.

As suggested in [39], in combining performance across multiple categories, both micro-average and macro-average measures will be reported and compared.

One of the purposes of forming a track retrieval task is to be able to use the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) measure. Unlike micro-average or macro-average measures, AUC is not related to the prior probability of a category, and thus will not be dominated by either big or small categories.

5.5 Comparisons on classification performances

Throughout the experiments, Friedman’s ANOVA test will be used as a general tool to determine if significant differences are present among systems. If a Friedman’s overall test shows the presence of a statistically significant difference (at p < 0.05) somewhere among the systems, the Tukey Kramer Honestly Significantly Different (TK-HSD) analyses [40] will be used to determine where the significant differences actually exist among the systems. The TK-HSD is chosen because it can properly control for the experiment wise Type I error rate which is important in the situation where many pair-wise comparisons are made.


[1] Juslin, P. N. and Sloboda, J. A. Music and emotion: introduction. In Music and Emotion: Theory and Research, P. N. Juslin and J. A. Sloboda, Ed. New York: Oxford University Press 2001.
[2] Cunningham, S. J., Jones, M., and Jones, S. Organizing digital music for use: an examination of personal music collections. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR’04), 2004.
[3] Cunningham, S. J., Bainbridge, D., and Falconer, A. More of an art than a science: supporting the creation of playlists and mixes. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR’06), 2006.
[4] Logan, B., Kositsky, A. and Moreno, P. Semantic analysis of song lyrics. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME'04), 2004.
[5] Hu, X. and Downie, J. S. Exploring mood metadata: relationships with genre, artist and usage metadata. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07), 2007.
[6] Lu, L., Liu, D. and Zhang, H. Automatic mood detection and tracking of music audio signals. IEEE Transactions On Audio, Speech, And Language Processing, 14(1): 5-18, 2006.
[7] Li, T. and Ogihara, M. Detecting emotion in music. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR’03), 2003.
[8] Feng, Y., Zhuang, Y. and Pan, Y. Popular music retrieval by detecting mood. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’03), 2003.
[9] Mandel, M., Poliner, G. and Ellis, D. Support vector machine active learning for music retrieval. Multimedia Systems, 12 (1): 3-13, 2006.
[10] Russell, J. A. A circumplex model of affect. Journal of Personality and Social Psychology, 39: 1161-1178, 1980.
[11] Juslin, P. N. and Laukka, P. Expression, perception, and induction of musical emotions: a review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3): 217-238, 2004.
[12] Downie, J. S. The Music Information Retrieval Evaluation eXchange (2005-2007): a window into music information retrieval research. Acoustical Science and Technology, 29(4): 247-255, 2008.
[13] Hu, X., Bay, M. and Downie, J. S. Creating a simplified music mood classification groundtruth set. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07), 2007.
[14] Trohidis, K., Tsoumakas, G., Kalliris, G. and Vlahavas, I. Multi-label classification of music into emotions. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08), 2008.
[15] Hu, X., Downie, J. S., Laurier, C., Bay, M. and Ehmann, A. The 2007 MIREX Audio Music Classification task: lessons learned. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR’08), 2008.
[16] Skowronek, J., McKinney, M. F. and van de Par, S. Ground truth for automatic music mood classification. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR’06), 2006.
[17] Tzanetakis, G. and Cook, P. Musical Genre Classification of Audio Signals. IEEE Transactions on Speech and Audio Processing, 10(5): 293-302, 2002.
[18] Aucouturier, J-J. and Pachet, F. Improving timbre similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1 (1), 2004. Available at http://www.csl.sony.fr/downloads/papers/uploads/aucouturier-04b.pdf.
[19] Neumayer, R. and Rauber, A. Integration of text and audio features for genre classification in music information retrieval. In Proceedings of the 29th European Conference on Information Retrieval (ECIR'07), 2007.
[20] Aucouturier, J-J., Pachet, F., Roy, P. and Beurivé, A. Signal + Context = Better Classification. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR’07),2007.
[21] Yang, D., and Lee, W. Disambiguating music emotion using software agents. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR'04), 2004.
[22] Laurier, C., Grivolla, J. and Herrera, P. Multimodal music mood classification using audio and lyrics. In Proceedings of the 7th International Conference on Machine Learning and Applications (ICMLA' 08), 2008.
[23] Watson, D., Tellegen, A. Toward a consensual structure of mood. Psychological Bulletin, 98: 219-235, 1985.
[24] Stone, P. J. General Inquirer: a Computer Approach to Content Analysis. Cambridge: M.I.T. Press, 1966.
[25] Pang, B. and Lee, L. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2): 1–135, 2008.
[26] Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47, 2002.
[27] Argamon, S., Saric, M. and Stein, S. S. Style mining of electronic messages for multiple authorship discrimination: first results. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), 2003.
[28] Turney, P. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the Association for Computational Linguistics (ACL), 2002.
[29] Alm, E. C. O. Affect in Text and Speech. Doctoral Dissertation. University of Illinois at Urbana-Champaign, 2008.
[30] Li, T. and Ogihara, M. Semi-supervised learning from different information sources. Knowledge and Information Systems, 7(3): 289-309, 2004.
[31] Bradley, M. M., and Lang, P. J. Affective Norms for English Words (ANEW): Stimuli, Instruction Manual and Affective Ratings. (Technical report C-1). Gainesville, FL. The Center for Research in Psychophysiology, University of Florida, 1999.
[32] Mihalcea, R. and Liu, H. A corpus-based approach to finding happiness. AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW), 2006.
[33] Dietterich, T. G. Ensemble methods in machine learning. In First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, F. Roli, Ed. New York: Springer Verlag, 2000.
[34] Hu, X., Sanghvi, V., Vong, B., On, P. J., Leong, C. and Angelica, J. Moody: a web-based music mood classification and recommendation system. Presented at the 9th International Conference on Music Information Retrieval (ISMIR’08), 2008. Abstract available at http://ismir2008.ismir.net/latebreak/hu.pdf.
[35] Lamere, P. Social tagging and music information retrieval. Journal of New Music Research, 37(2):101-104, 2008.
[36] Strapparava, C. and Valitutti, A. WordNet-Affect: an affective extension of WordNet. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04), 2004.
[37] Hu, X. Categorizing Music Mood in Social Context. Accepted by the American Society for Information Science and Technology Conference (ASIS&T 2009), Vancouver, BC, Canada, 2009 .
[38] Hevner, K. Experimental studies of the elements of expression in music. American Journal of Psychology, 48: 246-268, 1936.
[39] Yang. Y and Liu. X. A re-evaluation of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’99), 1999.
[40] Berenson, M. L., Levine, D. M. and Goldstein, M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Prentice-Hall, Englewood Cliffs, NJ, 1983.


1 Allmusic.com, a popular metadata service that reviews and categorizes albums, songs and artists.
2 The AMC tasks in MIREX07 and MIREX08 used the same dataset and thus results are comparable.
3 picture credit: imagine-it.org/gamessurvey/