Sentiment polarity datasets
polarity dataset v2.0 ( 3.0Mb) (includes README v2.0): 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.
Pool of 27886 unprocessed html files (81.1Mb) from which the polarity dataset v2.0 was derived. (This file is identical to movie.zip from data release v1.0.)
sentence polarity dataset v1.0 (includes sentence polarity dataset README v1.0: 5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005.
archive:
- polarity dataset v1.0 (2.8Mb) (includes README): 700 positive and 700 negative processed reviews. Released July 2002.
- polarity dataset v1.1 (2.2Mb) (includes README.1.1): approximately 700 positive and 700 negative processed reviews. Released November 2002. This alternative version was created by Nathan Treloar, who removed a few non-English/incomplete reviews and changing some of the labels (judging some polarities to be different from the original author's rating). The complete list of changes made to v1.1 can be found in diff.txt.
- polarity dataset v0.9 (2.8Mb) (includes a README):. 700 positive and 700 negative processed reviews. Introduced in Pang/Lee/Vaithyanathan EMNLP 2002. Released July 2002. Please read the "Rating Information - WARNING" section of the README.
- movie.zip (81.1Mb): all html files we collected from the IMDb archive.
Sentiment scale datasets
- scale dataset v1.0 (includes scale data README v1.0): a collection of documents whose labels come from a rating scale. Introduced in Pang/Lee ACL 2005. Released July 2005.
- Sep 30, 2009: Yanir Seroussi points out that due to some misformatting in the raw html files, six reviews are misattributed to Dennis Schwartz (29411 should be Max Messier, 29412 should be Norm Schrager, 29418 should be Steve Rhodes, 29419 should be Blake French, 29420 should be Pete Croatto, 29422 should be Rachel Gordon) and one (23982) is blank.
- original reviews for scale dataset v1.0 (includes scale data README v1.0): original reviews from which the subjective extracts in scale dataset v1.0 were extracted.
Subjectivity datasets
- subjectivity dataset v1.0 (508K) (includes subjectivity README v1.0): 5000 subjective and 5000 objective processed sentences. Introduced in Pang/Lee ACL 2004. Released June 2004.
- Pool of unprocessed source documents (9.3Mb) from which the sentences in the subjectivity dataset v1.0 were extracted. Note: On April 2, 2012, we replaced the original gzipped tarball with one in which the subjective files are now in the correct directory (so that the subjectivity directory is no longer empty; the subjective files were mistakenly placed in the wrong directory, although distinguishable by their different naming scheme).