统一设置ggplot2的绘图风格

library(ggplot2)
theme_set(theme_bw(base_family = "STKaiti"))

10.1:文本数据预处理

数据网址:http://www.cs.cornell.edu/people/pabo/movie-review-data/

英文文本预处理

该数据集为电影评论数据集,一共有1000条正向的评论和1000条负向的评论

数据读取,从文件夹中批量读取数据

library(stringr)

## 从文件夹中批量读取数据
# 查看工作文件夹下的文件名
flodername <- "data/chap10/review_polarity/txt_sentoken/neg"
filenames <- dir(flodername)
length(filenames)  ## 100个数据文件
## [1] 1000
head(filenames)
## [1] "cv000_29416.txt" "cv001_19502.txt" "cv002_17424.txt" "cv003_12683.txt"
## [5] "cv004_12641.txt" "cv005_29357.txt"
## 连接每个评论文件的工作目录
filenames <- str_c(flodername,filenames,sep = "/")
head(filenames)
## [1] "data/chap10/review_polarity/txt_sentoken/neg/cv000_29416.txt"
## [2] "data/chap10/review_polarity/txt_sentoken/neg/cv001_19502.txt"
## [3] "data/chap10/review_polarity/txt_sentoken/neg/cv002_17424.txt"
## [4] "data/chap10/review_polarity/txt_sentoken/neg/cv003_12683.txt"
## [5] "data/chap10/review_polarity/txt_sentoken/neg/cv004_12641.txt"
## [6] "data/chap10/review_polarity/txt_sentoken/neg/cv005_29357.txt"
## 定义读取单个文本的函数
readonetxt <- function(fn){
  return(readChar(fn,file.info(fn)$size))
}
## 读取所有文件
neg_text <- sapply(filenames, readonetxt)
length(neg_text)  
## [1] 1000
neg_text[[1]]
## [1] "plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat's the deal ? \nwatch the movie and \" sorta \" find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it's simply too jumbled . \nit starts off \" normal \" but then downshifts into this \" fantasy \" world in which you , as an audience member , have no idea what's going on . \nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \nnow i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \nit's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \nand do they make things entertaining , thrilling or even engaging , in the meantime ? \nnot really . \nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \ni guess the bottom line with movies like this is that you should always make sure that the audience is \" into it \" even before they are given the secret password to enter your world of understanding . \ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \nokay , we get it . . . there \nare people chasing her and we don't know who they are . \ndo we really need to see it over and over again ? \nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \nthere might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess \" the suits \" decided that turning it into a music video with little edge , would make more sense . \nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \noverall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \noh , and by the way , this is not a horror or teen slasher flick . . . it's \njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \nit also wrapped production two years ago and has been sitting on the shelves ever since . \nwhatever . . . skip \nit ! \nwhere's joblo coming from ? \na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n"
## 以相同的方法读取pos文件夹里的数据
flodername <- "data/chap10/review_polarity/txt_sentoken/pos"
filenames <- dir(flodername)
length(filenames)  ## 100个数据文件
## [1] 1000
head(filenames)
## [1] "cv000_29590.txt" "cv001_18431.txt" "cv002_15918.txt" "cv003_11664.txt"
## [5] "cv004_11636.txt" "cv005_29443.txt"
## 连接每个评论文件的工作目录
filenames <- str_c(flodername,filenames,sep = "/")
head(filenames)
## [1] "data/chap10/review_polarity/txt_sentoken/pos/cv000_29590.txt"
## [2] "data/chap10/review_polarity/txt_sentoken/pos/cv001_18431.txt"
## [3] "data/chap10/review_polarity/txt_sentoken/pos/cv002_15918.txt"
## [4] "data/chap10/review_polarity/txt_sentoken/pos/cv003_11664.txt"
## [5] "data/chap10/review_polarity/txt_sentoken/pos/cv004_11636.txt"
## [6] "data/chap10/review_polarity/txt_sentoken/pos/cv005_29443.txt"
## 读取所有文件
pos_text <- sapply(filenames, readonetxt)
length(pos_text)  
## [1] 1000
pos_text[[1]]
## [1] "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or \" graphic novel , \" if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called \" unfortunates \" ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( \" who holds back the electric car/who made steve guttenberg a star ? \" ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content \n"
## 连接两个list文本文件
polarity <- append(pos_text,neg_text)
length(polarity)
## [1] 2000
## 删除列表的名称
names(polarity) <- NULL
length(polarity[[1]])
## [1] 1
## 探索这2000条文档中,每个文档包含多少个字符
charnum <- as.vector(sapply(polarity,nchar))
hist(charnum,breaks = 20)

对文本数据进行预处理和探索

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 创建一个语料库,前1000条为正向内容,后1000条为负向内容
pol_cp <- Corpus(VectorSource(polarity))
## 输出语料库的内容
print(pol_cp)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2000
## 查看语料库中的内容
inspect(pol_cp[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content \n
pol_cp[[1]]$meta
##   author       : character(0)
##   datetimestamp: 2019-12-26 12:32:01
##   description  : character(0)
##   heading      : character(0)
##   id           : 1
##   language     : en
##   origin       : character(0)
##  tm_map() 作用于语料库上的转换函数,和lapply函数的用法相似,都是并行运算

## 去处语料库中的所有数字
pol_clearn <- tm_map(pol_cp,removeNumbers)
## Warning in tm_map.SimpleCorpus(pol_cp, removeNumbers): transformation drops
## documents
inspect(pol_clearn[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid 's with a -part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or " graphic novel , " if you will ) is over  pages long and includes nearly  more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in  london's east end . \nit's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow  ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson (  dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n :  - r for strong violence/gore , sexuality , language and drug content \n
## 从文本文档中删除标点符号
pol_clearn <- tm_map(pol_clearn,removePunctuation)
## Warning in tm_map.SimpleCorpus(pol_clearn, removePunctuation): transformation
## drops documents
inspect(pol_clearn[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] films adapted from comic books have had plenty of success  whether theyre about superheroes  batman  superman  spawn   or geared toward kids  casper  or the arthouse crowd  ghost world   but theres never really been a comic book like from hell before  \nfor starters  it was created by alan moore  and eddie campbell   who brought the medium to a whole new level in the mid s with a part series called the watchmen  \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd  \nthe book  or  graphic novel   if you will  is over  pages long and includes nearly  more that consist of nothing but footnotes  \nin other words  dont dismiss this film because of its source  \nif you can get past the whole comic book thing  you might find another stumbling block in from hells directors  albert and allen hughes  \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in  well  anything  but riddle me this  who better to direct a film thats set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society  \nthe ghetto in question is  of course  whitechapel in  londons east end  \nits a filthy  sooty place where the whores  called  unfortunates   are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision  \nwhen the first stiff turns up  copper peter godley  robbie coltrane  the world is not enough  calls in inspector frederick abberline  johnny depp  blow  to crack the case  \nabberline  a widower  has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium  \nupon arriving in whitechapel  he befriends an unfortunate named mary kelly  heather graham  say it isnt so  and proceeds to investigate the horribly gruesome crimes that even the police surgeon cant stomach  \ni dont think anyone needs to be briefed on jack the ripper  so i wont go into the particulars here  other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay  \nin the comic  they dont bother cloaking the identity of the ripper  but screenwriters terry hayes  vertical limit  and rafael yglesias  les mis  rables  do a good job of keeping him hidden from viewers until the very end  \nits funny to watch the locals blindly point the finger of blame at jews and indians because  after all  an englishman could never be capable of committing such ghastly acts  \nand from hells ending had me whistling the stonecutters song from the simpsons for days   who holds back the electric carwho made steve guttenberg a star     \ndont worry  itll all make sense when you see it  \nnow onto from hells appearance  its certainly dark and bleak enough  and its surprising to see how much more it looks like a tim burton film than planet of the apes did  at times  it seems like sleepy hollow    \nthe print i saw wasnt completely finished  both color and music had not been finalized  so no comments about marilyn manson   but cinematographer peter deming  dont say a word  ably captures the dreariness of victorianera london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks  even though the violence in the film pales in comparison to that in the blackandwhite comic  \noscar winner martin childs  shakespeare in love  production design turns the original prague surroundings into one creepy place  \neven the acting in from hell is solid  with the dreamy depp turning in a typically strong performance and deftly handling a british accent  \nians holm  joe goulds secret  and richardson   dalmatians  log in great supporting roles  but the big surprise here is graham  \ni cringed the first time she opened her mouth  imagining her attempt at an irish accent  but it actually wasnt half bad  \nthe film  however  is all good  \n    r for strong violencegore  sexuality  language and drug content \n
## 将所有的字母均转化为小写
pol_clearn <- tm_map(pol_clearn,tolower)
## Warning in tm_map.SimpleCorpus(pol_clearn, tolower): transformation drops
## documents
inspect(pol_clearn[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] films adapted from comic books have had plenty of success  whether theyre about superheroes  batman  superman  spawn   or geared toward kids  casper  or the arthouse crowd  ghost world   but theres never really been a comic book like from hell before  \nfor starters  it was created by alan moore  and eddie campbell   who brought the medium to a whole new level in the mid s with a part series called the watchmen  \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd  \nthe book  or  graphic novel   if you will  is over  pages long and includes nearly  more that consist of nothing but footnotes  \nin other words  dont dismiss this film because of its source  \nif you can get past the whole comic book thing  you might find another stumbling block in from hells directors  albert and allen hughes  \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in  well  anything  but riddle me this  who better to direct a film thats set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society  \nthe ghetto in question is  of course  whitechapel in  londons east end  \nits a filthy  sooty place where the whores  called  unfortunates   are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision  \nwhen the first stiff turns up  copper peter godley  robbie coltrane  the world is not enough  calls in inspector frederick abberline  johnny depp  blow  to crack the case  \nabberline  a widower  has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium  \nupon arriving in whitechapel  he befriends an unfortunate named mary kelly  heather graham  say it isnt so  and proceeds to investigate the horribly gruesome crimes that even the police surgeon cant stomach  \ni dont think anyone needs to be briefed on jack the ripper  so i wont go into the particulars here  other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay  \nin the comic  they dont bother cloaking the identity of the ripper  but screenwriters terry hayes  vertical limit  and rafael yglesias  les mis  rables  do a good job of keeping him hidden from viewers until the very end  \nits funny to watch the locals blindly point the finger of blame at jews and indians because  after all  an englishman could never be capable of committing such ghastly acts  \nand from hells ending had me whistling the stonecutters song from the simpsons for days   who holds back the electric carwho made steve guttenberg a star     \ndont worry  itll all make sense when you see it  \nnow onto from hells appearance  its certainly dark and bleak enough  and its surprising to see how much more it looks like a tim burton film than planet of the apes did  at times  it seems like sleepy hollow    \nthe print i saw wasnt completely finished  both color and music had not been finalized  so no comments about marilyn manson   but cinematographer peter deming  dont say a word  ably captures the dreariness of victorianera london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks  even though the violence in the film pales in comparison to that in the blackandwhite comic  \noscar winner martin childs  shakespeare in love  production design turns the original prague surroundings into one creepy place  \neven the acting in from hell is solid  with the dreamy depp turning in a typically strong performance and deftly handling a british accent  \nians holm  joe goulds secret  and richardson   dalmatians  log in great supporting roles  but the big surprise here is graham  \ni cringed the first time she opened her mouth  imagining her attempt at an irish accent  but it actually wasnt half bad  \nthe film  however  is all good  \n    r for strong violencegore  sexuality  language and drug content \n
## 去除停用词
pol_clearn <- tm_map(pol_clearn,removeWords,stopwords())
## Warning in tm_map.SimpleCorpus(pol_clearn, removeWords, stopwords()):
## transformation drops documents
inspect(pol_clearn[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] films adapted  comic books   plenty  success  whether theyre  superheroes  batman  superman  spawn    geared toward kids  casper    arthouse crowd  ghost world    theres never really   comic book like  hell   \n starters    created  alan moore   eddie campbell    brought  medium   whole new level   mid s   part series called  watchmen  \n say moore  campbell thoroughly researched  subject  jack  ripper   like saying michael jackson  starting  look  little odd  \n book    graphic novel     will     pages long  includes nearly    consist  nothing  footnotes  \n  words  dont dismiss  film    source  \n  can get past  whole comic book thing   might find another stumbling block   hells directors  albert  allen hughes  \ngetting  hughes brothers  direct  seems almost  ludicrous  casting carrot top   well  anything   riddle     better  direct  film thats set   ghetto  features really violent street crime   mad geniuses behind menace ii society  \n ghetto  question    course  whitechapel   londons east end  \n  filthy  sooty place   whores  called  unfortunates    starting  get  little nervous   mysterious psychopath    carving   profession  surgical precision  \n  first stiff turns   copper peter godley  robbie coltrane   world   enough  calls  inspector frederick abberline  johnny depp  blow   crack  case  \nabberline   widower   prophetic dreams  unsuccessfully tries  quell  copious amounts  absinthe  opium  \nupon arriving  whitechapel   befriends  unfortunate named mary kelly  heather graham  say  isnt    proceeds  investigate  horribly gruesome crimes  even  police surgeon cant stomach  \n dont think anyone needs   briefed  jack  ripper    wont go   particulars      say moore  campbell   unique  interesting theory    identity   killer   reasons  chooses  slay  \n  comic   dont bother cloaking  identity   ripper   screenwriters terry hayes  vertical limit   rafael yglesias  les mis  rables    good job  keeping  hidden  viewers    end  \n funny  watch  locals blindly point  finger  blame  jews  indians       englishman  never  capable  committing  ghastly acts  \n  hells ending   whistling  stonecutters song   simpsons  days    holds back  electric carwho made steve guttenberg  star     \ndont worry  itll  make sense   see   \nnow onto  hells appearance   certainly dark  bleak enough    surprising  see  much   looks like  tim burton film  planet   apes    times   seems like sleepy hollow    \n print  saw wasnt completely finished   color  music    finalized    comments  marilyn manson    cinematographer peter deming  dont say  word  ably captures  dreariness  victorianera london  helped make  flashy killing scenes remind    crazy flashbacks  twin peaks  even though  violence   film pales  comparison     blackandwhite comic  \noscar winner martin childs  shakespeare  love  production design turns  original prague surroundings  one creepy place  \neven  acting   hell  solid    dreamy depp turning   typically strong performance  deftly handling  british accent  \nians holm  joe goulds secret   richardson   dalmatians  log  great supporting roles    big surprise   graham  \n cringed  first time  opened  mouth  imagining  attempt   irish accent    actually wasnt half bad  \n film  however    good  \n    r  strong violencegore  sexuality  language  drug content \n
head(stopwords())
## [1] "i"      "me"     "my"     "myself" "we"     "our"
## 去除额外的空格
pol_clearn <- tm_map(pol_clearn,stripWhitespace)
## Warning in tm_map.SimpleCorpus(pol_clearn, stripWhitespace): transformation
## drops documents
inspect(pol_clearn[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] films adapted comic books plenty success whether theyre superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world theres never really comic book like hell starters created alan moore eddie campbell brought medium whole new level mid s part series called watchmen say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd book graphic novel will pages long includes nearly consist nothing footnotes words dont dismiss film source can get past whole comic book thing might find another stumbling block hells directors albert allen hughes getting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film thats set ghetto features really violent street crime mad geniuses behind menace ii society ghetto question course whitechapel londons east end filthy sooty place whores called unfortunates starting get little nervous mysterious psychopath carving profession surgical precision first stiff turns copper peter godley robbie coltrane world enough calls inspector frederick abberline johnny depp blow crack case abberline widower prophetic dreams unsuccessfully tries quell copious amounts absinthe opium upon arriving whitechapel befriends unfortunate named mary kelly heather graham say isnt proceeds investigate horribly gruesome crimes even police surgeon cant stomach dont think anyone needs briefed jack ripper wont go particulars say moore campbell unique interesting theory identity killer reasons chooses slay comic dont bother cloaking identity ripper screenwriters terry hayes vertical limit rafael yglesias les mis rables good job keeping hidden viewers end funny watch locals blindly point finger blame jews indians englishman never capable committing ghastly acts hells ending whistling stonecutters song simpsons days holds back electric carwho made steve guttenberg star dont worry itll make sense see now onto hells appearance certainly dark bleak enough surprising see much looks like tim burton film planet apes times seems like sleepy hollow print saw wasnt completely finished color music finalized comments marilyn manson cinematographer peter deming dont say word ably captures dreariness victorianera london helped make flashy killing scenes remind crazy flashbacks twin peaks even though violence film pales comparison blackandwhite comic oscar winner martin childs shakespeare love production design turns original prague surroundings one creepy place even acting hell solid dreamy depp turning typically strong performance deftly handling british accent ians holm joe goulds secret richardson dalmatians log great supporting roles big surprise graham cringed first time opened mouth imagining attempt irish accent actually wasnt half bad film however good r strong violencegore sexuality language drug content
## 清洗语料库

inspect(pol_cp[c(1,1001)])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \nthe book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . \nin other words , don't dismiss this film because of its source . \nif you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . \ngetting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? \nthe ghetto in question is , of course , whitechapel in 1888 london's east end . \nit's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision . \nwhen the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . \nabberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . \nupon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach . \ni don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay . \nin the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . \nit's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts . \nand from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . \ndon't worry - it'll all make sense when you see it . \nnow onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) . \nthe print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . \noscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . \neven the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . \nians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . \ni cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . \nthe film , however , is all good . \n2 : 00 - r for strong violence/gore , sexuality , language and drug content \n
## [2] plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat's the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it's simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea what's going on . \nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \nnow i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \nit's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \nand do they make things entertaining , thrilling or even engaging , in the meantime ? \nnot really . \nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \ni guess the bottom line with movies like this is that you should always make sure that the audience is " into it " even before they are given the secret password to enter your world of understanding . \ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \nokay , we get it . . . there \nare people chasing her and we don't know who they are . \ndo we really need to see it over and over again ? \nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \nthere might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess " the suits " decided that turning it into a music video with little edge , would make more sense . \nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \noverall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \noh , and by the way , this is not a horror or teen slasher flick . . . it's \njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \nit also wrapped production two years ago and has been sitting on the shelves ever since . \nwhatever . . . skip \nit ! \nwhere's joblo coming from ? \na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n
inspect(pol_clearn[c(1,1001)])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] films adapted comic books plenty success whether theyre superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world theres never really comic book like hell starters created alan moore eddie campbell brought medium whole new level mid s part series called watchmen say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd book graphic novel will pages long includes nearly consist nothing footnotes words dont dismiss film source can get past whole comic book thing might find another stumbling block hells directors albert allen hughes getting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film thats set ghetto features really violent street crime mad geniuses behind menace ii society ghetto question course whitechapel londons east end filthy sooty place whores called unfortunates starting get little nervous mysterious psychopath carving profession surgical precision first stiff turns copper peter godley robbie coltrane world enough calls inspector frederick abberline johnny depp blow crack case abberline widower prophetic dreams unsuccessfully tries quell copious amounts absinthe opium upon arriving whitechapel befriends unfortunate named mary kelly heather graham say isnt proceeds investigate horribly gruesome crimes even police surgeon cant stomach dont think anyone needs briefed jack ripper wont go particulars say moore campbell unique interesting theory identity killer reasons chooses slay comic dont bother cloaking identity ripper screenwriters terry hayes vertical limit rafael yglesias les mis rables good job keeping hidden viewers end funny watch locals blindly point finger blame jews indians englishman never capable committing ghastly acts hells ending whistling stonecutters song simpsons days holds back electric carwho made steve guttenberg star dont worry itll make sense see now onto hells appearance certainly dark bleak enough surprising see much looks like tim burton film planet apes times seems like sleepy hollow print saw wasnt completely finished color music finalized comments marilyn manson cinematographer peter deming dont say word ably captures dreariness victorianera london helped make flashy killing scenes remind crazy flashbacks twin peaks even though violence film pales comparison blackandwhite comic oscar winner martin childs shakespeare love production design turns original prague surroundings one creepy place even acting hell solid dreamy depp turning typically strong performance deftly handling british accent ians holm joe goulds secret richardson dalmatians log great supporting roles big surprise graham cringed first time opened mouth imagining attempt irish accent actually wasnt half bad film however good r strong violencegore sexuality language drug content 
## [2] plot two teen couples go church party drink drive get accident one guys dies girlfriend continues see life nightmares whats deal watch movie sorta find critique mindfuck movie teen generation touches cool idea presents bad package makes review even harder one write since generally applaud films attempt break mold mess head lost highway memento good bad ways making types films folks just didnt snag one correctly seem taken pretty neat concept executed terribly problems movie well main problem simply jumbled starts normal downshifts fantasy world audience member idea whats going dreams characters coming back dead others look like dead strange apparitions disappearances looooot chase scenes tons weird things happen simply explained now personally dont mind trying unravel film every now give clue get kind fed films biggest problem obviously got big secret hide seems want hide completely final five minutes make things entertaining thrilling even engaging meantime really sad part arrow dig flicks like actually figured halfway point strangeness start make little bit sense still didnt make film entertaining guess bottom line movies like always make sure audience even given secret password enter world understanding mean showing melissa sagemiller running away visions minutes throughout movie just plain lazy okay get people chasing dont know really need see giving us different scenes offering insight strangeness going movie apparently studio took film away director chopped shows mightve pretty decent teen mindfuck movie somewhere guess suits decided turning music video little edge make sense actors pretty good part although wes bentley just seemed playing exact character american beauty new neighborhood biggest kudos go sagemiller holds throughout entire film actually feeling characters unraveling overall film doesnt stick doesnt entertain confusing rarely excites feels pretty redundant runtime despite pretty cool ending explanation craziness came oh way horror teen slasher flick just packaged look way someone apparently assuming genre still hot kids also wrapped production two years ago sitting shelves ever since whatever skip wheres joblo coming nightmare elm street blair witch crow crow salvation lost highway memento others stir echoes

文本数据可视化

library(RColorBrewer)
library(wordcloud)
library(wordcloud2)

## 将清洗好的语料库转化为文档-词项矩阵
pol_dtm <- DocumentTermMatrix(pol_clearn)
dim(pol_dtm)
## [1]  2000 46460
## 查看每个词语的词频
freq.terms <- sort(colSums(as.matrix(pol_dtm)),decreasing = TRUE)
freq.terms <- data.frame(name = names(freq.terms),
                         fre = freq.terms,row.names = NULL)
head(freq.terms,10)
##     name  fre
## 1   film 8861
## 2    one 5521
## 3  movie 5440
## 4   like 3554
## 5   just 2900
## 6   even 2555
## 7   good 2321
## 8   time 2283
## 9    can 2232
## 10  will 2194
length(which(freq.terms$fre>500))
## [1] 175
## 可视化所有文档的词云
set.seed(375) # to make it reproducible
wordcloud(words=freq.terms$name, freq=freq.terms$fre,
          min.freq = 500,random.color = FALSE,
          scale=c(4,0.5), colors=brewer.pal(8,"Dark2"))

##  对比分析两种不同情感的评论的词云,使用可交互的词云
## 计算不同类型数据的词频
freq.pos <- sort(colSums(as.matrix(pol_dtm[1:1000,])),decreasing = TRUE)
freq.pos <- data.frame(name = names(freq.pos),fre = freq.pos,
                       row.names = NULL)
freq.neg <- sort(colSums(as.matrix(pol_dtm[1001:2000,])),decreasing = TRUE)
freq.neg <- data.frame(name = names(freq.neg),fre = freq.neg,
                       row.names = NULL)

dim(freq.neg[freq.neg$fre>300,])
## [1] 125   2
set.seed(375)
freq.neg[freq.neg$fre>200,] %>%
  wordcloud2()
freq.pos[freq.pos$fre>200,] %>%
  wordcloud2()

将文本数据从语料库中转化为dataframe

## 将文本数据从语料库中转化为dataframe
polaritydf <- data.frame(text=sapply(pol_clearn, identity), stringsAsFactors=F)
polaritydf$label <- rep(c("pos","neg"),each = 1000)
head(polaritydf,2)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   text
## 1 films adapted comic books plenty success whether theyre superheroes batman superman spawn geared toward kids casper arthouse crowd ghost world theres never really comic book like hell starters created alan moore eddie campbell brought medium whole new level mid s part series called watchmen say moore campbell thoroughly researched subject jack ripper like saying michael jackson starting look little odd book graphic novel will pages long includes nearly consist nothing footnotes words dont dismiss film source can get past whole comic book thing might find another stumbling block hells directors albert allen hughes getting hughes brothers direct seems almost ludicrous casting carrot top well anything riddle better direct film thats set ghetto features really violent street crime mad geniuses behind menace ii society ghetto question course whitechapel londons east end filthy sooty place whores called unfortunates starting get little nervous mysterious psychopath carving profession surgical precision first stiff turns copper peter godley robbie coltrane world enough calls inspector frederick abberline johnny depp blow crack case abberline widower prophetic dreams unsuccessfully tries quell copious amounts absinthe opium upon arriving whitechapel befriends unfortunate named mary kelly heather graham say isnt proceeds investigate horribly gruesome crimes even police surgeon cant stomach dont think anyone needs briefed jack ripper wont go particulars say moore campbell unique interesting theory identity killer reasons chooses slay comic dont bother cloaking identity ripper screenwriters terry hayes vertical limit rafael yglesias les mis rables good job keeping hidden viewers end funny watch locals blindly point finger blame jews indians englishman never capable committing ghastly acts hells ending whistling stonecutters song simpsons days holds back electric carwho made steve guttenberg star dont worry itll make sense see now onto hells appearance certainly dark bleak enough surprising see much looks like tim burton film planet apes times seems like sleepy hollow print saw wasnt completely finished color music finalized comments marilyn manson cinematographer peter deming dont say word ably captures dreariness victorianera london helped make flashy killing scenes remind crazy flashbacks twin peaks even though violence film pales comparison blackandwhite comic oscar winner martin childs shakespeare love production design turns original prague surroundings one creepy place even acting hell solid dreamy depp turning typically strong performance deftly handling british accent ians holm joe goulds secret richardson dalmatians log great supporting roles big surprise graham cringed first time opened mouth imagining attempt irish accent actually wasnt half bad film however good r strong violencegore sexuality language drug content 
## 2                                                                                                                                                                                                     every now movie comes along suspect studio every indication will stinker everybodys surprise perhaps even studio film becomes critical darling mtv films election high school comedy starring matthew broderick reese witherspoon current example anybody know film existed week opened plot deceptively simple george washington carver high school student elections tracy flick reese witherspoon overachiever hand raised nearly every question way way high mr m matthew broderick sick megalomaniac student encourages paul popularbutslow jock run pauls nihilistic sister jumps race well personal reasons dark side sleeper success expectations low going fact quality stuff made reviews even enthusiastic right cant help going baggage glowing reviews contrast negative baggage reviewers likely election good film live hype makes election disappointing contains significant plot details lifted directly rushmore released months earlier similarities staggering tracy flick election president extraordinary number clubs involved school play max fischer rushmore president extraordinary number clubs involved school play significant tension election potential relationship teacher student significant tension rushmore potential relationship teacher student tracy flick single parent home contributed drive max fischer single parent home contributed drive male bumbling adult election matthew broderick pursues extramarital affair gets caught whole life ruined even gets bee sting male bumbling adult rushmore bill murray pursues extramarital affair gets caught whole life ruined gets several bee stings happened individual screenplay rushmore novel election contain many significant plot points yet films probably even aware made two different studios genre high school geeks revenge movie hadnt fully formed yet even strengths election rely upon fantastic performances broderick witherspoon newcomer jessica campbell pauls antisocial sister tammy broderick playing mr rooney role ferris bueller seems fun hes since witherspoon revelation early year comedy teenagers little clout money witherspoon deserves oscar nomination campbells character gets going like fantastic speech gymnasium youre won one thing thats bothering since ive seen extraordinary amount sexuality film suppose coming mtv films expect less film starts light airy like sitcom screws tighten tensions mount alexander payne decides add elements frankly distract story bad enough mr m doesnt like tracys determination win costs throw studentteacher relationship even theres logical reason mr m affair theres lot like election plot similarities rushmore tonal nosedive takes gets explicitly sexdriven mark disappointment 
##   label
## 1   pos
## 2   pos
## 对比分析两种情感评论的用词长度
wordlist <- str_split(polaritydf$text,"[[:space:]]+")
wordlen <- unlist(lapply(wordlist,length))
polaritydf$wordlen <- wordlen

ggplot(polaritydf,aes(x = label,y = wordlen))+
  geom_violin(fill = "red",alpha = 0.5)

t.test(wordlen~label,data = polaritydf)
## 
##  Welch Two Sample t-test
## 
## data:  wordlen by label
## t = -5.9188, df = 1937, p-value = 3.827e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -53.66942 -26.95458
## sample estimates:
## mean in group neg mean in group pos 
##           337.027           377.339
## mean in group neg mean in group pos 
##           337.027           377.339 
## 说明人们愿意写更多的字来表达正向的情感。

## 将处理好的数据表格保存
write.csv(polaritydf,"data/chap10/review_polarity.csv",row.names = FALSE)

中文文本预处理

红楼梦数据集预处理,使用jiebar分词

library(jiebaR)
## Loading required package: jiebaRD
library(stringr)
library(parallel)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggExtra)
library(tm)
library(wordcloud2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 读取停用词
filename <- "data/chap10/Red_dream/红楼梦停用词.txt"
mystopwords <- readLines(filename)
## 读取红楼梦文本
filename <-"data/chap10/Red_dream/红楼梦文本_UTF8.txt"
Red_dream <- readLines(filename,encoding='UTF-8')
head(Red_dream)
## [1] "第1卷"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [2] "第一回 甄士隐梦幻识通灵 贾雨村风尘怀闺秀"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [3] "  此开卷第一回也。作者自云:因曾历过一番梦幻之后,故将真事隐去,而借“通灵”之说,撰此<<石头记>>一书也。故曰“甄士隐”云云。但书中所记何事何人?自又云:“今风尘碌碌,一事无成,忽念及当日所有之女子,一一细考较去,觉其行止见识,皆出于我之上。何我堂堂须眉,诚不若彼裙钗哉?实愧则有余,悔又无益之大无可如何之日也!当此,则自欲将已往所赖天恩祖德,锦衣纨э之时,饫甘餍肥之日,背父兄教育之恩,负师友规谈之德,以至今日一技无成,半生潦倒之罪,编述一集,以告天下人:我之罪固不免,然闺阁中本自历历有人,万不可因我之不肖,自护己短,一并使其泯灭也。虽今日之茅椽蓬牖,瓦灶绳床,其晨夕风露,阶柳庭花,亦未有妨我之襟怀笔墨者。虽我未学,下笔无文,又何妨用假语村言,敷演出一段故事来,亦可使闺阁昭传,复可悦世之目,破人愁闷,不亦宜乎?”故曰“贾雨村”云云。"
## [4] "  此回中凡用“梦”用“幻”等字,是提醒阅者眼目,亦是此书立意本旨。"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [5] "  列位看官:你道此书从何而来?说起根由虽近荒唐,细按则深有趣味。待在下将此来历注明,方使阅者了然不惑。"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [6] "  原来女娲氏炼石补天之时,于大荒山无稽崖练成高经十二丈,方经二十四丈顽石三万六千五百零一块。娲皇氏只用了三万六千五百块,只单单剩了一块未用,便弃在此山青埂峰下。谁知此石自经煅炼之后,灵性已通,因见众石俱得补天,独自己无材不堪入选,遂自怨自叹,日夜悲号惭愧。"
## 提取读入的文本的每个章节数据
## 去除空白行
Red_dream <- Red_dream[!is.na(Red_dream)]

## 删除卷数据
juan <- grep(Red_dream,pattern = "^第+.+卷")
Red_dream <- Red_dream[(-juan)]
## 找出每一章节的头部行数和尾部行数
## 每一章节的名字
nameindex <- grep(Red_dream,pattern = "^第+.+回")
Red_dreamname <- data.frame(name = Red_dream[nameindex],
                            chapter = 1:120)
head(Red_dreamname)
##                                       name chapter
## 1 第一回 甄士隐梦幻识通灵 贾雨村风尘怀闺秀       1
## 2 第二回 贾夫人仙逝扬州城 冷子兴演说荣国府       2
## 3 第三回 贾雨村夤缘复旧职 林黛玉抛父进京都       3
## 4 第四回 薄命女偏逢薄命郎 葫芦僧乱判葫芦案       4
## 5 第五回 游幻境指迷十二钗 饮仙醪曲演红楼梦       5
## 6 第六回 贾宝玉初试云雨情 刘姥姥一进荣国府       6
## 处理章节名,切分字符串
names <- data.frame(str_split(Red_dreamname$name,pattern = " ",simplify =TRUE))
Red_dreamname$chapter2 <- names$X1
## 连接字符串
Red_dreamname$Name <- apply(names[,2:3],1,str_c,collapse = ",")
## 每章的开始行数
Red_dreamname$chapbegin<- grep(Red_dream,pattern = "^第+.+回")
## 每章的结束行数
Red_dreamname$chapend <- c((Red_dreamname$chapbegin-1)[-1],length(Red_dream))
## 每章的段落长度
Red_dreamname$chaplen <- Red_dreamname$chapend - Red_dreamname$chapbegin
## 每章的内容
for (ii in 1:nrow(Red_dreamname)){
  ## 将一章的所有段落连接起来连接
  chapstrs <- str_c(Red_dream[(Red_dreamname$chapbegin[ii]+1):Red_dreamname$chapend[ii]],collapse = "")
  ## 剔除不必要的空格
  Red_dreamname$content[ii] <- str_replace_all(chapstrs,pattern = "[[:blank:]]",replacement = "")
}
## 每章节的内容
content <- Red_dreamname$content
Red_dreamname$content <- NULL
## 计算每章有多少个字
Red_dreamname$numchars <- nchar(content)
head(Red_dreamname)
##                                       name chapter chapter2
## 1 第一回 甄士隐梦幻识通灵 贾雨村风尘怀闺秀       1   第一回
## 2 第二回 贾夫人仙逝扬州城 冷子兴演说荣国府       2   第二回
## 3 第三回 贾雨村夤缘复旧职 林黛玉抛父进京都       3   第三回
## 4 第四回 薄命女偏逢薄命郎 葫芦僧乱判葫芦案       4   第四回
## 5 第五回 游幻境指迷十二钗 饮仙醪曲演红楼梦       5   第五回
## 6 第六回 贾宝玉初试云雨情 刘姥姥一进荣国府       6   第六回
##                                Name chapbegin chapend chaplen numchars
## 1 甄士隐梦幻识通灵,贾雨村风尘怀闺秀         1      50      49     7775
## 2 贾夫人仙逝扬州城,冷子兴演说荣国府        51      80      29     5882
## 3 贾雨村夤缘复旧职,林黛玉抛父进京都        81     119      38     8481
## 4 薄命女偏逢薄命郎,葫芦僧乱判葫芦案       120     149      29     5898
## 5 游幻境指迷十二钗,饮仙醪曲演红楼梦       150     236      86     7417
## 6 贾宝玉初试云雨情,刘姥姥一进荣国府       237     263      26     7274
## save(content,file = "data/chap10/Readdream_content.RData")

分词等相关的操作

##-----------------------------------------------------------------------
## 对红楼梦进行分词####
## 使用自定义词典的分词器,分词方式为"mix":最大概率分割模型
Red_fen <- jiebaR::worker(type = "mix",
                          user = "data/chap10/Red_dream/红楼梦词典.txt")
Fen_red <- apply_list(as.list(content),Red_fen)
length(Fen_red)
## [1] 120
lapply(Fen_red[1:5],head)
## [[1]]
## [1] "此"     "开卷"   "第一回" "也"     "作者"   "自云"  
## 
## [[2]]
## [1] "诗云"   "一局"   "输赢"   "料不真" "香销"   "茶"    
## 
## [[3]]
## [1] "却说" "雨村" "忙"   "回头" "看时" "不是"
## 
## [[4]]
## [1] "却说" "黛玉" "同"   "姊妹" "们"   "至"  
## 
## [[5]]
## [1] "第四回" "中"     "既"     "将"     "薛家"   "母子"
# 第10章的分词结果的抽样
Fen_red[[10]][20:40]
##  [1] "不"     "吵闹"   "了"     "大家"   "散"     "了"     "学"     "金荣"  
##  [9] "回到"   "家中"   "越想"   "越气"   "说"     "秦钟"   "不过"   "是"    
## [17] "贾蓉"   "的"     "小舅子" "又"     "不是"
## 去除停用词,使用并行的方法

cl <- makeCluster(4)
Fen_red <- parLapply(cl = cl,Fen_red, filter_segment,filter_words=mystopwords)
stopCluster(cl)
# Fen_red <- lapply(Fen_red, filter_segment,filter_words=mystopwords)
## 每章节最终有多少个词
Red_dreamname$wordlen <- unlist(lapply(Fen_red,length))
## 添加分组变量,前80章为1组,后40章为2组
Red_dreamname$Group <- factor(rep(c(1,2),times = c(80,40)),
                              labels = c("前80章","后40章"))
Red_dreamname$name <- NULL
head(Red_dreamname)
##   chapter chapter2                              Name chapbegin chapend chaplen
## 1       1   第一回 甄士隐梦幻识通灵,贾雨村风尘怀闺秀         1      50      49
## 2       2   第二回 贾夫人仙逝扬州城,冷子兴演说荣国府        51      80      29
## 3       3   第三回 贾雨村夤缘复旧职,林黛玉抛父进京都        81     119      38
## 4       4   第四回 薄命女偏逢薄命郎,葫芦僧乱判葫芦案       120     149      29
## 5       5   第五回 游幻境指迷十二钗,饮仙醪曲演红楼梦       150     236      86
## 6       6   第六回 贾宝玉初试云雨情,刘姥姥一进荣国府       237     263      26
##   numchars wordlen  Group
## 1     7775    1939 前80章
## 2     5882    1488 前80章
## 3     8481    2228 前80章
## 4     5898    1533 前80章
## 5     7417    1727 前80章
## 6     7274    1586 前80章
## 对每章的内容进行探索分析####
## 对相关章节进行分析
## 每章节的段落长度
p1 <- ggplot(Red_dreamname,aes(x = chapter,y = chaplen)) +
  theme_bw(base_family = "STKaiti",base_size = 10) +
  geom_point(colour = "red",size = 1) +geom_line() +
  geom_text(aes(x = 25,y = 0.9*max(Red_dreamname$chaplen)),
            label="前80章",family = "STKaiti",colour = "Red") +
  geom_text(aes(x = 100,y = 0.9*max(Red_dreamname$chaplen)),
            label="后40章",family = "STKaiti",colour = "Red") +
  geom_vline(xintercept = 80.5,colour = "blue") +
  labs(x = "章节",y = "段数",title = "《红楼梦》每章段数")
## 每章节的字数
p2 <- ggplot(Red_dreamname,aes(x = chapter,y = numchars)) +
  theme_bw(base_family = "STKaiti",base_size = 10) +
  geom_point(colour = "red",size = 1) +geom_line() +
  geom_text(aes(x = 25,y = 0.9*max(Red_dreamname$numchars)),
            label="前80章",family = "STKaiti",colour = "Red") +
  geom_text(aes(x = 100,y = 0.9*max(Red_dreamname$numchars)),
            label="后40章",family = "STKaiti",colour = "Red") +
  geom_vline(xintercept = 80.5,colour = "blue") +
  labs(x = "章节",y = "字数",title = "《红楼梦》每章字数")

p3 <- ggplot(Red_dreamname,aes(x = chapter,y = wordlen)) +
  theme_bw(base_family = "STKaiti",base_size = 10) +
  geom_point(colour = "red",size = 1) +geom_line() +
  geom_text(aes(x = 25,y = 0.9*max(Red_dreamname$wordlen)),
            label="前80章",family = "STKaiti",colour = "Red") +
  geom_text(aes(x = 100,y = 0.9*max(Red_dreamname$wordlen)),
            label="后40章",family = "STKaiti",colour = "Red") +
  geom_vline(xintercept = 80.5,colour = "blue") +
  labs(x = "章节",y = "词数",title = "《红楼梦》每章词数")
## 绘制每一章节的平行坐标图
p4 <- ggparcoord(Red_dreamname,columns = 6:8,scale = "center",
                 groupColumn = "Group",showPoints = TRUE,
                 title = "《红楼梦》") +
  theme_bw(base_family = "STKaiti",base_size = 10) +
  theme(legend.position =  "bottom",axis.title.x = element_blank()) +
  scale_x_discrete(labels = c("断落数","字数","词数")) +
  ylab("中心化数据大小")
  
gridExtra::grid.arrange(p1,p2,p3,p4,ncol = 2)

## 针对前80章和后40章在段数、字数和词数是否有差异进行检验
colnames(Red_dreamname)
## [1] "chapter"   "chapter2"  "Name"      "chapbegin" "chapend"   "chaplen"  
## [7] "numchars"  "wordlen"   "Group"
t.test(chaplen~Group,data = Red_dreamname)
## 
##  Welch Two Sample t-test
## 
## data:  chaplen by Group
## t = 3.1423, df = 109.35, p-value = 0.002158
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.178734 9.621266
## sample estimates:
## mean in group 前80章 mean in group 后40章 
##                 26.4                 20.5
t.test(numchars~Group,data = Red_dreamname)
## 
##  Welch Two Sample t-test
## 
## data:  numchars by Group
## t = 1.6919, df = 106.2, p-value = 0.09359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -74.40985 940.85985
## sample estimates:
## mean in group 前80章 mean in group 后40章 
##             7319.400             6886.175
t.test(wordlen~Group,data = Red_dreamname)
## 
##  Welch Two Sample t-test
## 
## data:  wordlen by Group
## t = 1.594, df = 108.46, p-value = 0.1138
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -21.96746 202.41746
## sample estimates:
## mean in group 前80章 mean in group 后40章 
##             1618.250             1528.025
## 从t检验结果可以发现,前80章与后40章的段落数差异较大,
## 但字数、词数的差异在显著性水平为0.05下是不显著的

统计词频并可视化为词云

## 去除分词中的无效字符
Fen_red2 <- Fen_red
for(ii in 1:length(Fen_red)){
  Fenone <- str_c(Fen_red[ii],sep = " ") 
  Fenone <- str_remove_all(Fenone,pattern = "[:punct:]")
  Fenone <- str_remove_all(Fenone,pattern = "\n")
  Fenone <- str_remove_all(Fenone,pattern = "c")
  Fen_red2[ii] <- Fenone
}
## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing

## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing
## 词频统计##-----------------------------------------------------------
## 1:构建文档-词项频数矩阵
#corpus <- Corpus(VectorSource(Fen_red))
corpus <- Corpus(VectorSource(Fen_red2))
Red_dtm <- DocumentTermMatrix(corpus,control = list(wordLengths=c(1,Inf)))
Red_dtm
## <<DocumentTermMatrix (documents: 120, terms: 40342)>>
## Non-/sparse entries: 117554/4723486
## Sparsity           : 98%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## 2:词频统计
word_freq <- sort(colSums(as.matrix(Red_dtm)),decreasing = TRUE)
word_freq <- data.frame(word = names(word_freq),freq=word_freq,row.names = NULL)
head(word_freq)
##   word freq
## 1 宝玉 3907
## 2 笑道 1955
## 3 贾母 1686
## 4 一个 1440
## 5 凤姐 1228
## 6 袭人 1152
dim(word_freq[word_freq$freq >= 300,])
## [1] 52  2
## 绘制词频图
nn = 300
word_freq[word_freq$freq >= nn,] %>%
  ggplot(aes(x = reorder(word,-freq),y = freq)) +
  theme_bw(base_size = 12,base_family = "STKaiti") +
  geom_bar(stat = "identity",fill= "red",colour = "lightblue",alpha = 0.6) +
  scale_x_discrete() +
  theme(axis.text.x = element_text(angle = 75,hjust = 1,size = 8),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "词项",y = "频数",title = "《红楼梦》词频图")

##  可交互的红楼梦词云

## 动态词云
set.seed(123)
word_freq[word_freq$freq>60,] %>%
  wordcloud2(color = 'random-dark',backgroundColor = "white",
             shape = 'star' )

10.2:文本主题模型挖掘

电影评论数据的LDA主题模型

library(lda)
library(LDAvis)
library(stringr)
library(servr)
## 读取数据
polarity <- read.csv("data/chap10/review_polarity.csv")
textlist <- str_split(polarity$text,"[[:space:]]+")
## 计算词的频率

# 获取模型所需要使用的词库
term_table <- table(unlist(textlist))
term_table <- sort(term_table, decreasing = TRUE)
term_table[1:20]
## 
##       film        one      movie       like                  just       even 
##       8861       5521       5440       3554       3024       2900       2555 
##       good       time        can       will      story      films       much 
##       2321       2283       2232       2194       2121       2103       2024 
##       also characters        get  character        two      first 
##       1965       1947       1921       1906       1825       1768
dim(term_table)
## [1] 46703
## 剔除空的数据
del <- (names(term_table)) %in% c("")
term_table <- term_table[!del]
dim(term_table)
## [1] 46702
# 删除出现次数小于10次的词语
del <- term_table < 10
term_table <- term_table[!del]
vocab <- names(term_table)

# 将数据整理为lda包可使用的形式
get_terms <- function(x) {
  index <- match(x, vocab)
  index <- index[!is.na(index)]
  rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
documents <- lapply(textlist, get_terms)

## documents[[i]]:表示第i个文档
## documents[[i]][1,j] : 表示文档i中第j个单词对应在词库中的(索引-1)
## documents[[i]][2,j] : 表示文档i中第j个单词出现的次数

## 使用lda模型
alpha <-  0.02
eta <- 0.02
set.seed(357)
t1 <- Sys.time()
fit <- lda.collapsed.gibbs.sampler(documents = documents, K = 4, 
                                   vocab = vocab, 
                                   num.iterations = 50, alpha = alpha, 
                                   eta = eta, initial = NULL, burnin = 0,
                                   compute.log.likelihood = TRUE)
t2 <- Sys.time()
t2 - t1  # about 24 minutes on laptop
## Time difference of 2.947711 secs
dim(fit$topics)
## [1]    4 9144
## 可视化lda模型
## 对模型进行可视化

theta <- t(apply(fit$document_sums + alpha, 2, function(x) x/sum(x)))
phi <- t(apply(t(fit$topics) + eta, 2, function(x) x/sum(x)))
doc_length <- sapply(documents, function(x) sum(x[2, ]))
term_frequency <- as.vector(term_table)
pol_ldavis <- list(phi = phi,theta = theta,doc_length = doc_length,
                   vocab = vocab,term_frequency = term_frequency)

# create the JSON object to feed the visualization:
json <- createJSON(phi = pol_ldavis$phi, 
                   theta = pol_ldavis$theta, 
                   doc.length = pol_ldavis$doc_length, 
                   vocab = pol_ldavis$vocab, 
                   term.frequency = pol_ldavis$term_frequency)

serVis(json, out.dir = "data/chap10/vis")
## Warning in dir.create(out.dir): 'data/chap10/vis'已存在

使用topicmodel包进行LDA主题模型分析

使用《红楼梦》数据集的文档-词项矩阵Red_dtm

library(topicmodels)
## 
## Attaching package: 'topicmodels'
## The following object is masked _by_ '.GlobalEnv':
## 
##     get_terms
library(tidytext)
library(tidyr)
Red_dtm
## <<DocumentTermMatrix (documents: 120, terms: 40342)>>
## Non-/sparse entries: 117554/4723486
## Sparsity           : 98%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## 减少文档-词项矩阵Red_dtm的稀疏性
Red_dtm2 <- removeSparseTerms(Red_dtm,0.98)
Red_dtm2
## <<DocumentTermMatrix (documents: 120, terms: 8373)>>
## Non-/sparse entries: 80202/924558
## Sparsity           : 92%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## 词项从原来的4万多减少到8000多

## LDA主题模型
lda <- LDA(Red_dtm2,k = 2,method= "Gibbs",
                control = list(seed = 1234))

lda
## A LDA_Gibbs topic model with 2 topics.
## 查看每个主题都有哪些词,每个主题的前10个词
terms(lda,10)
##       Topic 1  Topic 2 
##  [1,] "宝玉"   "宝玉"  
##  [2,] "贾政"   "笑道"  
##  [3,] "王夫人" "贾母"  
##  [4,] "凤姐"   "一个"  
##  [5,] "贾琏"   "众人"  
##  [6,] "听见"   "袭人"  
##  [7,] "老太太" "两个"  
##  [8,] "贾母"   "宝钗"  
##  [9,] "太太"   "黛玉"  
## [10,] "一个"   "凤姐儿"
## 每个文档最有可能的主题
d_topic <- topics(lda,1)
table(d_topic)
## d_topic
##  1  2 
## 44 76
## 计算每个词项属于每个主题的可能性,词项-主题矩阵
term_topics <- tidy(lda, matrix = "beta")
## 数据有三列,分别为主题,词项和对应的取值大小
head(term_topics,20)
## # A tibble: 20 x 3
##    topic term         beta
##    <int> <chr>       <dbl>
##  1     1 一一   0.000632  
##  2     2 一一   0.000447  
##  3     1 一个   0.00836   
##  4     2 一个   0.0106    
##  5     1 一事   0.00000137
##  6     2 一事   0.000459  
##  7     1 一二   0.0000288 
##  8     2 一二   0.000192  
##  9     1 一二年 0.00000137
## 10     2 一二年 0.000116  
## 11     1 一件   0.00143   
## 12     2 一件   0.00144   
## 13     1 一半   0.000248  
## 14     2 一半   0.000485  
## 15     1 一去   0.000179  
## 16     2 一去   0.00000127
## 17     1 一口   0.000316  
## 18     2 一口   0.000930  
## 19     1 一味   0.000138  
## 20     2 一味   0.000281
## 将3个变量的数组转化词项-主题矩阵
term_topicsm <- spread(term_topics,key = "topic",value = "beta")
colna <- colnames(term_topicsm)
colnames(term_topicsm) <- c(colna[1],paste("topic",
                                         colna[-1],sep = ""))

## 可视化每个主题的常用词
topicplotdata <- term_topics %>% group_by(topic) %>%
  ## 前30个词
  top_n(30,beta)%>%ungroup()%>%
  arrange(topic,desc(beta))%>%
  mutate(term = reorder(term, beta))
head(topicplotdata)
## # A tibble: 6 x 3
##   topic term      beta
##   <int> <fct>    <dbl>
## 1     1 宝玉   0.0200 
## 2     1 贾政   0.0130 
## 3     1 王夫人 0.0117 
## 4     1 凤姐   0.0115 
## 5     1 贾琏   0.0106 
## 6     1 听见   0.00951
## 直方图
ggplot(topicplotdata,aes(x = term,y = beta))+
  geom_bar(aes(fill = factor(topic)),stat = "identity",
           show.legend = FALSE)+
  facet_wrap(~topic,scales = "free",ncol = 2)+
  coord_flip()+
  theme(axis.text.x = element_text(size = 6))+
  labs(x = "",y = "")

## 文档主题矩阵,即:评论-主题矩阵
doc_topics <- tidy(lda, matrix = "gamma")
doc_topics$document <- as.integer(doc_topics$document)
doc_topics$topic <- factor(doc_topics$topic)
## 数据有三列,分别为文档,词项和对应的取值大小
head(doc_topics,20)
## # A tibble: 20 x 3
##    document topic gamma
##       <int> <fct> <dbl>
##  1        1 1     0.414
##  2        2 1     0.438
##  3        3 1     0.273
##  4        4 1     0.492
##  5        5 1     0.26 
##  6        6 1     0.352
##  7        7 1     0.300
##  8        8 1     0.288
##  9        9 1     0.310
## 10       10 1     0.516
## 11       11 1     0.481
## 12       12 1     0.391
## 13       13 1     0.320
## 14       14 1     0.349
## 15       15 1     0.367
## 16       16 1     0.445
## 17       17 1     0.261
## 18       18 1     0.216
## 19       19 1     0.295
## 20       20 1     0.261
## 直方图可视化属于相应主题的可能性
ggplot(doc_topics,aes(x = document,y = gamma,fill = topic))+
  geom_bar(stat = "identity")+
  geom_hline(yintercept = 0.5)+
  geom_vline(xintercept = 80)+
  labs(x = "章节",y = "可能性",
       title = "《红楼梦》每章表示的主题")+
  theme(legend.position = "top",
        plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(seq(0,120,by = 10),seq(0,120,by = 10))

10.3:文本情感分析

情感分析(Sentiment analysis),又称倾向性分析,意见抽取(Opinion extraction),意见挖掘(Opinion mining),情感挖掘(Sentiment mining),主观分析(Subjectivity analysis)等,它是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程,如从评论文本中分析用户对“数码相机”的“变焦、价格、大小、重量、闪光、易用性”等属性的情感倾向。

使用n-gramt特征的TF-IDF矩阵和glmnet包lasso分类模型进行情感分类。

library(text2vec)
## 
## Attaching package: 'text2vec'
## The following object is masked from 'package:topicmodels':
## 
##     perplexity
library(dplyr)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 3.0-2
library(Metrics)
##  使用无监督的方式来识别评论的情感
polarity <- read.csv("data/chap10/review_polarity.csv",
                     stringsAsFactors = FALSE)
polarity$label2 <- as.factor(ifelse(polarity$label == "neg", 0,1))
## 切分训练集和测试集,70%训练,30%测试
set.seed(123)
index <- sample(nrow(polarity),round(nrow(polarity)*0.7))
pol_train <- polarity[index,]
pol_test <- polarity[-index,]
## 对训练集和测试集进行处理,获取分类所需的N-grams分类特征
## 数据集迭代器
pol_train_it <- pol_train$text%>%word_tokenizer()%>%itoken()
## 获取N-grams得到的词,并对词库进行修剪
vocab = create_vocabulary(pol_train_it, ngram = c(1L, 2L))%>%
  prune_vocabulary(term_count_min=15,doc_proportion_max=0.5)
## 得到每个词项的哈希矢量化
vocab_vect <- vocab_vectorizer(vocab)
## 计算TF——IDF矩阵
pol_train_tfidf <- create_dtm(pol_train_it,vocab_vect)%>%
  fit_transform(TfIdf$new())
dim(pol_train_tfidf)
## [1] 1400 6100
## 获取测试集特征
pol_test_it <- pol_test$text%>%word_tokenizer()%>%itoken()
pol_test_tfidf <- create_dtm(pol_test_it,vocab_vect)%>%
  fit_transform(TfIdf$new())
dim(pol_test_tfidf)
## [1]  600 6100
## 使用Lasso分类器进行二分类模型
set.seed(123)
lasso_cv = cv.glmnet(x = pol_train_tfidf, y = pol_train$label2,alpha = 1,
                     family = 'binomial',type.measure = "auc",
                     nfolds = 3,maxit = 1e3)

## 可视化不同参数下训练集的精度
plot(lasso_cv)

## 使用最优的lambda参数训练分类器
lasso_cv$lambda.min
## [1] 0.009759684
best_lam <- lasso_cv$lambda.min
lasso_cla <- glmnet(x = pol_train_tfidf, y = pol_train$label2,
                    family = 'binomial',alpha = 1,
                    lambda = best_lam,maxit = 1e3)
summary(lasso_cla)
##            Length Class     Mode     
## a0            1   -none-    numeric  
## beta       6100   dgCMatrix S4       
## df            1   -none-    numeric  
## dim           2   -none-    numeric  
## lambda        1   -none-    numeric  
## dev.ratio     1   -none-    numeric  
## nulldev       1   -none-    numeric  
## npasses       1   -none-    numeric  
## jerr          1   -none-    numeric  
## offset        1   -none-    logical  
## classnames    2   -none-    character
## call          7   -none-    call     
## nobs          1   -none-    numeric
lasso_cla
## 
## Call:  glmnet(x = pol_train_tfidf, y = pol_train$label2, family = "binomial",      alpha = 1, lambda = best_lam, maxit = 1000) 
## 
##    Df   %Dev  Lambda
## 1 532 0.7526 0.00976
## 输出在测试集上的识别精度
test_pre <- predict(lasso_cla,pol_test_tfidf)
test_pre <- ifelse(test_pre > 0.5,1,0)
sprintf("在测试集上的情感识别精度为: %f",accuracy(pol_test$label2,test_pre))
## [1] "在测试集上的情感识别精度为: 0.851667"
table(pol_test$label2,test_pre)
##    test_pre
##       0   1
##   0 274  28
##   1  61 237

这里主要使用无监督的情感分析方法

评论数据的情感分类

library(gmodels)
library(SentimentAnalysis)
## 
## Attaching package: 'SentimentAnalysis'
## The following object is masked from 'package:base':
## 
##     write
library(tm)
##  使用无监督的方式来识别评论的情感
polarity <- read.csv("data/chap10/review_polarity.csv",
                     stringsAsFactors = FALSE)
polarity$label2 <- ifelse(polarity$label == "neg", "negative","positive")
pol_cp <- Corpus(VectorSource(polarity$text))
pol_dtm <- DocumentTermMatrix(pol_cp)
result <- analyzeSentiment(pol_dtm,removeStopwords = FALSE)
head(result)
##   WordCount  SentimentGI NegativityGI PositivityGI  SentimentHE NegativityHE
## 1       402 -0.014925373   0.07711443   0.06218905  0.017412935  0.000000000
## 2       355 -0.002816901   0.08450704   0.08169014  0.011267606  0.005633803
## 3       210  0.071428571   0.06190476   0.13333333  0.014285714  0.004761905
## 4       558 -0.012544803   0.07526882   0.06272401  0.003584229  0.001792115
## 5       366 -0.008196721   0.07377049   0.06557377 -0.002732240  0.005464481
## 6       488  0.045081967   0.03483607   0.07991803  0.016393443  0.006147541
##   PositivityHE  SentimentLM NegativityLM PositivityLM RatioUncertaintyLM
## 1  0.017412935 -0.002487562   0.02238806  0.019900498        0.007462687
## 2  0.016901408 -0.011267606   0.01971831  0.008450704        0.000000000
## 3  0.019047619  0.000000000   0.02380952  0.023809524        0.004761905
## 4  0.005376344  0.001792115   0.01254480  0.014336918        0.005376344
## 5  0.002732240 -0.008196721   0.01639344  0.008196721        0.005464481
## 6  0.022540984  0.008196721   0.01024590  0.018442623        0.002049180
##   SentimentQDAP NegativityQDAP PositivityQDAP
## 1  -0.017412935     0.08208955     0.06467662
## 2   0.028169014     0.03943662     0.06760563
## 3   0.066666667     0.03809524     0.10476190
## 4   0.007168459     0.05376344     0.06093190
## 5   0.010928962     0.05464481     0.06557377
## 6   0.040983607     0.02049180     0.06147541
## 可视化情感得分
plotSentiment(result$SentimentLM)

## LM: Loughran-McDonald Financial dictionary
## GI: Harvard-IV dictionary as used in the General Inquirer software
## HE: Henry’s Financial dictionary
## QDAP:qdap package dictionary

## 将情感得分转化为两种情感
result01 <- convertToBinaryResponse(result)
head(result01)
##   WordCount SentimentGI NegativityGI PositivityGI SentimentHE NegativityHE
## 1       402    negative   0.07711443   0.06218905    positive  0.000000000
## 2       355    negative   0.08450704   0.08169014    positive  0.005633803
## 3       210    positive   0.06190476   0.13333333    positive  0.004761905
## 4       558    negative   0.07526882   0.06272401    positive  0.001792115
## 5       366    negative   0.07377049   0.06557377    negative  0.005464481
## 6       488    positive   0.03483607   0.07991803    positive  0.006147541
##   PositivityHE SentimentLM NegativityLM PositivityLM RatioUncertaintyLM
## 1  0.017412935    negative   0.02238806  0.019900498        0.007462687
## 2  0.016901408    negative   0.01971831  0.008450704        0.000000000
## 3  0.019047619    positive   0.02380952  0.023809524        0.004761905
## 4  0.005376344    positive   0.01254480  0.014336918        0.005376344
## 5  0.002732240    negative   0.01639344  0.008196721        0.005464481
## 6  0.022540984    positive   0.01024590  0.018442623        0.002049180
##   SentimentQDAP NegativityQDAP PositivityQDAP
## 1      negative     0.08208955     0.06467662
## 2      positive     0.03943662     0.06760563
## 3      positive     0.03809524     0.10476190
## 4      positive     0.05376344     0.06093190
## 5      positive     0.05464481     0.06557377
## 6      positive     0.02049180     0.06147541
summary(result01[,c("SentimentGI","SentimentHE","SentimentLM","SentimentQDAP")])
##    SentimentGI     SentimentHE     SentimentLM    SentimentQDAP 
##  negative: 341   negative: 241   negative:1261   negative: 277  
##  positive:1659   positive:1759   positive: 739   positive:1723
## 检查不同词典和实际的情感是否识别一致
table(result01$SentimentGI,polarity$label2)
##           
##            negative positive
##   negative      233      108
##   positive      767      892
CrossTable(result01$SentimentGI,polarity$label2,prop.r=F, prop.c=T,
           prop.t=F, prop.chisq=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                      | polarity$label2 
## result01$SentimentGI |  negative |  positive | Row Total | 
## ---------------------|-----------|-----------|-----------|
##             negative |       233 |       108 |       341 | 
##                      |     0.233 |     0.108 |           | 
## ---------------------|-----------|-----------|-----------|
##             positive |       767 |       892 |      1659 | 
##                      |     0.767 |     0.892 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |      1000 |      1000 |      2000 | 
##                      |     0.500 |     0.500 |           | 
## ---------------------|-----------|-----------|-----------|
## 
## 
CrossTable(result01$SentimentHE,polarity$label2,prop.r=F, prop.c=T,
           prop.t=F, prop.chisq=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                      | polarity$label2 
## result01$SentimentHE |  negative |  positive | Row Total | 
## ---------------------|-----------|-----------|-----------|
##             negative |       166 |        75 |       241 | 
##                      |     0.166 |     0.075 |           | 
## ---------------------|-----------|-----------|-----------|
##             positive |       834 |       925 |      1759 | 
##                      |     0.834 |     0.925 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |      1000 |      1000 |      2000 | 
##                      |     0.500 |     0.500 |           | 
## ---------------------|-----------|-----------|-----------|
## 
## 
CrossTable(result01$SentimentLM,polarity$label2,prop.r=F, prop.c=T,
           prop.t=F, prop.chisq=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                      | polarity$label2 
## result01$SentimentLM |  negative |  positive | Row Total | 
## ---------------------|-----------|-----------|-----------|
##             negative |       721 |       540 |      1261 | 
##                      |     0.721 |     0.540 |           | 
## ---------------------|-----------|-----------|-----------|
##             positive |       279 |       460 |       739 | 
##                      |     0.279 |     0.460 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |      1000 |      1000 |      2000 | 
##                      |     0.500 |     0.500 |           | 
## ---------------------|-----------|-----------|-----------|
## 
## 
CrossTable(result01$SentimentQDAP,polarity$label2,prop.r=F, prop.c=T,
           prop.t=F, prop.chisq=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                        | polarity$label2 
## result01$SentimentQDAP |  negative |  positive | Row Total | 
## -----------------------|-----------|-----------|-----------|
##               negative |       192 |        85 |       277 | 
##                        |     0.192 |     0.085 |           | 
## -----------------------|-----------|-----------|-----------|
##               positive |       808 |       915 |      1723 | 
##                        |     0.808 |     0.915 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |      1000 |      1000 |      2000 | 
##                        |     0.500 |     0.500 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 
## 
table(result01$SentimentGI,polarity$label2)
##           
##            negative positive
##   negative      233      108
##   positive      767      892
table(result01$SentimentHE,polarity$label2)
##           
##            negative positive
##   negative      166       75
##   positive      834      925
table(result01$SentimentLM,polarity$label2)
##           
##            negative positive
##   negative      721      540
##   positive      279      460
table(result01$SentimentQDAP,polarity$label2)
##           
##            negative positive
##   negative      192       85
##   positive      808      915

红楼梦数据的无监督情感走向

library(jiebaR)
library(ggplot2)
library(dplyr)
library(tidyr)
##  读取数据
load("data/chap10/Readdream_content.RData")
## 读取情感辞典
ntusd_neg <- readLines("data/chap10/ntusd-negative.txt",warn = F)
ntusd_pos <- readLines("data/chap10/ntusd-positive.txt",warn = F)
length(ntusd_neg)
## [1] 8276
length(ntusd_pos)
## [1] 2810
## 将两个情感词典也作为分词的词典进行分词
Red_fen <- jiebaR::worker(type = "mix",
                          user = "data/chap10/Red_dream/红楼梦词典.txt")

new_user_word(Red_fen,ntusd_neg)
## [1] TRUE
new_user_word(Red_fen,ntusd_pos)
## [1] TRUE
Fen_red <- apply_list(as.list(content),Red_fen)
## 去除停用词
mystopwords <- readLines("data/chap10/Red_dream/红楼梦停用词.txt")
Fen_red <- lapply(Fen_red, filter_segment,filter_words=mystopwords)
length(Fen_red)
## [1] 120
lapply(Fen_red[1:5],head)
## [[1]]
## [1] "开卷"   "第一回" "作者"   "自云"   "一番"   "梦幻"  
## 
## [[2]]
## [1] "诗云"   "一局"   "输赢"   "料不真" "香销"   "尽尚"  
## 
## [[3]]
## [1] "却说" "雨村" "回头" "看时" "乃是" "当日"
## 
## [[4]]
## [1] "却说"   "黛玉"   "姊妹"   "王夫人" "王夫人" "兄嫂"  
## 
## [[5]]
## [1] "第四回" "薛家"   "母子"   "荣府"   "寄居"   "事略"
## 简单的方式,就是计算分词后的文本中,有多少negative和positive的词语
neg_score <- unlist(as.vector(lapply(Fen_red, function(x){sum(x %in% ntusd_neg)})))
pos_score <- unlist(as.vector(lapply(Fen_red, function(x){sum(x %in% ntusd_pos)})))
red_pol <- data.frame(neg_score=-neg_score,pos_score=pos_score,
                      chap = 1:120)
head(red_pol)
##   neg_score pos_score chap
## 1       -87        46    1
## 2       -91        34    2
## 3       -93        55    3
## 4       -93        29    4
## 5       -68        34    5
## 6       -70        31    6
## 可视化每个章节积极词汇和消极词汇使用的数量
red_pol%>%gather(key = "pol",value = "score",-chap)%>%
  ggplot(aes(x =chap,fill =pol))+
  geom_bar(aes(y = score),stat = "identity")+
  theme(legend.position = "top")+
  labs(x = "章节",y = "得分",title = "《红楼梦》各章情感")+
  theme(plot.title = element_text(hjust = 0.5))

10.4:网络数据可视化及描述

## 社交网络数据可视化
library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:text2vec':
## 
##     normalize
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(igraphdata)
library(ggplot2)
## 读取并查看数据
data("karate")
karate
## IGRAPH 4b458a1 UNW- 34 78 -- Zachary's karate club network
## + attr: name (g/c), Citation (g/c), Author (g/c), Faction (v/n), name
## | (v/c), label (v/c), color (v/n), weight (e/n)
## + edges from 4b458a1 (vertex names):
##  [1] Mr Hi  --Actor 2  Mr Hi  --Actor 3  Mr Hi  --Actor 4  Mr Hi  --Actor 5 
##  [5] Mr Hi  --Actor 6  Mr Hi  --Actor 7  Mr Hi  --Actor 8  Mr Hi  --Actor 9 
##  [9] Mr Hi  --Actor 11 Mr Hi  --Actor 12 Mr Hi  --Actor 13 Mr Hi  --Actor 14
## [13] Mr Hi  --Actor 18 Mr Hi  --Actor 20 Mr Hi  --Actor 22 Mr Hi  --Actor 32
## [17] Actor 2--Actor 3  Actor 2--Actor 4  Actor 2--Actor 8  Actor 2--Actor 14
## [21] Actor 2--Actor 18 Actor 2--Actor 20 Actor 2--Actor 22 Actor 2--Actor 31
## [25] Actor 3--Actor 4  Actor 3--Actor 8  Actor 3--Actor 9  Actor 3--Actor 10
## + ... omitted several edges
## 因为网络图形式的数据并不容易理解,转化为数据表格的形式
## 将网络图数据转化为数据表
karate_df <- as_data_frame(karate,what = "both")
head(karate_df$vertices)
##         Faction    name label color
## Mr Hi         1   Mr Hi     H     1
## Actor 2       1 Actor 2     2     1
## Actor 3       1 Actor 3     3     1
## Actor 4       1 Actor 4     4     1
## Actor 5       1 Actor 5     5     1
## Actor 6       1 Actor 6     6     1
head(karate_df$edges)
##    from      to weight
## 1 Mr Hi Actor 2      4
## 2 Mr Hi Actor 3      5
## 3 Mr Hi Actor 4      3
## 4 Mr Hi Actor 5      3
## 5 Mr Hi Actor 6      3
## 6 Mr Hi Actor 7      3
## 弹簧模型图网络图可视化
set.seed(12)
plot(karate,layout = layout.fruchterman.reingold)

karate2 <- karate
## 可视化调整后的图像
# 生成节点的颜色和形状
colrs <- c("pink","tomato", "gold")
V(karate2)$color <- colrs[V(karate2)$color]
## 单独设置主管和教练节点的颜色
V(karate2)[V(karate2)$label %in% c("H","A")]$color <- "lightblue"
## 调整节点的形状
V(karate2)$shape <- "circle"
V(karate2)[V(karate2)$label %in% c("H","A")]$shape <- "rectangle"
## 设置节点的大小
V(karate2)$size <- 12+degree(karate2)
## 设置边的粗细
E(karate2)$width <- E(karate2)$weight
## 可视化调整后的网络图
plot(karate2,layout = layout.fruchterman.reingold)

图的相关描述

## 图的描述
## 查看每个节点的度
degree(karate)
##    Mr Hi  Actor 2  Actor 3  Actor 4  Actor 5  Actor 6  Actor 7  Actor 8 
##       16        9       10        6        3        4        4        4 
##  Actor 9 Actor 10 Actor 11 Actor 12 Actor 13 Actor 14 Actor 15 Actor 16 
##        5        2        3        1        2        5        2        2 
## Actor 17 Actor 18 Actor 19 Actor 20 Actor 21 Actor 22 Actor 23 Actor 24 
##        2        2        2        3        2        2        2        5 
## Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30 Actor 31 Actor 32 
##        3        3        2        4        3        4        4        6 
## Actor 33   John A 
##       12       17
## 判断图是否为简单图
is.simple(karate)
## [1] TRUE
## 查看图中某个节点的邻居节点
neighbors(karate,"Mr Hi")
## + 16/34 vertices, named, from 4b458a1:
##  [1] Actor 2  Actor 3  Actor 4  Actor 5  Actor 6  Actor 7  Actor 8  Actor 9 
##  [9] Actor 11 Actor 12 Actor 13 Actor 14 Actor 18 Actor 20 Actor 22 Actor 32
## 判断图是否为连通图
is.connected(karate)
## [1] TRUE
## 计算图的直径
diameter(karate,directed = F)
## [1] 13
## 判断图是否为有向无环图
is.dag(karate)
## [1] FALSE
## 计算边的数目
ecount(karate)
## [1] 78
## 计算节点的数目
vcount(karate)
## [1] 34
## 计算节点邻居平均度
graph.knn(karate,c("Mr Hi","John A"))$knn
##  Mr Hi John A 
## 5.0000 4.0625
## 节点的接近中心性:如果一个节点和很多其它节点接近,那么该节点处于网络中心位置
clos <- closeness(karate)
data.frame(vex = names(clos),weight = clos)%>%
  ggplot(aes(x=reorder(vex,weight),y = weight))+
  geom_bar(stat = "identity",fill = "tomato")+coord_flip()+
  labs(x="节点",y = "节点中心性")

## 计算极大团的尺寸和数量
table(unlist(lapply(max_cliques(karate),length)))
## 
##  2  3  4  5 
## 11 21  2  2
## 图的密度:实际出现的边与可能的边的比值
graph.density(karate)
## [1] 0.1390374

10.5:网络图的分割

## 使用聚类分割关系网络图
kclu1 <- cluster_fast_greedy(karate2)
sizes(kclu1)
## Community sizes
##  1  2  3 
## 18 11  5
## 可视化聚类结果
set.seed(12)
plot(kclu1,karate2)

## 可视化系统聚类树
par(cex=0.8)
dendPlot(kclu1,mode = "hclust")

## 模块化的多级优化算法对网络结构进行聚类
kclu2 <- cluster_louvain(karate2)
sizes(kclu2)
## Community sizes
##  1  2  3  4 
##  5 11  6 12
plot(kclu2,karate2)