MSBA7012

极客清韵
• 阅读 624

MSBA7012 Individual Assignment
Deadline: Sunday, February 28, 2021 11:59pm
Datasets:
• FBPosts.csv contains all posts submitted by the official Facebook page accounts for 182
movies released in the United States in 2012. The content of each post is stored in the
“message_and_description” column.
• Bing Liu’s Opinion Lexicon: negative-words.txt and positive-words.txt
Questions:

  1. Zipf’s law states that the frequency of a word appearing in a large text corpus is inversely
    proportional to its rank. Make a plot in Python to illustrate the Zipf’s law using the words in
    all Facebook posts in the FBPosts.csv file. The x-axis of the plot is the rank of a word and the
    y-axis is the frequency of a word. Word frequency is defined as the number of times a word
    appear in all posts. The number of distinct words may be large; you can consider the top 1,000
    words only. Based on the plot you create, discuss whether the Zipf’s law is supported by this
    dataset and explain why it is supported or not. Limit your answer to 100 words. (5 marks)
  2. In Python, visualize the top 15 words with the highest tf-idf score for each of the following 4
    movies: Avengers (imdb_id= tt0848228), The Dark Knight Rises (tt1345836), The Hunger
    Games (tt1392170), and The Twilight Saga (tt1673434). Briefly summarize the insights you
    gain from this analysis. Limit your answer to 100 words. (5 marks)
  3. In Python, visualize the top 15 bigrams with the highest tf-idf score for each of the same 4
    moviesin Question #2. Compare the results you obtain for Questions #2 and #3 and comment
    on what additional insights you have gained from analyzing the bigrams in addition to the
    unigrams. Limit your answer to 100 words. (5 marks)
  4. Identify the top 20 most common positive and negative words based on Bing Liu’s opinion
    lexicon in all page posts and visualize the word frequencies in a bar chart (one for top 20
    positive words and one for top 20 negative words). (5 marks)
  5. Does the sentiment of Facebook page posts help predict the opening box office revenue?
    Interpret the economic significance of your result and explain why the sentiment of Facebook
    page posts helps or does not help predict the opening box office revenue. You may define
    sentiment in the following three ways: (1) fraction of positive words, (2) fraction of negative
    words, and (3) fraction of positive words - fraction of negative words. Feel free to use any
    analytics techniques (e.g., visualization, regression, machine learning, etc.) to provide an
    answer to this question. Since it is a prediction problem, you should only utilize the posts
    created before each movie’s release date. Limit your answer to one A4 page, including any
    text summary, figures, or tables. (10 marks)
    Deliverables:
    • A Word document (.docx) containing all the answers including plots or figures for the first 4
    questions and a one-page writing for your answer to the last question.
    • Source code of your programs for all questions in one file (either .py or .ipynb). Add comments
    to your code to improve readability. Make sure the grader can easily identify the source code
    for each of the five questions.
    • A readme.txt file describing the package/environment requirements to run your programs.
    • Compress the above three files into a zip file named with your student ID, e.g., 123456.zip.
    • You should not make any modifications to the three input files: FBPosts.csv, negativewords.txt,
    and positive-words.txt. They are the raw data input to your programs. Also, DO
    NOT include these three files in your zip file.
    WX:codehelp
点赞
收藏
评论区
推荐文章
blmius blmius
4年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
Oracle 分组与拼接字符串同时使用
SELECTT.,ROWNUMIDFROM(SELECTT.EMPLID,T.NAME,T.BU,T.REALDEPART,T.FORMATDATE,SUM(T.S0)S0,MAX(UPDATETIME)CREATETIME,LISTAGG(TOCHAR(
Wesley13 Wesley13
4年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
Easter79 Easter79
4年前
typeScript数据类型
//布尔类型letisDone:booleanfalse;//数字类型所有数字都是浮点数numberletdecLiteral:number6;lethexLiteral:number0xf00d;letbinaryLiteral:number0b101
Wesley13 Wesley13
4年前
VBox 启动虚拟机失败
在Vbox(5.0.8版本)启动Ubuntu的虚拟机时,遇到错误信息:NtCreateFile(\\Device\\VBoxDrvStub)failed:0xc000000034STATUS\_OBJECT\_NAME\_NOT\_FOUND(0retries) (rc101)Makesurethekern
Wesley13 Wesley13
4年前
FLV文件格式
1.        FLV文件对齐方式FLV文件以大端对齐方式存放多字节整型。如存放数字无符号16位的数字300(0x012C),那么在FLV文件中存放的顺序是:|0x01|0x2C|。如果是无符号32位数字300(0x0000012C),那么在FLV文件中的存放顺序是:|0x00|0x00|0x00|0x01|0x2C。2.  
Wesley13 Wesley13
4年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Wesley13 Wesley13
4年前
PHP创建多级树型结构
<!lang:php<?php$areaarray(array('id'1,'pid'0,'name''中国'),array('id'5,'pid'0,'name''美国'),array('id'2,'pid'1,'name''吉林'),array('id'4,'pid'2,'n
Wesley13 Wesley13
4年前
Java日期时间API系列36
  十二时辰,古代劳动人民把一昼夜划分成十二个时段,每一个时段叫一个时辰。二十四小时和十二时辰对照表:时辰时间24时制子时深夜11:00凌晨01:0023:0001:00丑时上午01:00上午03:0001:0003:00寅时上午03:00上午0
Wesley13 Wesley13
4年前
MBR笔记
<bochs:100000000000e\WGUI\Simclientsize(0,0)!stretchedsize(640,480)!<bochs:2b0x7c00<bochs:3c00000003740i\BIOS\$Revision:1.166$$Date:2006/08/1117
Python进阶者 Python进阶者
2年前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这