This repository has the coursework for Web Science (H).
The objective of this coursework is to develop a Twitter crawler for data collection in English and to conduct social media analytics. It is recommended to use Python language and also MongoDB for data storage.
The code and report needs to be submitted on or before the specified deadline. In addition, a sample of the data set collection should be provided.
The coursework will be marked out of 100 and will have 20% weight of the final marks. As the usual practice across the school, numerical marks will be appropriately converted into bands.
Tweets which are posted in United Kingdom are of main interest and need to be collected for 1 hour of any day. In addition, sample multimedia contents for tweets with media objects should be downloaded.
Develop a crawler to access as much Twitter data as possible and group the tweets based on similarity. Use important activity specific data to crawl additional data. During this process, Twitter data access APIs along with access constraints will be identified.
Use Twitter Streaming API for collecting data. Use Streaming API with a United Kingdom geographical filter along with selected words. Count the amount of data collected. Consider all the data collected for counting this, and count the re-tweets and quotes. [10 marks]
Group the tweets based on similarity - count the number of groups; count the elements in each group; identify prominent groups; prioritise terms in the group; identify entities in each group. Use this information to develop a REST API based crawler for activity specific data. [20 marks]
Enhance the crawling using the hybrid architecture of Twitter Streaming & REST APIs. [20 marks]
Analyse geo-tagged data for the UK for the period. Count the amount of collected geo-tagged data from the UK. Measure if there is any overlap between REST and Streaming APIs. [10 marks]
Download multimedia contents including videos and pictures for tweets with media objects. Provide a basic analysis of collected data. [10 marks]
Discuss the data access strategies implemented. Clearly specify the Twitter API specific restrictions encountered and they were addressed for collecting as much Twitter data as possible.
The report should be written with a 11pt font and with the maximum length of 10 pages. It should be organised the following way:
Section 1: Introduction
Section 2: Data Crawl
| Total | Streaming API | Retweets | Quotes | Images | Verified Count | Geotagged Data Count | Location Count |
|---|---|---|---|---|---|---|---|
| Total | Groups | Min Size | Max Size | Avg Size | … |
|---|---|---|---|---|---|
| Total | Streaming API | REST API | Redundant | Quotes | Retweets | Geotagged Data Count | Media Count |
|---|---|---|---|---|---|---|---|
Section 3: Scheduler/Ranker
Clone the repository, create a virtual environment with packages installed, and set your API keys in src/__init__.py or using environment variables. All key names are listed in .env.example. After the setup, simply run the directory with Python as __main__.py is present.
# Cloning
$ git clone https://github.com/ineshbose/UofG_Web_Science_H
$ cd UofG_Web_Science_H
# Creating Virtual Environment
$ python -m venv env
$ source env/bin/activate # Unix or macOS
# for Windows, use `env\scripts\activate`
$ pip install -r requirements.txt
# Environment Variable
$ set API_KEY="" # example
# Running
$ python .
We use cookies
We use cookies to analyze traffic and improve your experience. You can accept or reject analytics cookies.