0% found this document useful (0 votes)
2K views

Data Extraction and Text Analysis Blackcoffer Consulting: Objective

The document provides instructions for extracting textual data and performing text analysis on SEC/EDGAR financial reports. It outlines 15 variables that need to be computed for each report, including positive/negative word counts and scores, average sentence length, percentages of complex/uncertainty/constraining words, and word counts. Links to the financial reports are provided in a spreadsheet, and the output should add 15 columns to each row with the computed variables for that report. The analysis is to be done on the text portions of the reports only, removing HTML/XML codes and other unnecessary data.

Uploaded by

Vikas Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views

Data Extraction and Text Analysis Blackcoffer Consulting: Objective

The document provides instructions for extracting textual data and performing text analysis on SEC/EDGAR financial reports. It outlines 15 variables that need to be computed for each report, including positive/negative word counts and scores, average sentence length, percentages of complex/uncertainty/constraining words, and word counts. Links to the financial reports are provided in a spreadsheet, and the output should add 15 columns to each row with the computed variables for that report. The analysis is to be done on the text portions of the reports only, removing HTML/XML codes and other unnecessary data.

Uploaded by

Vikas Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Extraction and Text Analysis

Blackcoffer Consulting

Objective

Objective of this assignment is to extract textual data from SEC / EDGAR financial reports and
perform text analysis to compute variables those are explained below.

Data Source

Link to SEC / EDGAR financial reports are given in excel spreadsheet “cik_list.xlsx”.

Please add https://www.sec.gov/Archives/ to every cells of column F (cik_list.xlsx) to access


link to the financial report.

Example: Row 2, column F contains edgar/data/3662/0000950170-98-000413.txt

Add https://www.sec.gov/Archives/ to form financial report link i.e.

https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt

1 Variables:
“Text Analysis.docx” you need to compute following:
Section 1.1: Positive score, negative score, polarity score
Section 2: Average Sentence Length, percentage of complex words, fog index
Section 4: Complex word count
Section 5: Word count
 
In addition to these eight variables, compute two more items: “uncertainty” and “constraining”.
These variables are calculated similar to the ones in Section 1.1 or Section 4. Attached the lists
of words that are classified as uncertain or constraining.
For uncertainty: “uncertainty_dictionary.xlsx”
For constraining: “constraining_dictionary.xlsx”

4 more variables:

For the variables: positive word proportion, negative word proportion, uncertainty word
proportion, and constraining word proportion:
The absolute values of “Positive/Negative Scores” are equal to the number of positive/negative
words in each report of 10-Q/K; so the (Loughran-McDonald) positive/negative word proportion
can be simply calculated as “Positive/Negative Scores divided by Word Count – compute these
measure in addition to Polarity Score.  And, the “uncertainty score” and “constraining score” will
be also just equal to the number of corresponding words and you can calculate the portion of
these words as the same as above.  

1 more variables:
For the variable Constraining words for whole report
Add one variable to the mix, which will be calculated for the whole report. It’s the number of
“constraining” words over the whole report.

That means you need to collect/compute 15 variables in total.

2 Data
For each report (financial reports, links available in excel, cik list), we would like these 15
variables calculated for the whole report.

You need to read, access and clean the financial report and KEEP TEXT ONLY from the annual
report url. Clean and remove html / xml, etc. codes, syntaxes, and best of your knowledge. You
can keep text (paragraphs, sections, titles, readable text, tables, etc. whichever are the part of the
financial report in the given url). Remove all noise and unwanted data from the annual report
urls.
Attached is the spreadsheet “cik_list.xlsx”, which also contains the links to reports. It would be
ideal if you could add 15 columns to each row, so that we would have the # rows unchanged
after your data collection.

3 Output Data Structure


Output Variables:

1. All input variables in “cik_list.xlsx”


2. positive_score
3. negative_score
4. polarity_score
5. average_sentence_length
6. percentage_of_complex_words
7. fog_index
8. complex_word_count
9. word_count
10. uncertainty_score
11. constraining_score
12. positive_word_proportion
13. negative_word_proportion
14. uncertainty_word_proportion
15. constraining_word_proportion
16. constraining_words_whole_report

Checkout output data structure spreadsheet for the format of your output.

Timeline

6 days, sooner is better.


Where to submit

To submit your solution, please fill this google sheet and upload your article to google drive, and share
the drive url in the google sheet: https://forms.gle/nvWAgrCBdq1JkKou8

Make sure your submission contains:


a) .py file

b) output in csv or excel file as given in the output structure

c) instructions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy