Yahoo Finance Web Scraping with R

Published in

Level Up Coding

4 min readFeb 22, 2021

The financial health of a company revolves around three core elements: (1) well-functioning day-to-day systems and processes, (2) resilience (the ability to weather shocks), and (3) the pursuit of longer-term goals. Financial Performance of the company also reflects the effectiveness of management. Yahoo Finance is a good source for financial data on publicly traded stocks, and I was interested in pulling together an understanding of the financial health of a long list of companies.

A lot of websites are dynamically created. This could be through the way the content or the data is loaded — generally done using JavaScript / react.js. Alternatively, depending on the school of thought, the template or architecture of the website could be created dynamically instead of using the more traditional pre-defined nesting of tags. Yahoo Finance is one such dynamically created website, where the values of class and data-reactid are dynamically generated attributes and the data is loaded via json.

Python has the BeautifulSoup and Selenium packages to help with web scraping. In this post, we’re going to look at web scraping with R, specifically the Yahoo Finance website. The data is populated through react.js, so we can get the data in json format from the page source from the root.app.main component. Let’s get started with our setup!

root.app.main component contains the data for the page.

Setup

Apart from some general regular use R libraries, you’ll need to load libraries, specifically the below 4 libraries that deal with HTML reading and parsing:

xml2 — for Reading HTML or XML
XML — For parsing XML or HTML file or string
rvest — For extracting nodes and pieces of the HTML document including CSS Selectors
httr — For handling http requests — GET(), PUT(), POST(), PATCH(), HEAD(), and DELETE()

library(data.table)
library(tidyr)
library(tidyverse)
library(XML)
library(xml2)
library(rvest)
library(httr)

What are we going to scrape?

The company financials on the Yahoo Finance website could be reported on an annual basis or a quarterly basis. Either ways, there is a wide variety of metrics for which we can extract data. These metrics can be seen below:

Financial statement metrics reported by listed companies

For my purposes, I was interested in the following metrics reported on a quarterly basis:

Total Revenue
Earnings
Cost of Revenue
Gross Profit

Code

First we have to read the HTML script and start parsing the website.

url <- paste0('https://finance.yahoo.com/quote/', symbol, '/financials?p=', symbol)
html_M <- read_html(url) %>% html_node('body') %>% html_text() %>% toString()

I want to standardize all stocks financial data in USD for easy comparison. Therefore, it is important to extract the currency in which the finances have been reported. The reporting currency is passed through the root.app.main component in the following manner: ,\”financialCurrency\”:\”JPY\”}, . Below is the script I used to extract and strip the string for the currency code of “JPY”. Later, I will use the currency conversion rate and convert the revenue numbers to USD.

fin_cur <- sub(".*\"financialCurrency\":*(.*?) *[\n|\r\n|\r]{2}", "\\1", html_M)
fin_cur <- head(stringr::str_match_all(fin_cur, "(.*?)\\}")[[1]][, 2],1)
fin_cur=gsub("\"", "", fin_cur, fixed=T)

Total Revenue and Earnings on a quarterly basis

We want to be able to extract the quarterly total revenue and earnings data and save it in a usable format. To do so we’ll need to extract the quarterly data, parse it into a readable format and then iteratively capture the date alongside the raw numbers. We do so with the code snippet below. At line 5, in the below code snippet, the splitQ variable consists of the output shown in the screenshot; therefore, we need to use lines 6 to 18 to extract the raw data for total revenue and date. We can use similar code to extract the quarterly earnings as well.

Q_results <- sub(".*\"quarterly\":*(.*?) *[\n|\r\n|\r]{2}", "\\1", html_M)
Q_results <- head(stringr::str_match_all(Q_results, "\\[(.*?)\\]")[[1]][, 2],1)
splitQ <- str_split(Q_results, "\\{\"date\":")
splitQ <- splitQ[[1]]
splitQ<- paste("\\{\"date\":", splitQ, sep="")if(length(splitQ)>0){
   tot_rev_df <- data.frame(curr = fin_cur,
      key=str_extract(splitQ, "\"date\":\"*(.*?) *\""),
      value=str_extract(splitQ, "\"revenue\":\\{\"raw\":*(.*?) *,"))
   tot_rev_df <- tot_rev_df[complete.cases(tot_rev_df), ]
   tot_rev_df <- data.frame(lapply(tot_rev_df, as.character), stringsAsFactors=FALSE)
   tot_rev_df <- tot_rev_df %>%
      separate(key, c("first", "key"), sep=":") %>% 
      select(-first)
   tot_rev_df <- tot_rev_df %>%
      separate(value, c("first", "second", "value"), sep=":") %>%
      select(-first, -second)
   tot_rev_df <- tot_rev_df %>%
      mutate(key=gsub("\"", "", key, fixed=T),
         value=gsub(",", "", value, fixed=T))
}

Screenshot of the result captured in splitQ

Cost of Revenue and Gross Profit

The raw data belonging to the Cost of Revenue and Gross Profit metrics can be extracted with the below code. The first half of the character vector extracted from the ex_between function contains quarterly data and the second half contains annual data.

cost_rev<- qdapRegex::ex_between(html_M, "\"costOfRevenue\":", "\"fmt\"")[[1]]
cost_rev <- cost_rev[1:(length(cost_rev)/2)]
cost_rev <- gsub("{\"raw\":", "", cost_rev, fixed=T)
cost_rev <- gsub(",", "", cost_rev, fixed=T)gp <- qdapRegex::ex_between(html_M, "\"grossProfit\":", "\"fmt\"")[[1]]
gp <- gp[1:(length(gp)/2)]
gp <- gsub("{\"raw\":", "", gp, fixed=T)
gp <- gsub(",", "", gp, fixed=T)

Conclusion

The if condition of if(length(splitQ)>0){} takes care of catching errors from when a stock is not listed or financial reporting has not been completed. Once you’ve compiled the code snippets for the metrics of interest and sorted the results into a dataframe, you can wrap the script in a for loop and iteratively scrape for the different stocks you’re interested in.

Additional Resources

If you’re interested in scraping yahoo finance using Python you can check out: