Big Data and the Dot Com Bubble

I develop a big-data model which predicts dot-com market behavior. My model also predicts the dot-com collapse three months prior to its occurrence. My model differs from others that fail to explain the dot-com market in three ways. First it uses an objective machine driven methodology to analyze media news stories. Second, it treats news articles as complex multi-thematic constructs. Third it requires that news stories mention the firm in its headline. I submit that these three factors enable my model to explain dot-com market behavior where other models fail to do so.


Introduction
The literature provides mixed evidence as to the relationship between media coverage and bubble formation and collapse.Research immediately following the dot-com bubble collapse suggests that media-coverage exacerbates bubble formation and collapse.Shiller (2000) proposes that positive media feedback leads investors to overvalue internet stocks prior to the bubble, and after the bubble, leads investors to undervalue stocks.Thus media forces drive prices too high, prior to the bubble, and too low after the bubble.Subsequent research supports Shiller's finding that media coverage influences investor behavior.Tetlock (2007) published a series of papers showing that negative media drives stock prices down.Loughran and McDonald (2011) and Hanley and Hoberg (2009) show that investors respond to how the firm presents information in SEC filings, which are mined by media for information on the firm.
Recently, several papers have appeared which draw into question the results of the earlier research stream outlined above.Bhattacharya et al. (2009) reads and classifies media news items appearing between the years 1996 and 2000.They subjectively place the news articles into 'positive', 'neutral' and 'bad' news categories.They then test if the news articles explain bubble stock overpricing and/or underpricing.They fail to find such evidence and conclude that "media hype is unable to explain the stock bubble".Campbell, Turner and Walker (2012) use Bhattacharya's methodology to analyze the Railway bubble which occurred in England in the mid 1840s.They conclude that the media did not exacerbate bubble development.Rather, their evidence leads them to find that the media provides a "fair and unbiased" information resource to investors, which helps investors make rational buying/selling decisions.Thus, inexplicably, the latter literature on media's impact on bubbles is inconsistent with the earlier literature.This inconsistency provides me with an opportunity to contribute to the literature.information to the reader.Thus, when Bhattacharya et al. forces a single subjective label onto a news story, they confound the subtle content that the media is communicating to the investing public.
Lastly, Bhattacharya et al. (2009) use all new articles about a firm in which the firm is mentioned anywhere within the news article.I propose that most readers focus on the headline of a news article and perhaps the first paragraph.Thus, using news articles that discuss the firm only in the latter part of the article body, but not in the headline or first paragraph, introduces considerable noise into the study which may lead to the weakening of results.
I avoid these three shortcomings.First, I do not read the news stories; instead I use a computer to read them.To do this, each news story is saved as a text file to the computer's hard drive.These text files are then read into computer memory where a matrix is created for each file.The rows of this matrix represent each unique word found in the text file, and the columns of this matrix represent the counts of the words.Thus, a vector of matrices is created, with each matrix representing one news story and the entire vector representing the entire corpus being investigated.This approach has been used by many researchers in the finance literature (Loughran & McDonald, 2011;Hanley & Hoberg, 2009;Brockman & Cicon, 2013;Cicon, Clarke, Ferris, & Jayaraman, 2013).The primary advantage of this approach is that it removes the subjective nature of trying to classify documents read at different periods (days or years apart) and in different circumstances (home, office or campus).The computer reads all of the documents within a few minutes time and categorizes them in a way that is 100% replicable with regards to time and location.
I solve the second shortcoming by using methodologies similar to that used in Cicon, Clarke, Ferris and Jayaraman (2013).In this paper the authors break each press release into composite themes, treating each theme as incrementally informative.I use the six themes defined by Loughran and McDonald, (2011): positivity, negativity, litigiousness, uncertainty, modal strong and modal weak.Each of these themes is defined by a list of words that capture that theme's semantic context in the English language.For example, some of the words in the positivity dictionary are able, abundance, achieve, beautiful, charitable, etc.The positivity dictionary consists of over 350 words.The more of these words that the document contains, the higher the 'score' the document receives for positivity.I create scores for each document for each of the six L&M dictionaries.This differs from Bhattacharya's approach which treats each news article as a single token of information, an approach which weakens their study and may explain why they conclude that "media hype is unable to explain the stock bubble".I resolve the last shortcoming by limiting the breath of news coverage in my study.I require that the firm's name and/or ticker appear in the headline of the article in order for the news article to relevant to the firms in my study.Researchers provide evidence that readers searching for information about a firm do not read all available news articles about all available firms in order to divine information relevant to a particular firm.Rather, readers search headlines only for information about specific firms and read only those articles which mention the firm they are interested in the headline of the article.This interesting phenomenon is substantiated by findings from the Media Insight Project, an initiative of the American Press Institute (Note 1).

Data and Methodology
I begin building my sample of firms by searching SDC for all IPOs that issued over the period beginning in 1996 and ending in 2001.I exclude those IPOs which are unit offerings, rights offerings, closed end mutual funds, REITs and American depository receipts (ADRs).This search yields 2,706 firms.I download these firms and match them to the list of internet firms provided by Loughran and Ritter (2004).This step leaves me with a sample of 442 firms.I next create a matching sample of non-internet firms based on offer size and date.I match without replacement.Throughout this process I cross check each match with CRSP and COMPUSTAT.If the 'match' does not exist in both of these databases, I drop it and seek a match which does.At the end of this procedure I have a sample of 884 firms, 442 of which are 'internet' firms and 442 of which are 'non-internet' firms.
I next create my database of news articles for these firms.I begin by searching LexisNexis Academic for all news articles concerning my sample of 882 firms.I identify and download a total of 134,990 articles.The average news article length is 577 words.The total number of words in my article database is 77,883,240.I write C++ code to serialize each text file and extract company name, ticker symbol, the company's industry, the news headline and the news body.I then implement the methodology of Loughran and McDonald (2009) to compute softscores for each of their semantic constructs: litigiousness, modal strong, modal weak, negative tone, positive tone, and uncertainty.
Panel B reports results for the non-internet firms.I draw attention to my observation that the results in Panels A and B are orthogonal.Whereas lmPositive and lmModalStrong are significant in Panel A, they are not significant in Panel B. On the other hand, the three factors significant in Panel B are not significant in Panel A. Thus the internet stocks, and the non-internet stocks, are being driven by different soft factors.Those factors which are significant in Panel B are lmLitigious, lmModal Weak and lmUncertainty, at 0.009554, -0.007273 and 0.010414 respectively.
I next create and test a more parsimonious model (Brau, Cicon, & Ferris, 2014).In their regressions of first day returns, Loughran and McDonald (2011) report that their word lists take the following signs: uncertainty (+), weak modal (+), negative (+), positive (+), legal (-) and strong modal (-).Based on these results, I conflate all six of the Loughran and McDonald scores into a single big data factor as shown below: bigDataFactor= (uncertainty+modalweak+negative+positive )−(litigious+modalstrong ) (3) I use this factor to then create my parsimonious model: where all variables are defined in Table 1.I report results in Table 3.In Panel A, I report results for the internet firms.The intercept and the bigDataFactor have about equal magnitude, thus my big data factor explains half of the total variance in the Fama French returns (0.011590) and the intercept explains the other half (0.011590) .Both variables are significant at better than 1%.Panel B, on the other hand, report that the non-internet firms are not explained by the bigDataFactor, and all variance is captured by the intercept.This finding suggests that over this period of time that it is only the internet firms which are being driven by media forces, not the non-internet firms.Loughran and McDonald (2009) parameters with a more parsimonious mode.In Panel A, I report the results for only the Internet Firms, and in Panel B, I report the results for only the Non-internet Firms.I accomplish this by fitting the model below:  where all variables are defined in Table 1.Statistical significance is denoted as follows: '***' denotes 0% significance, '**' denotes 0.1% significance, '*' denotes 1% significance, '.' denotes 5% significance, and ' ' denotes no significance.
The difference between the Bhattacharya et al. (2009) study and my study is best appreciated by analyzing Figure 2.This figure first plots the abnormal returns that are not explained by the Fama French three factor model as a red line.The blue line plots the results of Equation 4(scaled to the same magnitude as the red line).Unlike in Bhattacharya, my model (blue line) closely follows the Fama French abnormal returns, with the exception of a deviation about three months prior to market collapse.This implies that my model may have the power to predict a collapse.
The power of my model to potentially predict an imminent market collapse is the most interesting part of this paper.It is not a finding that I expected.However, close inspection of Figure 3 shows that my model closely follows the dot-com market for most of the dot-com period.It is only about three months prior to the collapse of the dot-com market that my model deviates.I propose that up until this point, that media hype was driving the market.However, on or about December 7, 1999, the media started to draw back.At this point the market itself continued to overvalue internet stock, ignoring media that recommended prudence, and carried on the inertia of its previous exuberance.

Conclusion
In this paper I develop a big data model of the dot-com market.I show that my model explains dot-com market performance, despite the fact that Bhattacharya et al. (2009) asserts that "media hype is unable to explain the stock bubble".I propose that my model succeeds where Bhattacharya et al. (2009) fails, based on three factors.First, I remove subjectivity in the research methodology by using a machine to read the news articles, not relying on humans to read them over years of time.Second, I do not treat news as merely 'good' or 'bad'.Instead I extract measures of positivity, negativity, uncertainty, commitment and litigiousness.Lastly, I do not accept all news articles that mention the firm anywhere within the article.Instead, I limit my sample to news articles in which the firm is mentioned in the article headline.Surprisingly, my model also predicts the dot-com collapse: my model provides evidence that the market should have started falling on or about December 7, 1999.However the market continued to expand beyond that point for another three months.I propose that media was indeed driving dot-com market sentiment, but only until December 7, 1999.At that point the media began souring on dot-com internet stocks, however, market momentum continued to carry stock prices for another three months, well past the point where a prudent media observer would have pulled out.
Additional work needs to be performed on this study.First the control variables used by Bhattacharya et al. (2009) should be introduced.I do not expect that this will change my results materially because I add my big data variable at the same point where Bhattacharya adds his controls.Bhattacharya finds that the controls do not change his model or his results.Likewise, I expect that they will neither change mine.Secondly, my model should be applied to other bubbles as well to test if it has predictive power of them too, or if it merely reports an artifact of the dot-com era.
figure I replicate , computing the m or non-internet sto firms and call this Figure 2 figure I repeat the then add a control s the red line).Th blue line predicts a

Table 1 .
Variable definitions ple-non-internet sa rual for the intern ach day in my sam ormal returns.I

Table 2 .
Fama french abnormal returnsNote.This table tests the explanatory power of the Loughran and McDonald word dictionaries against the Fama French abnormal returns for the dot-com internet stocks.In Panel A, I report results for only the Internet Firms, and in Panel B, I report results for only the Non-internet Firms.Results are based on the model below:

Table 3 .
Fama french abnormal returns-the parsimonious model This table repeats the analysis in Table 2, but it replaces the six