Multimodal deep learning for multi-horizon corporate revenue forecasting
Loading...
Date
Authors
Wu, Qiping
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Corporate revenue forecasting matters for valuation, portfolio management, and capital allocation. However,
it is difficult because financial statements mainly reflect the past, while investors and firms often need forecasts
from the next quarter to a rolling one-year horizon. This challenge becomes even greater over longer horizons,
especially in fast-changing industries. This thesis addresses the problem by building a forecasting framework
that starts with a broad quantitative baseline and then extends to a multimodal approach.
First, this thesis develops a Temporal Fusion Transformer (TFT) baseline for next-quarter revenue
forecasting across 155 continuously listed S&P 500 firms. Under a strict chronological evaluation protocol,
the TFT model achieves a test Mean Absolute Percentage Error (MAPE) of 9.31%, a Root Mean Squared
Error (RMSE) of 1,973 million USD, and a Mean Absolute Error (MAE) of 1,790 million USD. Controlled
ablation analysis further shows that accurate short-horizon forecasting depends not only on autoregressive
revenue history, but also on structured firm context, including sector identity, year-over-year growth, and
firm scale variables such as total assets and equity.
Second, the framework is extended from one-quarter-ahead to four-quarter-ahead forecasting. The results
show that forecast accuracy deteriorates as the horizon expands, with MAPE rising from 9.31% at one quarter
ahead (𝑡 + 1) to 12.07% at four quarters ahead (𝑡 + 4). A comparison with an LSTM baseline under the
same chronological setting further suggests that this deterioration is not specific to a single model, but
reflects a broader limitation of purely financial forecasting approaches. The effect is especially pronounced
in technology-oriented firms, highlighting the limits of relying only on lagged financial data in non-linear
growth environments.
Third, the work proposes a multimodal TFT framework that integrates earnings-call-derived textual
signals into the forecasting pipeline. Focusing on the Mega-Cap 5 companies, the framework uses both
Financial Bidirectional Encoder Representations from Transformers (FinBERT) and a locally deployed
Llama-3 8B model to extract finance-domain sentiment and richer generative narrative features from quarterly
earnings call transcripts. These results show that transcript-based narrative features improve long-horizon
forecasting. Among the models, the Llama-3 representation delivers the biggest improvement. For example,
the pure TFT has a MAPE of 53.85%, while the FinBERT+TFT and Llama-3+TFT hybrids reduce it to
48.70% and 43.01%, respectively.
Overall, this thesis presents a practically deployable multimodal forecasting framework that bridges
the gap between backward-looking financial fundamentals and forward-looking managerial narratives in
corporate revenue forecasting.
