<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.1 20151215//EN" "JATS-archivearticle1.dtd">
<article xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">UCL Open Environ</journal-id>
<journal-id journal-id-type="publisher-id">UCLOE</journal-id>
<journal-title-group>
<journal-title>UCL Open Environment</journal-title>
<abbrev-journal-title>UCL Open Environ</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2632-0886</issn>
<publisher>
<publisher-name>UCL Press</publisher-name>
<publisher-loc>UK</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.14324/111.444/ucloe.000063</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Public opinion evaluation on social media platforms: a case study of High Speed 2 (HS2) rail infrastructure project</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id authenticated="false" contrib-id-type="orcid">https://orcid.org/0000-0002-2596-5031</contrib-id>
<name name-style="western">
<surname>Yao</surname>
<given-names>Ruiqiu</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<contrib contrib-type="author">
<name name-style="western">
<surname>Gillen</surname>
<given-names>Andrew</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
</contrib>
<aff id="aff1">
<label>1</label>Civil, Environmental and Geomatic Engineering, University College London, London, UK</aff>
<aff id="aff2">
<label>2</label>Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, USA</aff>
</contrib-group>
<author-notes>
<corresp id="cor1">*Corresponding author: E-mail: <email>ruiqiu.yao.19@ucl.ac.uk</email>
</corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" publication-format="electronic">
<day>08</day>
<month>09</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="epub" date-type="collection" publication-format="electronic">
<year>2023</year>
</pub-date>
<volume>5</volume>
<elocation-id>e063</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>06</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>06</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>© 2023 The Authors.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>The Authors.</copyright-holder>
<license>
<ali:license_ref>https://creativecommons.org/licenses/by/4.0/</ali:license_ref>
<license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution Licence (CC BY) 4.0</ext-link>, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>











<abstract>
<p>Public opinion evaluation is becoming increasingly significant in infrastructure project assessment. The inefficiencies of conventional evaluation approaches can be improved with social media analysis. Posts about infrastructure projects on social media provide a large amount of data for assessing public opinion. This study proposed a hybrid model which combines pre-trained RoBERTa and gated recurrent units for sentiment analysis. We selected the United Kingdom railway project, High Speed 2 (HS2), as the case study. The sentiment analysis showed the proposed hybrid model has good performance in classifying social media sentiment. Furthermore, the study applies latent Dirichlet allocation topic modelling to identify key themes within the tweet corpus, providing deeper insights into the prominent topics surrounding the HS2 project. The findings from this case study serve as the basis for a comprehensive public opinion evaluation framework driven by social media data. This framework offers policymakers a valuable tool to effectively assess and analyse public sentiment.</p>
</abstract>
<kwd-group>
<kwd>public opinion evaluation</kwd>
<kwd>civil infrastructure projects</kwd>
<kwd>machine learning</kwd>
<kwd>sentiment analysis</kwd>
<kwd>topic modelling</kwd>
</kwd-group>
<funding-group>
<funding-statement>Not applicable to this article.</funding-statement>
</funding-group>
<counts>
<fig-count count="7"/>
<table-count count="4"/>
<ref-count count="50"/>
<page-count count="15"/>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>Infrastructure systems lay the foundation of the economy for a nation by providing primary transportation links, dependable energy systems and water management systems to the public. In the United Kingdom (UK), the National Infrastructure Strategy 2020 reveals the determination of the UK government to deliver new infrastructure and upgrade existing infrastructure across the country to boost growth and productivity and achieve a net-zero objective by 2050 [<xref ref-type="bibr" rid="r1">1</xref>]. Although infrastructure projects positively affect the national economy, they can negatively impact the environment and society. For instance, they may disrupt the natural habitat of wildlife by filling up wetlands. As a result, the wildlife may have to migrate to other regions, causing problems to the ecology of certain regions [<xref ref-type="bibr" rid="r2">2</xref>].</p>
<p>Environmental impact assessments (EIA) are a critical part of the planning and delivery of large infrastructure projects. In EIA research, public participation schemes are becoming increasingly popular. O’Faircheallaigh [<xref ref-type="bibr" rid="r3">3</xref>] emphasised the importance of public participation in EIA decision-making processes. Social media platforms are gaining increasing ubiquity and are emerging methods for the public to participate in decision-making processes and raise environmental concerns. Thus, the research objective of this study is to evaluate the feasibility of using social media data to perform public participation analysis.</p>
<sec id="s1a">
<title>Conventional approaches to public opinion evaluations</title>
<p>Public hearings and public opinion polling are the two most adopted public consultation approaches. Checkoway [<xref ref-type="bibr" rid="r4">4</xref>] stated some drawbacks of public hearings. For instance, the technical terms are hard to understand for the public, and participants often do not represent the actual population. As for polling, Heberlein [<xref ref-type="bibr" rid="r5">5</xref>] revealed that conducting polling can usually take a month or even years. As civil infrastructure projects typically have tight project timelines, there is a need for a more efficient public opinion evaluation method.</p>
<p>Moreover, Ding [<xref ref-type="bibr" rid="r6">6</xref>] argued that the data collection process is costly for conventional opinion polling. A typical 1000-participant telephone interview will cost tens of thousands of US dollars to carry out [<xref ref-type="bibr" rid="r7">7</xref>]. Besides conducting surveys, costs associated with data input and data analysis should also be considered [<xref ref-type="bibr" rid="r6">6</xref>].</p>
<p>Public hearings and polling are not ideal for obtaining public opinions for infrastructure projects. They can be costly, invasive and time-consuming. Therefore, researchers have drawn attention to developing an alternative method for obtaining and assessing public opinion. A new opportunity in acquiring and evaluating public opinion has emerged with the growing popularity of various social media platforms [<xref ref-type="bibr" rid="r8">8</xref>]. User-generated content on social media platforms provides a huge amount of data for text mining. This text data is an alternative resource for opinion evaluation toward civil infrastructure projects.</p>
</sec>
<sec id="s1b">
<title>Related work on public opinion evaluation with social media analysis</title>
<p>Kaplan and Haenlein [<xref ref-type="bibr" rid="r9">9</xref>] defined social media platforms as Internet-based applications adopting Web 2.0 (participative Web). Due to the number of active users on Facebook and Twitter, the massive amount of user-generated content provides valuable opportunities for researchers to study various social topics [<xref ref-type="bibr" rid="r10">10</xref>]. Moreover, with machine learning and natural language processing, researchers can perform advanced and automated algorithms on social media posts, such as sentiment analysis and topic modelling. Sentiment analysis can categorise the textual data in social media into different emotional orientations, providing an indicator of public opinion. Recent research on infrastructure project evaluation with social media analysis revealed the feasibility of using social media analysis as an alternative public opinion evaluation method.</p>
<p>Aldahawi [<xref ref-type="bibr" rid="r11">11</xref>] investigated social networking and public opinion on controversial oil companies by sentiment analysis of Twitter data. Kim and Kim [<xref ref-type="bibr" rid="r12">12</xref>] adopted lexicon-based sentiment analysis for public opinion sensing and trend analysis on nuclear power in Korea. Lexicon-based sentiment analysis with domain-specified dictionaries and topic modelling has also been used on public opinion data for California High-Speed Rail and the Three Gorges Project [<xref ref-type="bibr" rid="r6">6</xref>,<xref ref-type="bibr" rid="r8">8</xref>]. Lexicon-based sentiment analysis calculates the sentiment of documentation from the polarity of words [<xref ref-type="bibr" rid="r13">13</xref>]. In lexicon-based sentiment analysis, it is assumed that words have inherent sentiment polarity independent of their context. A user must establish dictionaries containing words with sentiment polarity to build a lexicon-based classifier. After building up the classifier, the polarity of a document is calculated in three phases: establishing word-polarity value pairs, replacing words in the document with polarity values and calculating the sentiment polarity for the document. Ding [<xref ref-type="bibr" rid="r6">6</xref>] tailor-made a dictionary by removing unrelated words from a positive word list. Jiang et al. [<xref ref-type="bibr" rid="r8">8</xref>] built a dictionary for hydro projects by integrating the Nation Taiwan Sentiment Dictionary [<xref ref-type="bibr" rid="r14">14</xref>], Hownet (a Chinese/English bilingual lexicon database) [<xref ref-type="bibr" rid="r15">15</xref>] and a hydro project-related word list. Recent research showed the practicality of implementing the lexicon-based sentiment analysis for public opinion evaluation on civil projects. The recent developments in deep learning show a promising future for public opinion evaluation.</p>
</sec>
<sec id="s1c">
<title>Recent development of natural language processing</title>
<p>In 2014, Bahdanau et al. [<xref ref-type="bibr" rid="r16">16</xref>] introduced a novel neural network architecture named attention mechanisms. Attentional mechanisms are designed to mimic cognitive perception, which computes the attention weight on input sequences so that some parts of the input data obtain more attention than the rest. In 2017, Vaswani et al. [<xref ref-type="bibr" rid="r17">17</xref>] published their ground-breaking research paper ‘Attention is all you need’, where they proposed an influential neural network named transformer. The transformer architecture leverages self-attention and multi-head attention to enable parallel computation. Using multiple attention heads and a self-attention mechanism, the transformer architecture can obtain different aspects of input data through learning different functions. As a result, transformer architecture can handle increased model and data size. Kaplan et al. [<xref ref-type="bibr" rid="r18">18</xref>] demonstrated that transformer models have remarkable scaling behaviour as model performance increases with training size and model parameters. Hence, natural language processing can benefit from large-language models, such as generative pre-trained transformer (GPT) [<xref ref-type="bibr" rid="r19">19</xref>,<xref ref-type="bibr" rid="r20">20</xref>] and Bidirectional Encoder Representations from Transformers (BERT) [<xref ref-type="bibr" rid="r21">21</xref>].</p>
</sec>
<sec id="s1d">
<title>Research question and main contributions</title>
<p>The recent developments in deep learning research motivated this study to assess how state-of-art machine learning algorithms can help public opinion evaluation on infrastructure projects. The main contributions of this study include:</p>
<list list-type="order">
<list-item>
<p>This study proposed a hybrid transformer-recurrent neural network model for sentiment analysis, which combines the pre-trained Robustly optimised BERT approach (RoBERTa) [<xref ref-type="bibr" rid="r22">22</xref>] and bidirectional gated recurrent neural networks [<xref ref-type="bibr" rid="r23">23</xref>].</p>
</list-item>
<list-item>
<p>This study employed tweets data of High Speed 2 (HS2) as a case study, utilising it to compare the performance of the proposed RoBERTa–bidirectional gated recurrent unit (BiGRU) with baseline classifiers. Moreover, this study applied topic modelling with latent Dirichlet allocation (LDA) on tweet corpus.</p>
</list-item>
<list-item>
<p>Based on the insights from the case study results, the study proposes a public opinion evaluation framework that leverages social media data with RoBERTa–BiGRU and topic modelling. This framework provides a valuable tool for policymakers to evaluate public opinion effectively.</p>
</list-item>
</list>
<p>The rest of this article is organised as follows: the machine learning models section provides a detailed exposition of the machine learning algorithms used in this study. The case study with the High Speed 2 project section delves into the specific details and findings. This is followed by the limitations of this research and suggests potential avenues for future research. Finally, the conclusion summarises the main findings and contributions.</p>
</sec>
</sec>
<sec id="s2">
<title>Machine learning models</title>
<p>This section provides a comprehensive overview of implementing machine learning algorithms for public opinion evaluation. The formulation of the multinomial naïve Bayes (MNB) classifier is presented. The proposed RoBERTa–BiGRU model is then introduced, highlighting its essential components and architecture. Finally, the topic modelling technique using LDA is discussed.</p>
<sec id="s2a">
<title>Sentiment analysis with an MNB classifier</title>
<p>The naïve Bayes classifier is a family of probabilistic classification models based on the Bayes theorem [<xref ref-type="bibr" rid="r24">24</xref>]. The term ‘naïve’ means the naïve assumption of independence among each pair of features (attributes) and class variable values [<xref ref-type="bibr" rid="r25">25</xref>]. More specifically, the ‘naïve’ assumption means that classifiers process the text data independently as a bag-of-words, ignoring the relationships among words, such as sequences, and only considering the word frequency in the document. The mathematical formula of the Bayes theorem <xref ref-type="disp-formula" rid="ucloe-05-063_eq_001">Eq. (1)</xref> states that given <italic>n</italic> feature vectors <italic>x</italic>
<sub>1</sub>,…,<italic>x<sub>n</sub>
</italic> and class variable <italic>y</italic>, the probability distribution of <italic>y</italic> is:</p>
<p>
<disp-formula id="ucloe-05-063_eq_001">
<label>(1)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_001" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mn>(</mml:mn>
<mml:mi>y</mml:mi>
<mml:mtext>|</mml:mtext>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo> </mml:mo>
<mml:mn>(</mml:mn>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>P</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mtext>|</mml:mtext>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e001.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>Because the probability distribution of feature vectors <italic>P</italic>(<italic>x</italic>
<sub>1</sub>,…,<italic>x<sub>n</sub>
</italic>) is given by the model input, the following classification rule <xref ref-type="disp-formula" rid="ucloe-05-063_eq_002">Eq. (2)</xref> and <xref ref-type="disp-formula" rid="ucloe-05-063_eq_003">Eq. (3)</xref> can be obtained [<xref ref-type="bibr" rid="r26">26</xref>]:</p>
<p>
<disp-formula id="ucloe-05-063_eq_002">
<label>(2)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_002" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mn>(</mml:mn>
<mml:mi>y</mml:mi>
<mml:mtext>|</mml:mtext>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>∝</mml:mo>
<mml:mi>P</mml:mi>
<mml:mn>(</mml:mn>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true" mathsize="140%">
<mml:mo>∏</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:munderover>
<mml:mi>P</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mtext>|</mml:mtext>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo> </mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e002.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="ucloe-05-063_eq_003">
<label>(3)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_003" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mmultiscripts>
<mml:mi>P</mml:mi>
<mml:mprescripts/>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mmultiscripts>
<mml:mn>(</mml:mn>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true" mathsize="140%">
<mml:mo>∏</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:munderover>
<mml:mi>P</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mtext>|</mml:mtext>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo> </mml:mo>
<mml:mtext>,</mml:mtext>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e003.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>where <italic>P</italic>(<italic>y</italic>) is the frequency distribution of <italic>y</italic> in the training dataset and <italic>P</italic>(<italic>x<sub>i</sub>
</italic>|<italic>y</italic>) is determined by the naïve Bayes classifier assumptions. For example, the Gaussian naïve Bayes classifier assumes <italic>P</italic>(<italic>x<sub>i</sub>
</italic>|<italic>y</italic>) follows a Gaussian distribution.</p>
<p>In the case of the MNB classifier, the multinomial distribution is parameterised by (θ<sub>
<italic>y</italic>1</sub>,…, θ<italic>
<sub>yn</sub>
</italic>) vectors for each <italic>y</italic> with <italic>n</italic> features. θ<italic>
<sub>yi</sub>
</italic> indicates the probability distribution of <italic>x<sub>i</sub>
</italic> under class <italic>y</italic> in the training set. In other words, θ<italic>
<sub>yi</sub>
</italic> = <italic>P</italic>(<italic>x<sub>i</sub>
</italic>|<italic>y</italic>). Then, smoothed maximum likelihood estimation [<xref ref-type="bibr" rid="r27">27</xref>] can be used to estimate θ<italic>
<sub>yi</sub>
</italic>:</p>
<p>
<disp-formula id="ucloe-05-063_eq_004">
<label>(4)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_004" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>θ</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo> </mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mi>y</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>α</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mtext>,</mml:mtext>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e004.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>where <italic>N<sub>yi</sub>
</italic> is the number of occurrences of feature <italic>i</italic> for sentiment class <italic>y</italic>; <italic>N<sub>y</sub>
</italic> is the number of occurrences of all features for <italic>y</italic>; and α is the smoothing prior, which is a hyperparameter to be tuned.</p>
</sec>
<sec id="s2b">
<title>Sentiment analysis with RoBERTa–BiGRU</title>
<p>As mentioned in the recent development of natural language processing, transformer architectures have remarkable scaling ability to handle large training data sizes and model parameters. As a result, researchers have proposed fine-tuning a pre-trained large-scale transformer model for specific downstream natural language processing tasks. This approach is referred to as transfer learning and leverages knowledge learned from the large-scale database to other downstream tasks [<xref ref-type="bibr" rid="r28">28</xref>]. The Bidirectional Encoder Representations from Transformers (BERT) [<xref ref-type="bibr" rid="r21">21</xref>] is a large language model that has state-of-the-art for natural language processing performance. The BERT model encodes text data in a bidirectionally way such that BERT can process text tokens in both left-to-right and right-to-left directions. This study used a variant of the BERT model, named the Robustly optimised BERT approach (RoBERTa) [<xref ref-type="bibr" rid="r22">22</xref>], because RoBERTa is pre-trained on a much larger scale of text data than BERT.</p>
<p>Details of fine-tuning the RoBERTa model for sentiment analysis are shown in <xref ref-type="fig" rid="fg001">Fig. 1</xref>. RoBERTa used similar transformer architecture as BERT. The input token sequence is passed to multiple self-attention heads, followed by a layer normalisation [<xref ref-type="bibr" rid="r29">29</xref>]. The normalised data is subsequently sent to feed-forward networks and a second layer normalisation. <xref ref-type="fig" rid="fg001">Figure 1</xref> shows the transformer architecture of a single encoder layer. The RoBERTa model contains multiple encoders based on model preference. A RoBERTa encoder’s hidden states can then be fed into a classifier for classification tasks. Noticeably, the ‘&lt;cls&gt;’ token indicates the global representation of input text [<xref ref-type="bibr" rid="r28">28</xref>].</p>
<fig fig-type="figure" id="fg001" orientation="portrait" position="float">
<label>Figure 1</label>
<caption>
<p>Fine-tuning RoBERTa for sentiment analysis.</p>
</caption>
<graphic xlink:href="ucloe-05-063-g001.png" orientation="portrait" position="float"/>
</fig>
<p>The classifier can be different neural network architectures, such as feedforward neural networks (FNN) or recurrent neural networks (RNN). The long short-term memory (LSTM) architecture is a prevalent choice as the classifier [<xref ref-type="bibr" rid="r30">30</xref>]. The LSTM introduced internal states and gates in addition to RNN to process information in sequenced data [<xref ref-type="bibr" rid="r31">31</xref>]. The GRU architecture, proposed by Cho [<xref ref-type="bibr" rid="r23">23</xref>] in 2014, is a streamlined adaptation of LSTM architecture which retains internal states and gating mechanisms. This study adopted the GRU architecture as a classifier from RoBERTa outputs because gated recurrent unit (GRU) has a faster computation speed than LSTM with comparable performance [<xref ref-type="bibr" rid="r32">32</xref>].</p>
<p>The GRU model consists of two internal gates: a reset gate and an update gate. The reset gate determines the extent to which information from the previous state is retained, while the update gate controls the proportion of the new state that replicates the old state. The mathematical formulate of the reset gate and update gate are:</p>
<p>
<disp-formula id="ucloe-05-063_eq_005">
<label>(5)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_005" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">R</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>σ</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">X</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo> </mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e005.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="ucloe-05-063_eq_006">
<label>(6)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_006" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>σ</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">X</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo> </mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e006.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="ucloe-05-063_eq_007">
<label>(7)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_007" overflow="scroll">
<mml:mrow>
<mml:mi>σ</mml:mi>
<mml:mn>(</mml:mn>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mn>(</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e007.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>where <inline-formula id="ucloe-05-063_eq_015">
<alternatives>
<mml:math id="ucloe-05-063_math_015" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">X</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i001.png"/>
</alternatives>
</inline-formula> is a minibatch input of a memory cell (<italic>n</italic> is the number of sample and <italic>d</italic> is the dimension of features); <inline-formula id="ucloe-05-063_eq_016">
<alternatives>
<mml:math id="ucloe-05-063_math_016" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i002.png"/>
</alternatives>
</inline-formula> is the hidden state of the previous step (<italic>h</italic> is the number of hidden units of a GRU memory cell); <bold>
<italic>W</italic>
</bold>
<italic>
<sub>ir</sub>
</italic>, <inline-formula id="ucloe-05-063_eq_017">
<alternatives>
<mml:math id="ucloe-05-063_math_017" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i003.png"/>
</alternatives>
</inline-formula> and <bold>
<italic>W</italic>
</bold>
<italic>
<sub>iz</sub>
</italic>, <inline-formula id="ucloe-05-063_eq_018">
<alternatives>
<mml:math id="ucloe-05-063_math_018" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo> </mml:mo>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i004.png"/>
</alternatives>
</inline-formula> are model weights; and <italic>b<sub>ir</sub>
</italic>, <italic>b<sub>hr</sub>
</italic>, <italic>b<sub>iz</sub>
</italic>, and <italic>b<sub>hz</sub>
</italic> are model bias parameters. The reset gate <inline-formula id="ucloe-05-063_eq_019">
<alternatives>
<mml:math id="ucloe-05-063_math_019" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">R</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i005.png"/>
</alternatives>
</inline-formula> and update gate <inline-formula id="ucloe-05-063_eq_020">
<alternatives>
<mml:math id="ucloe-05-063_math_020" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i006.png"/>
</alternatives>
</inline-formula> are computed based on <xref ref-type="disp-formula" rid="ucloe-05-063_eq_005">Eq. (5)</xref> and <xref ref-type="disp-formula" rid="ucloe-05-063_eq_006">Eq. (6)</xref>. In other words, two gates are fully connected layers with sigmoid activation function <xref ref-type="disp-formula" rid="ucloe-05-063_eq_007">Eq. (7)</xref>.</p>
<p>The reset gate is designed to yield a candidate hidden state <inline-formula id="ucloe-05-063_eq_021">
<alternatives>
<mml:math id="ucloe-05-063_math_021" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">N</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i007.png"/>
</alternatives>
</inline-formula> with <xref ref-type="disp-formula" rid="ucloe-05-063_eq_008">Eq. (8)</xref> and tanh activation function <xref ref-type="disp-formula" rid="ucloe-05-063_eq_009">Eq. (9)</xref>. The influences of previous information <bold>
<italic>H</italic>
</bold>
<sub>
<italic>t</italic>−1</sub> in <xref ref-type="disp-formula" rid="ucloe-05-063_eq_008">Eq. (8)</xref> is reduced by the Hadamard product of <bold>
<italic>R</italic>
</bold>
<italic>
<sub>t</sub>
</italic> and <bold>
<italic>H</italic>
</bold>
<sub>
<italic>t</italic>−1</sub>. The candidate hidden state <bold>
<italic>N</italic>
</bold>
<sub>
<italic>t</italic>
</sub> is then passed to <xref ref-type="disp-formula" rid="ucloe-05-063_eq_010">Eq. (10)</xref> to calculate the new hidden state <bold>
<italic>H</italic>
</bold>
<italic>
<sub>t</sub>
</italic>, in which the update gate <bold>
<italic>Z</italic>
</bold>
<italic>
<sub>t</sub>
</italic> controls the degree to which <bold>
<italic>H</italic>
</bold>
<italic>
<sub>t</sub>
</italic> resembles <italic>
<bold>N</bold>
</italic>
<sub>
<italic>t</italic>
</sub>.</p>
<p>
<disp-formula id="ucloe-05-063_eq_008">
<label>(8)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_008" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">N</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>tanh</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">X</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">R</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>⊙</mml:mo>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e008.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="ucloe-05-063_eq_009">
<label>(9)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_009" overflow="scroll">
<mml:mrow>
<mml:mi>tanh</mml:mi>
<mml:mn>(</mml:mn>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mn>(</mml:mn>
<mml:mo>−</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mn>(2</mml:mn>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo> </mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e009.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="ucloe-05-063_eq_010">
<label>(10)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_010" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>(1</mml:mn>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>⊙</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>⊙</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e010.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>where <inline-formula id="ucloe-05-063_eq_022">
<alternatives>
<mml:math id="ucloe-05-063_math_022" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo> </mml:mo>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i008.png"/>
</alternatives>
</inline-formula> and <inline-formula id="ucloe-05-063_eq_023">
<alternatives>
<mml:math id="ucloe-05-063_math_023" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i009.png"/>
</alternatives>
</inline-formula> are model weights; <italic>b<sub>in</sub>
</italic> and <italic>b<sub>hn</sub>
</italic> are bias parameters; and ⊙ is the Hadamard product, which is also referred to as the element-wise product.</p>
<p>Similar to the bidirectional setting of BERT, a two-layer GRU is also able to process the text data bidirectionally with a forward layer and a backward layer, as shown in <xref ref-type="fig" rid="fg002">Fig. 2</xref>. The hidden state of the forward layer and backward layer is denoted as <inline-formula id="ucloe-05-063_eq_024">
<alternatives>
<mml:math id="ucloe-05-063_math_024" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">→</mml:mo>
</mml:mover>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i010.png"/>
</alternatives>
</inline-formula> and <inline-formula id="ucloe-05-063_eq_025">
<alternatives>
<mml:math id="ucloe-05-063_math_025" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">←</mml:mo>
</mml:mover>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mi>ℝ</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mtext>.</mml:mtext>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i011.png"/>
</alternatives>
</inline-formula> The forward layer hidden states <inline-formula id="ucloe-05-063_eq_026">
<alternatives>
<mml:math id="ucloe-05-063_math_026" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">→</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i012.png"/>
</alternatives>
</inline-formula> is then multiplied with a dropout rate δ, which is a Bernoulli random variable with δ probability of being 0. The output of a GRU is a concatenate of <inline-formula id="ucloe-05-063_eq_027">
<alternatives>
<mml:math id="ucloe-05-063_math_027" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>δ</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">→</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i013.png"/>
</alternatives>
</inline-formula> and <inline-formula id="ucloe-05-063_eq_028">
<alternatives>
<mml:math id="ucloe-05-063_math_028" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">H</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">←</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="ucloe-05-063-i014.png"/>
</alternatives>
</inline-formula> with dimension <italic>n</italic> × 2<italic>h</italic>.</p>
<fig fig-type="figure" id="fg002" orientation="portrait" position="float">
<label>Figure 2</label>
<caption>
<p>Bidirectional GRU model.</p>
</caption>
<graphic xlink:href="ucloe-05-063-g002.png" orientation="portrait" position="float"/>
</fig>
<p>The RoBERTa model can be fine-tuned by optimising the loss function of the above-mentioned bidirectional GRU and connecting the output of a bidirectional GRU with a fully connected layer.</p>
<p>The loss function to be optimised in GRU is a cross entropy function [<xref ref-type="bibr" rid="r33">33</xref>]. Moreover, the fully connected layer uses the softmax activation function <xref ref-type="disp-formula" rid="ucloe-05-063_eq_011">Eq. (11)</xref>:</p>
<p>
<disp-formula id="ucloe-05-063_eq_011">
<label>(11)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_011" overflow="scroll">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mn>(</mml:mn>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true" mathsize="140%">
<mml:mo>∑</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e011.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>where <italic>n</italic> is the number of sentiment classes. The fully connected layer converts the hidden states of the bidirectional GRU to the probability of each sentiment class.</p>
<p>
<xref ref-type="fig" rid="fg003">Figure 3</xref> demonstrates the complete structure of the RoBERTa–BiGRU model. Firstly, tweets are tokenised with the RoBERTa tokeniser. Then, the tokens are passed to 12 encoders with multiple self-attention heads to obtain 768 tweets’ hidden representations. The tweets’ hidden representations can then be allocated to sentiment classes through a bidirectional GRU and fully connected layer.</p>
<fig fig-type="figure" id="fg003" orientation="portrait" position="float">
<label>Figure 3</label>
<caption>
<p>Structure of RoBERTa–BiGRU for sentiment analysis.</p>
</caption>
<graphic xlink:href="ucloe-05-063-g003.png" orientation="portrait" position="float"/>
</fig>
</sec>
<sec id="s2c">
<title>Topic modelling with LDA</title>
<p>Deerwester et al. [<xref ref-type="bibr" rid="r34">34</xref>] proposed a latent semantic indexing method for topic modelling, applying singular value decomposition (SVD) to derive the latent semantic structure model from the matrix of terms from documents. SVD is a linear algebra technique to decompose an arbitrary matrix to its singular values and singular vectors [<xref ref-type="bibr" rid="r35">35</xref>]. Blei et al. [<xref ref-type="bibr" rid="r36">36</xref>] introduced LDA, which is a general probabilistic model of a discrete dataset (text corpus).</p>
<p>LDA is a Bayesian model, which models a document as a finite combination of topics. Each topic is modelled as a combination of topic probabilities. For example, an article that talks about the structural design of a building complex may have various topics, including ‘structural layout’ and ‘material’. The topic ‘structural layout’ may have high-frequency words related to structural design, such as ‘beam’, ‘column’, ‘slab’ and ‘resistance’. Also, the ‘material’ topic may have the words ‘concrete’, ‘steel’, ‘grade’ and ‘yield’. In short, a document has different topics with a probabilistic distribution, and each topic has different words with a probabilistic distribution. Human supervision is not required in LDA topic modelling, as LDA only needs a number of topics to perform an analysis.</p>
<p>Topic modelling with LDA has a wide range of applications in research. Xiao et al. [<xref ref-type="bibr" rid="r37">37</xref>] used LDA variant topic modelling to uncover the probabilistic relationship between adverse drug reaction topics. Xiao et al. found that the LDA variant topic modelling has higher accuracy than alternative methods. Jiang et al. [<xref ref-type="bibr" rid="r8">8</xref>] showed the feasibility of LDA topic modelling to extract topics about the Three Gorges Project on a Chinese social media platform. Apart from focusing on extracting terms from the textual corpus, topic modelling is another trend-finding tool, as it will reveal the relationship between topics. Chuang et al. [<xref ref-type="bibr" rid="r38">38</xref>] proposed a method to visualise topics with circles in a two-dimensional plane, whose centre is determined by the calculated distance between topics. The distance is calculated using Jenson–Shannon divergence, and principal components analysis determines the size of the circle [<xref ref-type="bibr" rid="r39">39</xref>].</p>
</sec>
</sec>
<sec id="s3">
<title>Case study with the HS2 project</title>
<p>This section provides implementation details of sentiment classifiers and topic modelling methods for the HS2 case study. First, the background of the HS2 project is presented, offering insights into the rail infrastructure project. This is followed by an explanation on data collection and processing, detailing the methods employed to gather social media data related to HS2. Then the evaluation metrics used to assess the performance of sentiment classifiers are presented, enabling a thorough examination of sentiment classification models. The following two sections show the results of sentiment analysis and topic modelling, respectively. Finally, a framework for evaluating public opinion based on social media data is introduced.</p>
<sec id="s3a">
<title>Background on the HS2 project</title>
<p>The transportation demand for the UK railway network has steadily grown over the past decades. According to the Department for Transport [<xref ref-type="bibr" rid="r40">40</xref>], rail demand has doubled since 1994–1995, with a rising rate of 3% every year. Therefore, the HS2 programme is proposed to construct a new high-speed and high-capacity railway, aiming to boost the economy in the UK, improve connectivity by shortening journey time, provide sufficient capacity to meet future railway network demand and reduce carbon emission by reducing long-distance driving. <xref ref-type="fig" rid="fg004">Figure 4</xref> shows that HS2 will connect London, Leeds, Birmingham and Manchester, joining the existing railway infrastructure to allow passengers to travel to Glasgow, Newcastle and Liverpool [<xref ref-type="bibr" rid="r41">41</xref>].</p>
<fig fig-type="figure" id="fg004" orientation="portrait" position="float">
<label>Figure 4</label>
<caption>
<p>HS2 infrastructure map [<xref ref-type="bibr" rid="r41">41</xref>].</p>
</caption>
<graphic xlink:href="ucloe-05-063-g004.png" orientation="portrait" position="float"/>
</fig>
</sec>
<sec id="s3b">
<title>Data preparation</title>
<p>The collection of HS2-related tweets was carried out using Twitter application programming interfaces (API). Specifically, tweets that containing the hashtags ‘#HS2’ and ‘#HighSpeed2’ were collected. However, the number of collectable tweets is constrained by the limitations imposed by the Twitter API, which restricts the collection to under 10,000 tweets. Thus, the total number of tweets collected was 8623 tweets. The tweets were sampled over a 5-year period from 2017 to 2020. The tweet distribution across the years is: 2017 (1544 tweets), 2018 (1130 tweets), 2019 (2909 tweets) and 2020 (3040 tweets). Noticeably, the tweets collected were in an extended mode, allowing the retrieval of the complete text, surpassing the 140-character limit.</p>
<p>Data preprocessing involves cleaning and preparing data to increase the accuracy and performance of text-mining tasks, such as sentiment analysis and topic modelling. Tweet text data tend to contain uninformative text, such as URL links, Twitter usernames and email addresses. For MNB and lexicon-based classifiers, the stop words need to be removed. To be more specific, stop words are words that do not have sentiment orientation, such as ‘me’, ‘you’, ‘is’, ‘our’, ‘him’ and ‘her’. As each word in text data is treated as a dimension, keeping stop words and uninformative text will complicate the text mining by making text mining a high dimension problem [<xref ref-type="bibr" rid="r42">42</xref>]. Other text preprocessing techniques for MNB and lexicon-based classifiers include text lowercasing and text stemming. Noticeably, the transformer architectures do not require removing stop words, lowercasing and text stemming, as transformers are able to handle the implied information in stop words.</p>
<p>Upon conducting a manual inspection of collected tweets, the number of tweets expressing positive sentiment was significantly lower than those with negative or neutral sentiment. The sentiment classification task is set to binary to address the imbalance issue. The task was designed to classify tweets as either having negative sentiment or non-negative sentiment (including neutral and positive sentiments). A set of 1400 tweets was carefully annotated to train classifiers in this case study. Within this annotated dataset, 700 tweets were labelled as negative sentiment, while the remaining 700 tweets were labelled as non-negative sentiment. To access the annotated training tweets, a GitHub link is provided in the Open data and materials availability statement, facilitating transparency and reproducibility of this study. The annotated tweets were split into 70% training dataset (980 tweets) and 30% validation dataset (420 tweets).</p>
</sec>
<sec id="s3c">
<title>Sentiment analysis results</title>
<p>Three sentiment classifiers were used in this case study: (1) VADER [<xref ref-type="bibr" rid="r43">43</xref>], a rule-based lexicon sentiment classifier. (2) An MNB classifier, which is built following details in the background on the HS2 project. (3) A RoBERTa–BiGRU model that is developed based on the architecture presented in the data preparation section. The model details of each classifier are shown in <xref ref-type="table" rid="tb001">Table 1</xref>. The hyperparameters in MNB and RoBERTa–BiGRU, such as smoothing priors <italic>α</italic>, batch size, hidden units and dropout rate, were tuned by a grid search. The RoBERTa–BiGRU model was trained on a Tesla T4 GPU on Google Colab with a total training time of 2421.23 s for 100 epochs.</p>
<table-wrap id="tb001" orientation="portrait" position="float">
<label>Table 1.</label>
<caption>
<p>Model details of each classifier</p>
</caption>
<table>
<thead>
<tr>
<th align="left" colspan="1" rowspan="1" valign="top">Name</th>
<th align="left" colspan="1" rowspan="1" valign="top">Model parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">VADER</td>
<td align="left" colspan="1" rowspan="1" valign="top">Rules specified in [<xref ref-type="bibr" rid="r43">43</xref>]</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">MNB</td>
<td align="left" colspan="1" rowspan="1" valign="top">Smoothing priors: <italic>α</italic> = 0.1</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">RoBERTa–BiGRU</td>
<td align="left" colspan="1" rowspan="1" valign="top">Batch size: 16<break/>Hidden units: 256<break/>Dropout rate: 0.5<break/>Optimiser: AdamW<break/>Learning rate: 2 × <italic>e</italic>
<sup>–6</sup>
<break/>Epoch: 100</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The performances of three classifiers were evaluated with accuracy and receiver operating characteristic (ROC) curve. Accuracy, as shown in <xref ref-type="disp-formula" rid="ucloe-05-063_eq_012">Eq. (12)</xref>, measures the accuracy of the classifier with all correctly identified cases overall. A ROC curve plots the true positive rate, as shown <xref ref-type="disp-formula" rid="ucloe-05-063_eq_013">Eq. (13)</xref>, along the <italic>y</italic> axis and the false positive rate, as shown in <xref ref-type="disp-formula" rid="ucloe-05-063_eq_014">Eq. (14)</xref>, along the <italic>x</italic> axis. An ROC curve shows the graphical interpretation of gain (true positive rate) and loss (false positive rate) [<xref ref-type="bibr" rid="r44">44</xref>]. The area under the curve (AUC) score calculates the total area under the ROC curve. The AUC score quantitatively evaluates the performance of a classifier, which represents the possibility that a random positive datapoint ranks higher than a random negative datapoint [<xref ref-type="bibr" rid="r45">45</xref>].</p>
<p>
<disp-formula id="ucloe-05-063_eq_012">
<label>(12)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_012" overflow="scroll">
<mml:mrow>
<mml:mtext>accuracy </mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext> </mml:mtext>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e012.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="ucloe-05-063_eq_013">
<label>(13)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_013" overflow="scroll">
<mml:mrow>
<mml:mtext>true positive rate (recall) </mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext> </mml:mtext>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e013.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>
<disp-formula id="ucloe-05-063_eq_014">
<label>(14)</label>
<alternatives>
<mml:math display="block" id="ucloe-05-063_math_014" overflow="scroll">
<mml:mrow>
<mml:mtext>false positive rate </mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext> </mml:mtext>
<mml:mfrac>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<graphic xlink:href="ucloe-05-063-e014.png" orientation="portrait" position="float"/>
</alternatives>
</disp-formula>
</p>
<p>where <italic>TP</italic> = true positive, <italic>TN</italic> = true negative, <italic>FP</italic> = false positive and <italic>FN</italic> = false negative.</p>
<p>
<xref ref-type="table" rid="tb002">Table 2</xref> demonstrates the accuracy of each sentiment classifier. The lexicon-based VADER has the lowest accuracy (70.24%) among the three classifiers. MNB and RoBERTa–BiGRU show better accuracy performance than VADER, whereas MNB and RoBERTa–BiGRU have increased accuracies of 12.38% and 19.28%, respectively. MNB and RoBERTa–BiGRU are then compared with respect to the AUC scores. MNB has an AUC score of 0.9023, while RoBERTa–BiGRU has a slightly lower AUC score of 0.8904. Both MNB and RoBERTa–BiGRU have an AUC score of around 0.9, which indicates that both models have a high level of ability to classify tweet sentiment. Noticeably, <xref ref-type="fig" rid="fg005">Fig. 5(b)</xref> has a much steeper curve. The steeper curve means that RoBERTa–BiGRU can achieve higher recall with a low FP rate, which is desirable behaviour in sentiment analysis. As a result, RoBERTa–BiGRU has the best performance in terms of both accuracy and the ROC curve. Thus, RoBERTa–BiGRU was used for sentiment analysis with all collected tweets.</p>
<fig fig-type="figure" id="fg005" orientation="portrait" position="float">
<label>Figure 5</label>
<caption>
<p>(a) ROC curve for MNB classifier.</p>
<p>(b) ROC curve for RoBERTa–BiGRU.</p>
</caption>
<graphic xlink:href="ucloe-05-063-g005.png" orientation="portrait" position="float"/>
</fig>
<table-wrap id="tb002" orientation="portrait" position="float">
<label>Table 2.</label>
<caption>
<p>Model accuracy performance</p>
</caption>
<table>
<thead>
<tr>
<th align="left" colspan="1" rowspan="1" valign="top">Name</th>
<th align="left" colspan="1" rowspan="1" valign="top">Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">VADER</td>
<td align="left" colspan="1" rowspan="1" valign="top">70.24%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">MNB</td>
<td align="left" colspan="1" rowspan="1" valign="top">82.62%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">RoBERTa–BiGRU</td>
<td align="left" colspan="1" rowspan="1" valign="top">89.52%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>
<xref ref-type="fig" rid="fg006">Figure 6</xref> shows the sentiment distribution of HS2-related tweets from 2017 to 2020. Notably, there was a substantial increase in the number of tweets in 2019, indicating a heightened presence of the HS2 project in social media discussions during and after that year. Moreover, it is worth mentioning that the majority of tweets collected across all time periods exhibited a negative sentiment. Specifically, negative tweets accounted for 57.77% in 2017, 53.32% in 2018, 60.64% in 2019 and 65.19% in 2020.</p>
<fig fig-type="figure" id="fg006" orientation="portrait" position="float">
<label>Figure 6</label>
<caption>
<p>Sentiment analysis results for 2017–2020.</p>
</caption>
<graphic xlink:href="ucloe-05-063-g006.png" orientation="portrait" position="float"/>
</fig>
<p>The substantial proportion of negative tweets in all periods indicates a prevailing negative sentiment among the public regarding HS2, highlighting the importance for policymakers and decision-makers to take this sentiment into consideration. However, it is essential to approach these findings with caution. While the high percentage of negative tweets may raise concerns, it is crucial to note that this alone does not necessarily imply a public relationship emergency for HS2. It is worth acknowledging that certain Twitter users might repeatedly express their negative sentiment towards HS2 [<xref ref-type="bibr" rid="r46">46</xref>], potentially influencing the overall sentiment distribution. Given the sentiment analysis results, it is important to uncover the key topics within the tweets discussions, necessitating the application of topic modelling.</p>
</sec>
<sec id="s3d">
<title>Topic modelling results</title>
<p>The tweets dataset is then classified using the RoBERTa–BiGRU model into two collections: negative corpus and non-negative corpus. Each collection was performed individually with topic modelling and visualisation. Topic modelling with LDA was performed with genism, a collection of Python scripts developed by Rehurek and Sojka [<xref ref-type="bibr" rid="r47">47</xref>]. We used pyLDAvis, a Python library, for visualising topics such that we could determine the most suitable number of topics. Several models were constructed with a number of topics ranging from 3 to 20. We selected five as the number of topics through manual inspection of term distribution and topic relevance.</p>
<sec id="s3d1">
<title>Negative tweets corpus</title>
<p>
<xref ref-type="table" rid="tb003">Table 3</xref> shows major topics in the negative corpus. Topic 1 is the largest topic and accounts for 35.3% of the negative corpus. Topic 1 contains words such as ‘need’, ‘money’, ‘nhs’, ‘badly’ and ‘billion’. These words express the negative sentiment on HS2 budget spending. These tweets criticise the over-spending of HS2 and argue that the money should be invested in the National Health Service (NHS) rather than HS2. Topic 2 and Topic 4 have a similar focus. Topic 2 has words such as ‘government’, ‘protester’ and ‘social’, and Topic 4 include words such as ‘stophs2’, ‘petition’, ‘media’ and ‘political’. Both Topic 2 and Topic 4 discuss the campaign to stop HS2 project by a petition. Topic 3 and Topic 5 show some relevance. Topic 3 contains ‘stop’, ‘please’, ‘trees’, ‘contractors’, ‘changed’ and ‘essential’, which raises environmental concerns about construction work on woodlands. Topic 5 also discusses the environmental issues with the words ‘construction’, ‘damage’ and ‘destroy’.</p>
<table-wrap id="tb003" orientation="portrait" position="float">
<label>Table 3.</label>
<caption>
<p>Topics in negative corpus</p>
</caption>
<table>
<thead>
<tr>
<th align="left" colspan="1" rowspan="1" valign="top">Topic number</th>
<th align="left" colspan="1" rowspan="1" valign="top">Terms</th>
<th align="left" colspan="1" rowspan="1" valign="top">Topic percentage</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">1</td>
<td align="left" colspan="1" rowspan="1" valign="top">Borisjohnson, hs2, work, time, need, money, say, nhs, use, uk, course, amp, transport, nt, cancel, even local, badly, billion, ancient, public, needed, boris, way, think, country, rishisunak, trains, know</td>
<td align="left" colspan="1" rowspan="1" valign="top">35.3%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">2</td>
<td align="left" colspan="1" rowspan="1" valign="top">Rail, government, going, still, protesters, like, news, case, go, social, could, economic, train, people, home, London, times, business, ltd, working, travel, back, road, north, sense, says, dont</td>
<td align="left" colspan="1" rowspan="1" valign="top">24.2%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">3</td>
<td align="left" colspan="1" rowspan="1" valign="top">Stop, post, mps, please, another, anti, away, seems, trees, make, already, without, contractors, may, changed, control, steeple, long, big, bill, sign, essential, protest, claydon, likely, means, yet, billions, station, caught</td>
<td align="left" colspan="1" rowspan="1" valign="top">13.9%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">4</td>
<td align="left" colspan="1" rowspan="1" valign="top">Sopths2, workers, petition, sites, via, take, destruction, ever, change, media, track, year, ukparliament, least, investment, everyone, account, despite, find, continue, political, wants, white, along, british, longer, evidence, called, massive, elephant</td>
<td align="left" colspan="1" rowspan="1" valign="top">13.6%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">5</td>
<td align="left" colspan="1" rowspan="1" valign="top">Report, scrap, construction, costs, last, end, law, latest, true, tax, first, damage, full, job, trident, nesting, figures, wonder, share, read, unnecessary, questions, destroy, failed, coming, vital</td>
<td align="left" colspan="1" rowspan="1" valign="top">13.1%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3d2">
<title>Non-negative tweets corpus</title>
<p>
<xref ref-type="table" rid="tb004">Table 4</xref> shows topics in a non-negative corpus. Topic 1 includes words such as ‘new’, ‘railway’, ‘good’, ‘midlands’ and ‘important’, where tweets express positive sentiment on HS2 by mentioning the positive effect on the Midlands. A similar result can be found in Topic 3, which includes words such as ‘planning’, ‘Manchester’, ‘airport’, ‘benefit’, ‘better’. Topic 3 highlights that the transportation infrastructure in Manchester could benefit from the HS2 project. Topic 2 discusses the business case of HS2 with words such as ‘project’, ‘business’, ‘build’, ‘network’ and ‘industry’. Topics 4 and 5 both discuss potential improvements on the accessibility to the airport with words ‘heathrow’, ‘airports’, ‘opportunities’. Overall, the LDA topic modelling showed good execution on obtaining key topics from the tweet corpus.</p>
<table-wrap id="tb004" orientation="portrait" position="float">
<label>Table 4.</label>
<caption>
<p>Topics in non-negative corpus</p>
</caption>
<table>
<thead>
<tr>
<th align="left" colspan="1" rowspan="1" valign="top">Topic number</th>
<th align="left" colspan="1" rowspan="1" valign="top">Terms</th>
<th align="left" colspan="1" rowspan="1" valign="top">Topic percentage</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">1</td>
<td align="left" colspan="1" rowspan="1" valign="top">Work, new, project, one, railway, station, first, time, may, ever, people, plans, common, good, midlands, find, watch, still, well, way, may, could, largest, part, back, important, day</td>
<td align="left" colspan="1" rowspan="1" valign="top">35.4%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">2</td>
<td align="left" colspan="1" rowspan="1" valign="top">Construction, hs2ltd, rail, post, projects, train, business, build, track, road, read, network, phase, industry, latest, leaders, think, green, big, please, works, air, know, local, year, along</td>
<td align="left" colspan="1" rowspan="1" valign="top">24.4%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">3</td>
<td align="left" colspan="1" rowspan="1" valign="top"> High, speed, need, old, north, planning. would, capacity, built, engineering, course, Manchester, building, another, plan, recent, airport, must, benefit, needs, evidence, better, needed, chief, funding</td>
<td align="left" colspan="1" rowspan="1" valign="top">15.9%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">4</td>
<td align="left" colspan="1" rowspan="1" valign="top">Government, news, trains, us, would, home, two, heathrow, cost, start, railways, service, suppliers, roads, update, every, keep, seems, question, longer, join, money</td>
<td align="left" colspan="1" rowspan="1" valign="top">13.3%</td>
</tr>
<tr>
<td align="left" colspan="1" rowspan="1" valign="top">5</td>
<td align="left" colspan="1" rowspan="1" valign="top">Stations, use, lake, community, following, scheme, economic, really, opportunities, spending, committee, supply, benefits, due, chain, role, early, daily, fund, freight, article, essential, airports</td>
<td align="left" colspan="1" rowspan="1" valign="top">11.1%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s3e">
<title>Proposed public opinion evaluation framework using social media data</title>
<p>The case study results showed that the RoBERTa–BiGRU and LDA topic modelling have a good performance in evaluating public opinion on HS2 with tweet data. Hence, a public opinion evaluation framework using social media data is proposed to facilitate the decision-making of policymakers.</p>
<p>
<xref ref-type="fig" rid="fg007">Figure 7</xref> presents the comprehensive public opinion evaluation framework that utilises social media data. The process begins by collecting social media data, such as tweets, and storing them in a database. Subsequently, the social media data is processed through sentiment annotation, which involves labelling the data to create training sets. These training sets are then utilised for training a sentiment classifier called RoBERTa–BiGRU. Once the RoBERTa–BiGRU sentiment classifier is trained, it is employed to categorise social media tweets into predefined sentiment labels. Additionally, leveraging LDA topic modelling, the framework extracts key topics from the social media data. Policymakers can subsequently utilise the sentiment analysis results and key topics to evaluate public opinion regarding infrastructure projects.</p>
<fig fig-type="figure" id="fg007" orientation="portrait" position="float">
<label>Figure 7</label>
<caption>
<p>Public opinion evaluation framework.</p>
</caption>
<graphic xlink:href="ucloe-05-063-g007.png" orientation="portrait" position="float"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>Limitation and future research direction</title>
<sec id="s4a">
<title>Human factors in annotating tweets sentiments</title>
<p>Researchers usually assign multiple annotators (3 to 5) to tag the sentiment orientation to minimise the influence of human annotators [<xref ref-type="bibr" rid="r48">48</xref>]. However, in our study, all the training data was tagged by one annotator. As a result, the human factor may have affected the accuracy of the sentiment classifier. The future application of fine-tuning sentiment classifiers could benefit from multiple annotators.</p>
<p>Another impact of human factors could be different sentiment interpretations. For example, the following tweet may be tagged with different sentiment orientations. ‘#HS2 is a £100bn scheme to have slightly shorter journey times from Manchester and Birmingham to London, thereby solving Britain’s biggest ever problem.’ One annotator can argue that there are positive sentiment signs (shorter journey time and solving problems). In contrast, another annotator could also argue that the tweet used a sarcastic tone to express a negative sentiment towards over budget issue of HS2.</p>
</sec>
<sec id="s4b">
<title>Topic modelling challenges</title>
<p>Text documents are combinations of probabilistic distributions of topics, and each topic is a probabilistic distribution of words. However, tweets are short microblogs with character limitations (280 characters), which usually contain one topic. Therefore, LDA may have problems in calculating the probabilistic distribution of topics in tweets data. The performance of tweets topic modelling could be improved with the neural optic models, leveraging deep generative models [<xref ref-type="bibr" rid="r49">49</xref>]. Future research on public opinion evaluation with social media data could use Bayesian networks. In particular, gamma-belief networks showed promising results in yielding structure topics [<xref ref-type="bibr" rid="r50">50</xref>].</p>
</sec>
</sec>
<sec id="s5">
<title>Conclusion</title>
<p>This study utilised tweets data from the HS2 project as a case study. The tweets data were used to compare the performance of the proposed RoBERTa–BiGRU model with MNB and VADER. RoBERTa–BiGRU showed the best performance in terms of accuracy and ROC curves. Additionally, the study employs LDA to uncover key topics within the tweet corpus. This analysis enhances understanding of the prominent themes surrounding the HS2 project. The insights derived from the HS2 case study results lay the foundation for a public opinion evaluation framework. This framework, driven by social media data, is an invaluable tool for policymakers to evaluate public sentiment effectively. Overall, this study contributes to the field of public opinion evaluation by introducing a hybrid model, presenting a comprehensive case study analysis, and proposing a practical framework for public opinion evaluation.</p>
</sec>
</body>
<back>
<sec id="s6">
<title>Authorship contribution</title>
<p>This research was initially conducted as part of the requirements for the MSc in Civil Engineering at University College London. Mr Ruiqiu Yao was supervised by Dr Andrew Gillen for his MSc dissertation. The general topic and use of social media data were proposed by Dr Gillen, and they met regularly to discuss the research process. Mr Yao conducted the literature review as well as the data collection and analysis, identifying relevant sources of data and analytical tools. Mr Yao drafted the manuscript and Dr Gillen provided feedback on drafts.</p>
</sec>
<sec id="s7" sec-type="data-availability">
<title>Open data and materials availability statement</title>
<p>The datasets generated during and/or analysed during the current study are available in the repository: <ext-link ext-link-type="uri" xlink:href="https://github.com/RY7415/OpinionAnalysisSocialMedia">https://github.com/RY7415/OpinionAnalysisSocialMedia</ext-link>. This includes the collected data (anonymised) and the Python source code.</p>
</sec>
<sec id="s8">
<title>Declarations and conflicts of interest</title>
<sec id="s8a">
<title>Research ethics statement</title>
<p>The authors conducted the research reported in this article in accordance with UCL Research Ethics standards.</p>
</sec>
<sec id="s8b">
<title>Consent for publication statement</title>
<p>The authors declare that research participants’ informed consent to publication of findings – including photos, videos and any personal or identifiable information – was secured prior to publication.</p>
</sec>
<sec id="s8c" sec-type="COI-statement">
<title>Conflicts of interest statement</title>
<p>The authors declare no conflicts of interest with this work.</p>
</sec>
</sec>
<ref-list>
<title>References</title>
<ref id="r1">
<label>[1]</label>
<element-citation publication-type="book">
<collab>HM Treasury</collab>
<source>National infrastructure strategy</source>
<year>2020</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="https://www.gov.uk/government/publications/national-infrastructure-strategy">https://www.gov.uk/government/publications/national-infrastructure-strategy</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r2">
<label>[2]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Hayes</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Addressing the environmental impacts of large infrastructure projects: making ‘mitigation’ matter</article-title>
<source>Environ Law Rep</source>
<year>2014</year>
<volume>44</volume>
<elocation-id>10016</elocation-id>
<comment>
<ext-link ext-link-type="uri" xlink:href="https://heinonline.org/HOL/Page?handle=hein.journals/elrna44&amp;id=18&amp;collection=journals&amp;index=#">https://heinonline.org/HOL/Page?handle=hein.journals/elrna44&amp;id=18&amp;collection=journals&amp;index=#</ext-link>
</comment>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
</element-citation>
</ref>
<ref id="r3">
<label>[3]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>O’Faircheallaigh</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Public participation and environmental impact assessment: purposes, implications, and lessons for public policy making</article-title>
<source>Environ Impact Assess Rev</source>
<year>2010</year>
<volume>30</volume>
<issue>1</issue>
<fpage>19</fpage>
<lpage>27</lpage>
<pub-id pub-id-type="doi">10.1016/j.eiar.2009.05.001</pub-id>
</element-citation>
</ref>
<ref id="r4">
<label>[4]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Checkoway</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>The politics of public hearings</article-title>
<source>J Appl Behav Sci</source>
<year>1981</year>
<volume>17</volume>
<issue>4</issue>
<fpage>566</fpage>
<lpage>582</lpage>
<pub-id pub-id-type="doi">10.1177/002188638101700411</pub-id>
</element-citation>
</ref>
<ref id="r5">
<label>[5]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Heberlein</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Some observations on alternative mechanisms for public involvement: the hearing, public opinion poll, the workshop and the quasi-experiment</article-title>
<source>Nat Resour J</source>
<year>1976</year>
<volume>16</volume>
<issue>1</issue>
<fpage>197</fpage>
<lpage>212</lpage>
</element-citation>
</ref>
<ref id="r6">
<label>[6]</label>
<element-citation publication-type="thesis">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ding</surname>
<given-names>Q</given-names>
</name>
</person-group>
<source>Using social media to evaluate public acceptance of infrastructure projects. Thesis</source>
<publisher-name>University of Maryland</publisher-name>
<year>2018</year>
<pub-id pub-id-type="doi">10.13016/M27M0437D</pub-id>
</element-citation>
</ref>
<ref id="r7">
<label>[7]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>O’Connor</surname>
<given-names>B</given-names>
</name>
<name name-style="western">
<surname>Balasubramanyan</surname>
<given-names>R</given-names>
</name>
<name name-style="western">
<surname>Routledge</surname>
<given-names>BR</given-names>
</name>
<name name-style="western">
<surname>Smith</surname>
<given-names>NA</given-names>
</name>
</person-group>
<chapter-title>From tweets to polls: linking text sentiment to public opinion time series</chapter-title>
<conf-name>Proceedings of the fourth international AAAI conference on weblogs and social media</conf-name>
<conf-date>23–26 May 2010</conf-date>
<conf-loc>Washington, USA</conf-loc>
<year>2010</year>
</element-citation>
</ref>
<ref id="r8">
<label>[8]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Jiang</surname>
<given-names>H</given-names>
</name>
<name name-style="western">
<surname>Qiang</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Lin</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Assessment of online public opinions on large infrastructure projects: a case study of the Three Gorges Project in China</article-title>
<source>Environ Impact Assess Rev</source>
<year>2016</year>
<volume>61</volume>
<fpage>38</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="doi">10.1016/j.eiar.2016.06.004</pub-id>
</element-citation>
</ref>
<ref id="r9">
<label>[9]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kaplan</surname>
<given-names>AM</given-names>
</name>
<name name-style="western">
<surname>Haenlein</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Users of the world, unite! The challenges and opportunities of social media</article-title>
<source>Bus Horiz</source>
<year>2010</year>
<volume>53</volume>
<issue>1</issue>
<fpage>59</fpage>
<lpage>68</lpage>
<pub-id pub-id-type="doi">10.1016/j.bushor.2009.09.003</pub-id>
</element-citation>
</ref>
<ref id="r10">
<label>[10]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Park</surname>
<given-names>SB</given-names>
</name>
<name name-style="western">
<surname>Ok</surname>
<given-names>CM</given-names>
</name>
<name name-style="western">
<surname>Chae</surname>
<given-names>BK</given-names>
</name>
</person-group>
<article-title>Using Twitter data for cruise tourism marketing and research</article-title>
<source>J Travel Tour Mark</source>
<year>2016</year>
<volume>33</volume>
<issue>6</issue>
<fpage>885</fpage>
<lpage>898</lpage>
<pub-id pub-id-type="doi">10.1080/10548408.2015.1071688</pub-id>
</element-citation>
</ref>
<ref id="r11">
<label>[11]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Aldahawi</surname>
<given-names>HA</given-names>
</name>
</person-group>
<source>Mining and analysing social network in the oil business: twitter sentiment analysis and prediction approaches</source>
<publisher-name>Cardiff University</publisher-name>
<year>2015</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>
<ext-link ext-link-type="uri" xlink:href="https://orca.cardiff.ac.uk/id/eprint/85006/1/2015aldahawihphd.pdf.pdf">https://orca.cardiff.ac.uk/id/eprint/85006/1/2015aldahawihphd.pdf.pdf</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r12">
<label>[12]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kim</surname>
<given-names>DS</given-names>
</name>
<name name-style="western">
<surname>Kim</surname>
<given-names>JW</given-names>
</name>
</person-group>
<article-title>Public opinion sensing and trend analysis on social media: a study on nuclear power on Twitter 1</article-title>
<source>Int J Multimedia Ubiquitous Eng</source>
<year>2014</year>
<volume>9</volume>
<issue>11</issue>
<fpage>373</fpage>
<lpage>384</lpage>
<pub-id pub-id-type="doi">10.14257/ijmue.2014.9.11.36</pub-id>
</element-citation>
</ref>
<ref id="r13">
<label>[13]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Taboada</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Brooke</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Tofiloski</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Voll</surname>
<given-names>K</given-names>
</name>
<name name-style="western">
<surname>Stede</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Lexicon-based methods for sentiment analysis</article-title>
<source>Comput Linguist</source>
<year>2011</year>
<volume>37</volume>
<issue>2</issue>
<fpage>267</fpage>
<lpage>307</lpage>
<pub-id pub-id-type="doi">10.1162/COLI_a_00049</pub-id>
</element-citation>
</ref>
<ref id="r14">
<label>[14]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ku</surname>
<given-names>LW</given-names>
</name>
<name name-style="western">
<surname>Liang</surname>
<given-names>YT</given-names>
</name>
<name name-style="western">
<surname>Chen</surname>
<given-names>HH</given-names>
</name>
</person-group>
<source>Opinion extraction, summarization and tracking in news and Blog Corpora</source>
<year>2006</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>
<ext-link ext-link-type="uri" xlink:href="https://aaai.org/papers/0020-opinion-extraction-summarization-and-tracking-in-news-and-blog-corpora/">https://aaai.org/papers/0020-opinion-extraction-summarization-and-tracking-in-news-and-blog-corpora/</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r15">
<label>[15]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Dong</surname>
<given-names>Z</given-names>
</name>
<name name-style="western">
<surname>Dong</surname>
<given-names>Q</given-names>
</name>
</person-group>
<chapter-title>HowNet – a hybrid language and knowledge resource</chapter-title>
<conf-name>NLP-KE 2003 – 2003 international conference on natural language processing and knowledge engineering, proceedings</conf-name>
<publisher-name>Institute of Electrical and Electronics Engineers Inc</publisher-name>
<year>2003</year>
<fpage>820</fpage>
<lpage>824</lpage>
<pub-id pub-id-type="doi">10.1109/NLPKE.2003.1276017</pub-id>
</element-citation>
</ref>
<ref id="r16">
<label>[16]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Bahdanau</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Cho</surname>
<given-names>K</given-names>
</name>
<name name-style="western">
<surname>Bengio</surname>
<given-names>Y</given-names>
</name>
</person-group>
<source>Neural machine translation by jointly learning to align and translate. [online]</source>
<month>September</month>
<year>2014</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1409.0473">http://arxiv.org/abs/1409.0473</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r17">
<label>[17]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Vaswani</surname>
<given-names>A</given-names>
</name>
<name name-style="western">
<surname>Shazeer</surname>
<given-names>N</given-names>
</name>
<name name-style="western">
<surname>Parmar</surname>
<given-names>N</given-names>
</name>
<name name-style="western">
<surname>Uszkoreit</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Jones</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Gomez</surname>
<given-names>AN</given-names>
</name>
<etal/>
</person-group>
<source>Attention is all you need. [online]</source>
<month>June</month>
<year>2017</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1706.03762">http://arxiv.org/abs/1706.03762</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r18">
<label>[18]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kaplan</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>McCandlish</surname>
<given-names>S</given-names>
</name>
<name name-style="western">
<surname>Henighan</surname>
<given-names>T</given-names>
</name>
<name name-style="western">
<surname>Brown</surname>
<given-names>TB</given-names>
</name>
<name name-style="western">
<surname>Chess</surname>
<given-names>B</given-names>
</name>
<name name-style="western">
<surname>Child</surname>
<given-names>R</given-names>
</name>
<etal/>
</person-group>
<source>Scaling laws for neural language models. [online]</source>
<month>January</month>
<year>2020</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2001.08361">http://arxiv.org/abs/2001.08361</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r19">
<label>[19]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Brown</surname>
<given-names>TB</given-names>
</name>
<name name-style="western">
<surname>Mann</surname>
<given-names>B</given-names>
</name>
<name name-style="western">
<surname>Ryder</surname>
<given-names>N</given-names>
</name>
<name name-style="western">
<surname>Subbiah</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Kaplan</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Dhariwal</surname>
<given-names>P</given-names>
</name>
<etal/>
</person-group>
<source>Language models are few-shot learners. [online]</source>
<month>May</month>
<year>2020</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2005.14165">http://arxiv.org/abs/2005.14165</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r20">
<label>[20]</label>
<element-citation publication-type="book">
<source>OpenAI. GPT-4 technical report. [online]</source>
<month>March</month>
<year>2023</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation> 
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2303.08774">http://arxiv.org/abs/2303.08774</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r21">
<label>[21]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Devlin</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Chang</surname>
<given-names>MW</given-names>
</name>
<name name-style="western">
<surname>Lee</surname>
<given-names>K</given-names>
</name>
<name name-style="western">
<surname>Toutanova</surname>
<given-names>K</given-names>
</name>
</person-group>
<source>BERT: pre-training of deep bidirectional transformers for language understanding. [online]</source>
<month>October</month>
<year>2018</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1810.04805">http://arxiv.org/abs/1810.04805</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r22">
<label>[22]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name name-style="western">
<surname>Ott</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Goyal</surname>
<given-names>N</given-names>
</name>
<name name-style="western">
<surname>Du</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Joshi</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Chen</surname>
<given-names>D</given-names>
</name>
<etal/>
</person-group>
<source>RoBERTa: a robustly optimized BERT pretraining approach. [online]</source>
<month>July</month>
<year>2019</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1907.11692">http://arxiv.org/abs/1907.11692</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r23">
<label>[23]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Cho</surname>
<given-names>K</given-names>
</name>
<name name-style="western">
<surname>van Merrienboer</surname>
<given-names>B</given-names>
</name>
<name name-style="western">
<surname>Gulcehre</surname>
<given-names>C</given-names>
</name>
<name name-style="western">
<surname>Bahdanau</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Bougares</surname>
<given-names>F</given-names>
</name>
<name name-style="western">
<surname>Schwenk</surname>
<given-names>H</given-names>
</name>
<etal/>
</person-group>
<source>Learning phrase representations using RNN encoder–decoder for statistical machine translation. [online]</source>
<month>June</month>
<year>2014</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation> 
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1406.1078">http://arxiv.org/abs/1406.1078</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r24">
<label>[24]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Bayes</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S</article-title>
<source>Philos Trans R Soc Lond</source>
<year>1763</year>
<volume>53</volume>
<fpage>370</fpage>
<lpage>418</lpage>
<pub-id pub-id-type="doi">10.1098/rstl.1763.0053</pub-id>
</element-citation>
</ref>
<ref id="r25">
<label>[25]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Jiang</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Cai</surname>
<given-names>Z</given-names>
</name>
<name name-style="western">
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
<name name-style="western">
<surname>Wang</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Naive Bayes text classifiers: a locally weighted learning approach</article-title>
<source>J Exp Theor Artif Intell</source>
<year>2013</year>
<volume>25</volume>
<issue>2</issue>
<fpage>273</fpage>
<lpage>286</lpage>
<pub-id pub-id-type="doi">10.1080/0952813X.2012.721010</pub-id>
</element-citation>
</ref>
<ref id="r26">
<label>[26]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
</person-group>
<chapter-title>The optimality of naive Bayes</chapter-title>
<conf-name>Proceedings of the seventeenth international Florida Artificial Intelligence Research Society conference. [online]</conf-name>
<publisher-name>AAAI Press</publisher-name>
<year>2004</year>
<fpage>562</fpage>
<lpage>567</lpage>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
</element-citation>
</ref>
<ref id="r27">
<label>[27]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Manning</surname>
<given-names>CD</given-names>
</name>
<name name-style="western">
<surname>Raghavan</surname>
<given-names>P</given-names>
</name>
<name name-style="western">
<surname>Schütze</surname>
<given-names>H</given-names>
</name>
</person-group>
<source>Introduction to information retrieval</source>
<publisher-loc>Cambridge</publisher-loc>
<publisher-name>Cambridge University Press</publisher-name>
<year>2008</year>
<pub-id pub-id-type="doi">10.1017/CBO9780511809071</pub-id>
</element-citation>
</ref>
<ref id="r28">
<label>[28]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Zhang</surname>
<given-names>A</given-names>
</name>
<name name-style="western">
<surname>Lipton</surname>
<given-names>ZC</given-names>
</name>
<name name-style="western">
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name name-style="western">
<surname>Smola</surname>
<given-names>AJ</given-names>
</name>
</person-group>
<source>Dive into deep learning</source>
<publisher-loc>Cambridge</publisher-loc>
<publisher-name>Cambridge University Press</publisher-name>
<year>2021</year>
</element-citation>
</ref>
<ref id="r29">
<label>[29]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Ba</surname>
<given-names>JL</given-names>
</name>
<name name-style="western">
<surname>Kiros</surname>
<given-names>JR</given-names>
</name>
<name name-style="western">
<surname>Hinton</surname>
<given-names>GE</given-names>
</name>
</person-group>
<source>Layer normalization. [online]</source>
<month>July</month> 
<year>2016</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1607.06450">http://arxiv.org/abs/1607.06450</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r30">
<label>[30]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Tan</surname>
<given-names>KL</given-names>
</name>
<name name-style="western">
<surname>Lee</surname>
<given-names>CP</given-names>
</name>
<name name-style="western">
<surname>Anbananthen</surname>
<given-names>KSM</given-names>
</name>
<name name-style="western">
<surname>Lim</surname>
<given-names>KM</given-names>
</name>
</person-group>
<article-title>RoBERTa-LSTM: a hybrid model for sentiment analysis with transformer and recurrent neural network</article-title>
<source>IEEE Access</source>
<year>2022</year>
<volume>10</volume>
<fpage>21517</fpage>
<lpage>21525</lpage>
<pub-id pub-id-type="doi">10.1109/ACCESS.2022.3152828</pub-id>
</element-citation>
</ref>
<ref id="r31">
<label>[31]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Hochreiter</surname>
<given-names>S</given-names>
</name>
<name name-style="western">
<surname>Schmidhuber</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Long short-term memory</article-title>
<source>Neural Comput</source>
<year>1997</year>
<volume>9</volume>
<issue>8</issue>
<fpage>1735</fpage>
<lpage>1780</lpage>
<pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id>
</element-citation>
</ref>
<ref id="r32">
<label>[32]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chung</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Gulcehre</surname>
<given-names>C</given-names>
</name>
<name name-style="western">
<surname>Cho</surname>
<given-names>K</given-names>
</name>
<name name-style="western">
<surname>Bengio</surname>
<given-names>Y</given-names>
</name>
</person-group>
<chapter-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</chapter-title>
<conf-name>NIPS 2014 Workshop on Deep Learning</conf-name>
<conf-date>13 December 2014</conf-date>
<conf-loc>Montreal, Canada</conf-loc>
<year>2014</year>
</element-citation>
</ref>
<ref id="r33">
<label>[33]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Gneiting</surname>
<given-names>T</given-names>
</name>
<name name-style="western">
<surname>Raftery</surname>
<given-names>AE</given-names>
</name>
</person-group>
<article-title>Strictly proper scoring rules, prediction, and estimation</article-title>
<source>J Am Stat Assoc</source>
<year>2007</year>
<volume>102</volume>
<issue>477</issue>
<fpage>359</fpage>
<lpage>378</lpage>
<pub-id pub-id-type="doi">10.1198/016214506000001437</pub-id>
</element-citation>
</ref>
<ref id="r34">
<label>[34]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Deerwester</surname>
<given-names>S</given-names>
</name>
<name name-style="western">
<surname>Dumais</surname>
<given-names>ST</given-names>
</name>
<name name-style="western">
<surname>Furnas</surname>
<given-names>GW</given-names>
</name>
<name name-style="western">
<surname>Landauer</surname>
<given-names>TK</given-names>
</name>
<name name-style="western">
<surname>Harshman</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Indexing by latent semantic analysis</article-title>
<source>J Am Soc Inf Sci</source>
<year>1990</year>
<volume>41</volume>
<issue>6</issue>
<fpage>391</fpage>
<lpage>407</lpage>
<comment>
<ext-link ext-link-type="uri" xlink:href="https://asistdl.onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9">https://asistdl.onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r35">
<label>[35]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Kanatani</surname>
<given-names>K</given-names>
</name>
</person-group>
<source>Linear algebra for pattern processing projection, singular value decomposition, and pseudoinverse</source>
<publisher-loc>San Rafael, CA</publisher-loc>
<publisher-name>Morgan &amp; Claypool</publisher-name>
<year>2021</year>
</element-citation>
</ref>
<ref id="r36">
<label>[36]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Blei</surname>
<given-names>DM</given-names>
</name>
<name name-style="western">
<surname>Ng</surname>
<given-names>AY</given-names>
</name>
<name name-style="western">
<surname>Jordan</surname>
<given-names>MI</given-names>
</name>
</person-group>
<article-title>Latent dirichlet allocation</article-title>
<source>J Mach Learn Res</source>
<year>2003</year>
<volume>3</volume>
<fpage>993</fpage>
<lpage>1022</lpage>
</element-citation>
</ref>
<ref id="r37">
<label>[37]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Xiao</surname>
<given-names>C</given-names>
</name>
<name name-style="western">
<surname>Zhang</surname>
<given-names>P</given-names>
</name>
<name name-style="western">
<surname>Art Chaovalitwongse</surname>
<given-names>W</given-names>
</name>
<name name-style="western">
<surname>Hu</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Wang</surname>
<given-names>F</given-names>
</name>
</person-group>
<source>Adverse drug reaction prediction with symbolic latent Dirichlet allocation. In: Proceedings of the 31st AAAI conference on artificial intelligence</source>
<publisher-loc>San Francisco, CA</publisher-loc>
</element-citation>
</ref>
<ref id="r38">
<label>[38]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Chuang</surname>
<given-names>J</given-names>
</name>
<name name-style="western">
<surname>Ramage</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Manning</surname>
<given-names>CD</given-names>
</name>
<name name-style="western">
<surname>Heer</surname>
<given-names>J</given-names>
</name>
</person-group>
<chapter-title>Interpretation and trust: designing model-driven visualizations for text analysis</chapter-title>
<conf-name>Conference on human factors in computing systems – proceedings</conf-name>
<year>2012</year>
<fpage>443</fpage>
<lpage>452</lpage>
<pub-id pub-id-type="doi">10.1145/2207676.2207738</pub-id>
</element-citation>
</ref>
<ref id="r39">
<label>[39]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Sievert</surname>
<given-names>C</given-names>
</name>
<name name-style="western">
<surname>Shirley</surname>
<given-names>K</given-names>
</name>
</person-group>
<chapter-title>LDAvis: a method for visualizing and interpreting topics</chapter-title>
<conf-name>Proceedings of the workshop on interactive language learning, visualization, and interfaces</conf-name>
<publisher-name>Association for Computational Linguistics (ACL)</publisher-name>
<month>June</month>
<year>2014</year>
<conf-loc>Maryland, USA</conf-loc>
<fpage>63</fpage>
<lpage>70</lpage>
<pub-id pub-id-type="doi">10.3115/v1/w14-3110</pub-id>
</element-citation>
</ref>
<ref id="r40">
<label>[40]</label>
<element-citation publication-type="book">
<source>Department for Transport. Rail factsheet 2019</source>
<year>2019</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>
<ext-link ext-link-type="uri" xlink:href="https://www.gov.uk/government/statistics/rail-factsheet-2019">https://www.gov.uk/government/statistics/rail-factsheet-2019</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r41">
<label>[41]</label>
<element-citation publication-type="book">
<source>HS2 Ltd. High-speed network map</source>
<year>2023</year>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
<comment>Available from: <ext-link ext-link-type="uri" xlink:href="https://www.hs2.org.uk/the-route/high-speed-network-map/">https://www.hs2.org.uk/the-route/high-speed-network-map/</ext-link>
</comment>
</element-citation>
</ref>
<ref id="r42">
<label>[42]</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name name-style="western">
<surname>Haddi</surname>
<given-names>E</given-names>
</name>
<name name-style="western">
<surname>Liu</surname>
<given-names>X</given-names>
</name>
<name name-style="western">
<surname>Shi</surname>
<given-names>Y</given-names>
</name>
</person-group>
<chapter-title>The role of text pre-processing in sentiment analysis</chapter-title>
<source>Procedia computer science</source>
<publisher-name>Elsevier B.V.</publisher-name>
<month>January</month>
<year>2013</year>
<fpage>26</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.1016/j.procs.2013.05.005</pub-id>
</element-citation>
</ref>
<ref id="r43">
<label>[43]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Hutto</surname>
<given-names>CJ</given-names>
</name>
<name name-style="western">
<surname>Gilbert</surname>
<given-names>E</given-names>
</name>
</person-group>
<chapter-title>VADER: a parsimonious rule-based model for sentiment analysis of social media text</chapter-title>
<conf-name>Proceedings of the eighth international AAAI conference on weblogs and social media</conf-name>
<conf-date>1–4 June 2014</conf-date>
<conf-loc>Michigan, USA</conf-loc>
<comment>[online]</comment>
<year>2014</year>
<fpage>216</fpage>
<lpage>225</lpage>
<date-in-citation content-type="access-date">Accessed 31 May 2023</date-in-citation>
</element-citation>
</ref>
<ref id="r44">
<label>[44]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Davis</surname>
<given-names>J</given-names>
</name>
</person-group>
<chapter-title>Goadrich M. The relationship between precision-recall and ROC curves</chapter-title>
<conf-name>Proceedings of the 23rd international conference on machine learning</conf-name>
<year>2006</year>
</element-citation>
</ref>
<ref id="r45">
<label>[45]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Fawcett</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>An introduction to ROC analysis</article-title>
<source>Pattern Recognit Lett</source>
<year>2006</year>
<volume>27</volume>
<issue>8</issue>
<fpage>861</fpage>
<lpage>874</lpage>
<pub-id pub-id-type="doi">10.1016/j.patrec.2005.10.010</pub-id>
</element-citation>
</ref>
<ref id="r46">
<label>[46]</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name name-style="western">
<surname>Rozema</surname>
<given-names>JG</given-names>
</name>
<name name-style="western">
<surname>Bond</surname>
<given-names>AJ</given-names>
</name>
</person-group>
<article-title>Framing effectiveness in impact assessment: discourse accommodation in controversial infrastructure development</article-title>
<source>Environ Impact Assess Rev</source>
<year>2015</year>
<volume>50</volume>
<fpage>66</fpage>
<lpage>73</lpage>
<pub-id pub-id-type="doi">10.1016/j.eiar.2014.08.001</pub-id>
</element-citation>
</ref>
<ref id="r47">
<label>[47]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Rehurek</surname>
<given-names>R</given-names>
</name>
<name name-style="western">
<surname>Sojka</surname>
<given-names>P</given-names>
</name>
</person-group>
<chapter-title>Software framework for topic modelling with large corpora</chapter-title>
<conf-name>Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks</conf-name>
<conf-date>May 2010</conf-date>
<conf-loc>Malta</conf-loc>
<year>2010</year>
<fpage>46</fpage>
<lpage>50</lpage>
</element-citation>
</ref>
<ref id="r48">
<label>[48]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Callison-Burch</surname>
<given-names>C</given-names>
</name>
</person-group>
<chapter-title>Fast, cheap, and creative: evaluating translation quality using Amazon’s Mechanical Turk</chapter-title>
<conf-name>EMNLP 2009 – Proceedings of the 2009 conference on empirical methods in natural language processing: a meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009</conf-name>
<conf-date>August 2009</conf-date>
<conf-name>Singapore</conf-name>
<fpage>286</fpage>
<lpage>295</lpage>
</element-citation>
</ref>
<ref id="r49">
<label>[49]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Zhao</surname>
<given-names>H</given-names>
</name>
<name name-style="western">
<surname>Phung</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Huynh</surname>
<given-names>V</given-names>
</name>
<name name-style="western">
<surname>Jin</surname>
<given-names>Y</given-names>
</name>
<name name-style="western">
<surname>Du</surname>
<given-names>L</given-names>
</name>
<name name-style="western">
<surname>Buntine</surname>
<given-names>W</given-names>
</name>
</person-group>
<chapter-title>Topic modelling meets deep neural networks: a survey</chapter-title>
<conf-name>Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence</conf-name>
<conf-date>19–27 August 2021</conf-date>
<conf-loc>Montreal, Canada</conf-loc>
<year>2021</year>
</element-citation>
</ref>
<ref id="r50">
<label>[50]</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name name-style="western">
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
<name name-style="western">
<surname>Chen</surname>
<given-names>B</given-names>
</name>
<name name-style="western">
<surname>Guo</surname>
<given-names>D</given-names>
</name>
<name name-style="western">
<surname>Zhou</surname>
<given-names>M</given-names>
</name>
</person-group>
<chapter-title>WHAI: Weibull hybrid autoencoding inference for deep topic modeling</chapter-title>
<conf-name>Proceedings of 6th International Conference on Learning Representations</conf-name>
<conf-date>3 May 2018</conf-date>
<conf-loc>Vancouver, Canada</conf-loc>
<year>2018</year>
</element-citation>
</ref>
</ref-list>
</back>
</article>
