The main goal of this stage is to adapt the input text so that it is as close as possible to what the model expects. Additionally, the input text is separated into sentences. This separation may seem a trivial task, but it involves a series of difficulties that we will soon explain.
Broadly speaking, the pre-processing stage is further divided into the following steps:
Remove carriage returns, tab stops (
\t) and extra spaces between words (e.g.,
Add a space at the beginning of intermediate sentences (e.g.,
"How’s it going?Great!"→
"How’s it going? Great!").
Divide the text in sentences. This is important since some models (like the T5) have a maximum input size (which is the length of the sequences used to train them). Two common strategies to avoid this limitation are (a) truncating the input text, which can lead to significant loss of information, or (b) splitting the text into smaller fragments, summarizing each of these fragments, and then concatenating them to create the final summary. In our case, we follow this second strategy. Furthermore, we make sure that, when fragmenting the text, no sentence is split in order to keep the cohesion of the text. And that is why we first carry out this division of the text in sentences.
Currently, we use Microsoft’s Bling Fire library to carry out the division of the sentences, which gives good results overall.