A Systematic Approach to Map the Research Articles' Sections to IMRAD

Article in IEEE Access · July 2020

DOI: 10.1109/ACCESS.2020.3009021


2 authors:

Ibrar Ahmed Muhammad Tanvir Afzal

A Systematic Approach to Map the

Research Articles' Sections to IMRAD
Capital University of Science and Technology Islamabad Pakistan
Capital University of Science and Technology Islamabad Pakistan
Corresponding author: Ibrar Ahmed

ABSTRACT The amount of scientific publications is believed to get doubled every five-years. These
publications are stored by citation indexes and digital libraries in the form of complete PDF or/and by
extracting terms from these documents. This indexing behavior poses several challenges for the scientific
community as well as for digital repositories in terms of handling the advanced requirements of a user.
For instance, addressing queries like “Give me those papers that contain the term “Pagerank” in their
result section” may not be answered unless the papers are indexed section-wise. This issue has been
focused by researchers and international prestigious challenges by top venues in the world like Semantic
Publishing Challenge in ESWC. One of the important metadata extraction from research papers is the
section information such as IMRAD (Introduction, Methodology, Results, and Discussion). Researchers
have presented different approaches to identify and map the section-headings to IMRAD sections. The
existing studies have employed parameters like dictionary terms, the template of a paper, and in-text citation
frequency to map section-headings onto logical sections. The critical analysis of state-of-the-art revealed
that some immensely potential features have been ignored, which might result in accurate mapping. In this
study, we propose a novel approach that employs new features along with previously well-known features
to map sections-headings to IMRAD. The newly proposed features are: (1) variant of In-text Citation count
(2) Figure counts, (3) Table counts, and (4) subheading implicit mapping. The employed data set contains
5000 research papers, collected from CiteSeer. The evaluation of the proposed approach and comparisons
with state-of-the-art three approaches revealed an improvement of 18.96%, 21.77%, and 9.50% in average
precision with Ding et al, Shahid et al, and Habib et al respectively. This research has significant implications
for citation indexes and digital libraries.

INDEX TERMS IMRAD, Relations Database, Section Mapping, Scientific Document Classification.

I. INTRODUCTION retrieve a maximum amount of strongly relevant research

papers against the topic posed in the query. However, the
OMMUNICATION in science is realized through sci-
C entific publications. Due to the latest inventions in sci-
ence, a tremendous increase has been reported in the amount
existing IR systems index the data in a semi-structured for-
mat. The PDF is one of the most widely employed semi-
structured formats, which was developed under the Camelot
of publications on WWW. The amount is believed to get
Project to share documents that include text and images. Due
doubled after every five-year [1]. The scientific plethora is
to improper indexing of PDF files, the existing IR systems are
diffused on the web through different means like publishing
unable to handle advanced queries of a user. The examples of
in different venues such as conferences, journals, and work-
advanced queries are: (1) find all the research papers contain-
shops. These venues publish their research corpora on the
ing the term “Data Science” in the Methodology section of a
web which is indexed by digital repositories like CiteSeer,
research paper or (2) find all those papers that have calculated
Google Scholar, and Scopus, etc. A user exploits information
“F-measure” in the “Results” section etc. The queries of such
retrieval (IR) systems like search engines, citation indexes or
nature can only be addressed if research papers are section-
digital libraries to extract desired information. A user poses a
wise indexed. The community has proposed solutions in the
query with an intention that he/she will obtain the maximum
form of defining the semantic structure of PDF documents
relevant information. For instance, a research scholar finding
and performing their section-wise mapping [2].
papers to conduct a literature survey will always wish to

Initially, research papers were written in the letter-style state-of-the-art.

format. In the early 20th century, a standard format was In this paper, we present a comprehensive approach that
presented in the form of IMRAD (Introduction, Methods, intelligently maps research articles to IMRAD. The pro-
Results, And Discussion) [3]. posed approach takes advantage of accurately identifying the
The origin of the IMRAD is vague. However, according to subsections and mapping them to IMRAD headings based
Gitanjali Batmanabane [4], Louis Pasteur is the first person on their main section mapping to achieve better results.
who used that format and later used by Sir Austin Bradford Furthermore, the proposed approach also exploits novel po-
Hill. The structure states that a research paper should com- tential features which have great potential to improve the
prise logical sections like Introduction, Methods, Results, performance of IR systems. The features include: (1) In-Text
and Discussion. The current behavior from the majority of Citations Count (2) Figures Count, and (3) Tables Count.
the scientific community in terms of preparing research pa- These features have been chosen based on the assumption
pers favors the IMRAD structure. Identifying logical sections that their certain frequency may hint in determining the as-
of a research paper has already been focused by international sociation of one typical heading to a specific logical section.
prestigious challenges by the top venues in the world like Se- The proposed methodology is evaluated on logical sections
mantic Publishing Challenge in ESWC [5] and the research of 5000 research papers in PDF taken from CiteSeer having
community [2], [6]–[8]. 39420 section headings.
It should be noted here that the names given by the We have compared the proposed approach with all of three
authors of the papers as the section-headings usually are section-mapping techniques [2], [6], [8] on the same dataset
not identical to the names being used by IMRAD structure of 5000 papers. Our proposed technique outperformed all
(i.e Introduction, Methods, Results, and Discussions). For three. The proposed approach gained 18.96%, 21.77%, and
example, according to the state-of-the-art [2], that performed 9.50% improvement in average precision of all sections from
experiments on 1,833 section-headings, have concluded that Ding et al [8], A. Shahid, and M. T. Afzal [2] and Raja Habib
none of the methodology section was named as “Meth- and M. T. Afzal [6] respectively.
ods/Methodology” by the authors of respective papers. Only The rest of the paper is organized as follows: Section
1 II presents a critical analysis on state-of-the-art systems.
During the course of many years, researchers presented Section III highlights the lessons learned from empirical
different section identification techniques [2], [6], [8]. Ding experiments on research papers. Section IV contains the
et al [8] performs a study to identify the distribution of in- information about the dataset followed by Section V which
text citations across sections. For this task, the researcher elaborates the methodology proposed to map sections onto
identified the section headings and mapped those headings on the IMRAD structure. Section VI discusses the results and
to logical sections. For this, the researcher used very exten- analysis. In the end, the performance evaluation has been
sive dictionary terms to identify the section and applied their demonstrated in section VII followed by the conclusion that
technique on 866 full-text articles containing 6866 sections is presented in section VIII.
and achieved 81% accuracy. A. Shahid and M. T. Afzal [2]
extended Ding et al [8] technique with different dictionary II. LITERATURE REVIEW
terms along with research paper templates and layout to Since 1665, the letter style format has been followed by
identify section headings and mapped them to IMRAD struc- the research community to prepare research documents [9].
ture. The researcher applied the technique on 1200 papers In the earlier 20th century, a standard structure designed
containing 12,180 sections and got 0.78 precision and 0.79 specifically to write research papers was presented in the
recall. Raja Habib and M. T. Afzal [6] used frequency of in- form of IMRAD (Introduction, Methodology, Results, and
text citation to identify the section and applied that technique Discussion). Gradually, the popularity of the structure in-
on 5000 bibliographically coupled papers and achieved 90% creased and most of the research community adopted this for
accuracy. preparing research documents [9]. The community believed
However, on the critical investigation, we identified two that if the structure of research papers is defined by mapping
key areas. In some scenarios, the contemporary approaches their logical sections to IMRAD, then the performance of
are unable to differentiate between sections and subsections IR systems can significantly be enhanced. In this regard, the
of a research paper. For example, a logical section “1. In- scientific community has put various efforts from defining the
troduction” contains a subsection, named “1.2 Background”. structure of PDF files to mapping them to IMRAD. The stud-
The existing approaches treat both of them as independent ies have focused on adding semantic structure to the content
headings and map them individually to the IMRAD structure, of PDF files to retrieve information in an intelligent manner.
which may increase the chances of inaccurate mapping. A semantically defined structure of a document can be used
Some studies also consider the subheading and rely on head- for different applications pertaining to information retrieval,
ing tags <h1... hn>, which can be failed in some cases, we such as, to generate summaries or process advanced queries.
have explained the issue in detail in section 2 of this paper. The studies focusing on defining the structure of a research
Other than subheadings, there exists a list of potential section paper’s content are referred to as discourse analysis. The
identifier parameters that have been ignored by the existing scientific community has presented research papers sections-
based studies in two dimensions: some have employed log- manually assessed and it was revealed that none of the papers
ical sections of research papers to identify different aspects used term “Methodology” for the methodology section and
pertaining to bibliometric analysis and others have mapped only 1% of the papers used the term “Result” for the “Result”
logical sections to IMRAD structure. section. Besides these, some other researchers also used the
logical section to identify important citations for example
A. TECHNIQUES USING LOGICAL SECTIONS like Nazir et al [18], Hasan et al [19], and Pride et al [20].
The study proposed by Tuefel and Moens et al [10], presented
an Argumentative Zoning (AZ) system, using elements of B. TECHNIQUES MAPPING LOGICAL SECTIONS
scientific argumentation [10]. These scientific arguments are Ding et al [8] used extensive dictionary terms to identify
referred to as “owns a work”, “others work” and “con- and map the logical sections to IMRAD. This technique
trast” [11]. Another similar scheme, Core Scientific Concepts was tested on 866 full-text articles containing 6866 sections
(CoreSC) extracts generic concepts like Hypothesis, Model, and achieved 81% accuracy. A. Shahid and M. T. Afzal [2].
and Experiments, etc. from research papers [12]. Besides extended the study of Ding et al [8]. and used different
defining the structure, IMRAD has also proven useful in dictionary terms and templates of the papers to identify and
various other contexts, for instance, Teufel et al [13] stated map the section to IMRAD. The study of A. Shahid and M. T.
that investigating a citation count with respect to its location Afzal [2] has mapped the research articles from the domain
in a logical section can produce good results to discern of Computer Science and automatically extracted the sections
the sentiment aspect of the citing author. Similarly, another using DEo ontology. The study has then performed a rigorous
study [14] has performed citation analysis by exploiting in- analysis of comparison done with various ML techniques
text citations in different sections (introduction, methods, and unlike the study presented by Kafkas et al [17], the
results, and discussion) of a research paper. The importance approach [2] has formally presented the proposed algorithm.
of logical sections has also been delineated by A. Shahid The approach A. Shahid and M. T. Afzal [2] have mapped
and M. T. Afzal [2], to discover the semantic relationships the sections of research papers into six logical sections of
between research articles. Another approach [15], presented IMRAD (“Introduction”, “Related Work”, “Methods”, “Re-
an ontology (DEo) to define logical structures of scientific sults”, “Discussion” and “Conclusion”). The approach [21]
documents. The semantic indexing of research documents has harnessed two main features, layout information of sci-
holds great potential in identifying implicit knowledge. In the entific documents and dictionary terms to form heuristics to
study [16] authors have exploited in-text citation frequencies section-wise map the research articles on IMRAD structure.
and their patterns from the logical section of research papers. The evaluation has been done on 329 papers from the domain
The evaluation has been done on the data set of scientific pa- of Computer Science published in the Journal of Universal
pers taken from CiteSeer. Kafkas et al [17]presented a study Computer Science (J.UCS). Raja Habib and M. T. Afzal
wherein sections-based search functionality was provided for [6] used in-text citation frequency to identify the sections.
the research papers published in the Journal of Biomedical. This technique achieved 90% accuracy when tested on 5000
The approach presented by A. Shahid and M. T. Afzal [2] is papers.
closest to Kafkas et al [17] The difference is that the approach After a critical analysis of the studies presented above
Kafkas et al [17] manually extracts the sections by utilizing illustrates that contemporary state-of-the-art has proposed
designed rules. The semantic indexing of research documents several methods to semantically index research documents
holds great potential in identifying implicit knowledge. The according to some predefined structure. A few efforts have
contemporary IR systems are unable to semantically index been made to define the structure of research documents from
the structure of PDF files due to which advanced queries the domain of Computer Science. The approach proposed by
cannot be processed. The examples of advanced queries are: A. Shahid and M. T. Afzal [2], is the most recent approach
(1) find all the research papers containing the term “Data which performs section-wise mapping of research articles
Science” in the Methodology section of a research paper or from the domain of Computer Science by utilizing heuristics
(2) find all those papers that have calculated “F-measure” in formed using layout information and content information.
the “Results” section etc. Addressing such queries is a dire However, the approach holds various deficiencies which can
need of the current era especially when there exists a huge adversely impact the performance of IR systems. We have
amount of research plethora on the web. However, defining a critically scrutinized the approach by manual investigation
semantic structure that is able to address the aforementioned and identified existing gaps which are the focus of the
queries is a challenging process. This is due to the fact that proposed study. The following section “Lessons Learned”
people use different sets of vocabulary or semantic terms to presents an in-depth analysis of the identified issues with the
name the same logical sections. For instance, some authors help of examples taken from a real data set.
use the term “Literature Review” while some use “Related
work” for a section to represent state-of-the-art studies. In III. LESSONS LEARNED
such scenarios, it becomes difficult to semantically distin- As explained earlier, our work is closest to the work pre-
guish the terms. Such issues have also been reported by sented by A. Shahid and M. T. Afzal [2]. The approach
A. Shahid and M. T. Afzal [2] wherein 329 papers were presented by A. Shahid and M. T. Afzal [2] maps logical
sections of research papers to IMRAD by using template in- figures or tables hints towards the association of the section
formation and dictionary-based rules. We have implemented to a particular logical section of IMRAD. For instance, a
the approach of A. Shahid and M. T. Afzal [2] and discovered “Methodology” or “Result” section contains a comparatively
some patterns that led us to formulate the proposed research higher “Figure count” or “Table count” than other sections.
questions. Let us look into the patterns with the help of real To the best of our knowledge, the contemporary approach [2]
examples taken from the PDF files of the employed data set. has not given the due importance to the potential parameters
“Figure count” and “Tables count”. In this study, we consider
A. SUBHEADINGS MAPPING both parameters “figure count” and “table count” based on
In the PDF file shown in Figure 1, the Introduction section of an assumption that the count of figures links to the specific
a research paper “1. Introduction” contains three subsections, section of the IMRAD structure. In the XML files, figures
“1.1. Biology needs computation”, “1.2. Genes and cells” and and tables are represented in the form of objects, as shown in
“1.3. GemCell”. The XML file of this PDF shown in Figure Figures 4 and Figure 5.
2 the main heading (i.e., 1. Introduction) is represented
with tag <h1> and all the sub-headings are represented with C. IN-TEXT CITATIONS FREQUENCY
the tag <h2>. The content inside the opening and closing
Similar to the aforementioned assumption followed for “Fig-
bracket of the heading tag is considered as the name of an
ure Count” and “Table Count” parameters, the “In-Text Ci-
independent logical section. All the remaining subheadings
tation count” can also serve as an important indicator in
are also represented with the same tag <h1>. This manual
determining association to a specific logical section. Ding
inspection revealed the fact that the approach [2] is unable
et al [8] have employed the frequency of in-text citation
to differentiate between the main section and subsection of
in all the logical sections. We have manually investigated
a paper, rather it treats all of them as independent headings.
research papers of the employed data set and identified that
This deficiency causes another major issue, explained below.
the number of in-text citation counts is not equal in all
Ding et al [8] already used the heading <h1> and <h2>. How-
the sections. For instance, the “Literature Review” section
ever, if we see in Figure 3, there is a sub-section named “1.3
contains the highest amount of in-text citations than other
Related Work”. Now, as per the IMRAD structure, the section
sections. Considering this aspect, our approach maps a logi-
will be considered as an independent section and will get
cal section having the highest number of in-text citations to
mapped to the “Related Work” section of IMRAD. However,
the Literature review section. Raja Habib and M. T. Afzal [6]
in reality, the section does not belong to the literature review
have considered this parameter for mapping logical sections
section of IMRAD. Such issues result in false mapping,
to the IMRAD structure using the Ding et al [8] study. Similar
further compromising the performance of IR systems. On
to figures and tables, citations are also represented in the form
careful examination of the heading content, we identified that
of objects in XML files, as shown in Figure 6.
most of the headings start with a bullet number. For instance,
The important aspect overlooked by contemporary studies
the main heading starts with “1. . . ” followed by the sub-
is that they have not given adequate importance to the po-
headings with “1.1..”, “1.2..”, “1.3..” and “1.4..”. We argue
tential features like “In-Text Citation count”, “Figure count”
that the inability of differentiating the logical structure by
and “Table count”. Although, as explained above, these
XML can be addressed with the help of regular expressions
parameters could be the potential contributors for section
that are able to intelligently differentiate between the sections
identification. In this study, our focus is to overcome the
on the basis of the mentioned patterns.
stated deficiencies to improve the performance of IR systems
To recapitulate, the examples discussed above depict that
to a great extent.
the scope of the contemporary state-of-the-art on logical
sections mapping fails to intelligently map the subsections.
Treating all the sub-sections as independent sections and IV. DATA COLLECTION
explicitly mapping them to the IMRAD structure can have Contemplating the fact that an appropriate data set plays a
an adverse impact on the overall precision of IR systems. crucial role in determining the significance of a proposed
Our study overcomes all the mentioned issues by utilizing study, we have collected the data set in such a way that
regex to implicitly map the subsection to the same section ensures the validity of the proposed study on the papers
which is its main section. As explained earlier, the proposed published in diverse domains.
study employs some potential features such as In-Text Ci- For the verification of our approach, we require a compre-
tations count, Figures count, and Tables count to determine hensive dataset from diversifying domains, covering different
the association of a section to specific logical sections of authors, and different journals. We found a dataset that has
IMRAD. The justification for harnessing these parameters is the characteristic which we require from Raja Habib and M.
given below. T. Afzal [6]. This data set is freely available. We used 17
different queries mentioned in the Table 1 which are adapted
B. FIGURES AND TABLES COUNT from Raja Habib and M. T. Afzal [6] to collect the data from
A scientific article illustrates results in the form of a figure or CiteSeer. CiteSeer indexes a vast amount of research papers
table. There may be a high probability that the frequency of in diversified disciplines of Computer Science. The employed
4 VOLUME 4, 2016

FIGURE 1. Subheading example in form of PDF file.

FIGURE 2. Subheading Example 1 (XML)

FIGURE 3. Subheading Example 2 (XML)

FIGURE 4. Figures in Research Publications

FIGURE 5. Tables in Research Publications.

FIGURE 6. Example of a figure caption.

dataset contains 5000 papers containing 39420 sections of Engine (SGE) is used to generate schema of the XML files,
different journals. which is maintained in PostgreSQL to parse and insert the
XML data. Thereafter, Data Extraction Engine (DEE) is
V. METHODOLOGY used to extract headings and subheadings from sections of
This section encompasses details about the proposed method- research papers and other objects like citations, figures, and
ology. It works in four modules to map the logical sections tables. The Mapping SQL Engine (MSE) maps the extracted
of research articles to the IMRAD structure. The modules headings and subheadings to IMRAD with the help of a
include Schema Generation Engine (SGE), (2) Data Ex- devised algorithm 1. The last module, Mapping View Engine
traction Engine (DEE), (3) Data Mapping Engine (DME), (MVE) is used to visualize the resultant mapping using
and Mapping View Engine (MVE). First of all, the data set XPath/XQuery expressions. In the end, the mapped sections
containing PDF files of research papers are collected from a are evaluated by using the benchmark data set that contains
digital library named CiteSeer. The PDF files are converted section annotations formed with the help of a user study. The
into XML using the PDFX [22] tool. The Schema Generation
TABLE 1. Queries used to collect dataset C. DATA MAPPING ENGINE (DME)

The previous module Data Extraction Engine extracts the
Queries used to collect dataset [6]
Number Query
data from XML and after some processing, it inserts data
1 Social network into a relational format in the database. The relation database
2 Information retrieval format of the dataset provides ease to manipulate the data to
3 Bayesian networks make information extraction easy, but there is still a need to
4 Feature selection
5 Collaborative recommendation find a section and map these sections to IMRAD. The Data
6 Recommendation system Mapping Engine(DME) uses XQuery/XPath/SQL queries to
7 Content based filtering extract the section and mapped that section to IMRAD. By
8 Black box testing
9 Automatic generation carefully examining the structure of the XML files, we have
10 Regression testing designed some rules to map the sections to IMRAD. The
11 Query processing algorithm 5, 6 and 7 shows the process of mapping sections
12 Sensor networks
13 Wireless communications to IMRAD.
14 Opinion mining
15 Subjectivity analysis 1) Subheadings Mappings
16 Online marketing
17 Graph theory The research publications consist of heading and subhead-
ings. It is really important to segregate the heading and
subheadings. While mapping the heading to sections, some
overall structure of the proposed methodology is shown in studies keep subheading into consideration, but some studies
Figure 7 and Figure 8. The proposed algorithm is formally totally ignore to handle the subheadings. For instance, if a
represented bellow in form or different modules 1, 2 4, 5 6, heading <h1> contains two sub-headings <h2> and <h3>,
7. The detailed explanation of all the modules implemented then all of three tags are considered as independent sections
in the proposed methodology is delineated in the following by the existing studies. Due to this negligence, IMRAD
sections. heading can be mistakenly mapped to the subheading.
Our proposed approach distinguishes between the main
A. SCHEMA GENERATION ENGINE (SGE) heading and subheading with the help of regular expres-
As explained earlier, research papers in the employed data sions 3. Unlike other studies, our approach does not map
set were in the form of PDF, which was converted into XML sub-heading to the IMRAD structure. The sub-headings are
format. The requisite parameters from XML files are required mapped to that section of IMRAD which is the parent
to be stored in some meaningful format so that extensive heading of that sub-heading in the hierarchy. The proposed
queries could be processed on it. In this regard, we developed approach validates the following two aspects that have been
Schema Generation Engine (SGE) wherein we stored that overlooked by the scientific community.
information in different tables of the database and created 1 - Section subheading is properly distinguishable in XML
links between them. The XML publications were inserted format by XML heading tags <h2><h3> etc.
into the PostgreSQL, which is a renowned relational database 2 - Section subheadings are not distinguishable in XML
formed using the generated schema. The schema is the true format, and need to run some regular expression 3 to distin-
mapping of research publications to the relational databases. guish heading by bullets.
Since PostgreSQL is a very powerful XML engine used to The following regular expression is used to detect the
manipulate XML data, therefore, we picked it to parse and headings and subheadings, in the scenarios wherein separate
insert the XML data tags for headings were not present in XML.
We observed that in some cases headings and sub-headings
B. DATA EXTRACTION ENGINE(DEE) are not explicitly defined in XML. In such scenarios, it
The Data Extraction Engine (DEE) is used to pre-process becomes difficult to detect the starting and ending of the
the data. While preprocessing, all the noise from the data main headings<h1>. We have critically analyzed such XML
is removed. By the term “noise”, we mean those papers files in the data set and found that XML’s element “region
that do not follow the paragraph and heading syntax of class="DoCO:TextChunk" can be used wherein headers are
research papers. In XML files, the tags of sub-headings are missing in XML files.
not properly nested within the tags of main headings, rather
all are represented as independent headings. We have ex- 2) Sections Sequences
tracted all the tags from XML using DEE to determine their In recent years, the scholarly community follows IMRAD
position (i.e, main heading, or subheading) in the paper. The based rules while preparing research documents. They follow
DEE extracted the headings and subheadings using the XML the rule of sequence. If the rules are properly followed then
heading structures and populated the database accordingly. it becomes much feasible for IR systems to correctly identify
Afterward, DEE identified citations and objects like figures, the sections. However, we have critically analyzed PDF files
tables, etc. from the XML files. The algorithms 2 shows the from the employed data set and observed that some of the
process of data extraction from XML. research papers have not followed the rule of sequence. The
FIGURE 7. System Design and Component

alternate methods are required to identify the sections from methodology matches the synonyms for section mapping.
those papers that have not been structured according to the
rules. 4) Citation Count
Citation is one of the powerful bibliometric indices, used to
3) Sections Known Names reveal facts regarding various aspects of scientific literature.
Typically, the scholarly community harnesses the below- A citation count is the number of occurrences a citation
mentioned names for the sections of research papers. The appears in the body of a document, also known as in-text
presence of these names makes it easier to identify the name citations. We have considered “citation count” as a measure
of logical sections. The names include: to detect logical sections. This measure has been chosen
• Abstract based on an assumption that a certain count of a citation
• Keywords identified in a logical section may serve as a strong relevance
• Introduction clue. For instance, typically, the “Literature Review” section
• Literature Review / History /Related work contains comparatively a greater number of in-text citations
• Methodology than other sections. Such a section can be mapped to the
• Results / Discussion “Literature Review/Related Work” section.
• Conclusions and future work The XML files contain elements like Xref, ref, section, etc.
• References The Xref along with ref-type = ’bibr’ exhibits in-text citation.
In this study, we have also identified the synonyms of these This can be linked to the ’ref’ tags via an attribute. After
section names. Although the scientific community follows extracting all these tags, we computed the count of citations.
the IMRAD structure but does not strictly adhere to use Thereafter, the tag anchors were detected using the CAD
the same names as mentioned above. Usually, a researcher [23].
picks synonyms of the names as per his/her choice. For
instance, as identified in PDF files of the employed data 5) The occurrence of objects
set, some authors have named “Evaluation” while some have Most of the research articles contain multiple figures and
named “Analysis” to the “Results/Discussion” section. We tables. The XML files represent these figures and tables
have extracted synonyms from all the 5000 research articles. in the form of objects. We have picked these objects as a
Wherever the exact section name is not found, the proposed measure to identify logical sections based on the same notion
8 VOLUME 4, 2016

FIGURE 8. System Flow

of harnessing citation count measure. The results/discussion Input: XML Publications Dataset
section contains comparatively a greater number of objects Result: IMRAD Mapped Sections
than other sections. These objects are detected using XML 1:
tags like “TableBox”, and “FigureBox”, as shown in Figure 2: for each xml ∈ X M L do
4 and Figure 5. 3: lh = exctractHeaders(xmli ) 2
4: ml = mapSectionUsingDisctionary(lh, ml)
5: ml = mapSectionUsingTemplate(lh, ml) A. Shahid and M.
The Mapping View Engine (MVE) is designed to visualize 6: ml = mapSectionUsingCitationFreq(lh, ml)
the resultant mapping for analysis. As discussed earlier, we 7: ml = mapSectionUsingFigureFreq(lh, ml)
8: ml = mapSectionUsingTableFreq(lh, ml)
mapped all the sections with IMRAD; then we need some 9: end for
views to get that information from the database. To get that Algorithm 1: Section Mapping to IMRAD
data, we need to run some SQL/XPATH queries using regular
expressions which is not always an easy way; therefore MVE
provides different views of that results. Input: XML File: xml
Output: Header/Section List hl
VI. RESULTS AND EVALUATION 1: T ← xml // extract text from xml
Once all the modules of the proposed methodology have 3: for each heading ∈ T do
been implemented, the next steps involve the evaluation 4: hl <= xP athQuery(getH1) // get Heading <h1>
of data using the benchmark data set. The outcomes are 5: hl <= xP athQuery(getH2) // get Heading <h2>
evaluated in two phases: (1) We have identified a one-layer 6: hl <= xP athQuery(getH3) // get Heading <h3>
hierarchy of sub-headings and then mapped the sections to 7: bl <= extBullets(hl) // get Bullets
8: hl <= getSectionClass”DoCO : SectionT itle”
the IMRAD (2) Afterwards, we have determined collective 9: end for
contribution of all the parameters in a hybrid manner by Algorithm 2: extractHeader
combining the parameters “citation count”, “figures count”
Input: Header/Section List hl

Output: Bullet List: bl Input: XML file:xml, Section Mapping List: sml
Result: Section Mapping List:sml
1: \?:^|
2: for each l ∈ ”lh” do
3: F Pi = getRegionClass”DCO : F igureBOX”
2: return bl // Bullet List 4:
5: for each imrad ∈ IM RAD do
Algorithm 3: exctractBullet 6:
7: if lj ! = imradi then
9: if F Pi ≥ 60% then
10: smlr esults <= true; // A RESULT
Input: XML file:xml, Header List:lh 11: return sml // Section Mapping List
Result: Section Mapping List:sml, Header List:lh 12: end if
1: 13:
2: for each l ∈ ”lh” do 14: else
3: 15:
4: for each imrad ∈ IM RAD do 16: if F Pi ≥ 30% then
5: 17: smlm eth <= true; // A METHODOLOGY
6: if lj ! = imradi then Section
7: return sml // Section Mapping List 18: return sml // Section Mapping List
8: end if 19: end if
9: end for 20: end if
10: end for 21: end for
Algorithm 4: mapSectionUsingDisctionary 22: end for
Algorithm 6: mapSectionUsingFigureFreq

Input: XML file:xml, Section Mapping List: sml

Result: Section Mapping List:sml
1: Input: XML file:xml, Section Mapping List: sml
2: for each l ∈ ”lh” do Result: Section Mapping List:sml
3: (CTi , CT Pi ) = String-Citaion_Anchor_Detection(t) CAD 1:
[23] 2: for each l ∈ ”lh” do
4: (CNi , CN Pi ) = Numeric-Citaion_Anchor_Detection(t) 3: T Pi = getRegionClass”DCO : T ableBox”
CAD [23] 4:
5: 5: for each imrad ∈ IM RAD do
6: for each imrad ∈ IM RAD do 6:
7: 7: if lj ! = imradi then
8: if lj ! = imradi then 8:
9: 9: if T Pi ≥ 70% then
10: if CFi ≥ 50% then 10: smlr esults <= true; // A RESULT
11: smld iscussion <= true; // A DISCUSSION section
Section 11: return sml // Section Mapping List
12: return sml // Section Mapping List 12: end if
13: end if 13:
14: 14: else
15: else 15:
16: 16: if F Pi ≥ 20% then
17: if CFi ≥ 30% then 17: smlm eth <= true; // A METHODOLOGY
18: smli ntro <= true; // A INTRODUCTION Section
Section 18: return sml // Section Mapping List
19: return sml // Section Mapping List 19: end if
20: end if 20: end if
21: end if 21: end for
22: end for 22: end for
23: end for Algorithm 7: mapSectionUsingTableFrequncy
Algorithm 5: mapSectionUsingCitationFrequncy

and “tables count”. The standard evaluation measures, in-

cluding precision, recall, and f-measure have been employed.
We calculated the precision/recall / f-measure of each section
separately for both the aforementioned phases. Figure 12
shows the combined average score of precision, recall, and
F-measure for all the sections, and figures 9, 10, and 11
represents the individual scores for each section. The formal
description of the formulas of precision, recall, and f-measure
is shown below.

Px = T Px /(T Px + F Px )
Rx = T Px /(T Px + F Nx )
⇒ x0 = Introduction
⇒ x1 = M ethods
⇒ x2 = Results
⇒ x3 = Discussion
FIGURE 10. Section wise comparison of recall
Pavg = Px /4 ⇒ Pavg = Average P recision
Ravg = Rx /4 ⇒ Ravg = Average Recall

FIGURE 11. Section wise comparison of F-Measure

Since the proposed approach has an application in in-

formation retrieval (IR) systems is specifically concerned
FIGURE 9. Section wise comparison of precision about returning most of the correct responses against a query,
therefore, we have drawn comparisons here on the basis of
Figure 9 shows the comparison of the precision of our the precision score. The evaluation results of our approach
approach with state-of-the-art approaches. It is quite evident illustrate the increased value of precision, and when all the
from the graph that all three approaches mapped the intro- features were assessed in a hybrid manner then the proposed
duction with high accuracy but failed to accurately map the approach tremendously outperformed the approaches [2],
other sections. On the other hand, our approach performed [6], [8]. We believe that this improvement is due to the
better and achieved almost similar precision in the mapping consideration of objects in XML files.
of all the sections. Further, the other approaches were unable The analysis discussed above revealed that our study has
to identify the Result or Discussion section. Similarly, Figure outperformed all of the three existing state-of-the-art studies
10 shows the comparison of recall of our proposed approach on section mapping by achieving 8.96%, 21.77%, and 9.50%
with all three state-of-the-art approaches. improvement from Ding et al [8], Raja Habib and M. T.
TABLE 2. Performance comparison with start-of-the-art techniques

Techniques Time in Seconds

Ding et al [8] 10
A. Shahid and M. T. Afzal [2] 14
Habi et al [6] 21
Proposed 34

of these results will be precompiled before they are made

available to be used in the online system. This indicates that
when a user performs such a query then there is no such
difference between the time taken by any of the approaches.
Instead, all of the approaches will service the user based on
the processed data.

This paper has presented a novel method to map logical
sections of PDF formatted research papers to the IMRAD
structure. Unlike contemporary state-of-the-art, the approach
FIGURE 12. Comparison of average precision of combined sections has been designed after a critical manual investigation of one
of the most recent approaches of logical sections mapping.
Our study has identified a set of novel features and justified
Afzal [6] and A. Shahid and M. T. Afzal [2], respectively. their potential by evaluating them on a benchmark data set.
This signifies the potential of all the novel parameters like These features include: (1) In-Text Citation count (2) Figure
In-Text Citation count, Figure count, and Table count. We count and (4) Tables count. The study generated XML files
argue that the worth of XML objects must not be overlooked from PDF files using PDFX [22] and exploited XML headers
while section detection. The collective contribution of all and objects with the help of regular expressions to detect
the employed parameters and potential of designed regular logical sections, sub-sections, and mapping them to their cor-
expressions can adequately cope up to overcome the deficien- responding section of IMRAD. The study has outperformed
cies in the existing state-of-the-art in section mapping. contemporary studies by attaining 0.97 precision (i.e., 0.85
vs. 0.97) and recall 0.98 (i.e. 0.86 to 0.98) This improved
VII. PERFORMANCE EVALUATION COMPARISON precision and recall is due to the incorporation of the features
We have evaluated the performance of our approach with that have been overlooked by the contemporary state-of-the-
state-of-the-art techniques. All experiments were performed art. The tool PDFX was unable to process approximately
in the same environment on AWS medium instances. We have 10% of the PDF files into XML files. In the future, attention
performed experiments in multiple iterations and reported may be paid to initiate methods of section identification in
the outcome in median values. The table 2 shows the the scenarios wherein PDFX [22] fails.
number in seconds. The approach of Ding et al [8] consumed
comparatively less time than all other approaches. This is ACKNOWLEDGMENT
due to the fact that Ding et al [8] have only used dictionary The authors acknowledge Ms. Faiza Qayyum for helping us
terms to identify sections and mapped them to IMRAD. A. in linguistics review of the paper and giving feedback to make
Shahid and M. T. Afzal [2] took a bit longer than Ding et it readily understandable.
al [8] because Shahid’s approach also uses templates along
with the dictionary terms. The Raja Habib and M. T. Afzal REFERENCES
[6] took longer than the previous two approaches because [1] P. O. Larsen and M. von Ins, “The rate of growth in scientific
counting the citations and parsing the whole text take a lot publication and the decline in coverage provided by science citation
of time. Most of the time is consumed by our proposed index,” Scientometrics, vol. 84, no. 3, pp. 575–603, Mar. 2010. [Online].
Available: https://doi.org/10.1007/s11192-010-0202-z
approach because it applies multiple techniques for section [2] A. Shahid and M. T. Afzal, “Section-wise indexing and retrieval of
identification and mapping. Moreover, it is philosophically research articles,” Cluster Computing, vol. 21, no. 1, pp. 481–492, May
true because our approach uses all the features employed 2017. [Online]. Available: https://doi.org/10.1007/s10586-017-0914-4
[3] L. B. Sollaci and M. G. Pereira, “The introduction, methods, results, and
by the state-of-the-art techniques [2], [6], [8] along with the discussion (IMRAD) structure: a fifty-year survey,” J Med Libr Assoc,
novel proposed features. The whole process was done offline, vol. 92, no. 3, pp. 364–367, Jul 2004.
and we have maintained all the information in the database. [4] G. Batmanabane, “The IMRAD structure,” in Reporting and Publishing
Although the proposed technique takes more time than state- Research in the Biomedical Sciences. Springer Singapore, Dec. 2017, pp.
1–4. [Online]. Available: https://doi.org/10.1007/978-981-10-7062-4_1
of-the-art techniques; however, this is not important because [5] C. Lange and A. Di Iorio, “Semantic publishing challenge – assessing the
all of this processing will be done on the backend and all quality of scientific output,” vol. 475, 08 2014.

[6] R. Habib and M. T. Afzal, “Sections-based bibliographic coupling for I brar Ahmed earned his master’s in computer
research paper recommendation,” Scientometrics, vol. 119, no. 2, pp. science from International Islamic University Is-
643–656, Mar. 2019. [Online]. Available: https://doi.org/10.1007/s11192- lamabad. Later he earned his MS in Computer
019-03053-8 Engineering Degree from the University of Sci-
[7] A. KHAN, A. Shahid, and M. Afzal, “Extending co-citation using sections ence and Technology Taxila. He is working in
of research articles,” TURKISH JOURNAL OF ELECTRICAL ENGI- a research and development organization as a
NEERING AND COMPUTER SCIENCES, vol. 26, pp. 3346–3356, 11 Database Specialist. Currently, he is pursuing his
Ph.D. in computer science. He authored multiple
[8] Y. Ding, X. Liu, C. Guo, and B. Cronin, “The distribution of references
books in the database field. His research interest
across texts: Some implications for citation analysis,” Journal of
Informetrics, vol. 7, no. 3, pp. 583–592, Jul. 2013. [Online]. Available: includes Databases, digital libraries, and Data Sci-
https://doi.org/10.1016/j.joi.2013.03.003 ence.
[9] B. Fecher and S. Friesike, Open Science: One Term, Five Schools of
Thought. Cham: Springer International Publishing, 2014, pp. 17–47.
[10] S. Teufel, A. Siddharthan, and C. Batchelor, “Towards domain-
independent argumentative zoning: Evidence from chemistry and
computational linguistics,” in Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Processing. Singapore:
Association for Computational Linguistics, Aug. 2009, pp. 1493–1502.
[Online]. Available: https://www.aclweb.org/anthology/D09-1155
[11] S. Teufel and M. Moens, “Summarizing scientific articles: Experiments
with relevance and rhetorical status,” Computational Linguistics,
vol. 28, no. 4, pp. 409–445, Dec. 2002. [Online]. Available:
[12] M. Liakata, S. Teufel, A. Siddharthan, and C. Batchelor, “Corpora for the
conceptualisation and zoning of scientific papers,” in Proceedings of the
Seventh International Conference on Language Resources and Evaluation
(LREC’10). Valletta, Malta: European Language Resources Association
(ELRA), May 2010.
[13] S. Teufel, “Citations and sentiment. in: Workshop on text mining for M UHAMMAD TANVIR AFZAL received the
scholarly communications and repositories, university of manchester, uk Ph.D. degree with highest distinction in Computer
(2009),” 2009. Science from the Graz University of Technology,
[14] S. Maričić, J. Spaventi, L. Pavičić, and G. Pifat-Mrzljak, “Citation Austria, secured Gold medal in his M.Sc Com-
context versus the frequency counts of citation histories,” Journal of puter Science from Quiad-i-Azam University, Is-
the American Society for Information Science, vol. 49, no. 6, pp. lamabad, Pakistan. He has been associated with
530–540, 1998. [Online]. Available: https://doi.org/10.1002/(sici)1097- academia and industry at various levels for the last
4571(19980501)49:6<530::aid-asi5>3.0.co;2-8 20 years, and currently, he is serving as Professor
[15] S. Peroni, D. Shotton, and F. Vitali, “Faceted documents,”
in the Department of Computer Science at Capital
in Proceedings of the 2012 ACM symposium on Document
University of Science and Technology, Islamabad.
engineering - DocEng '12. ACM Press, 2012. [Online]. Available:
https://doi.org/10.1145/2361354.2361396 He is also serving as editor-in-chief for a reputed impact factor journal
[16] Y. Guo, A. Korhonen, M. Liakata, I. Silins, J. Hogberg, and U. Stenius, known as: Journal of Universal Computer Science. Dr. Afzal authored more
“A comparison and user-based evaluation of models of textual information than 100 research papers in leading peer-reviewed journals and confer-
structure in the context of cancer risk assessment,” BMC bioinformatics, ences in the field of Data Science, Information retrieval and visualization,
vol. 12, p. 69, 03 2011. Semantics, Digital Libraries, and Scientometrics. He has authored two
[17] Ş. Kafkas, X. Pi, N. Marinos, F. Talo’, A. Morrison, and J. R. books and has edited two books in Computer Science. His cumulative
McEntyre, “Section level search functionality in europe PMC,” Journal impact factor is 60+, with citations over 500. He played a pivotal role
of Biomedical Semantics, vol. 6, no. 1, p. 7, 2015. [Online]. Available: in making collaborations between MAJU-JUCS, MAJU-IICM, and TUG-
https://doi.org/10.1186/s13326-015-0003-7 UNIMAS. He served as a Ph.D. symposium chair, session chair, finance
[18] S. Nazir, M. Asif, S. Ahmad, F. Bukhari, M. T. Afzal, and H. Aljuaid, chair, committee member, and editor of several IEEE, ACM, Springer,
“Important citation identification by exploiting content and section-wise Elsevier international conferences, and journals. Dr. Afzal conducted more
in-text citation count,” PLOS ONE, vol. 15, no. 3, p. e0228885, Mar. than 100 curricular, co-curricular, and extra-curricular activities in the last
2020. [Online]. Available: https://doi.org/10.1371/journal.pone.0228885 5 years including seminars, workshops, national competitions (ExcITeCup),
[19] S.-U. Hassan, M. Imran, S. Iqbal, N. R. Aljohani, and R. Nawaz, “Deep
and invited international and national speakers from Google, Oracle, IICM,
context of citations using machine-learning models in scholarly full-text
IFIS, SEGA Europe, etc. Under his supervision, more than 60 post-grad
articles,” Scientometrics, vol. 117, no. 3, pp. 1645–1662, Oct. 2018.
[Online]. Available: https://doi.org/10.1007/s11192-018-2944-y students (MS and Ph.D.) have defended their research theses successfully
[20] D. Pride and P. Knoth, “Incidental or influential? - challenges in auto- and a number of Ph.D. and MS students are pursuing their research with
matically detecting citation importance using publication full texts,” in him.
Research and Advanced Technology for Digital Libraries. Springer
International Publishing, 2017, pp. 572–578.
[21] J. Beel and B. Gipp, “Google scholar’s ranking algorithm: An introductory
overview,” 07 2009.
[22] A. Constantin, S. Pettifer, and A. Voronkov, “PDFX,” in
Proceedings of the 2013 ACM symposium on Document
engineering - DocEng '13. ACM Press, 2013. [Online]. Available:
[23] R. Ahmad and M. T. Afzal, “CAD: an algorithm for citation-anchors
detection in research papers,” Scientometrics, vol. 117, no. 3, pp. 1405–
1423, Sep. 2018. [Online]. Available: https://doi.org/10.1007/s11192-018-

