Deep Learning For Technical Document Classification: IEEE Transactions On Engineering Management March 2022
Deep Learning For Technical Document Classification: IEEE Transactions On Engineering Management March 2022
Deep Learning For Technical Document Classification: IEEE Transactions On Engineering Management March 2022
net/publication/358890574
CITATIONS READS
5 233
4 authors, including:
Jianxi Luo
Singapore University of Technology and Design
163 PUBLICATIONS 2,119 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shuo Jiang on 10 March 2022.
Abstract—In large technology companies, the requirements for [2], [3]. Prior studies have reported that engineers spend two-
managing and organizing technical documents created by engi- thirds of their time communicating to obtain a related document
neers and managers have increased dramatically in recent years, input for their work and make decisions based on such materials
which has led to a higher demand for more scalable, accurate, and
automated document classification. Prior studies have only focused [4]. It is widely believed that 20% of engineering information
on processing text for classification, whereas technical documents can be extracted from a database comprising numeric data only,
often contain multimodal information. To leverage multimodal and the remaining 80% is hidden in the documents [5]–[7].
information for document classification to improve the model per- Feldman et al. [8] similarly asserted that 80% of explicit knowl-
formance, this article presents a novel multimodal deep learning edge in companies can be found in their documents. With an
architecture, i.e., TechDoc, which utilizes three types of informa-
tion, including natural language texts and descriptive images within increase in the scale and complexity of engineering activities,
documents and the associations among the documents. The ar- technical documents are being created at a greater pace than
chitecture synthesizes the convolutional neural network, recurrent before [9].
neural network, and graph neural network through an integrated A well-organized technical document classification enables
training process. We applied the architecture to a large multimodal
engineers to retrieve and reuse documents more easily. How-
technical document database and trained the model for classifying
documents based on the hierarchical International Patent Classi- ever, a continuously increasing volume of technical documents
fication system. Our results show that TechDoc presents a greater requires engineers to spend much more time managing them
classification accuracy than the unimodal methods and other state- than before. The label assignment and categorization of tech-
of-the-art benchmarks. The trained model can potentially be scaled nical documents are human labor-intensive, expensive, and
to millions of real-world multimodal technical documents, which time-consuming. Because these documents are usually lengthy
is useful for data and knowledge management in large technology
companies and organizations. and full of complicated technical terminologies, it is also dif-
ficult to find specific experts to handle them. Specialized indi-
Index Terms—Artificial intelligence, deep learning, document vidual experts with limited knowledge and cognitive capacity
classification, neural networks, technology management.
might not be able to accurately determine the labels or cate-
gories of specific documents in a wide (many diverse classes)
I. INTRODUCTION and deep (multilevel hierarchy) classification system. There-
NGINEERING processes involve significant technical and fore, we turn to artificial intelligence for reducing the time
E organizational knowledge and comprise a sequence of
activities, such as design, analysis, and manufacturing [1]. Dur-
and cost and ensuring an accurate classification of technical
documents.
ing these engineering activities, a large amount of data and Several prior studies have explored the use of machine learn-
knowledge is generated and stored in various types of technical ing algorithms to automatically classify technical documents for
documents, such as technical reports, emails, papers, and patents knowledge management [10]. Among them, several traditional
methods, such as the K-nearest neighbors (KNN) and support
vector machine (SVM), are not scalable and are incapable of
Manuscript received December 10, 2021; revised January 27, 2022; accepted classifying documents that large engineering companies, such
February 14, 2022. This work was supported in part by the SUTD-MIT Interna- as Boeing or General Motors, need to manage. Recent deep
tional Design Center and SUTD Data-Driven Innovation Laboratory, in part by
the National Natural Science Foundation of China under Grant 52035007 and
learning-based approaches have demonstrated the ability of a
Grant 51975360, in part by the Special Program for Innovation Method of the scalable classification on large engineering document datasets
Ministry of Science and Technology, China under Grant 2018IM020100, in part [11]. They focused on textual information and applied various
by the National Social Science Foundation of China under Grant 17ZDA020,
and in part by the China Scholarship Council. Review of this manuscript was
natrual language processing (NLP) techniques to develop au-
arranged by Department Editor K.-K. R. Choo. (Corresponding author: Shuo tomated classifiers, such as recurrent neural networks (RNN),
Jiang.) long short-term memory (LSTM) networks, and specific pre-
Shuo Jiang and Jie Hu are with the School of Mechanical Engineering, Shang-
hai Jiao Tong University, Shanghai 200240, China (e-mail: jsmech@sjtu.edu.cn;
trained models. However, the performance of current systems is
hujie@sjtu.edu.cn). insufficiently reliable for real-world applications. For example,
Christopher L. Magee is with the Institute for Data, Systems and Society, Li et al. [12] reported that their convolutional neural network
Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail:
cmagee@mit.edu).
(CNN)-based classifier achieved a precision of 73% and an F1
Jianxi Luo is with the Engineering Product Development, Singapore Univer- score of 42% on their curated dataset of two million patent
sity of Technology and Design, Singapore 487372 (e-mail: luo@sutd.edu.sg). documents.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TEM.2022.3152216.
Technical documents normally contain both text and images
Digital Object Identifier 10.1109/TEM.2022.3152216 [13]. Ullman et al. studied the importance of technical drawings
0018-9391 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
in the engineering design process [14]. In a technical document, II. LITERATURE REVIEW AND BACKGROUND
visual information often plays important role in presenting its
A. Document Classification
novelty [15], [16]. In recent years, the data science community
has explored multimodal deep learning that can utilize, process, Document classification is a fundamental task in NLP and
and relate information from multiple modalities, and reported text mining, and to date, a wide variety of algorithms have
the superiority of a multimodal model over unimodal models exhibited significant progress. Traditional document classifica-
for various tasks [17], [18], which presents new opportunities tion approaches represent text with sparse lexical features, such
to improve the technical document classification. Also, tech- as term frequency-inverse document frequency (TF-IDF) and
nical documents are often associated with one another, via N-grams, and then use a linear model (e.g., logistic regression)
interdocument references or citations, indicating their coupling or kernel methods (e.g., SVM) based on these representations
and embeddedness in a greater nearly-decomposable knowledge [34], [35]. In recent years, the development of high-performance
system [19]. Recent advanced graph neural networks (GNNs) computing has enabled us to take advantage of various deep
enable us to learn such relational information and classify indi- learning methods and end-to-end training and learning, includ-
vidual nodes into several predefined categories [20]. In addition, ing CNNs [36], RNNs [37], capsule neural networks [30], and
the technology knowledge space is a natural complex system transformers [39]. For example, Joulin et al. [40] proposed a
and constitutes many knowledge categories and subcategories simple but efficient model, called FastText, which views the
corresponding to different technology fields [21]. When con- text as a bag of words and then passes them through one or
sidering the classification of a large number and diversity of more multilayer perceptrons for classification. Lai et al. [41]
documents, a hierarchical classification system is required to proposed a recurrent CNN (RCNN) for text classification with-
assign documents into multilayer categories, which has not been out human-designed features. This model applied a recurrent
supported by current technical document classifiers yet. structure to extract long-range contextual dependence when
In this article, we propose a multimodal deep learning-based learning representations. In addition, Yang et al. [42] proposed
model, i.e., TechDoc, for the accurate hierarchical classification a hierarchical attention network for document classification.
of technical documents. Our aimed contribution is for the en- In this model, the hierarchical structure mirrors the natural
gineering management community with a focus on engineering hierarchical structure of documents, and attention mechanisms
document management, especially for large engineering com- are applied at both word- and sentence-level structures, enabling
panies. Engineering documents are normally multimodal, and it to differentially attend to less and more important content when
their classification needs to be hierarchical. Relevant automated learning document representations. However, these models only
classification methods that specifically address such require- use natural language data as the presentation of documents, and
ments do not exist. Thus, this work is expected to bridge the gap they are usually trained on a general document corpus that often
between the up-to-date multimodal deep learning techniques and involves a wide range of nonengineering topics.
engineering document management. Specifically, a few studies related to technical document
Our TechDoc model utilizes three types of information (i.e., classification have already existed in the engineering field. For
text, image, and network) of engineering documents for the example, Caldas and Soibelman [10] described a document
automated classification by synthesizing the CNN, RNN, and classification method based on a hierarchical structure from
GNN. To illustrate the proposed method, we applied it to a the Construction Specifications Institute. They used TF-IDF to
benchmark patent dataset and trained the model to classify represent the text and trained an SVM classifier to categorize the
technical documents based on the hierarchical International documents. The experiments were conducted using a dataset of
Patent Classification (IPC) system as the evaluation case study, 3030 documents. Similarly, Chagheri et al. [43] used an SVM
which shows better performances than other existing classifica- algorithm to train a classifier that helped Continew Company
tion methods. In addition, as far as we know, this study is the classify and manage technical documents. Their model was
first effort to utilize and synthesize intrinsic information within trained and evaluated on a small set of 800 documents. These
technical documents and the associations among the documents initial studies used traditional nonscalable machine learning
to automate hierarchical technical document classification. techniques and were illustrated with small document sets.
Taken together, this research contributes to the growing liter- Patent documents represent typical and complex engineer-
ature on engineering knowledge management [22]–[24], patent ing design documents. Prior studies have focused on patent
analysis [25]–[29], and data-driven engineering applications document classification utilizing various machine learning and
[30]–[33]. NLP techniques. For example, Fall et al. [44] and Tikk et al.
This article is organized as follows. In Section II, we briefly [45] separately presented several basic classifiers on the World
review the relevant literature about document classification and Intellectual Property Organization (WIPO) dataset, including
multimodal deep learning. Section III introduces the proposed NB, KNN, and SVM. The CLEF-IP tracks included a patent
TechDoc model in detail. Section IV presents a case study on classification task [46], [47], which provided a dataset of more
a patent document dataset followed by a discussion on applica- than 1 million patents as the training set to classify 3000 patents
tions of the TechDoc in Section V. Finally, Section VI concludes into their IPC subclasses. All winning models were based on
this article and discusses the limitations and opportunities for the Winnow classifier, and triplet features were used as the
future research. input [48]. Later, Li et al. [49] proposed a forward ANN-based
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
model and employed the Levenberg–Marquardt algorithm to learning. Specifically, they concatenated high-level representa-
train the model on a small dataset. Wu et al. [50] proposed a tions and trained two restricted Boltzmann machines to recon-
hybrid genetic algorithm with an SVM to classify 234 patents struct the original representations of audio and video. Srivastava
into 2 sets. Similar to the trend of general document classifica- and Salakhutdinov [61] proposed a similar approach to modify
tion, these early machine learning-based studies used manually the feature learning and reconstruction process using deep Boltz-
selected features or statistic-based features as the representation mann machines.
of patents, which may lead to a loss of information. Furthermore, various neural network architectures have been
Several recent deep learning-based approaches have been used to construct multimodal representations [62]–[64]. Each
applied to patent classification research. Grawe et al. [51] modality starts with some individual neural layers, followed
proposed an approach that integrates Word2Vec and LSTM to by a specific hidden layer that projects multiple modalities
classify patents into 50 categories. Likewise, Shalaby et al. into a joint latent space. The joint representation is then fed
[52] represented patent documents as fixed hierarchy vectors into multiple hidden layers or directly followed by a final
and used an LSTM-based architecture to classify them. Risch supervised layer for downstream tasks. Audebert et al. [65]
et al. [53] proposed domain-specific word embeddings and proposed a deep learning-based infused multimodal classifier
designed a gated recurrent unit (GRU) network based model for documental image classification, utilizing both visual pixel
for the patent classification task. Li et al. [12] presented the information and the textual content in the images. Their model
DeepPatent algorithm based on CNNs and word vectors. Recent adopted the MobileNet-v2 model to learn visual features and a
studies have also employed transfer learning on large pretrained simple LSTM network with the FastText representation model
language models, including ULMFiT and bidirectional encoder [40] to process text data. Despite these promising applications
representations from transformers (BERT). Hepburn et al. [54] of multimodal deep learning to multimodal data, multimodal
proposed a patent classification framework based on the SVM technical documents remain mostly unexplored.
and ULMFiT techniques. Kang et al. [55] and Lee and Hsiang
[56] fine-tuned the BERT pretrained model to address the patent III. METHOD
prior art search task. Abdelgawad et al. [57] applied state-of-the-
This section proposes TechDoc, a novel deep learning ar-
art hyperparameter optimization techniques to the patent classi-
chitecture for multimodal technical document classification. As
fication problem and presented their effects on the accuracy.
depicted in Fig. 1, the entire workflow consists of three steps:
Although deep learning models have achieved a better perfor-
1) data preprocessing;
mance than traditional machine learning-based methods, several
2) image and text fusion learning;
limitations remain. First, the performances of current state-
3) network feature fusion learning and document classifica-
of-the-art systems are not sufficiently reliable for real-world
tion.
large-scale complex technical document management systems
In the first step, several text preprocessing methods, such
[11]. Because existing classification approaches solely use text
as tokenization, phrasing, denoising, lemmatization, and stop-
as the model input and disregard the figures in technical doc-
word removal, are applied to convert documents into a suitable
uments, new opportunities exist to improve the classification
representation for the classification model. Compound images
performance when using multimodal deep learning techniques.
are separated into individual ones based on a pretrained CNN
Second, all existing approaches are aimed at assigning labels at a
model. A network of the documents based on their associations
single level. To develop a scalable and fine-grained classification
is constructed. In the second step, image and text feature vectors
system for technical documents, a hierarchical classification is
are jointly trained via neural networks and fused via stepwise
desired [58]. Furthermore, prior studies have used inconsistent
concatenation operations. In the third step, the fused features
datasets and classification schemes for model training and test-
derived from the second step are used as the input document
ing, which makes benchmarking and comparisons difficult. A
vectors, together with the interdocument association network
golden standard dataset, such as the patent dataset and the IPC
information, for network fusion learning and final document
system, may provide a common ground for model training and
classification.
performance benchmarking of different models.
Fig. 2 shows the architecture of TechDoc. It consists of three
major modules: text feature learning with RNN, image feature
B. Multimodal Deep Learning learning with CNN, and network feature learning with GNN.
The image and text feature learning modules (based on RNN and
Multimodal deep learning aims to design and train models
CNN, respectively) are fused first to represent the intrinsic fea-
that can utilize, process, and relate information from multiple
tures for individual documents, and then fused with the network
modalities [59]. Most related reviews claim the superiority of
learning module (based on GNN) for document classification.
multimodal over unimodal approaches for a series of tasks,
In the following, we will describe in detail these modules and
including retrieval, matching, and classification [17], [18]. The
how they are synthesized together in our architecture.
most common multimodal sources are text, images, videos,
and audio. Various multimodal deep architectures have been
A. Image Learning Module
proposed to leverage the advantages of multiple modalities. For
example, Ngiam et al. [60] proposed the learning of shared To extract visual features from the technical images, a pre-
representations over multiple modalities using multimodal deep trained CNN model, i.e., VGG-19 [66], is utilized as the image
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
information encoder. It is a robust CNN model and has been dependencies on different time scales. There are two types of
widely used in many computer vision applications. It has 19 gates in the GRU: reset gate rt and update gate zt . Both aim to
trainable layers, including convolutional layers, fully connected control the update of information to the state. At time t, the GRU
layers, max-pooling layers, and dropout layers. The VGG-19 computes its new state as
model we used is pretrained on ImageNet [54]. Then, transfer
learning techniques are used to fine-tune the image encoder. ht = (1 − zt ) ht−1 + zt ht . (1)
Specifically, the final prediction layer of the model is removed This is a linear interpolation between the old state ht−1 and the
and replaced by a new fully connected layer, a dense layer, candidate state ht obtained using the new sequence information.
and an output layer on the top. The modified model is aimed The update gate zt controls how much previous information will
at classifying each image into predefined categories to learn remain and how much new information will be added. Here, zt
corresponding knowledge. Following the training process, the is computed as
second-to-last fully connected layer of the model is used to
extract high-dimensional vectors as image features (vimg ). zt = σ (Wz xt + Uz ht−1 + bz ) (2)
where xt indicates the embedding vector with time t, and W ,
B. Text Learning Module U , and b denote the appropriately sized matrices of the weights
In this part, our model aims to learn word-level, sentence- and biases, respectively. The symbol σ is a sigmoid activation
level, and document-level information from textual information. function, and the operator ࣻ represents an elementwise multi-
The word encoder is built on the bidirectional RNN [67], plication. The current state ht is computed as
which enables the utilization of the flexible length of contexts ht = tanh (Wh xt + rt (Uh ht−1 ) + bh ) (3)
before and after the current word position. We used the GRU [68]
to track the state of the input sequences without using separate where the reset gate rt determines how much information from
memory cells, which is well suited for extracting long-range the old state is added to the current state.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Similar to the unidirectional GRU, the bidirectional GRU exp ui T us
processes the input data in two directions with both the forward αi = T
t exp (ui us )
and backward hidden layers. The computational results of both
→
−
directions are then concatenated as the output. Let ht be the v= α i hi (8)
←−
forward output of the bidirectional GRU and ht be the backward i
output. The final output is the stepwise concatenation of both where us represents the sentence-level context vector, and is
forward and backward outputs randomly initialized and updated, similar to uw .
−→ ← − Through the above training process, the derived document
ht = h t , h t . (4)
vector v contains hierarchical semantic information from both
Then, the sentence encoder utilizes the word-level represen- word-level and sentence-level structures in a technical docu-
tation as the input to build sentence-level vectors using the ment. Thus, we call it vtxt in the following sections.
embedding layer and bidirectional GRU layers. After that, the
sentence-level vectors are converted into document-level vectors C. Image and Text Feature Fusion Learning
using different bidirectional GRU layers. Note that not all words In this part, the text feature vtxt and image feature vimg are
and sentences contribute equally to the vector representation. fused and jointly trained via several fully connected layers and
Accordingly, we introduce the attention mechanism [68] to stepwise concatenation operations. The fully connected and
identify essential items for the model. concatenated layers at the end of the model are designed to
Assume that the input text has M sentences, and each sentence form a hierarchical learning structure to enable hierarchical
contain Ti words. Let wit with t ∈ [1, T ] represent the words classification. For the first-level classification, the fusion process
in sentence i. Given a word wit , the embedding layer and is computed as
bidirectional GRU layer convert it into the hidden state hit as l1
l1
vtxt = σ Wtxt vtxt + bl1
−→ −−→ txt
hit = GRU (We wit ) , t ∈ [1, T ] l1
l1
vimg = σ Wimg vimg + bl1
img
←− ←−−
hit = GRU (We wit ) , t ∈ [T, 1] l1 l1
l1
vpat = vtxt , vimg
−
→ ← −
hit = hit , hit (5) l1 l1
pl1 = softmax Wpat vpat + bl1
pat (9)
−−→
where We indicates the embedding layer matrix, and GRU and where softmax(·) is the softmax activation function, and pl1
←−−
GRU represent the operations mentioned in the previous section. is the predicted probability vector for the first-level category
Then, the attention weights of words αit and sentence vectors (not the final classification result). Similarly, the second-level
si can be computed as follows: and third-level classifications are similar to the above equa-
tions.
Categorical cross-entropy is used as the training loss:
uit = tanh (Ww hit + bw ) L = y log ŷ, where y and ŷ denote the predicted label and
ground truth, respectively. Because each of the three tasks has
exp uit T uw an independent loss, the overall loss for our model is
αit = T
t exp (uit uw )
Loverall = ζi Mi , i = 1, 2, 3 (10)
si = αit hit (6) i
t
where ζi is the weight loss, and i ζi = 1. Since lower-level
where the context vector uw can be viewed as a high-level classification is the main task in the case study, and we set 0.05,
representation of a fixed input over words [69], [70], and is 0.1, and 0.85 for the three loss weights.
randomly initialized and updated jointly during the training The fully connected layer that aims at the lower-level clas-
process. Then, another bidirectional GRU layer is used to convert sification task to learn information from the higher-level task
the sentence vectors si into hidden state hi as through backpropagation. The numbers of neurons in different
→ −−→
− layers were tuned for our evaluation case study, and the values are
hi = GRU (si ) , i ∈ [1, M ] shown in Fig. 2. Finally, we can generate holistic feature vectors
←
− ←−− that contain both image and text information for individual tech-
hi = GRU (si ) , i ∈ [M, 1] nical documents from three concatenated layers, corresponding
−
→ ← −
to the hierarchical classification task.
hi = hi , hi . (7)
D. Image, Text, and Network Fusion Learning
The attention weights of words αi and document vectors v
can be computed as Training to the step in the previous steps has derived fused
feature (containing both image and text information) vectors
ui = tanh (Ws hi + bs ) for individual technical documents. These vectors are then used
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
EXAMPLE OF THE IPC HIERARCHICAL STRUCTURE
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
original values for text and zero values for images. As for the
network-only model, we adopt the basic majority voting strategy
based on the forward citations of any given patent to make
predications. Moreover, we implemented five different patent
classification algorithms [12], [52], [53], [56], [88], three general
document classification algorithms [40]–[42] and a multimodal
classification model [65] on the same dataset for comparison
with TechDoc.
Fig. 6. Informative regions of five random patent images for the image encoder One challenge in training TechDoc is tuning the presetting
identified via CAM technique [86].
hyperparameters. Table II lists the hyperparameter space and
the range of potential values or settings. We used a benchmark
documents in are English, most of the words can be separated method to randomly search in the hyperparameter space to iden-
from each other by white spaces. Next, tokens are standardized tify the configuration that leads to high performance [89]. In our
and cleaned by a denoising step, which includes converting every study, 50 neural network models with different configurations
term into a lower case and removing numbers, punctuation, and were trained. The final selected settings for the text processing
other special characters. The third step is stop-word removal, are shown in Table II, which represents the best combination
which aims to drop frequently used stop-words and filler words in the potential hyperparameter setting space. For the neural
that add no value to further analysis. We use a widely used network parameters, we set the dimension of the GRU to 128.
list [83] and a USPTO patent stop-word list2 to identify and In this case, as a combination of forward and backward GRUs,
remove stop words from the obtained tokens. In the last step, the dimensions of both word and sentence feature vectors are
all tokens are converted into their regularized forms to avoid 256. We set the dimension of the fully connected layer to 256
multiple forms of the same word and thus reduce the index size. with He’s uniform initialization [90]. As for the training process,
This operation is achieved by first utilizing a POS tagger [84] we specified the batch size to 64. We set 25 as the maximum
to identify the types of tokens in a sentence and accordingly number of sentences in a document, and 10 as the maximum
lemmatize them. For example, if the word “studying” is tagged number of words in a sentence. In the image and text fusion
as a VERB by the POS tagger, it would be converted into learning module, we applied the Adam optimizer [91] with the
“study,” but remain as “studying” when tagged as a NOUN. All best learning rate using a grid search on the validation set. The
text preprocessing steps were applied using the natural language iteration time was set to 10, which was sufficient for the con-
toolkit (NLTK) [85], which is a suite of text processing libraries vergence of the model’s loss function. In the network learning
using the Python programming language. module, we built a two-layer GraphSAGE model with 1024
c) Image and text feature extraction: As described in Sec- nodes in each layer. As for the number of sampled neighbors,
tion II, we use a VGG19 network as the image encoder [66]. Each we set the sizes of 1-hop and 2-hop neighbor samples to be 5
individual technical image is represented as a 1024-dimensional and 2. The node numbers of the last several fully connected
vector. We utilize a class activation mapping (CAM) technique layers are corresponding to the specific tasks (8 or 122 or 622).
[86] to highlight the importance of the image region for the We stacked the GraphSAGE layers and fully connected layer
neural network. Fig. 6 illustrates the informative regions of five in the model, and defined the category cross-entropy as the
random patent images. loss function. Adam optimizer was used again for the GNN
To encode the textual information, we used the TechNet training. We set the iteration time of GNN training to 10. The
pretrained word embedding vectors to represent every single aggregate functions and other parameters were set as suggested
token. TechNet is a semantic network consisting of words and in the original GraphSAGE paper [20]. All experiments were
phrases contained in patent titles and abstracts from the USPTO conducted on a machine with an Nvidia Titan X 16 GB GPU
patent database, and also provides embedding vectors for tech- and 64 GB of RAM.
nical terms [87]. Using TechNet, each word is converted into We used the top-1, top-5, and top-10 accuracies and recipro-
a 150-dimension vector as the input of the multimodal deep cal average rank (RAR) measures to evaluate performances of
learning model. different models. The top-K accuracy calculates the percentage
of correct labels within the top-K-predicted scores. The RAR
B. Experiment Setup measures how far down the ranking of the correct label is. It can
To evaluate our proposed TechDoc trimodal deep learning be calculated as follows:
model (text+image+network), we first compared it against three
unimodal models (text-only, image-only, network-only) and two
1
dual-modal models (text+image, text+network). All these mod- RAR = N (11)
els are originated from one or more modules of the total TechDoc 1/N × i=1 rank(yj )
architecture and trained with the same experimental settings. For
example, in the text-only unimodal model, the input data have
where N indicates the number of documents, and rank(yj )
2 [Online]. Available: http://patft.uspto.gov/netahtml/PTO/help/stopword. represents the ranking position of the ground-truth label in the
htm predicted score list.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
PRESETTING HYPERPARAMETER SPACE
TABLE III
SECTION (IPC 1-DIGIT) CLASSIFICATION RESULTS
TABLE IV
CLASS (IPC 3-DIGIT) CLASSIFICATION RESULTS
C. Experimental Results three unimodal models, we can find that using the text-only or
To understand which modality is more critical for doc- the network-only models can get reasonably good performances,
which are much better than performances of the image-only
ument classification, we conducted an ablation study to
analyze the experimental results on the benchmark patent model. In addition, both of the two dual-modal models outper-
form all three unimodal models. It is not surprising to see that
dataset for six models, including three unimodal models
removing the network learning module makes more impact on
(image-only, text-only, network-only), two dual-modal models
(text+image, text+network) and the trimodal TechDoc model the model performance than removing the image module. These
findings reveal that the text and network information of technical
(text+image+network). All the models were run ten times.
documents are more important for classification while involving
The bold entities in Tables III–VI and VIII represent for the
best performances. The results (mean±standard deviations) of technical visual information can additionally bring a modest and
consistent advantage for the model.
all metrics are reported in Table III–V. First, we can see
that the TechDoc model, which fuses text, image, and network TechDoc was then compared with nine prior relevant deep
information, outperforms other models on all tasks and metrics learning models, including five patent document classifica-
tion models, three general text classification models, and one
significantly based on student t-test (p < 0.05) [95]. Looking at
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
SUBCLASS (IPC 4-DIGIT) CLASSIFICATION RESULTS
TABLE VI
SUBCLASS (IPC 4-DIGIT) CLASSIFICATION RESULTS WITH DIFFERENT MODELS
multimodal document classification model, on the same dataset. visual pixel information and the textual content in the images.
All used titles and abstracts as textual inputs and aimed at Third, our model aims to classify text documents into a given
predicting the IPC 4-digit subclass labels. Specifically, for the hierarchy, which enables the model to share technical knowledge
fine-tuned BERT model, we leveraged the released BERT-Base at different levels, whereas other methods regard the predictions
pretrained model (Uncased: 12-layer, 768-hidden, 12-heads, 110 at different levels as separate tasks.
million parameters) [96], as the author claimed in the literature To evaluate the computing efficiency of different models,
[56]. Table VI shows the performance of each model. TechDoc we report the training time (ten epochs for each model) in
outperforms baselines significantly based on t-test (p < 0.05) Table VI. We can see that the TechDoc consumes more training
for all indicators. time (model (l), 95.3 min) than the unimodal methods that only
There might be three reasons why our model shows better process texts, but significantly less time than the second most
performances than others. First, compared to the models that accurate model, Fine-tuned BERT (model (e), 1290.3 min). It
only focus on processing text data, adding information of image is important to note that, multimodal data fusion would natu-
and network can bring additional predictive power (also shown rally require extra computation and training time than unimodal
in Table III–V). Second, compared to the other dual-modal learning [97]. In our case, processing and fusing additional
method (model (i)) that uses both text and image information image and network information to texts is expected to incur
for classification, the dual-modal model based on the TechDoc extra computations and training time. We can also find the dual-
architecture (model (j)) also shows its advantages on perfor- modal method that fuses text and network information based
mance. This is because the TechDoc is explicitly designed for on TechDoc architecture can reduce much training time and
the technical document classification. In contrast, the model achieve a better performance than other baselines. When training
(j) is designed for documental image classification by utilizing efficiency is the top priority for users, they can choose the
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VII
TIME-COMPLEXITY ANALYSIS WITH DIFFERENT MODELS
dual-modal method (text+network) to build their own document outperforms the image-only model in every section, reflecting
classifier. the natural difficulty in classifying patents using only images
We then conducted a time-complexity analysis [98] on the even for human experts. For example, some technical images
training time increase for unit accuracy improvement of the only present a partial view of a specific design, which cannot
models, using the fastest model FastText as the baseline. Δt represent the entire product. In Fig. 7(a) and (b), we can find
denotes the difference between the training times of a model and both models achieve better accuracy in sections A, B, G, and H
FastText. Δa denotes the difference between the Top-1 accuracy than in the other sections. The imbalanced patent dataset has a
values of the model and FastText. Then, Δt/Δa indicates how more significant impact on the image-only model than others.
much additional training time the model would require to obtain The image-only model even ignores section D, which is quite
unit accuracy increase. As reported in Table VII, our model took a small group. In Fig. 7(c), we can find that the network-only
additional 5.4 min to improve each 1% of top-1 accuracy from model has a good performance for small groups, which shows
FastText, which is better than another multimodal model (model the value of involving relational information. TechDoc [see
(i)) and the second most accurate model (model (e)). When using Fig. 7(d)] outperforms all the unimodal models in section D. This
DeepPatent as the baseline, our model took additional 21.4 min may be because TechDoc can learn some internal relationships
to improve each 1% of top-1 accuracy. among three modalities, and the superposition of information
In sum, our model presents statistically significant accuracy is nonlinear for the model. Moreover, in the second column of
advantages over all prior models, despite requiring additional Fig. 7(b) and (c), we can see that most misclassifications occur in
training time to process and fuse additional image and network section B. The same situation is also shown in Fig. 7(d), whereas
information. A user may employ the TechDoc, when classifi- TechDoc alleviates this problem to a certain extent.
cation performance is the top priority, and the training time
is affordable for the specific user. In the case study, training E. Analysis of the Performance Across Subclasses
TechDoc took 95 min for around 0.6 million documents with In this section, we move forward to explore the classification
only one GPU. In real-world applications with a much larger performance across subclasses. We selected the biggest 50 sub-
training dataset, users may consider training TechDoc on more classes based on their size (number of patents) and computed
powerful computing infrastructures, such as GPU clusters and their precision, recall, and F1-score. The results are presented in
cloud-computing platforms, to limit training time. Fig. 8. The subclasses are in descending order from left to right
according to size.4
D. Comparison of Unimodal and Multimodal Models Fig. 8 shows that the performances of 50 subclasses are very
Fig. 7 reports the confusion matrices of three unimodal models uneven. We should note that an imbalanced dataset might lead to
and the multimodal model (i.e., TechDoc) for patent section bad performance for some small groups. Although the F1-score
classification.3 These matrices were built based on the predicted ranges from 0.27 to 0.88, most of the subclasses (43/50) achieve
results of the test set. In Fig. 7, we can observe that TechDoc a score of higher than 0.50. This result and the overall trends
improves the classification of sections C, D, E, F, G, and H of the three curves reveal that there is only a modest correlation
when compared to the best performance of the unimodal models. between performance and size for these 50 subclasses.
The increase in accuracy ranges 0.02–0.08. The text-only model Among all subclasses, eight groups have an F1-score of
higher than 0.8. The best five groups are: G03G (electrography,
3 The brief meaning of eight sections (Section A–H) is shown in Fig. 4. The
detailed descriptions of eight sections can be viewed at the WIPO website: 4 The detailed descriptions of the 50 subclasses can be viewed at the WIPO
https://www.wipo.int/classifications/ipc/ipcpub website: https://www.wipo.int/classifications/ipc/ipcpub
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 7. Confusion matrices of four models on the patent section (IPC 1-digit) classification task. (a) Image - only. (b) Text - only. (c) Network - only. (d) TechDoc
(Image+Text+Network).
Fig. 8. Categorial classification performance (precision, recall, and F1-score) using TechDoc for the biggest 50 subclasses.
electrophotography, and magnetography), A63B (apparatus for Patent documents that belong to layered products can come from
physical training), G11B (information storage based on relative entirely different disciplines. According to official documents
movement between record carrier and transducer), H01L (semi- from the WIPO,5 both layered honeycomb and layered cellular
conductor devices), E21B (earth or rock drilling apparatus). are covered by this subclass. For the other two subclasses, C07K
By checking the brief descriptions of these subclasses, we find and H04J, some similar categories also exist. For example, our
that most of them are related to devices, apparatus, or specific TechDoc model misclassified a patent in the test set named
technical objects. Patents from these groups usually contain “transmission line monitoring system” (US12491709) into the
distinct domain-related information from both text and images, subclass H04J. The ground-truth label of this patent is H04B
which may at least partially explain the better accuracy. (transmission), which shares similar technical content of H04J.
In addition, we find three subclasses whose F1-score is lower Thus, when predicting such types of subclasses, the top-5 re-
than 0.4: B32B (layered products), C07K (peptides), and H04J turned labels from our model can more meaningful.
(multiplex communication). For the first subclass, B32B, the rea-
son may be the abstract description of the domains themselves. 5 [Online]. Available: https://www.wipo.int/classifications/ipc/en/
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VIII
SUBCLASS (IPC 4-DIGIT) CLASSIFICATION RESULT UNDER DIFFERENT NUMBER OF INPUT IMAGES
F. Discussion on the Influence of Data Preprocessing technical documents for technological labels at different levels
of a well-defined hierarchy. TechDoc can be used to help com-
As described in the previous section, the application of the
proposed model requires a series of automatic data preprocess- panies automatically classify and manage their technical doc-
ing steps, which may introduce biases. For the image preprocess- uments, particularly newly generated ones, in several practical
ways.
ing, we leveraged a fine-tuned CNN model to separate compound
images with an accuracy of 92%, which means there still exists Table IX shows the taxonomy of neural network training
a few compound images used as the input of the visual end strategies. First of all, companies that already have sufficient
technical documents classified in their own document catego-
during the model training. A cleaner image dataset would further
bring some improvements to the current performance of our rization system (which most large established companies have)
can use their classified/labeled document data to train a model
proposed model. In addition, the text-image pairs used in our
based on the TechDoc architecture and the workflow we in-
experiments only contain one image (the image shown on the
first page) per document. We tested some alternative settings troduced above and apply the trained model to automatically
classify documents generated later. This is strategy #1 in the
(using five or all images per document as input images), and
taxonomy in Table IX. Alternatively, they may directly use the
found using one image per document achieves the best result
(see Table VIII). There are several reasons. First and foremost, model trained on patent data and IPC to automatically classify
documents generated later, without requiring new categorical
the first image (usually shown on the front page of the patent)
is often the most important and representative one to a patent. labels, i.e., strategy #2 in Table IX.
Second, a patent may have several similar images, which contain Some companies (e.g., small enterprises or new startups)
might not have many documents themselves to train the TechDoc
redundant information to the model. Third, some types of images
do not provide much technical-related visual information, such model from scratch. As illustrated in the case study, TechDoc
as flowcharts and tables. Involving these kinds of images may can be trained using the multimodal engineering dataset that
we created based on patent data. The trained model can auto-
even harm the performance of our model. We should note that
the method we use to combine all images in the experiments is a matically classify newly generated nonpatent documents into
simple way of the early fusion. There is still room for future respective categories in the IPC hierarchy. This is again strategy
#2 in Table IX. Furthermore, the trained model can be used to
studies to explore the way to utilize all images of technical
documents. extract the features of multimodal documents from the hidden
layers. Using unsupervised clustering methods, such features
For the text preprocessing, we applied the general steps to
enable the document owners to identify clusters or categories
clean the original technical text for further model training, in-
cluding tokenization, denoising, stop-word removal, and lemma- that are different from the initial predefined patent classification
scheme, and guide them to define their own categories.
tization. It is noteworthy that all these automatic preprocess-
For those companies preferring their own document catego-
ing steps may involve biases in real-world applications. Prior
fundamental research [99] studied the different combinations rization system but having insufficient in-house labeled docu-
ments for high-performance neural network training, they may
of preprocessing methods and their influence on the model
consider a transfer learning strategy. That is to first utilize
performance, and pointed out that the best text preprocessing
strategy might be different for different datasets. In the future the large patent database to pretrain a neural network for the
IPC task (like the one we trained in this research), and then
work, we plan to conduct a more detailed experiment to find
out the best text preprocessing strategy for technical document further retrain/fine-tune the network with the relatively smaller
classification. set of proprietary documents and their categorical labels. This
is strategy #3 in Table IX. For the transfer, one may freeze the
parameters of the pretrained network, remove the topmost layers,
V. DISCUSSION ON APPLICATIONS OF THE TECHDOC add several layers on the top including the final fully-connected
Thus far, we have presented a deep learning model for hi- output layer with the same dimensions of the in-house cate-
erarchical classification of multimodal technical documents, gorization system, and randomly initialize the new parameters
i.e., TechDoc, which combines three types of information in added to the structure, for retraining with the proprietary data
training, and demonstrated and tested it with patent document and categorical labels.
classification. The application of TechDoc is not limited to patent In addition to the above scenarios, the TechDoc architecture
data and IPC. In practice, it can be also trained on nonpatent can also be applied for classification when documents only
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX
TAXONOMY OF NEURAL NETWORK TRAINING STRATEGIES
involve one or two types of information. The results of the visual features. Determining how to take advantage of all patent
ablation study (see Table III–V) show that using the text-only images remains a challenge. Second, some patents have more
or the network-only model can obtain reasonably good per- than one IPC code, which makes such a classification task a
formances. Compared to alternative methods, Table VI shows multilabel classification problem. Although some prior research
the two dual-modal models based on the TechDoc architecture has presented certain achievements [12], [52], [56], it remains
(text+image, text+network) and the trimodal TechDoc model challenging to determine the exact number of categories. Third,
(text+image+network) can outperform other existing models. the current training workflow is computing-intensive because
This finding suggests that our model is more suitable and com- of the large number of free weights in the multimodal deep
petitive for classifying multimodal engineering documents com- learning model. In the future, we plan to further improve the
prising two or three modalities. It is noteworthy that the training model by exploring alternative and more efficient ways to mine
efficiency of the dual-modal (text+network) model is better than multimodal information, especially visual information. Besides,
the trimodal TechDoc model, according to Table VI. In this case, further research can also explore whether other types of technical
when training efficiency is the top priority for users, they can document datasets (especially the technical documents gener-
choose the dual-modal model (text+network) to build their own ated in large engineering companies) and other technical-related
document classifier. When classification performance is the top classification systems (e.g., USPC and CPC) are more gener-
priority and the training time is affordable for the specific users, ally amenable to higher accuracies for the TechDoc model. In
they can employ the TechDoc model that combines three types addition to classification, some prior studies have also utilized
of information. text mining techniques to automatically analyze and manage
technical documents, including topic modeling [100], [101], and
VI. CONCLUDING REMARKS subject-action-object semantic structure extraction [102], [103].
Researchers may combine these AI techniques, in conjunction
The engineering design, analysis, and manufacturing pro- with document classification methods, to develop more powerful
cesses generate many diverse technical documents that describe technology management systems for real-world applications.
technologies, products, processes, and systems. For large engi-
neering companies, the number of technical documents that need
to be managed and organized for retrieval and reuse has grown REFERENCES
dramatically and demands more scalable and accurate document [1] K. T. Ulrich, S. D. Eppinger, and M. C. Yang, Product Design and
classification. Therefore, automated approaches increase their Development, 7th ed. New York, NY, USA: McGraw-Hill, 2020.
[2] G. J. Hahm, J. H. Lee, and H. W. Suh, “Semantic relation based person-
potential value in reducing the burden of experts and supporting alized ranking approach for engineering document retrieval,” Adv. Eng.
diverse analytical reasons for classification. Herein, we propose Inform., vol. 29, no. 3, pp. 366–379, 2015.
a multimodal deep learning architecture (i.e., TechDoc) for [3] H. Chen, X. Wang, S. Pan, and F. Xiong, “Identify topic relations in
scientific literature using topic modeling,” IEEE Trans. Eng. Manage.,
technical document classification that can take advantage of vol. 68, no. 5, pp. 1232–1244, Oct. 2021.
three types of information (images and texts of documents, and [4] M. Hertzum and A. M. Pejtersen, “The information-seeking practices of
relational network among documents) and assign documents engineers: Searching for documents as well as for people,” Inf. Process.
Manage., vol. 36, no. 5, pp. 761–778, 2000.
into hierarchical categories. [5] X. Liu, “A multi-agent-based architecture for enterprise customer and
TechDoc synthesizes the CNN, RNN, and GNN through an supplier cooperation context-aware information systems,” in Proc. Int.
integrated training process. To illustrate the proposed method, Conf. Autonomic Auton. Syst., 2007, pp. 58–58.
[6] F. S. C. Tseng, “Design of a multi-dimensional query expression for
we applied it to a large multimodal technical document database document warehouses,” Inf. Sci., vol. 174, no. 1/2, pp. 55–79, 2005.
of about 0.8 million patents and trained the model to classify [7] F. S. C. Tseng and A. Y. H. Chou, “The concept of document
technical documents based on the hierarchical IPC system. We warehousing for multi-dimensional modeling of textual-based busi-
ness intelligence,” Decis. Support Syst., vol. 42, no. 2, pp. 727–744,
demonstrated that the multimodal fusion model outperforms 2006.
the unimodal models and baseline models significantly. There [8] R. Feldman et al., The Text Mining Handbook: Advanced Approaches in
is still much room for improvement. First, how to identify an Analyzing Unstructured Data. Cambridge, U.K.: Cambridge Univ. Press,
2007.
effective way to utilize the information in document images [9] S. Allard, K. J. Levine, and C. Tenopir, “Design engineers and technical
is still unclear. In this study, we only utilized the first images professionals at work: Observing information usage in the workplace,”
of patents and leveraged a fine-tuned CNN network to extract J. Amer. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 443–454, 2009.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[10] C. H. Caldas and L. Soibelman, “Automating hierarchical document [32] A. Brem, F. Giones, and M. Werle, “The AI digital revolution in inno-
classification for construction management information systems,” Autom. vation: A conceptual framework of artificial intelligence technologies
Construction, vol. 12, no. 4, pp. 395–406, 2003. for the management of innovation,” IEEE Trans. Eng. Manage., to be
[11] L. Aristodemou and F. Tietze, “The state-of-the-art on intellectual prop- published, doi: 10.1109/TEM.2021.3109983.
erty analytics (IPA): A literature review on artificial intelligence, machine [33] K. Kaur, S. Garg, G. Kaddoum, E. Bou-Harb, and K.-K. R. Choo, “A
learning and deep learning methods for analysing intellectual property big data-enabled consolidated framework for energy efficient software
(IP) data,” World Pate Inf., vol. 55, pp. 37–51, 2018. defined data centers in IoT setups,” IEEE Trans. Ind. Informat., vol. 16,
[12] S. Li, J. Hu, Y. Cui, and J. Hu, “DeepPatent: Patent classification with no. 4, pp. 2687–2697, Apr. 2019.
convolutional neural networks and word embedding,” Scientometrics, [34] S. Wang and C. D. Manning, “Baselines and bigrams: Simple, good
vol. 117, no. 2, pp. 721–744, 2018. sentiment and topic classification,” in Proc. 50th Annu. Meeting Assoc.
[13] J. S. Linsey, K. L. Wood, and A. B. Markman, “Modality and represen- Comput. Linguistics, 2012, pp. 90–94.
tation in analogy,” Artif. Intell. Eng. Des. Anal. Manuf., vol. 22, no. 2, [35] T. Joachims, “Text categorization with support vector machines: Learning
pp. 85–100, 2008. with many relevant features,” in Proc. Eur. Conf. Mach. Learn., 1998,
[14] D. G. Ullman, S. Wood, and D. Craig, “The importance of drawing in the pp. 137–142.
mechanical design process,” Comput. Graph., vol. 14, no. 2, pp. 263–274, [36] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional
1990. neural network for modelling sentences,” in Proc. Annu. Meeting Assoc.
[15] S. Jiang, J. Luo, G. Ruiz-pava, J. Hu, and C. L. Magee, “Deriving design Comput. Linguistics, 2014, pp. 655–665.
feature vectors for patent images using convolutional neural networks,” [37] P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for text clas-
ASME J. Mech. Des., vol. 143, no. 6, p. 061405, 2021. sification with multi-task learning,” in Proc. 25th Int. Joint Conf. Artif.
[16] Z. Zhang and Y. Jin, “An unsupervised deep learning model to dis- Intell., 2016, pp. 2873–2879.
cover visual similarity between sketches for visual analogy support,” [38] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-
in Proc. ASME Int. Des. Eng. Tech. Conf. Comput. Inf. Eng. Conf., 2020, encoders,” in Proc. Int. Conf. Artif. Neural Netw., 2011, pp. 44–51.
p. V008T08A003. [39] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Conf. Neural
[17] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, Inf. Process. Syst., 2017, pp. 5998–6008.
“Multimodal fusion for multimedia analysis: A survey,” Multimedia Syst., [40] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for
vol. 16, no. 6, pp. 345–379, 2010. efficient text classification,” in Proc. 15th Conf. Eur. Chapter Assoc.
[18] C. A. Bhatt and M. S. Kankanhalli, “Multimedia data mining: State of Comput. Linguistics, 2017, pp. 427–431.
the art and challenges,” Multimedia Tools Appl., vol. 51, no. 1, pp. 35–76, [41] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural
2011. networks for text classification,” in Proc. 29th Conf. Artif. Intell., 2015,
[19] B. Song and J. Luo, “Mining patent precedents for data-driven design: pp. 2267–2273.
The case of spherical rolling robots,” J. Mech. Des., vol. 139, no. 11, [42] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical
p. 111420, 2017. attention networks for document classification,” in Proc. Conf. North
[20] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2016,
learning on large graphs,” in Proc. 31st Int. Conf. Neural Inf. Process. pp. 1480–1489.
Syst., 2017, pp. 1025–1035. [43] S. Chagheri, C. Roussey, S. Calabretto, and C. Dumoulin, “Technical
[21] J. Luo, S. Sarica, and K. L. Wood, “Guiding data-driven design ideation documents classification,” in Proc. 15th Int. Conf. Comput. Supported
by knowledge distance,” Knowl.-Based Syst., vol. 218, p. 106873, 2021. Cooperative Work Des., 2011, pp. 808–812.
[22] G. Elia, A. Margherita, and G. Passiante, “Management engineering: [44] C. J. Fall, A. Törcsvári, P. Fiévet, and G. Karetka, “Automated categoriza-
A new perspective on the integration of engineering and management tion of German-language patent documents,” Expert Syst. Appl., vol. 26,
knowledge,” IEEE Trans. Eng. Manage., vol. 68, no. 3, pp. 881–893, no. 2, pp. 269–277, 2004.
Jun. 2021. [45] D. Tikk, G. Biró, and J. D. Yang, “Experiment with a hierarchical text
[23] M. F. Manesh, M. M. Pellegrini, G. Marzi, and M. Dabic, “Knowledge categorization method on WIPO patent collections,” in Proc. Appl. Res.
management in the fourth industrial revolution: Mapping the literature Uncertainty Model. Anal., 2005, pp. 283–302.
and scoping future avenues,” IEEE Trans. Eng. Manage., vol. 68, no. 1, [46] F. Piroi, M. Lupu, A. Hanbury, A. P. Sexton, W. Magdy, and I. V
pp. 289–300, Feb. 2021. Filippov, “CLEF-IP 2010: Retrieval experiments in the intellectual prop-
[24] J. Sofiyabadi, C. Valmohammadi, and A. Sabet ghadam, “Impact of erty domain,” in Proc. Workshop Proc. Cross-Lang. Eval. Forum Eur.
knowledge management practices on innovation performance,” IEEE Languages, 2010, pp. 1–12.
Trans. Eng. Manage., to be published, doi: 10.1109/TEM.2020.3032233. [47] F. Piroi, M. Lupu, A. Hanbury, and V. Zenz, “CLEF-IP 2011: Retrieval in
[25] A. J. C. Trappey, C. V. Trappey, U. H. Govindarajan, and J. J. H. the intellectual property domain,” in Proc. Workshop Proc. Cross-Lang.
Sun, “Patent value analysis using deep learning models-the case of IoT Eval. Forum, 2011, pp. 1–16.
technology mining for the manufacturing industry,” IEEE Trans. Eng. [48] S. Verberne and E. D’hondt, “Patent classification experiments with
Manage., vol. 68, no. 5, pp. 1334–1346, Oct. 2021. the linguistic classification system LCS in CLEF-IP 2011.,” in Proc.
[26] A. Rodriguez et al., “Patent clustering and outlier ranking methodolo- Workshop Proc. Cross-Lang. Eval. Forum Eur. Languages, 2011, pp. 1–9.
gies for attributed patent citation networks for technology opportunity [49] Z. Li, D. Tate, C. Lane, and C. Adams, “A framework for automatic TRIZ
discovery,” IEEE Trans. Eng. Manage., vol. 63, no. 4, pp. 426–437, level of invention estimation of patents using natural language processing,
Nov. 2016. knowledge-transfer and patent citation metrics,” Comput. Des., vol. 44,
[27] Z. Qiu and Z. Wang, “Technology forecasting based on semantic and no. 10, pp. 987–1010, 2012.
citation analysis of patents: A case of robotics domain,” IEEE Trans. [50] C.-H. Wu, Y. Ken, and T. Huang, “Patent classification system using a new
Eng. Manage., to be published, doi: 10.1109/TEM.2020.2978849. hybrid genetic algorithm support vector machine,” Appl. Soft Comput.,
[28] P. Chandra and A. Dong, “Knowledge network robustness: A new per- vol. 10, no. 4, pp. 1164–1177, 2010.
spective on the appropriation of knowledge from patents,” IEEE Trans. [51] M. F. Grawe, C. A. Martins, and A. G. Bonfante, “Automated patent
Eng. Manage., to be published, doi: 10.1109/TEM.2020.3016278. classification using word embedding,” in Proc. Int. Conf. Mach. Learn.
[29] G. Zanella, C. Z. Liu, and K.-K. R. Choo, “Understanding the Appl., 2017, pp. 408–411.
trends in blockchain domain through an unsupervised systematic [52] M. Shalaby, J. Stutzki, M. Schubert, and S. Günnemann, “An LSTM
patent analysis,” IEEE Trans. Eng. Manage., to be published, approach to patent classification based on fixed hierarchy vectors,” in
doi: 10.1109/TEM.2021.3074310. Proc. SIAM Int. Conf. Data Mining, 2018, pp. 495–503.
[30] J. Luo, “Data-Driven innovation : What is it?,” IEEE Trans. Eng. Manage., [53] J. Risch and R. Krestel, “Domain-specific word embeddings for patent
to be published, doi: 10.1109/TEM.2022.3145231. classification,” Data Technol. Appl., vol. 53, no. 1, pp. 108–122, 2019.
[31] S. Jiang, J. Hu, K. L. Wood, and J. Luo, “Data-driven design-by-analogy: [54] J. Hepburn, “Universal language model fine-tuning for patent classi-
State-of-the-art and future directions,” ASME J. Mech. Des., vol. 144, fication,” in Proc. Australas. Lang. Technol. Assoc. Workshop, 2018,
no. 2, p. 020801, 2022. pp. 93–96.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[55] D. M. Kang, C. C. Lee, S. Lee, and W. Lee, “Patent prior art search using [80] S. Tsutsui and D. J. Crandall, “A data driven approach for
deep learning language model,” in Proc. Symp. Int. Database Eng. Appl., compound figure separation using convolutional neural networks,”
2020, pp. 1–5. in Proc. 14th IAPR Int. Conf. Document Anal. Recognit., 2017,
[56] J. S. Lee and J. Hsiang, “Patent classification by fine-tuning BERT pp. 533–540.
language model,” World Pat. Inf., vol. 61, no. 1, p. 101965, 2020. [81] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
[57] L. Abdelgawad, P. Kluegl, E. Genc, S. Falkner, and F. Hutter, “Optimizing 2018, arXiv:1804.02767.
neural networks for patent classification,” in Proc. Joint Eur. Conf. Mach. [82] A. K. Uysal and S. Gunal, “The impact of preprocessing on text classifi-
Learn. Knowl. Discov. Databases, 2020, pp. 688–703. cation,” Inf. Process. Manage., vol. 50, no. 1, pp. 104–112, 2014.
[58] L. A. Tomei, Taxonomy for the Technology Domain/ Calgary, AB, [83] R. M. Hayes, “The SMART retrieval system: Experiments in automatic
Canada: Idea Group. Inc., 2005. document processing,” IEEE Trans. Prof. Commun., vol. PC-15, no. 1,
[59] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine Mar. 1972.
learning: A survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. [84] K. Toutanova and C. D. Manning, “Enriching the knowledge sources
Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019. used in a maximum entropy part-of-speech tagger,” in Proc. Joint
[60] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal SIGDAT Conf. Empirical Methods Natural Lang. Process., 2000,
deep learning,” in Proc. 28th Int. Conf. Int. Conf. Mach. Learn., 2011, pp. 63–70.
pp. 689–696. [85] S. Bird and E. Loper, “NLTK: The natural language toolkit,”
[61] N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep in Proc. 42nd Annu. Meeting Assoc. Comput. Linguistics, 2004,
Boltzmann machines,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 214–217.
pp. 2222–2230. [86] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learn-
[62] Y. Kang, S. Kim, and S. Choi, “Deep learning to hash with multiple ing deep features for discriminative localization,” in Proc. IEEE Conf.
representations,” in Proc. IEEE 12th Int. Conf. Data Mining, 2012, Comput. Vis. Pattern Recognit., 2016, pp. 2921–2929.
pp. 930–935. [87] S. Sarica, J. Luo, and K. L. Wood, “TechNet: Technology semantic
[63] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue, “Exploring inter-feature network based on patent data,” Expert Syst. Appl., vol. 142, p. 112995,
and inter-class relationships with deep neural networks for video classi- 2020.
fication,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 167–176. [88] L. Xiao, G. Wang, and Y. Zuo, “Research on patent text classification
[64] Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning based on Word2Vec and LSTM,” in Proc. 11th Int. Symp. Comput. Intell.
for audio-visual speech recognition,” in Proc. IEEE Int. Conf. Acoust., Des., 2018, pp. 71–74.
Speech Signal Process., 2015, pp. 2130–2134. [89] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
[65] N. Audebert, C. Herold, K. Slimani, and C. Vidal, “Multimodal deep net- mization.,” J. Mach. Learn. Res., vol. 13, no. 2, pp. 281–305, 2012.
works for text and image-based document classification,” in Proc. Joint [90] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Sur-
Eur. Conf. Mach. Learn. Knowl. Discov. Databases, 2019, pp. 427–443. passing human-level performance on imagenet classification,” in Proc.
[66] K. Simonyan and A. Zisserman, “Very deep convolutional networks for IEEE Int. Conf. Comput. Vis., 2015, pp. 1026–1034.
large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represen- [91] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
tations, 2015. in Proc. 3rd Int. Conf. Learn. Representations, 2015.
[67] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” [92] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for
IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov. 1997. word representation,” in Proc. Conf. Empirical Methods Natural Lang.
[68] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by Process., 2014, pp. 1532–1543.
jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn. [93] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilingual
Representations, 2015. graph of general knowledge,” in Proc. 31st AAAI Conf. Artif. Intell., 2017,
[69] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, and others, “End-to- pp. 4444–4451.
end memory networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, [94] I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Takefuji,
pp. 2440–2448. “Wikipedia2Vec: An optimized tool for learning embeddings of words
[70] A. Kumar et al., “Ask me anything: Dynamic memory networks for and entities from Wikipedia,” in Proc. Conf. Empirical Methods Natural
natural language processing,” in Proc. 33nd Int. Conf. Mach. Learn., Lang. Process., 2020, pp. 23–30.
2016, pp. 1378–1387. [95] Student, “The probable error of a mean,” Biometrika, vol. 6, no. 1,
[71] K. Fu, J. Murphy, M. Yang, K. Otto, D. Jensen, and K. Wood, “Design-by- pp. 1–25, 1908.
analogy: Experimental evaluation of a functional analogy search method- [96] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
ology for concept generation improvement,” Res. Eng. Des., vol. 26, no. 1, of deep bidirectional transformers for language understanding,” in Proc.
pp. 77–95, 2015. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang.
[72] B. Song, V. Srinivasan, and J. Luo, “Patent stimuli search and its influence Technol., 2019, pp. 4171–4186.
on ideation outcomes,” Des. Sci., vol. 3, no. e25, pp. 1–25, 2017. [97] J. Gao, P. Li, Z. Chen, and J. Zhang, “A survey on deep learning for
[73] J. Luo, B. Yan, and K. Wood, “InnoGPS for data-driven exploration of multimodal data fusion,” Neural Comput., vol. 32, no. 5, pp. 829–864,
design opportunities and directions: The case of Google Driverless Car 2020.
Project,” J. Mech. Des., vol. 139, no. 11, p. 111416, 2017. [98] R. Lee and I.-Y. Chen, “The time complexity analysis of neural network
[74] S. Altuntas and M. Sezer, “A novel technology intelligence tool model configurations,” in Proc. Int. Conf. Math. Comput. Sci. Eng., 2020,
based on utility mining,” IEEE Trans. Eng. Manage., to be published, pp. 178–183.
doi: 10.1109/TEM.2021.3101582. [99] Y. HaCohen-Kerner, D. Miller, and Y. Yigal, “The influence of prepro-
[75] G. Xu, F. Dong, and J. Feng, “Mapping the technological landscape of cessing on text classification using a bag-of-words representation,” PLoS
emerging industry value chain through a patent lens: An integrated frame- One, vol. 15, no. 5, pp. 1–22, 2020.
work with deep learning,” IEEE Trans. Eng. Manage., to be published, [100] X. Wang, Y. Qiao, Y. Hou, S. Zhang, and X. Han, “Measuring
doi: 10.1109/TEM.2020.3041933. technology complementarity between enterprises with an hLDA topic
[76] V. Giordano, F. Chiarello, N. Melluso, G. Fantoni, and A. Bonaccorsi, model,” IEEE Trans. Eng. Manage., vol. 68, no. 5, pp. 1309–1320,
“Text and dynamic network analysis for measuring technological conver- Oct. 2021.
gence: A case study on defense patent data,” IEEE Trans. Eng. Manage., [101] C. Wei, L. Chaoran, L. Chuanyun, K. Lingkai, and Y. Zaoli, “Tracing
to be published, doi: 10.1109/TEM.2021.3078231. the evolution of 3-D printing technology in China using LDA-based
[77] L. Siddharth, L. T. M. Blessing, K. L. Wood, and J. Luo, “Engineering patent abstract mining,” IEEE Trans. Eng. Manage., to be published,
knowledge graph from patent database,” J. Comput. Inf. Sci. Eng., vol. 22, doi: 10.1109/TEM.2020.2975988.
no. 2, pp. 1–36, 2022. [102] R. Li, X. Wang, Y. Liu, and S. Zhang, “Improved technology similarity
[78] J. C. Gomez and M.-F. Moens, “A survey of automated hierarchical measurement in the medical field based on subject-action-object seman-
classification of patents,” in Professional Search in the Modern World. tic structure: A case study of alzheimer’s disease,” IEEE Trans. Eng.
Berlin, Germany: Springer, 2014, pp. 215–249. Manage., to be published, doi: 10.1109/TEM.2020.3047370.
[79] C. L. Benson and C. L. Magee, “A hybrid keyword and patent class [103] X. Han, D. Zhu, X. Wang, J. Li, and Y. Qiao, “Technology opportunity
methodology for selecting relevant sets of patents for a technological analysis: Combining SAO networks and link prediction,” IEEE Trans.
field,” Scientometrics, vol. 96, no. 1, pp. 69–82, 2013. Eng. Manage., vol. 68, no. 5, pp. 1288–1298, Oct. 2021.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Shuo Jiang received the B.S. degree in mechanical Christopher L. Magee received the B.S., M.S.,
engineering from the School of Mechanical Engi- and Ph.D. degrees in metallurgy and materials sci-
neering, East China University of Science and Tech- ence from the Carnegie Institute of Technology (now
nology, Shanghai, China, in 2016. He is currently Carnegie Mellon University), Pittsburgh, PA, USA, in
working toward the Ph.D. degree with the School of 1963 and 1966 respectively, and the M.B.A. degree
Mechanical Engineering, Shanghai Jiao Tong Univer- from Michigan State University, East Lansing, MI,
sity, Shanghai, China. USA, in 1979.
He was a visiting Ph.D. student at Institute for Data, He is currently a Professor with the Engineer-
Systems, and Society (Design and Invention Group), ing Systems Division and Mechanical Engineering
Massachusetts Institute of Technology for one year Department, Massachusetts Institute of Technology
sponsored by the Chinese Scholarship Council, and (MIT), and the Codirector with the International De-
also a visiting Ph.D. student at Data-Driven Innovation Lab, Singapore Uni- sign Center, associated with the Singapore University of Technology and Design
versity of Technology and Design. His research interests include data-driven being codeveloped by MIT and Singapore. His recent research has emphasized
designs, machine learning-based engineering designs, and computational design innovation and technology development in complex systems.
methods. Dr. Magee was elected to the National Academy of Engineering (1997) while
working with the Ford Motor Company for contributions to advanced vehicle
development. He was a Ford technical fellow (1996), and is a fellow of the
American Society for Metals.
Jie Hu received the Ph.D. degree in mechanical engi- Jianxi Luo received the B.E. and M.S. degrees in en-
neering from Zhejiang University, Hangzhou, China, gineering from Tsinghua University, Beijing, China,
in 2001. in 2001 and 2004, respectively, and the S.M. degree
He is currently a tenured Full Professor with the in technology policy and the Ph.D. degree in engi-
School of Mechanical Engineering, Shanghai Jiao neering systems (technology management and policy
Tong University (SJTU), Shanghai, China. Prior to track) from the Massachusetts Institute of Technol-
joining SJTU, he was a postdoctoral researcher at ogy, Cambridge, MA, USA, 2006 and 2010, respec-
Tsinghua University. His research interests include tively.
innovative designs, design theory, artificial intelli- He is currently the Founder and the Director of
gence, and computer-aided designs. Data-Driven Innovation Lab, Singapore University of
Technology and Design (SUTD), Singapore. He was
the Director of SUTD Technology Entrepreneurship Programme. He teaches
topics on engineering entrepreneurship, design, and innovation. His research
interests include data-driven innovation and artificial intelligence for design.
Dr. Luo is currently the Department Editor for the IEEE TRANSACTIONS
ON ENGINEERING MANAGEMENT, an Associate Editor for Design Science and
Artificial Intelligence for Engineering Design, Analysis and Manufacturing, and
an Editorial Board Member of Research in Engineering Design. He was the
Chair of INFORMS Technology Innovation Management and Entrepreneurship
Section.
Authorized licensed use limited to: Shanghai Jiaotong University. Downloaded on March 09,2022 at 06:48:15 UTC from IEEE Xplore. Restrictions apply.
View publication stats