Patent Analysis - 2 Good PDF

A Statistical Text Mining Method for Patent Analysis
Sunghae Jun

Sunghae Jun
Department of Statistics Cheongju University, shjun@cju.ac.kr
Abstract
Most text data from diverse document databases are unsuitable for analytical methods based on
statistics and machine learning algorithms. Patent documents are also compiled into text datasets.
Similar to other document datasets, we therefore need to transform patent documents into structured
data for a statistical analysis. This transformation is performed using the preprocessing of text mining
techniques. We can analyze the patent documents after their preprocessing. For a patent analysis, two
phases, preprocessing and analysis, are required. In this paper, we try to combine the two phases into
one. We propose a statistical text mining method to improve the performance of a patent data analysis.
Our proposed method carries out text mining and a statistical analysis at the same time. To show the
contribution of our study, we illustrate how it can be applied in a real domain using a target
technology.
Keywords: Statistical text mining, Preprocessing, Statistical analysis, Patent system, Technology
forecasting, Patent analysis
1. Introduction
A patent is a type of intellectual property (IP). Unlike IPs such as trademarks, copyrights, and trade
secrets, a patent includes diverse information of technological development [1]. In addition, the patent
system protects the exclusive rights of inventors regarding their developed technology for limited
periods of time [2]. Many companies hope to analyze patent documents to ascertain certain technology
trends. However, because patent data include text, numbers, dates, and pictures, patent data are
unsuitable for analysis methods such as statistics and machine learning algorithms [1], which require
structured data consisting of numbers or frequencies. To solve this problem, we should preprocess
patent documents using text mining techniques [3]. Text mining is a data mining process used to
manage and analyze texts or document data [4]. We therefore use text mining methods to preprocess
patent data to build structured data for a statistical analysis. In this paper, we combine statistics and text
mining for a patent analysis. This type of combination is called statistical text mining (STM) and has
been used in medical and bio sectors [5][6]. This research provides another STM approach for patent
analysis, and differs from previous studies. In our STM, we used descriptive statistics and a
multivariate analysis based on statistics. In addition, using text mining techniques, we import data,
create a corpus, and eliminate whitespace and stop words. In other words, we apply basic and advanced
analyses to structured patent data after the preprocessing of patent documents. The STM results can be
used for R&D planning, technology management, or new product development. To show how our
STM can be used in a real domain, we perform a case study using retrieved patent documents from the
Korea Intellectual Property Rights Information Service (KIPRIS) database. We then apply our STM
process to this case study in a step-by-step manner.
2. Statistical and text mining

Statistics and text mining are both data mining tools [4]. Statistics is a leaning method for prediction
and forecasting [7]. Most statistical methods are based on a probability distribution such as a binomial
or normal distribution. Text mining is an analytic process for discovering novel information from a
large amount of text data [4]. Differing from general data mining, text mining requires some natural
language process techniques because the preprocessing of text data should be performed to analyze text
data.
International Journal of Advancements in Computing Technology(IJACT)

Volume 5, Number 12, August 2013
144

Sunghae Jun
3. Statistical text mining for a patent analysis

Statistics plays an important role in big data analyses. Patent documents are typical examples of big
data. A patent document contains diverse data types such as text, numbers, dates, and pictures [1].
Patents include a lot of information regarding the results of technological developments. Most
companies have therefore tried to analyze patent data for their technology management such as R&D
planning and the development of new products. Many statistical methods have been used for patent
analyses [8][9]. However, certain problems exist in patent analyses using statistical methods. One such
problem is that patent documents are not suitable for statistical analyses because traditional statistics
requires numerical data. Several studies to overcome this problem have therefore been conducted
[8][10]. Text mining is a good approach to settle these obstacles [3]. Therefore, we combine a
statistical analysis and text mining, called statistical text mining, for a patent analysis. Figure 1 shows
the scope of the proposed statistical text mining method.
Figure 1. Interdisciplinary statistical text mining

We researched STM for use in a big data analysis, and analyzed patent documents as examples of
big data. Statistics and text mining are different fields used in big data analyses. Statistics is a natural
science based on probability for estimation, prediction, or forecasting. Text mining is an applied
information science based on computer science for text data analyses. Table 1 shows a comparison
between statistics, text mining, and statistical text mining based on their respective scopes [4][7].
Table 1. Scopes of statistics, statistical text mining, and text mining
Statistics
Statistical text mining
Text mining
Descriptive statistics
- sample statistic
- tables & graphs
Probability
- support and confidence
- Bayes rule
Random variable
- discrete & continuous
Probability distribution
- binomial & Poisson
- normal
Inference
- estimation
- hypothesis testing
Structured data construction

- importing text data
- creating corpus
- building document term matrix
Preprocessing
- eliminating whitespace
- removing stop and common words
Basic analysis
- summary statistics
- meta analysis
Advanced analysis
- finding association rules
- regression & path analysis
- text or document clustering
Information retrieval
- precision & recall
- document selection
Text indexing
- latent semantic indexing
- locality processing indexing
- probabilistic indexing
Query processing
Document classification
- document ranking
Web mining
- web usage mining
- web structure mining
- web recommendation system
Our STM is focused on patent documents. The statistics field in Table 1 is some distance from text
mining. Statistics is used to summarize and visualize numeric data, and to infer a population based on
estimation and hypothesis testing. In comparison, text mining is used to process text data through
retrieval and indexing. However, STM makes structured data suitable for a statistical analysis. The
structured data construction and preprocessing of STM are based on text mining techniques. In addition,
145

Sunghae Jun
the basic and advanced analyses of STM are based on statistical methods. Figure 2 shows the step-bystep process of our proposed STM method.
Figure 2. Proposed STM process

We retrieve patent documents of a given technology and the patent data is not suitable to statistical
analysis. We therefore transform the documents into structured data in the form of a patent keyword
matrix. The rows of the matrix consist of searched patents. All keywords extracted from the patents are
in the columns of the structured matrix. In addition, each value of the matrix is the occurrence
frequency of the keyword in the patent, which is a numerical data type. We can therefore analyze this
matrix through statistics. The results of a statistical analysis may reveal novel knowledge and hidden
patterns, and can be used for the R&D planning and new product development of a company after
applying the opinions of domain experts. In the next section, we describe a case study suggesting how
the proposed STM can be applied to a real-world problem.
4. Experimental results
For our case study, we searched for patent documents from the KIPRIS patent database [11]. The
keyword equation is as follows: title = text * mining; in other words, we retrieved patent documents
related to text mining technology. In addition, we obtained patents from the United States and Europe
on July 5, 2013. The patent data consisted of 67 documents in total. Among them, 60 were US patents,
and the remaining were Europe patents. Next, we illustrate the case study for STM in a step-by-step
manner.
4.1. Structured data construction

In this paper, we performed text mining techniques to construct structured data for a statistical
analysis. To preprocess the patent data used in our research, we used the text mining package of the R
project [12][13]. We downloaded the patent data as an Excel file and imported it into the R project. We
created a corpus of 67 text documents. From this corpus, we constructed a patent term matrix using 67
documents and 1439 terms. We found the following highly frequent (over 10) terms in the structured
data: also, analysis, analyzed, and, apparatus, are, associated, based, between,
business, can, candidate, characteristic, coincident, computer, condition, confidence,
content, corresponding, data, definition, degree, device, disclosed, document, each,
element, entities, extracted, extracting, extraction, extract, feature, for, frequency,
from, generating, generation, has, include, including, information, inherent, input,
into, knowledge, language, least, list, may, means, method, mining, more,
obtained, one, phrase, piece, plurality, portion, predetermined, present, processing,
program, provided, query, referring, result, said, search, section, set, structured,
such, system, term, text, textual, that, the, this, topic, unit, used, user,
using, value, video, web, which, with, within, word, and words.
146

Sunghae Jun
To build the patent keyword matrix, we selected keywords from the frequent terms by eliminating
stop and common words. In this way, we determined the following candidate keywords: analysis,
associated, business, candidate, characteristic, coincident, computer, condition,
confidence, content, data, definition, degree, device, disclosed, document, element,
entities, extract, feature, frequency, generation, including, information, inherent,
input, knowledge, language, list, means, method, mining, phrase, piece, plurality,
portion, predetermined, present, processing, program, query, referring, search,
section, set, structured, system, term, text, topic, unit, used, user, value,
video, web, and word.
The final keywords can be determined based on the knowledge of experts in the text mining field. In
this paper, we determined the top ten keywords from the candidates other than text and mining.
Table 2. Top ten keywords for text mining patents

Keywords
Business, confidence, content, disclosed, extract, feature, information, query, search, structured
We therefore constructed a patent keyword matrix using 67 patents and 10 keywords. Using this
matrix data, we performed basic and advanced analyses, the results of which were used for practical
application.
4.2. Basic analysis

STM provides a summary and visualization of the searched patent data in the basic analysis step.
The following figure shows the number of patents by year.
Figure 3. Number of applied patents

The first patent related to text mining was applied in 1999. We therefore can see that the
technological development of text mining was started comparatively recently. The development of text
mining technology has also recently decreased. In addition, we summarized the International Patent
Classification (IPC) codes of the searched patent data. IPC is a patent classification system from the
World Intellectual Property Organization [14]. The following table shows the IPC codes used in text
mining patents and their representative technology.
IPC
C12Q
Number of
IPCs
Table 3. Number of IPC codes

G06F
G06K
G06N
G06Q
58
G06T
G10L
We found that the text mining patents consisted of seven IPC codes. Most technologies of these text
mining patents are based on the IPC code, G06F, which describes the technology of electric digital
data processing [14]. The IPC code, C12Q, represents the technology of measuring or testing
147

Sunghae Jun
processes involving enzymes or micro-organisms, and the remaining IPC codes are related to
information technologies. We therefore can see that text mining technology is an interdisciplinary area
for developing technologies. Next, we attempted to find more detailed knowledge through an advanced
analysis.
4.3. Advanced analysis

In this section, we applied a statistical analysis to the structured data. First, we performed the
following correlation analysis:
Figure 4. Correlation coefficient matrix between keywords

We know that confidence, feature, and structured have relatively large correlation
coefficients for text. In addition, mining has a relatively large correlation coefficient between
confidence, and we therefore concluded that the development of text mining technology is needed
for a confidence technology. In addition, we should consider featured and structured approaches for
text mining technology development. We performed another STM analysis as follows.
Table 4. Influence of keywords on text mining
Text
Mining
Keywords
estimate
significance
estimate
significance
Business
0.11
0.83
0.40
0.22
Confidence
2.65
0.00
1.08
0.00
0.32
Content
0.06
0.84
-0.20
Disclosed
0.50
0.59
0.46
0.46
Extract
0.93
0.50
0.71
0.44
0.19
Feature
0.56
0.09
0.28
Information
-0.05
0.84
0.13
0.41
Query
-0.09
0.77
-0.05
0.81
Search
-0.00
0.99
-0.06
0.69
Structured
-0.88
0.29
-0.56
0.32
We carried out multiple regression analysis, estimate and significance are regression parameter and
probability value (p-value) respectively. We can find the relative influence of a keyword to text or
mining using the estimated value, and statistical testing using the significance (less than 0.05). Similar
to the results of a correlation analysis, we found that the confidence technology is important for the
development of text mining technology. We next applied our results to a practical problem.
148

Sunghae Jun
4.4. Practical application

Using ten extracted keywords, we found that the disclosure, extraction, query, and search of
information and structured content are important for business problems. In addition, based on the
results of an IPC code summary, we can see that text mining technology can be applied to bio-systems.
The basic analysis results show that the technologies related to text mining are comparatively recent.
Finally, based on the results of an advanced analysis, we obtained the statistical significance of
confidence technology on text mining. Therefore, to build an R&D plan for text mining technology, we
recommend the results of our step-by-step STM process.
5. Conclusion and future work

In this paper, we proposed an STM methodology for a patent analysis. The proposed model consists
of a patent document search related to the target technology, structured data construction, basic and
advanced analyses, and practical application. To illustrate how our STM can be applied to a real
domain, we performed a case study using the patent data related to text mining technology. Our
research can contribute to the effective R&D planning or technology management of a company. For
future work, we will study the development of the STM process for a patent analysis in more detail.
Acknowledgement
This work was supported by a research grant from Cheongju University in 2012.
References
[1] Roper, A. T., Cunningham, S. W., Porter, A. L., Mason, T. W., Rossini F. A., Banks J.,
Forecasting and Management of Technology, Wiley, 2011.
[2] Hunt, D., Nguyen, L., Rodgers, M., Patent Searching Tools & Techniques, Wiley, 2007.
[3] Tseng, Y., Lin, C., Lin, Y., Text mining techniques for patent analysis, Information Processing
and Management, vol. 43, no. 5, pp. 1216-1247, 2007.
[4] Han, J., Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2005.
[5] Luther, S., Berndt, D., Finch, D., Richardson, M., Hickling, E., and Hickam, D., Using statistical
text mining to supplement the development of an ontology, Journal of Biomedical Informatics,
vol. 44, supp. 1, pp. 86-93, 2011.
[6] McCart, J., Luther, S., An Introductory Look at Statistical Text Mining for Health Services
Researchers, Seminar of Consortium for Healthcare Information Research, Health Service
Research & Development Service, Department of Veterans Affairs, the United States, 2012.
[7] Ross, S. M., Introduction to Probability and Statistics for Engineers and Scientists, Elsevier, 2009.
[8] Jun, S., Uhm, D., Patent and Statistics, Whats the connection?, Communications of the Korea
Statistical Society, vol. 17, no. 2, pp. 205-222, 2010.
[9] Jun, S., Park, S., Jang, D., Technology Forecasting using Matrix Map and Patent Clustering,
Industrial Management & Data Systems, vol. 112, iss. 5, pp. 786-807, 2012.
[10] Feinerer, I., Hornik, K., Meyer, D., Text mining infrastructure in R, Journal of Statistical
Software, vol. 25, no. 5, pp. 1-54, 2008.
[11] KIPRIS, Korea Intellectual Property Rights Information Service, www.kipris.or.kr, 2013.
[12] Feinerer, I., Hornik, K., Meyer, D., Text mining infrastructure in R, Journal of Statistical
Software, vol. 25, no. 5, pp. 1-54. 2008.
[13] Feinerer, I., Hornik, K., Text Mining Package, Package tm, cran.r-project.org, 2013.
[14] WIPO, World Intellectual Property Organization, www.wipo.org, 2013.
149

Patent Analysis - 2 Good PDF

Uploaded by

Copyright:

Available Formats

Patent Analysis - 2 Good PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Patent Analysis - 2 Good PDF

Uploaded by

Copyright:

Available Formats

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis

2. Statistical and text mining

International Journal of Advancements in Computing Technology(IJACT)

A Statistical Text Mining Method for Patent Analysis

3. Statistical text mining for a patent analysis

Figure 1. Interdisciplinary statistical text mining

Structured data construction

A Statistical Text Mining Method for Patent Analysis

Figure 2. Proposed STM process

4.1. Structured data construction

A Statistical Text Mining Method for Patent Analysis

Table 2. Top ten keywords for text mining patents

4.2. Basic analysis

Figure 3. Number of applied patents

Table 3. Number of IPC codes

A Statistical Text Mining Method for Patent Analysis

4.3. Advanced analysis

Figure 4. Correlation coefficient matrix between keywords

A Statistical Text Mining Method for Patent Analysis

4.4. Practical application

5. Conclusion and future work

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.