Patent Analysis - 2 Good PDF
Patent Analysis - 2 Good PDF
Patent Analysis - 2 Good PDF
Sunghae Jun
Abstract
Most text data from diverse document databases are unsuitable for analytical methods based on
statistics and machine learning algorithms. Patent documents are also compiled into text datasets.
Similar to other document datasets, we therefore need to transform patent documents into structured
data for a statistical analysis. This transformation is performed using the preprocessing of text mining
techniques. We can analyze the patent documents after their preprocessing. For a patent analysis, two
phases, preprocessing and analysis, are required. In this paper, we try to combine the two phases into
one. We propose a statistical text mining method to improve the performance of a patent data analysis.
Our proposed method carries out text mining and a statistical analysis at the same time. To show the
contribution of our study, we illustrate how it can be applied in a real domain using a target
technology.
Keywords: Statistical text mining, Preprocessing, Statistical analysis, Patent system, Technology
forecasting, Patent analysis
1. Introduction
A patent is a type of intellectual property (IP). Unlike IPs such as trademarks, copyrights, and trade
secrets, a patent includes diverse information of technological development [1]. In addition, the patent
system protects the exclusive rights of inventors regarding their developed technology for limited
periods of time [2]. Many companies hope to analyze patent documents to ascertain certain technology
trends. However, because patent data include text, numbers, dates, and pictures, patent data are
unsuitable for analysis methods such as statistics and machine learning algorithms [1], which require
structured data consisting of numbers or frequencies. To solve this problem, we should preprocess
patent documents using text mining techniques [3]. Text mining is a data mining process used to
manage and analyze texts or document data [4]. We therefore use text mining methods to preprocess
patent data to build structured data for a statistical analysis. In this paper, we combine statistics and text
mining for a patent analysis. This type of combination is called statistical text mining (STM) and has
been used in medical and bio sectors [5][6]. This research provides another STM approach for patent
analysis, and differs from previous studies. In our STM, we used descriptive statistics and a
multivariate analysis based on statistics. In addition, using text mining techniques, we import data,
create a corpus, and eliminate whitespace and stop words. In other words, we apply basic and advanced
analyses to structured patent data after the preprocessing of patent documents. The STM results can be
used for R&D planning, technology management, or new product development. To show how our
STM can be used in a real domain, we perform a case study using retrieved patent documents from the
Korea Intellectual Property Rights Information Service (KIPRIS) database. We then apply our STM
process to this case study in a step-by-step manner.
144
Information retrieval
- precision & recall
- document selection
Text indexing
- latent semantic indexing
- locality processing indexing
- probabilistic indexing
Query processing
Document classification
- document ranking
Web mining
- web usage mining
- web structure mining
- web recommendation system
Our STM is focused on patent documents. The statistics field in Table 1 is some distance from text
mining. Statistics is used to summarize and visualize numeric data, and to infer a population based on
estimation and hypothesis testing. In comparison, text mining is used to process text data through
retrieval and indexing. However, STM makes structured data suitable for a statistical analysis. The
structured data construction and preprocessing of STM are based on text mining techniques. In addition,
145
the basic and advanced analyses of STM are based on statistical methods. Figure 2 shows the step-bystep process of our proposed STM method.
4. Experimental results
For our case study, we searched for patent documents from the KIPRIS patent database [11]. The
keyword equation is as follows: title = text * mining; in other words, we retrieved patent documents
related to text mining technology. In addition, we obtained patents from the United States and Europe
on July 5, 2013. The patent data consisted of 67 documents in total. Among them, 60 were US patents,
and the remaining were Europe patents. Next, we illustrate the case study for STM in a step-by-step
manner.
146
To build the patent keyword matrix, we selected keywords from the frequent terms by eliminating
stop and common words. In this way, we determined the following candidate keywords: analysis,
associated, business, candidate, characteristic, coincident, computer, condition,
confidence, content, data, definition, degree, device, disclosed, document, element,
entities, extract, feature, frequency, generation, including, information, inherent,
input, knowledge, language, list, means, method, mining, phrase, piece, plurality,
portion, predetermined, present, processing, program, query, referring, search,
section, set, structured, system, term, text, topic, unit, used, user, value,
video, web, and word.
The final keywords can be determined based on the knowledge of experts in the text mining field. In
this paper, we determined the top ten keywords from the candidates other than text and mining.
IPC
C12Q
Number of
IPCs
G06T
G10L
We found that the text mining patents consisted of seven IPC codes. Most technologies of these text
mining patents are based on the IPC code, G06F, which describes the technology of electric digital
data processing [14]. The IPC code, C12Q, represents the technology of measuring or testing
147
processes involving enzymes or micro-organisms, and the remaining IPC codes are related to
information technologies. We therefore can see that text mining technology is an interdisciplinary area
for developing technologies. Next, we attempted to find more detailed knowledge through an advanced
analysis.
significance
estimate
significance
Business
0.11
0.83
0.40
0.22
Confidence
2.65
0.00
1.08
0.00
0.32
Content
0.06
0.84
-0.20
Disclosed
0.50
0.59
0.46
0.46
Extract
0.93
0.50
0.71
0.44
0.19
Feature
0.56
0.09
0.28
Information
-0.05
0.84
0.13
0.41
Query
-0.09
0.77
-0.05
0.81
Search
-0.00
0.99
-0.06
0.69
Structured
-0.88
0.29
-0.56
0.32
We carried out multiple regression analysis, estimate and significance are regression parameter and
probability value (p-value) respectively. We can find the relative influence of a keyword to text or
mining using the estimated value, and statistical testing using the significance (less than 0.05). Similar
to the results of a correlation analysis, we found that the confidence technology is important for the
development of text mining technology. We next applied our results to a practical problem.
148
Acknowledgement
This work was supported by a research grant from Cheongju University in 2012.
References
[1] Roper, A. T., Cunningham, S. W., Porter, A. L., Mason, T. W., Rossini F. A., Banks J.,
Forecasting and Management of Technology, Wiley, 2011.
[2] Hunt, D., Nguyen, L., Rodgers, M., Patent Searching Tools & Techniques, Wiley, 2007.
[3] Tseng, Y., Lin, C., Lin, Y., Text mining techniques for patent analysis, Information Processing
and Management, vol. 43, no. 5, pp. 1216-1247, 2007.
[4] Han, J., Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2005.
[5] Luther, S., Berndt, D., Finch, D., Richardson, M., Hickling, E., and Hickam, D., Using statistical
text mining to supplement the development of an ontology, Journal of Biomedical Informatics,
vol. 44, supp. 1, pp. 86-93, 2011.
[6] McCart, J., Luther, S., An Introductory Look at Statistical Text Mining for Health Services
Researchers, Seminar of Consortium for Healthcare Information Research, Health Service
Research & Development Service, Department of Veterans Affairs, the United States, 2012.
[7] Ross, S. M., Introduction to Probability and Statistics for Engineers and Scientists, Elsevier, 2009.
[8] Jun, S., Uhm, D., Patent and Statistics, Whats the connection?, Communications of the Korea
Statistical Society, vol. 17, no. 2, pp. 205-222, 2010.
[9] Jun, S., Park, S., Jang, D., Technology Forecasting using Matrix Map and Patent Clustering,
Industrial Management & Data Systems, vol. 112, iss. 5, pp. 786-807, 2012.
[10] Feinerer, I., Hornik, K., Meyer, D., Text mining infrastructure in R, Journal of Statistical
Software, vol. 25, no. 5, pp. 1-54, 2008.
[11] KIPRIS, Korea Intellectual Property Rights Information Service, www.kipris.or.kr, 2013.
[12] Feinerer, I., Hornik, K., Meyer, D., Text mining infrastructure in R, Journal of Statistical
Software, vol. 25, no. 5, pp. 1-54. 2008.
[13] Feinerer, I., Hornik, K., Text Mining Package, Package tm, cran.r-project.org, 2013.
[14] WIPO, World Intellectual Property Organization, www.wipo.org, 2013.
149