Role of MATLAB in Crop Yield Estimation

Abstract: Agriculture provides food security and strong Current scenario-

economy for any country. According to the Food and Today, precision agriculture refers to the application of
Agriculture Organization, “Food security exists when modern GPS technology in connection with small-scale,
all people have access to sufficient, safe and nutritious sensor-based treatment of the crop. This introduces large
food to meet their dietary needs and food preferences amounts of data which are collected and stored for later
for an active and healthy life.” Food security is a major usage. Appropriate use of this data often leads to gain in
determinant of national security and self- sufficiency efficiency and thereby economic advantage. However, the
Matlab plays an important role in crop yield estimation. amount of data poses a problem – which should be solved
In this paper an attempt has been made to estimate crop using techniques. One of the tasks that remains incomplete
yield for selected cash crops (Rice, Groundnut, Soyabin, is yield prediction based on available data. This can be
Ragi) from the sample data collected from twelve formulated and treated as a multi-dimensional regression
talukas of Kolhapur district, Maharashtra state. task. This paper deals with appropriate regression
Analysis and estimation of yield was done using process techniques and evaluates it on selected agriculture data of
tool constructed by Matlab. One can use this model for Kolhapur District in Maharashtra state, India.
estimating yield of any crop. The advances in computing
and information storage have provided major factors Kolhapur District Agricultural
for calculating and assessing the results. The challenge Fertility of soils is determined by various macro and micro
has been to extract knowledge from this raw data; this nutrients available in the soil. The Panchganga Basin, a
has lead to new methods and techniques such as data well watered and agriculturally developed region covers
mining that can bridge the knowledge of the data to the 45752.2 area and supports 26, 11,547 (2.6 percent of
crop yield estimation. This research aimed to assess state) population. The index values of N, P & K are
these new techniques and apply them to the various collected from government soil survey and soil testing
variables consisting in the database to establish if Laboratory, Kolhapur at village level. These index values
meaningful relationships can be found. of N. P. & K. are grouped into six categories and tahsil wise
Keywords- Yield estimation, process model, regression areas in percentage in concern category are computed. To
analysis, crop cutting experiments, sql server,Matlab recognized the fertility level of the soils composite index is
computed with the help of NPK values and is grouped into
Introduction- five categories. In Kolhapur district, there is large variation
India today ranks second, worldwide in average annual in the distribution of macronutrients of the soil. It is
agricultural output. Agriculture and associated sectors like observed that most of the areas of the study region are
agro-forestry and fisheries accounted for 16.6% of the GDP fertile in nature. Low and very low fertility of soil is noted
in 2009 along with 50% of the total workforce. The in some pockets only. The physiography, climate and
economic contribution of agriculture to India's GDP is agricultural activities have greatly influenced the nutrients
steadily declining with the country's broad-based economic status of soil. Specific fertilizers and addition of organic
growth. Agriculture is demographically the broadest matters are recommended for nutrients deficient areas
economic sector and plays a major role in weaving the which will help to keep the balance of nutrients and to
socio-economic fabric of India. restore the fertility of soils. Moreover, it is observed during
India is the world's largest producer of many fresh fruits, the fieldwork that the anthropogenic influences are
vegetables, milk, spices, fresh meats, fiber yielding crops degrading the soils in the region which needs further
such as jute; millets and oil seeds, etc. as per FAO world investigations.
agriculture statistics (2010)(ref.).
Technological Changes-
Role of technology in increasing agricultural and food
production in the country is well known. However,
adequate and convincing evidence on impact of improved
technologies and policies followed during different periods
since 1951 in reducing variation in production and resulting
risk has been lacking. The issue of instability attracted lot
of attention of researchers, in the early phase of adoption of
green revolution technology, who found that adoption of
Raorane A.A. is with Department of computer science, Vivekanand new technology had increased instability in food grains and
College, Tarabai park Kolhapur INDIA. and
Kulkarni R.V. is working as Head of the Department, Chh. Shahu Institute agricultural production in India. This conclusion was based
of business Education and Research Centre Kolhapur. 416006 INDIA on the period when improved technology had reached very
Email: small area. This study shows that when a little longer period

is taken into consideration, which witnessed spread of Researcher’s own experience and situation of
improved technology to large area, the inference on Agriculture in Kolhapur district
increase in instability due to adoption of new technology Yield of a crop is dependent on geo-climatic condition of
get totally refuted. Yield variability in food grains crops as an area. Farming in Kolhapur district is largely based on the
well as in non food grains crops was much lower in the first past experiences of the producer in addition to the guidance
phase of green revolution extending upto 1988 as compared by government departments. Yield estimates are annually
to pre green revolution period. Volatility in yield, away predicted by the statistical department of the government.
from trend, witnessed further decline during 1989-2007. After analyzing the outcome or the yield, a farmer
Production of non food grains show increase in instability understands the problems in the process of cultivation,
during last two decades but production of food grains and progress of crops, cropping pattern, effect of rainfall, soil
total crop sector was much more stable in the recent period parameters and effects of fertilizers and pesticides on the
compared to pre green revolution and first two decades of crop. Important question is whether the producers acquire
green revolution in the Country. This indicates that Indian this scientific data from various agriculture departments and
agriculture has developed resilience to absorb various if available, on time? And whether he has the capacity to
shocks in supply caused by climatic and other factors. Food analyze this information?
grains production remained more unstable as compared to
production of group of non food grains crops. Instability in From the researcher’s point of view, for estimation of yield
yield of cereal and pulses declined over time. However, three factors are important viz., area under cultivation,
opposite holds true for oilseeds. Oilseed production is also rainfall and soil fertility. After collecting this information,
found more risky as compared to cereals and pulses. relation between these variables is established to predict
Among individual crops wheat, paddy and sugarcane are crop production.
found least risky whereas bajra, groundnut,
rapeseed/mustard, jowar and gram involves high risk. Methodology:
Pattern in area, yield and production instability of food
grains differs widely across states. Yield instability was Literature survey and personal interviews are taken from
major source of instability in food grains production in various personality concerns with agriculture. In literature
most of the states. Production was most stable in the state survey books, phd thesis, research papers, articles,
of Punjab followed by Kerala. Haryana, Uttar Pradesh and conference papers were studied. In doing so the dialogue
West Bengal have brought down instability in food grains with the researcher, agriculturist, and government
production sharply. Foodgrains production is highly agriculture statistician carried away time to time for coming
unstable in the states of Maharahtra, Tamil Nadu, Orissa, to final perfect model.
Madhya Pradesh, Rajasthan and Gujarat.

The researcher aiming to build a model should keep in statistical formulae to get results in the form of table, graph,
mind that it can be handled easily by the end user. There etc.
should be a standard input form to incorporate data through
the desk top. Later on the data is processed with appropriate

After incorporating the data through input screen and through the process model to find a good “fit”
entering the data in the database tables, the data in the between a representation of the data and a data
database table is processed using various queries of SQL. mining algorithm. In addition, distinct attribute
This data is then connected to MATLAB for data mining. combinations running through different schemes can
MATLAB simplifies the task of calculation by using produce wildly different data models, even though
various statistical libraries of formulae. the predictive accuracy of the results may be
equivalent. These alternative views may provide
To check the validity of the data, before making the model valuable insights into patterns covering different
using MATLAB, the data is analyzed in excel worksheet subsets of the data.
using all statistical functionality.
In the model presented in following Figure, activity flows
DATA MINING PROCESS MODEL in a clockwise direction. In the pre- processing stage, the
raw data is represented as a single table, as required by the
In the course of this project, analysis of real data sets, data mining algorithm’s. This table is translated into the
primarily agricultural data sets, provided by various ARFF format, an attribute/value table representation that
agriculture organizations were carried out. From this includes header information on the attributes data types.
experience a process model was developed for applying The data may also require considerable ‘cleansing’, to
data mining techniques to data, with the goal of remove outliers, handle missing values, detect erroneous
incorporating the induced domain information into a values, and so forth.
software module (Figure 1). The key points of this model At this point the data provider (domain expert) and the data
(Garner et al, 1995) are; mining expert collaborate to transform the cleansed data
• A two-way interaction between the provider of the into a form that will produce a readable, accurate data
data and the data mining expert. Both work together model when processed by a data mining algorithm. These
to transform the raw data into the final data set(s) two analysts may, for example, hypothesize that one or
input to the machine learning algorithm’s-with the more attributes are irrelevant, and set aside these extraneous
domain expert providing information about data columns. Attributes may be manipulated mathematically,
semantics and ‘legal’ transformations that can be for example to convert all columns containing productivity
applied to the data, and the data mining expert index of crop measurements to a common scale, to
guiding the process so as to improve the normalize values in a given column, or to combine two or
intelligibility and accuracy of the results. more columns into a single derived attribute.
• An iterative approach. Machine learning is an
exploratory process; it generally takes several cycles

Figure- Process model for a machine learning application (data flow diagram)

One or more versions of the cleansed data are then researcher rearranged the model structure or check the
processed by the data mining schemes. The domain expert viability of the data for his model. This procedure is
determines which portions of the output are sufficiently repeated until the final model works out. For example,
narrative or interesting to deserve further exploration, and construct the fitting of the regression model for estimation
which portions represent common knowledge for that field. or prediction of the crop yield. Initially there was only one
The data mining expert interprets the algorithm’s output model for the district, but the results obtained w.r.t. the
and gives advice on further experiments that could be run estimation of crop yields by using the model did not match
with this data. the actual crop yield. So, the model was rearranged for
working suitably at taluka level. Again taluka wise data was
Project Aim collected and analyzed.
Development of System
Hence, aim is to provide, not only an analysis of selected The above models have been used is for analysis of yields
data mining tools available within MATLAB and a of four crops viz. Paddy, Soybean, Groundnut and Ragi,
synthesis of these tools, but more importantly, a means to because they all are rain fed crops. Twelve tehsils and four
analyze and synthesize further data mining tools, thus crops, meaning 48 models were designed. As some crops
providing an increasingly holistic view of the data mining were not cultivated in some talukas, there is no available
capabilities of MATLAB. data, e.g. in Shirol, Ragi is not cultivated, in Gaganbawada
Essentially then, researcher wish to discover the extent to Soybean is not cultivated, etc.
which each of a number of MATLAB data mining tools is
capable of carrying out the different stages of the data These above tasks are carried out using Excel. But when the
mining process. He wish to integrate these tools in order to complex data analysis for huge data have to be carry out, its
bring greater clarity to the potential of MATLAB in the difficult to arrange the data and memorize the various steps
data mining arena. And, as he do this, to clearly define the which we carried out past. So to overcome this difficulty
methodology used in carrying out this work, in order that it researcher checked various option to tackle this. So he
might be used in future work in this area. came to final strategy which is describe below
In summary, researcher aim is to create a means for
obtaining a holistic view of the data mining capabilities of 1) Consider the data as database.
MATLAB. Researcher is accomplishing this by setting 2) Select the suitable RDBMS for the data.
forth the methodology of this process and by demonstrating 3) After completing the table design in database.
this methodology by investigating and synthesizing several 4) Access the data in data mining tool [Matlab] for
data mining tools available for MATLAB. final calculation using various predefined tools available in
Project Motivation
MATLAB is a powerful and flexible tool, of performing
The average yield, standard deviation, partial and multiple
data mining. It is clear that MATLAB has not been given
correlation coefficients have been calculated and fitted in
appropriate attention in this area. MATLAB is not yet in the
multiple regression planes. In the end principal component
league of packages such as Clementine, Weka and even
from the variables was calculated.
Excel. In addition, though MATLAB is chosen more
Multiple Regression: Some statistical methods serve as
frequently than Oracle, it is generally used in conjunction
forecasting (or estimation) techniques. One of such
with other tools. Whereas Oracle is implemented as a
techniques is regression analysis. We are familiar with
standalone tool over 50% of the time, MATLAB is used on
linear regression and correlation of two variables. It is
its own just over 12% of the time.
called as simple correlation and regression.
However, in practice, we observe that, the variable under
Data Mining Model building [Regression analysis using
study is influenced by two or more variables. Hence, two
variables are not sufficient to describe it. e.g. National
Objective of the present work is to formulate a perfect income is based on several variables such as agricultural
model for the agricultural yield estimation. With the yield, industrial production, import, export, production of
objective in mind vast data was collected in relation to the minerals, marine wealth etc. A variable whose numerical
yield estimation. But, as the entire data is not immediately value is to be predicted is taken as dependent variable or
relevant for model making, first we have to clean the data response variable (National income) and remaining all
and rearrange it as suitable for the selected statistical variables are treated as independent variables or
method. Then different experimental models are worked explanatory variables.
out and if the model does not interpret as expected

The regression analysis based on the dependent variable Note that R is symmetric matrix. |R| denotes determinant of
and two or more independent variables is referred as R is given by,
multiple regressions.

The correlation coefficient between X1 and X2 is 1 r12 r13

1 r23 r12 r23 r12 r23
Cov ( X 1 , X 2 ) X X2 R  r12 1 r23  1  r12  r13
r12 = 
or r23 1 r13 1 r13 1
 1 2 n 1 2 r13 r23 1

X 1 X 2  n 1 2 r12

The correlation coefficient between X2 and X3 is

2 2 2
= 1  r12  r13  r23  2r12 r13 r23
r23 =
Cov ( X 2 , X 3 )

X 2 X3
 2 3 n 2 3 Co-factors of the elements in the first row are given by,

X 2 X 3  n 2 3 r23 1 r23
R11  (1)11  1  r232
The correlation coefficient between X1 and X3 is r23 1

r13 =
Cov ( X 1 , X 3 )

X 1 X3
r12 r23
R12  (1)1 2  r13 r23  r12
 1 3 n 1 3 r13 1
X 1 X 3  n 1 3 r13
1 r13
r12, r13, r23 are called total correlation coefficients. R13  (1)13  1  r132
r13 1

Fitting of Regression Planes:

Considering Researchers regression equation, the equation 2] Finalization of Formulization

of regression plane of X1 on X2 and X3 where X1 :
Productivity Index, X2 : Rainfall and X3 : Soil fertility 3] Model fitting using Excel
The equation of regression plane of X1 on X2 and X3 is 4] Summarization of results from Excel analysis
given by
X1 = a + b12.3X2 + b13.2X3 Data Mining
Where a, b12.3 and b13.2 are constants to be determined by
From the analyzed datasets, researcher extract the
the method of least squares.
knowledge i.e. estimation of crop yield considering
The required equation of regression plane of X1 on X2 and
historical data.
X3 is given by,

X 1 
 X1 
(X 2  X 2 ) 
(X 3  X 3)  0
Database design

Arrange the data in database for better data transformation.

For this process researcher selected Sql Server version 2.5.
It is used to estimate X1 : Productivity index, when X2 :
Rainfall and X3 : Soil fertility index for the crop for given 1] Establish E-R diagram.
year are known. Then value of X1 : Productivity Index
obtained is the estimated value for that year. It compared 2] Normalize E-R diagram.
with the actual i.e. observed value of X1
3] Draw table description diagram.
Stages in data analysis for crop estimation
4] Convert this into table with primary key, foreign key,
The crop estimation involved the following important steps, and other constraints.

1] Finalization of variables

Fig. Data Base

Using Sql server analyzed properly through the software, to get correct
After completing the Database design which include factors results of estimation.
like Productivity index, rainfall, soil fertility index for four For Data connectivity to Sql server, researcher used ODBC
crops Rice, Groundnut, Soybean and Raggi for 12 Talukas. (Open database connectivity) with user DSN (Data source).
Researcher concentrated mainly to design database in such An ODBC user data source stores information about
way that user can easily input the data of any variable indicated data provider.
associated with crop estimation into developed software.
Researcher has taken due care about this data which is

Using MATLAB 2] The code is compact because of predefined modules

Researcher connected this database to the MATLAB for used in the code.
statistical analysis. Researcher has developed best 3] Less time required to test the complex procedure.
algorithm which suit the estimation model. Researcher used 4] User can write his own function which he can use many
MATLAB version R2010 for his software development. times when required for data analysis.
5] Very large database can be handled.
MATLAB code for connectivity 6] Open source system.
Conn = database(‘data source name’;’user name’,’ After completing these stages, researcher processed the data
password’) in data mining model, which is constructed using Matlab.
This connects a MATLAB software session to a database
via. ODBC driver and assign the returned connection object
to conn. The arguments passed are as follows Summery
1) Datasource name- The data source to which you
connect. In this paper the author has exposed to basic technologies
2) Username and password – Password and user name to of Data Mining and basic description of how Data Mining
connect the database. architecture can be develop to deliver value of data mining
to the end user.
Sql query execution in MATLAB
The query written for yield estimation in MATLAB, the One of the task remain incomplete is yield prediction based
syntax is as follows on available data. The author used Data Mining perspective
Exec(conn, ‘sql query’) which can be formulated and treated as a multidimensional
e.g. exec ( conn, ‘use yield Estimation’) regression task. This paper deal with appropriate regression
A database object called curser has been used. Running techniques and evaluate it on selected agriculture data.
exec returns the cursor object to the variable curs and
returns additional information about the cursor object.
