Role of MATLAB in Crop Yield Estimation: Raorane A.A. and Kulkarni R.V
Role of MATLAB in Crop Yield Estimation: Raorane A.A. and Kulkarni R.V
Role of MATLAB in Crop Yield Estimation: Raorane A.A. and Kulkarni R.V
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014
Raorane A.A. and Kulkarni R.V. 2
is taken into consideration, which witnessed spread of Researcher’s own experience and situation of
improved technology to large area, the inference on Agriculture in Kolhapur district
increase in instability due to adoption of new technology Yield of a crop is dependent on geo-climatic condition of
get totally refuted. Yield variability in food grains crops as an area. Farming in Kolhapur district is largely based on the
well as in non food grains crops was much lower in the first past experiences of the producer in addition to the guidance
phase of green revolution extending upto 1988 as compared by government departments. Yield estimates are annually
to pre green revolution period. Volatility in yield, away predicted by the statistical department of the government.
from trend, witnessed further decline during 1989-2007. After analyzing the outcome or the yield, a farmer
Production of non food grains show increase in instability understands the problems in the process of cultivation,
during last two decades but production of food grains and progress of crops, cropping pattern, effect of rainfall, soil
total crop sector was much more stable in the recent period parameters and effects of fertilizers and pesticides on the
compared to pre green revolution and first two decades of crop. Important question is whether the producers acquire
green revolution in the Country. This indicates that Indian this scientific data from various agriculture departments and
agriculture has developed resilience to absorb various if available, on time? And whether he has the capacity to
shocks in supply caused by climatic and other factors. Food analyze this information?
grains production remained more unstable as compared to
production of group of non food grains crops. Instability in From the researcher’s point of view, for estimation of yield
yield of cereal and pulses declined over time. However, three factors are important viz., area under cultivation,
opposite holds true for oilseeds. Oilseed production is also rainfall and soil fertility. After collecting this information,
found more risky as compared to cereals and pulses. relation between these variables is established to predict
Among individual crops wheat, paddy and sugarcane are crop production.
found least risky whereas bajra, groundnut,
rapeseed/mustard, jowar and gram involves high risk. Methodology:
Pattern in area, yield and production instability of food
grains differs widely across states. Yield instability was Literature survey and personal interviews are taken from
major source of instability in food grains production in various personality concerns with agriculture. In literature
most of the states. Production was most stable in the state survey books, phd thesis, research papers, articles,
of Punjab followed by Kerala. Haryana, Uttar Pradesh and conference papers were studied. In doing so the dialogue
West Bengal have brought down instability in food grains with the researcher, agriculturist, and government
production sharply. Foodgrains production is highly agriculture statistician carried away time to time for coming
unstable in the states of Maharahtra, Tamil Nadu, Orissa, to final perfect model.
Madhya Pradesh, Rajasthan and Gujarat.
The researcher aiming to build a model should keep in statistical formulae to get results in the form of table, graph,
mind that it can be handled easily by the end user. There etc.
should be a standard input form to incorporate data through
the desk top. Later on the data is processed with appropriate
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014
Raorane A.A. and Kulkarni R.V. 3
After incorporating the data through input screen and through the process model to find a good “fit”
entering the data in the database tables, the data in the between a representation of the data and a data
database table is processed using various queries of SQL. mining algorithm. In addition, distinct attribute
This data is then connected to MATLAB for data mining. combinations running through different schemes can
MATLAB simplifies the task of calculation by using produce wildly different data models, even though
various statistical libraries of formulae. the predictive accuracy of the results may be
equivalent. These alternative views may provide
To check the validity of the data, before making the model valuable insights into patterns covering different
using MATLAB, the data is analyzed in excel worksheet subsets of the data.
using all statistical functionality.
In the model presented in following Figure, activity flows
DATA MINING PROCESS MODEL in a clockwise direction. In the pre- processing stage, the
raw data is represented as a single table, as required by the
In the course of this project, analysis of real data sets, data mining algorithm’s. This table is translated into the
primarily agricultural data sets, provided by various ARFF format, an attribute/value table representation that
agriculture organizations were carried out. From this includes header information on the attributes data types.
experience a process model was developed for applying The data may also require considerable ‘cleansing’, to
data mining techniques to data, with the goal of remove outliers, handle missing values, detect erroneous
incorporating the induced domain information into a values, and so forth.
software module (Figure 1). The key points of this model At this point the data provider (domain expert) and the data
(Garner et al, 1995) are; mining expert collaborate to transform the cleansed data
• A two-way interaction between the provider of the into a form that will produce a readable, accurate data
data and the data mining expert. Both work together model when processed by a data mining algorithm. These
to transform the raw data into the final data set(s) two analysts may, for example, hypothesize that one or
input to the machine learning algorithm’s-with the more attributes are irrelevant, and set aside these extraneous
domain expert providing information about data columns. Attributes may be manipulated mathematically,
semantics and ‘legal’ transformations that can be for example to convert all columns containing productivity
applied to the data, and the data mining expert index of crop measurements to a common scale, to
guiding the process so as to improve the normalize values in a given column, or to combine two or
intelligibility and accuracy of the results. more columns into a single derived attribute.
• An iterative approach. Machine learning is an
exploratory process; it generally takes several cycles
Figure- Process model for a machine learning application (data flow diagram)
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014
Raorane A.A. and Kulkarni R.V. 4
One or more versions of the cleansed data are then researcher rearranged the model structure or check the
processed by the data mining schemes. The domain expert viability of the data for his model. This procedure is
determines which portions of the output are sufficiently repeated until the final model works out. For example,
narrative or interesting to deserve further exploration, and construct the fitting of the regression model for estimation
which portions represent common knowledge for that field. or prediction of the crop yield. Initially there was only one
The data mining expert interprets the algorithm’s output model for the district, but the results obtained w.r.t. the
and gives advice on further experiments that could be run estimation of crop yields by using the model did not match
with this data. the actual crop yield. So, the model was rearranged for
working suitably at taluka level. Again taluka wise data was
Project Aim collected and analyzed.
Development of System
Hence, aim is to provide, not only an analysis of selected The above models have been used is for analysis of yields
data mining tools available within MATLAB and a of four crops viz. Paddy, Soybean, Groundnut and Ragi,
synthesis of these tools, but more importantly, a means to because they all are rain fed crops. Twelve tehsils and four
analyze and synthesize further data mining tools, thus crops, meaning 48 models were designed. As some crops
providing an increasingly holistic view of the data mining were not cultivated in some talukas, there is no available
capabilities of MATLAB. data, e.g. in Shirol, Ragi is not cultivated, in Gaganbawada
Essentially then, researcher wish to discover the extent to Soybean is not cultivated, etc.
which each of a number of MATLAB data mining tools is
capable of carrying out the different stages of the data These above tasks are carried out using Excel. But when the
mining process. He wish to integrate these tools in order to complex data analysis for huge data have to be carry out, its
bring greater clarity to the potential of MATLAB in the difficult to arrange the data and memorize the various steps
data mining arena. And, as he do this, to clearly define the which we carried out past. So to overcome this difficulty
methodology used in carrying out this work, in order that it researcher checked various option to tackle this. So he
might be used in future work in this area. came to final strategy which is describe below
In summary, researcher aim is to create a means for
obtaining a holistic view of the data mining capabilities of 1) Consider the data as database.
MATLAB. Researcher is accomplishing this by setting 2) Select the suitable RDBMS for the data.
forth the methodology of this process and by demonstrating 3) After completing the table design in database.
this methodology by investigating and synthesizing several 4) Access the data in data mining tool [Matlab] for
data mining tools available for MATLAB. final calculation using various predefined tools available in
MATLAB.
Project Motivation
MATLAB is a powerful and flexible tool, of performing
The average yield, standard deviation, partial and multiple
data mining. It is clear that MATLAB has not been given
correlation coefficients have been calculated and fitted in
appropriate attention in this area. MATLAB is not yet in the
multiple regression planes. In the end principal component
league of packages such as Clementine, Weka and even
from the variables was calculated.
Excel. In addition, though MATLAB is chosen more
Multiple Regression: Some statistical methods serve as
frequently than Oracle, it is generally used in conjunction
forecasting (or estimation) techniques. One of such
with other tools. Whereas Oracle is implemented as a
techniques is regression analysis. We are familiar with
standalone tool over 50% of the time, MATLAB is used on
linear regression and correlation of two variables. It is
its own just over 12% of the time.
called as simple correlation and regression.
However, in practice, we observe that, the variable under
Data Mining Model building [Regression analysis using
study is influenced by two or more variables. Hence, two
MATLAB]
variables are not sufficient to describe it. e.g. National
Objective of the present work is to formulate a perfect income is based on several variables such as agricultural
model for the agricultural yield estimation. With the yield, industrial production, import, export, production of
objective in mind vast data was collected in relation to the minerals, marine wealth etc. A variable whose numerical
yield estimation. But, as the entire data is not immediately value is to be predicted is taken as dependent variable or
relevant for model making, first we have to clean the data response variable (National income) and remaining all
and rearrange it as suitable for the selected statistical variables are treated as independent variables or
method. Then different experimental models are worked explanatory variables.
out and if the model does not interpret as expected
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014
Raorane A.A. and Kulkarni R.V. 5
The regression analysis based on the dependent variable Note that R is symmetric matrix. |R| denotes determinant of
and two or more independent variables is referred as R is given by,
multiple regressions.
X 1 X 2 n 1 2 r12
X 2 X 3 n 2 3 r23 1 r23
R11 (1)11 1 r232
The correlation coefficient between X1 and X3 is r23 1
r13 =
Cov ( X 1 , X 3 )
X 1 X3
or
r12 r23
R12 (1)1 2 r13 r23 r12
1 3 n 1 3 r13 1
X 1 X 3 n 1 3 r13
1 r13
r12, r13, r23 are called total correlation coefficients. R13 (1)13 1 r132
r13 1
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014
Raorane A.A. and Kulkarni R.V. 6
R11
1
X 1
X1
R12
2
(X 2 X 2 )
R13
3
(X 3 X 3) 0
Database design
1] Finalization of variables
Using Sql server analyzed properly through the software, to get correct
After completing the Database design which include factors results of estimation.
like Productivity index, rainfall, soil fertility index for four For Data connectivity to Sql server, researcher used ODBC
crops Rice, Groundnut, Soybean and Raggi for 12 Talukas. (Open database connectivity) with user DSN (Data source).
Researcher concentrated mainly to design database in such An ODBC user data source stores information about
way that user can easily input the data of any variable indicated data provider.
associated with crop estimation into developed software.
Researcher has taken due care about this data which is
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014
Raorane A.A. and Kulkarni R.V. 7
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014
Raorane A.A. and Kulkarni R.V. 8
International Journal of Scientific Research in Computer Science (IJSRCS) Vol. 2, Issue. 2, April. 2014