Mini Project Surya

An Industry-Oriented Mini Project Report
On
“MALWARE DETECTION USING MACHINE
LEARNING AND PERFORMANCE EVALUATION”
Submitted in Partial Fulfillment of the

Academic Requirement for the Award of
Degree of
BACHELOR OF TECHNOLOGY
in
Computer Science and Engineering
(Artificial Intelligence & Machine Learning)
Submitted By:
B.MANI SURYA (20R01A66C6)
Under the esteemed guidance of
Mr.P. Vijay Kumar

Asst Prof. Dept of CSE(AI&ML)
CMR INSTITUTE OF TECHNOLOGY

(UGCAUTONOMOUS)
Approved by AICTE, Permanent Affiliation to JNTUH, Accredited by NBA and NAAC
Kandlakoya(V),MedchalDist-501 401
www.cmrithyderabad.edu.in
2023-24
CMR INSTITUTE OF TECHNOLOGY
(UGCAUTONOMOUS)
Approved by AICTE, Permanent Affiliation to JNTUH, Accredited by NBA and NAAC
Kandlakoya (V),Medchal Dist-501 401
www.cmrithyderabad.edu.in
CERTIFICATE
This is to certify that an Industry oriented Mini Project entitled with “MALWARE DETECTION
USING MACHINE LEARNING AND PERFORMANCE EVALUATION” is being submitted
by:

CH.NIKHIL REDDY (20R01A66D0)
K.MANIKANTA (20R01A66E6)
A.MANISH (20R01A66C1)
To JNTUH, Hyderabad, in partial fulfillment of the requirement for award of the degree of B.Tech
in CSE (AI&ML) and is a record of a bonafide work carried out under our guidance and
supervision. The results in this project have been verified and are found to be satisfactory. The
results embodied in this work have not been submitted to have any other University forward of any
other degree or diploma.
Signature of Guide SiSignature of Project Coordinator Signature of HOD

Mr. P.Vijay Kumar Dr.S.Dhanalakshmi Prof. P. Pavan Kumar
ACKNOWLEDGEMENT
We are extremely grateful to Dr. M Janga Reddy, Director, Dr. B. Satyanarayana, Principal and
Prof. P.Pavan Kumar , Head of Department, Dept of Computer Science and Engineering (AI &ML),
CMR Institute of Technology for their inspiration and valuable guidance during entire duration.
We are extremely thankful to Dr.S.Dhanalakshmi, Project Coordinator and internal guide Mr. P.
Vijay Kumar, Dept of Computer Science and Engineering (AI & ML), CMR Institute of Technology
for their constant guidance, encouragement and moral support throughout the project.
We will be failing in duty if we do not acknowledge with grateful thanks to the authors of the
references and other literatures referred in this Project.
We express our thanks to all staff members and friends for all the help and coordination extended in
bringing out this Project successfully in time.
Finally, we are very much thankful to our parents and relatives who guided directly or indirectly for
every step towards success.

CH.NIKHIL REDDY (20R01A66D0)
K.MANIKANTA (20R01A66E6)
A.MANISH (20R01A66C1)
iii
ABSTACT
Android platform due to open source characteristic and Google backing has the largest global
market share. Being the world's most popular operating system, it has drawn the attention of
cyber criminals operating particularly through wide distribution of malicious applications.
This paper proposes an effectual machine-learning based approach for Android Malware
Detection making use of evolutionary Genetic algorithm for discriminatory feature selection.
Selected features from Genetic algorithm are used to train machine learning classifiers and
their capability in identification of Malware before and after feature selection is compared.
The experimentation results validate that Genetic algorithm gives most optimized feature
subset helping in reduction of feature dimension to less than half of the original feature-set.
Classification accuracy of more than 94% is maintained post feature selection for the machine
learning based classifiers, while working on much reduced feature dimension, thereby, having
a positive impact on computational complexity of learning classifiers.
iv
INDEX
ACKNOWLEDGEMENT iii
ABSTRACT iv
INDEX v
LISTOFFIGURES vi
LISTOFSCREENSHOTS vii
1. INTRODUCTION 1
1.1 ABOUT PROJECT 1
1.2 EXISTING SYSTEM 1
1.3 PROPOSEDSYSTEM 2
2. REQUIREMENT SPECIFICATIONS 3
2.1 REQUIREMENT ANALYSIS 3
2.2 SPECIFICATION PRINCIPLES 4
3. SYSTEM DESIGN 7
3.1 ARCHITECTURE DIAGRAM 8
3.2 UML DIAGRAMS 10

4. IMPLEMENTATION 15
4.1 PROJECTMODULES 15
4.2 ALGORITHMS 15
4.3 SAMPLECODE 17
5. TESTING 25
5.1 TESTINGMETHODS 25
6. RESULTS 26
7. CONCLUSION 32
8. REFERENCES 33
v
LIST OF FIGURES
Figur Particul PageN

e ars o.
No.
1.3 Proposed Methodology 2
3.1 Architecture Diagram 8
3.1 Algorithm Diagram 9
3.2.1 Use Case Diagram 10
3.2.2 Sequence Diagram 11
3.2.3 Class Diagram 12
3.2.4 Component Diagram 13
3.2.5 Activity Diagram 14
vi
Malware Detection using Machine Learning and Performance Evaluation
LIST OF SCREENSHOTS
ScreenshotNo. Particulars PageNo.
6.1 Upload Dataset 26
6.2 Select Dataset 26
6.3 Generate Train & Test Model 27
6.4 Run SVM Algorithm 27
6.5 Run SVM with Genetic Algorithm’ 28
6.6 Result 28
6.7 Run Neural Network Algorithm’ 29
6.8 Run Neural Network with Genetic 29

Algorithm’
6.9 30
NN with Genetic result
6.10 Accuracy Graph 1 30
6.11 Accuracy Graph 2 31
vii
1. INTRODUCTION
1.1ABOUT
PROJECT:
Android Apps are freely available on Google Play Store, the official Android app store as
well as third-party app stores for users to download. Due to its open source nature and
popularity, malware writers are increasingly focusing on developing malicious
applications for Android operating system. In spite of various attempts by Google Play
Store to protect against malicious apps, they still find their way to mass market and cause
harm to users by misusing personal information related to their phone book, mail accounts,
GPS location information and others for misuse by third parties or else take control of the
phones remotely. Therefore, there is need to perform malware analysis or reverse-
engineering of such malicious applications which pose serious threat to Android platforms.
Broadly speaking, Android Malware analysis is of two types: Static Analysis and Dynamic
Analysis. Static analysis basically involves analyzing the code structure without executing
it while dynamic analysis is examination of the runtime behavior of Android Apps in
constrained environment. Given in to the ever-increasing variants of Android Malware
posing zero-day threats, an efficient mechanism for detection of Android malwares is
required. In contrast to signature-based approach which requires regular update of
signature database.
1.2 EXISTING SYSTEM:

The main contribution of the work is reduction of feature dimension to less than half of
original feature-set using Genetic Algorithm such that it can be fed as input to machine
learning classifiers for training with reduced complexity while maintaining their accuracy in
malware classification. In contrast to exhaustive method of feature selection which requires
testing for 2N different combinations, where N is the number of features, Genetic Algorithm,
Dept of CES(AI&ML) 1
a heuristic searching approach based on fitness function has been used for feature selection.
The optimized feature set obtained using Genetic algorithm is used to train two machine
1.3 PROPOSED SYSTEM :
Two set of Android Apps or APKs: Malware/Goodware are reverse engineered to extract
features such as permissions and count of App Components such as Activity, Services,
Content Providers, etc. These features are used as featurevector with class labels as Malware
and Goodware represented by 0 and 1 respectively in CSV format.To reduce dimensionality
of feature-set, the CSV is fed to Genetic Algorithm to select the most optimized set of
features. The optimized set of features obtained is used for training two machine learning
classifiers: Support Vector Machine and Neural Network. In the proposed methodology,
static features are obtained from AndroidManifest.xml which contains all the important
information needed by any Android platform about the Apps. Androguard tool has been used
for disassembling of the APKs and getting the static features.
2. REQUIREMENT SPECIFICATIONS
2.1 REQUIREMENT ANALYSIS
2.1.1 HARDWARE REQUIREMENTS
For Developing the application the following are the hardware requirements
• PROCESSOR - Pentium–III
• SPEED – 2.4GHz
• RAM - 512 MB(min)
• HARD DISK - 20 GB
• FLOPPY DRIVE - 1.44MB
• KEY BOARD - Standard Keyboard
• MONITOR – 15 VGA Colour
2.1.2 SOFTEWARE REQUIREMENTS:
For Developing the application the following are the software requirements
 Python
 Pip(to install python packages)
Operating Systems supported
 Windows
Technologies and Packages used to Develop
• Tensorflow
• Numpy
• Pandas
• Matplotlib
2.2 SPECIFICATION PRINCIPLES:

When developing a malware detection system using machine learning, adhering to certain
specification principles is crucial for its effectiveness and reliability. First and foremost, a
comprehensive feature set should be identified, encompassing various aspects of file behavior
and structure. This involves selecting relevant features that distinguish between benign and
malicious entities, such as API calls, file size, and entropy. Moreover, the dataset used for
training must be representative of real-world scenarios, ensuring diversity in malware types
and sources.
Additionally, the model's architecture and hyperparameters need careful consideration. A

well-structured model, possibly utilizing deep learning techniques for intricate pattern
recognition, should be chosen. Regularization techniques and proper validation strategies
must be employed to prevent overfitting. Interpretability of the model is essential, allowing
security analysts to understand the rationale behind flagged instances.
Continuous updating of the system is imperative to address evolving malware tactics. Regular
retraining on fresh datasets and incorporating feedback from real-world incidents ensures the
model remains resilient against emerging threats. Finally, the system should integrate
seamlessly into existing cybersecurity frameworks, promoting interoperability and facilitating
efficient deployment in diverse environments. Adhering to these specification principles
enhances the robustness and adaptability of machine learning- based malware detection
systems.
2.2.1SOFTWARE DESCRIPTION:
Python:
Python is an interpreted high-level programming language for general-purpose programming.
Created by Guido van Rossum and first released in 1991, Python has a design philosophy that
emphasizes code readability, notably using significant whitespace.
Python features a dynamic type system and automatic memory management. It supports
multiple programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library.
Python is Interpreted − Python is processed at runtime by the interpreter. You do not need
to compile your program before executing it. This is similar to PERL and PHP.
Python is Interactive − you can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python also acknowledges that speed of development is important. Readable and terse code is
part of this, and so is access to powerful constructs that avoid tedious repetition of code.
Maintainability also ties into this may be an all but useless metric, but it does say something
about how much code you have to scan, read and/or understand to troubleshoot problems or
tweak behaviors. This speed of development, the ease with which a programmer of other
languages can pick up basic Python skills and the huge standard library is key to another area
where Python excels. All its tools have been quick to implement, saved a lot of time, and
several of them have later been patched and updated by people with no Python background -
without breaking.
Machine Learning :
Before we take a look at the details of various machine learning methods, let's start by
looking at what machine learning is, and what it isn't.. The study of machine learning
certainly arose from research in this context, but in the data science application of machine
learning methods, it's more helpful to think of machine learning as a means of building
models of data.
Fundamentally, machine learning involves building mathematical models to help understand

data. "Learning" enters the fray when we give these models tunable parameters that can be
adapted to observed data; in this way the program can be considered to be "learning" from the
data. Once these models have been fit to previously seen data, they can be used to predict and
understand aspects of newly observed data. I'll leave to the reader the more philosophical
digression regarding the extent to which this type of mathematical, model-based "learning" is
similar to the "learning" exhibited by the human brain.Understanding the problem setting in
machine learning is essential to using these tools effectively, and so we will start with some
broad categorizations of the types of approaches we'll discuss here.
PACKAGES USED IN PROJECT :
TensorFlow :
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for
machine learning applications such as neural networks. It is used for both research and
production at Google.
Numpy :
Numpy is a general-purpose array-processing package. It provides a high-performance

multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
 A powerful N-dimensional array object
 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined using Numpy which allows
Numpy to seamlessly and speedily integrate with a wide variety of databases.
Pandas :
Pandas is an open-source Python Library providing high-performance data manipulation and

analysis tool using its powerful data structures. Python was majorly used for data munging
and preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and analysis
of data, regardless of the origin of data load, prepare, manipulate, model, and analyze. Python
with Pandas is used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc.
Matplotlib :
Matplotlib is a Python 2D plotting library which produces publication quality figures in a

variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter Notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample plots and
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
Scikit-learn :
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a

consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use.
3. SYSTEM DESIGN
3.1ARCHITECTURE/BLOCKDIAGRAM:
Feature selection is an important part in machine learning to reduce data dimensionality and
extensive research carried out for a reliable feature selection method. For feature selection
filter method and wrapper method have been used. In filter method, features are selected on
the basis of their scores in various statistical tests that measure the relevance of features by
their correlation with dependent variable or outcome variable. Wrapper method finds a subset
of features by measuring the usefulness of a subset of feature with the dependent variable.
Hence filter methods are independent of any machine learning algorithm whereas in wrapper
method the best feature subset selected depends on the machine learning algorithm used to
train the model. In wrapper method a subset evaluator uses all possible subsets and then uses
a classification algorithm to convince classifiers from the features in each subset. The
classifier consider the subset of feature with which the classification algorithm performs the
best.
ALGORITHMS USED IN THIS PROJECT :-
Steps involve in genetic algorithm to choose important attributes.
Step 1: Initialize the algorithm using feature subsets which are binary encoded such that if
the feature is included it is represented by 1 and if it is excluded it is represented by 0 in the
chromosome.
Step 2: Start the algorithm defining an initial set of population generated randomly.
Step 3: Assign a fitness score calculated by the defined fitness function for genetic algorithm.
Step 4: Selection of Parents: Chromosomes with good fitness scores are given preference
over others to produce next generation of off-springs.
Step 5: Perform crossover and mutation operations on the selected parents with the given
probability of crossover and mutation for generation of off-springs.
Repeat the Steps 3 to 5 iteratively till the convergence is met and fittest chromosome from
population, that is, the optimal feature subset is resulted.
3.1UML DIAGRAM :
3.1.1USE CASE DIAGRAM:
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.
3.1.2SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction

diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are sometimes called event
diagrams, event scenarios, and timing diagrams.
3.1.3CLASS DIAGRAM :
In software engineering, a class diagram in the Unified Modeling Language (UML) is

a type of static structure diagram that describes the structure of a system by showing the
system's classes, their attributes, operations (or methods), and the relationships among the
classes. It explains which class contains information.
3.1.4COMPONENT DIAGRAM :
Component diagram is a special kind of diagram in UML. The purpose is also

different from all other diagrams discussed so far. It does not describe the functionality of the
system but it describes the components used to make those functionalities.Thus from that
point of view, component diagrams are used to visualize the physical components in a
system. These components are libraries, packages, files, etc.
3.1.5ACTIVITY DIAGRAM :
Activity diagrams are graphical representations of workflows of stepwise

activities and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and operational
step-by-step workflows of components in a system. An activity diagram shows the overall
flow of control.
4. IMPLEMENTATION
4.1 PROJECT MODULES:
SERVICE PROVIDER
In this module, the Service Provider has to login by using valid user name and password.
After login successful he can do some operations such as Login, Browse Water Data Sets and
Train & Test, View Trained and Tested Water Data Sets Accuracy in Bar Chart, View
Trained and Tested Water Data Sets Accuracy Results, View Predicted Water Quality
Detection Type, Find Water Quality Detection Type Ratio, Download Predicted Data Sets,
View Water Quality Detection Ratio Results, View All Remote Users.
VIEW AND AUTHORIZE USERS
In this module, the admin can view the list of users who all registered. In this, the admin can
view the user’s details such as, user name, email, address and admin authorizes the users.
REMOTE USER
In this module, there are n numbers of users are present. User should register before doing
any operations. Once user registers, their details will be stored to the database. After
registration successful, he has to login by using authorized user name and password. Once
Login is successful user will do some operations like REGISTER AND LOGIN, PREDICT
WATER QUALITY DETECTION TYPE, VIEW YOUR PROFILE.
4.2 ALGORITHMS:
Decision tree classifiers:
Decision tree classifiers are used successfully in many diverse areas. Their most important
feature is the capability of capturing descriptive decision making knowledge from the
supplied data. Decision tree can be generated from training sets. The procedure for such
generation based on the set of objects (S), each belonging to one of the classes C1, C2, …, Ck
is as follows:
Step 1. If all the objects in S belong to the same class, for example Ci, the decision tree for S
consists of a leaf labeled with this class
Step 2. Otherwise, let T be some test with possible outcomes O1, O2,.. On. Each object in
S has one outcome for T so the test partitions S into subsets S1, S2, ..Sn where each object in
Si has outcome Oi for T. T becomes the root of the decision tree and for each outcome Oi we
build a subsidiary decision tree by invoking the same procedure recursively on the set Si.
Gradient boosting
Gradient boosting is a machine learning technique used in regression and classification tasks,
among others. It gives a prediction model in the form of an ensemble of weak prediction
models, which are typically decision trees.[1][2] When a decision tree is the weak learner, the
resulting algorithm is called gradient-boosted trees; it usually outperforms random forest.A
gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but
it generalizes the other methods by allowing optimization of an arbitrary differentiable loss
function.
K-Nearest Neighbors (KNN)
 Simple, but a very powerful classification algorithm

 Classifies based on a similarity measure
 Non-parametric
 Lazy learning
 Does not “learn” until the test example is given
 Whenever we have a new data to classify, we find its K-nearest neighbors from the training
data Example
 Training dataset consists of k-closest examples in feature space
 Feature space means, space with categorization variables (non-metric variables)
 Learning based on instances, and thus also works lazily because instance close to the input
vector for test or prediction may take time to occur in the training dataset.
4.3 SAMPLE CODE:

INDEX.HTML
from future import absolute_import from future import division
from future import print_function import argparse
import collections
from datetime import datetime import hashlib
import os.path import random import re import sys import tarfile
import numpy as np
from six.moves import urllib import tensorflow as tf
from tensorflow.python.framework import graph_util from tensorflow.python.framework
import tensor_shape from tensorflow.python.platform import gfile
from tensorflow.python.util import compat FLAGS = None
MAX_NUM_IMAGES_PER_CLASS = 2 ** 27 - 1 # ~134M
def create_image_lists(image_dir, testing_percentage, validation_percentage): if not
gfile.Exists(image_dir):
tf.logging.error("Image directory '" + image_dir + "' not found.") return None
result = collections.OrderedDict() sub_dirs = [ os.path.join(image_dir,item)
for item in gfile.ListDirectory(image_dir)] sub_dirs = sorted(item for item in sub_dirs if
gfile.IsDirectory(item))
for sub_dir in sub_dirs:
extensions = ['jpg', 'jpeg', 'JPG', 'JPEG'] file_list = []
dir_name = os.path.basename(sub_dir) if dir_name == image_dir:
continue
tf.logging.info("Looking for images in '" + dir_name + "'") for extension in extensions:
file_glob = os.path.join(image_dir, dir_name, '*.' + extension)
file_list.extend(gfile.Glob(file_glob))
if not file_list: tf.logging.warning('No files found') continue
if len(file_list) < 20:
tf.logging.warning(
'WARNING: Folder has less than 20 images, which may cause issues.') elif
len(file_list) > MAX_NUM_IMAGES_PER_CLASS: tf.logging.warning(
'WARNING: Folder {} has more than {} images. Some images will '
'never be selected.'.format(dir_name, MAX_NUM_IMAGES_PER_CLASS)) label_name
= re.sub(r'[^a-z0-9]+', ' ', dir_name.lower())
training_images = [] testing_images = [] validation_images = []
for file_name in file_list:
base_name = os.path.basename(file_name) hash_name = re.sub(r'_nohash_.*$', '', file_name)
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) * (100.0 /
MAX_NUM_IMAGES_PER_CLASS))
If percentage_hash < validation_percentage: validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name) result[label_name] = {}
ground_truths = []
if how_many >= 0:
for unused_i in range(how_many):
label_index = random.randrange(class_count) label_name = list(image_lists.keys())
[label_index]
image_index = random.randrange(MAX_NUM_IMAGES_PER_CLASS + 1) image_name =
get_image_path(image_lists, label_name, image_index,
image_dir, category) bottleneck = get_or_create_bottleneck(
sess, image_lists, label_name, image_index, image_dir, category, bottleneck_dir,
jpeg_data_tensor, decoded_image_tensor, resized_input_tensor, bottleneck_tensor,
architecture)
ground_truth = np.zeros(class_count, dtype=np.float32) ground_truth[label_index] = 1.0
bottlenecks.append(bottleneck) ground_truths.append(ground_truth)
filenames.append(image_name) else:
for label_index, label_name in enumerate(image_lists.keys()): for image_index, image_name
in enumerate(
image_lists[label_name][category]):
image_name = get_image_path(image_lists, label_name, image_index, image_dir, category)
bottleneck = get_or_create_bottleneck(
sess, image_lists, label_name, image_index, image_dir, category, bottleneck_dir,
jpeg_data_tensor, decoded_image_tensor, resized_input_tensor, bottleneck_tensor,
architecture)
ground_truth = np.zeros(class_count, dtype=np.float32) ground_truth[label_index] = 1.0
ground_truths = []
label_index = random.randrange(class_count) label_name = list(image_lists.keys())
[label_index]
image_index = random.randrange(MAX_NUM_IMAGES_PER_CLASS + 1) image_path =
get_image_path(image_lists, label_name, image_index, image_dir,
category)
if not gfile.Exists(image_path):
resize_scale = 1.0 + (random_scale / 100.0) margin_scale_value = tf.constant(margin_scale)
resize_scale_value = tf.random_uniform(tensor_shape.scalar(), minval=1.0,
maxval=resize_scale)
scale_value = tf.multiply(margin_scale_value, resize_scale_value) precrop_width =
tf.multiply(scale_value, input_width) precrop_height = tf.multiply(scale_value, input_height)
precrop_shape = tf.stack([precrop_height, precrop_width]) precrop_shape_as_int =
tf.cast(precrop_shape, dtype=tf.int32
layer_biases = tf.Variable(tf.zeros([class_count]), name='final_biases')
cross_entropy_mean = tf.reduce_mean(cross_entropy) tf.summary.scalar('cross_entropy',
cross_entropy_mean) with tf.name_scope('train'):
optimizer = tf.train.GradientDescentOptimizer(FLAGS.learning_rate)
def prepare_file_system():
if tf.gfile.Exists(FLAGS.summaries_dir): tf.gfile.DeleteRecursively(FLAGS.summaries_dir)
tf.gfile.MakeDirs(FLAGS.summaries_dir)
if FLAGS.intermediate_store_frequency > 0
bottleneck_tensor_name = 'pool_3/_reshape:0'
bottleneck_tensor_size = 2048
input_width = 299
input_height = 299
input_depth = 3 resized_input_tensor_name = 'Mul:0'
model_file_name = 'classify_image_graph_def.pb' input_mean = 128
input_std = 128
elif architecture.startswith('mobilenet_'):
parts = architecture.split('_')
if len(parts) != 3 and len(parts) != 4: tf.logging.error("Couldn't understand architecture
name '%s'", architecture)
return None
version_string = parts[1]
if (version_string != '1.0' and version_string != '0.75' and version_string != '0.50' and
version_string != '0.25'):
tf.logging.error(
""""The Mobilenet version should be '1.0', '0.75', '0.50', or '0.25', but found '%s' for
architecture '%s'""",
version_string, architecture) return None
size_string = parts[2]
if (size_string != '224' and size_string != '192' and size_string != '160' and size_string !=
'128'):
tf.logging.error(
"""The Mobilenet input size should be '224', '192', '160', or '128', but found '%s' for
architecture '%s'""",
size_string, architecture) return None
if len(parts) == 3: is_quantized = False
else:
if parts[3] != 'quantized':
tf.logging.error
# Look at the folder structure, and create lists of all the images.
m image_lists = create_image_lists(FLAGS.image_dir, FLAGS.testing_percentage,
FLAGS.validation_percentage)
class_count = len(image_lists.keys()) if class_count == 0:
tf.logging.error('No valid folders of images found at ' + FLAGS.image_dir) return -1
if class_count == 1:
tf.logging.error('Only one valid folder of images found at ' + FLAGS.image_dir +
' - multiple classes are needed for classification.')
return -1
with tf.Session(graph=graph) as sess:

# Set up the image decoding sub-graph.
jpeg_data_tensor, decoded_image_tensor = add_jpeg_decoding(
model_info['input_width'], model_info['input_height'],
model_info['input_depth'], model_info['input_mean'],
model_info['input_mean'], model_info['input_std'])
else:
# We'll make sure we've calculated the 'bottleneck' image summaries and # cached them on
disk.
cache_bottlenecks(sess, image_lists, FLAGS.image_dir, FLAGS.bottleneck_dir,
jpeg_data_tensor,
Look at the folder structure, and create lists of all the images.
m image_lists = create_image_lists(FLAGS.image_dir, FLAGS.testing_percentage,
FLAGS.validation_percentage)
class_count = len(image_lists.keys()) if class_count == 0:
tf.logging.error('No valid folders of images found at ' + FLAGS.image_dir) return -1
if class_count == 1:
tf.logging.error('Only one valid folder of images found at ' + FLAGS.image_dir +
' - multiple classes are needed for classification.')
return -1
model_info['input_depth'], model_info['input_mean'],
model_info['input_mean'], model_info['input_std'])
else:
# We'll make sure we've calculated the 'bottleneck' image summaries and # cached them on
disk.
cache_bottlenecks(sess, image_lists, FLAGS.image_dir, FLAGS.bottleneck_dir,
jpeg_data_tensor,
5.TEST RESULTS :
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, subassemblies, assemblies and/or a finished product It is the process of
exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner.
5.1 UNIT TESTING:

Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.
Test Objectivies
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be Tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed.
5.2 INTEGRATION TESTING
Software integration testing is the incremental integration testing of two or more

integrated software components on a single platform to produce failures caused by interface
defects.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company
level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
5.3 ACCEPTANCE TESTING
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results : All the test cases mentioned above passed successfully.
6.RESULTS
6.1: UPLOAD DATASET
In above screen click on ‘Upload Android Malware Dataset’ button and upload dataset.
6.2:SELECT DATASET
In above screen I am uploading ‘AndroidDataset.csv’ file and after upload will get below
screen
6.3: GENERATE TRAIN & TEST MODEL
Now click on ‘Generate Train & Test Model’ button to split dataset into train and test part.
All machine learning algorithms will take 80% dataset for training and 20% dataset to test
accuracy of trained model. After clicking that button will get train and test model
6.4:RUN SVM ALGORITHM
In above screen we can see there are total 3799 android app records are there and application
using 3039 records for training and 760 records for testing. Now we have both train and test
model and now click on ‘Run SVM Algorithm’ button to generate SVM model on train and
test and get its accuracy
6.5 :RUN SVM WITH GENETIC ALGORITHM
In above screen we got 98% accuracy for SVM and now click on ‘Run SVM with Genetic
Algorithm’ button to choose optimize features and then run SVM on optimize features to get
accuracy
6.6 :RESULT
In above screen SVM with Genetic algorithm got 93% accuracy. Genetic with SVM accuracy
is less but its execution time will be less which we can see at the time of comparison graph.
(Note: when u run genetic then 4 empty windows will open u just close all those 4 windows
and let main window to run)
6.7:RUN NEURAL NETWORK ALGORITHM
In above console we can see genetic algorithm chooses 40 features from all dataset features.
Now click on ‘Run Neural Network Algorithm’ button to test neural network accuracy.
6.8 : RUN NEURAL NETWORK WITH GENETIC ALGORITHM
In above screen neural network also gave 98.64% accuracy. Now click on ‘Run Neural
Network with Genetic Algorithm’ button to get NN accuracy with genetic algorithm
6.9 :NN WITH GENETIC RESULT
In above screen NN with genetic got 98.02% accuracy. Now click on ‘Accuracy Graph’
button to see all algorithms accuracy in graph
6.10 :EXECUTION TIME GRAPH 1
In above graph x-axis represents algorithm name and y-axis represents accuracy and in all
SVM got high accuracy. Now click on ‘Execution Time Graph’ button to get execution time
of all algorithm
6.11 : EXECUTION TIME GRAPH 2
In above graph x-axis represents algorithm name and y-axis represents execution time. From
above graph we can conclude that with genetic algorithm machine learning algorithms taking
less time to build model.
7.CONCLUSIONS
7.1 CONCLUSION :
As the number of threats posed to Android platforms is increasing day to day, spreading
mainly through malicious applications or malwares, therefore it is very important to design a
framework which can detect such malwares with accurate results. Where signature-based
approach fails to detect new variants of malware posing zero-day threats, machine learning
based approaches are being used. The proposed methodology attempts to make use of
evolutionary Genetic Algorithm to get most optimized feature subset which can be used to
train machine learning algorithms in most efficient way.
From experimentations, it can be seen that a decent classification accuracy of more than 94%
is maintained using Support Vector Machine and Neural Network classifiers while working
on lower dimension feature-set, thereby reducing the training complexity of the classifiers
Further work can be enhanced using larger datasets for improved results and analyzing the
effect on other machine learning algorithms when used in conjunction with Genetic
Algorithm.
8.REFERENCES
D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, and K. Rieck, “Drebin: Effective and
Explainable Detection of Android Malware in Your Pocket,” in Proceedings 2014 Network
and Distributed System Security Symposium, 2014.
[2]N. Milosevic, A. Dehghantanha, and K. K. R. Choo, “Machine learning aided Android

malware classification,” Comput.Electr.Eng., vol. 61, pp. 266–274, 2017.
[3]J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-An, and H. Ye, “Significant Permission
Identification for Machine-Learning-Based Android Malware Detection,” IEEE Trans. Ind.
Informatics, vol. 14, no. 7, pp. 3216–3225, 2018.
[4] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, “MADAM: Effective and

Efficient Behavior- based Android Malware Detection and Prevention,” IEEE Trans.
Dependable Secur. Comput., vol. 15, no. 1, pp. 83–97, 2018.
[5] S. Arshad, M. A. Shah, A. Wahid, A. Mehmood, H. Song, and H. Yu,

“SAMADroid: A Novel 3- Level Hybrid Malware Detection Model for Android Operating
System,” IEEE Access, vol. 6, pp. 4321–4339, 2018.

Mini Project Surya

Uploaded by

Copyright:

Available Formats

Mini Project Surya

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mini Project Surya

Uploaded by

Copyright:

Available Formats

An Industry-Oriented Mini Project Report

Submitted in Partial Fulfillment of the

Under the esteemed guidance of

Mr.P. Vijay Kumar

CMR INSTITUTE OF TECHNOLOGY

B.MANI SURYA (20R01A66C6)

Signature of Guide SiSignature of Project Coordinator Signature of HOD

B.MANI SURYA (20R01A66C6)

3.2 UML DIAGRAMS 10

Figur Particul PageN

ScreenshotNo. Particulars PageNo.

6.1 Upload Dataset 26

6.2 Select Dataset 26

6.3 Generate Train & Test Model 27

6.4 Run SVM Algorithm 27

6.5 Run SVM with Genetic Algorithm’ 28

6.7 Run Neural Network Algorithm’ 29

6.8 Run Neural Network with Genetic 29

6.11 Accuracy Graph 2 31

1.2 EXISTING SYSTEM:

1.3 PROPOSED SYSTEM :

2.1 REQUIREMENT ANALYSIS

2.1.1 HARDWARE REQUIREMENTS

2.1.2 SOFTEWARE REQUIREMENTS:

 Pip(to install python packages)

Operating Systems supported

Technologies and Packages used to Develop

2.2 SPECIFICATION PRINCIPLES:

Additionally, the model's architecture and hyperparameters need careful consideration. A

interpreter directly to write your programs.

Fundamentally, machine learning involves building mathematical models to help understand

PACKAGES USED IN PROJECT :

Numpy is a general-purpose array-processing package. It provides a high-performance

 A powerful N-dimensional array object

 Sophisticated (broadcasting) functions

 Tools for integrating C/C++ and Fortran code

 Useful linear algebra, Fourier transform, and random number capabilities

Pandas is an open-source Python Library providing high-performance data manipulation and

Matplotlib is a Python 2D plotting library which produces publication quality figures in a

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a

ALGORITHMS USED IN THIS PROJECT :-

Steps involve in genetic algorithm to choose important attributes.

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction

In software engineering, a class diagram in the Unified Modeling Language (UML) is

Component diagram is a special kind of diagram in UML. The purpose is also

Activity diagrams are graphical representations of workflows of stepwise

VIEW AND AUTHORIZE USERS

K-Nearest Neighbors (KNN)

 Simple, but a very powerful classification algorithm

 Feature space means, space with categorization variables (non-metric variables)

4.3 SAMPLE CODE:

from future import absolute_import from future import division

from future import print_function import argparse

from datetime import datetime import hashlib

import os.path import random import re import sys import tarfile

from six.moves import urllib import tensorflow as tf

from tensorflow.python.framework import graph_util from tensorflow.python.framework

import tensor_shape from tensorflow.python.platform import gfile

from tensorflow.python.util import compat FLAGS = None