TDS Notes Jan22 Term
TDS Notes Jan22 Term
TDS Notes Jan22 Term
The first thing to do in any data science exercise is to discover the problem.
– 1) Know the audience(person/role) [bcz different audience means
different questions on the same data]
– 2) Situation
– 3) Problem
– 4) Action
– 5) Impact
Eg. John, the marketing head, Person,role
Must create a region-wise budget, situation
But doesn’t know the region-wise ROI. problem
By prioritizing the region, action
He can maximize the ROI. impact
Week 2
Get the data– by downloading, querying(using API), or scraping(from web
pages or pdfs).
tsv– tab separated value
Libraries–
BeautifulSoup library- for scraping data from webpage
-to parse html code
geocoder Api using Nominatim is used for extracting location information.
requests library - to get webpage
json- to convert api into json dictionary object
urlencode- to give structure to long url
pandas - to manipulate the data
wikipedia- to extract data from wikipedia
wk.search ()
wk.summary()
wk.summary(.., sentences= limit number)
wk.page()
tabula- to convert pdf table to csv file
Week 4 & 5
Pycaret library helps in building end to end machine learning pipelines.
EXCEL
Correlation- answers about how often is there an effect
Regression- answers about how much of an effect
Outliers- answers about when does this fail
PYTHON
Classification- answers about Which group does a given data point belong
Forecasting- answers about prediction
Clustering- answers about grouping of things (multiple variables)
For categorical variables where order can be specified -Ordinal encoder is used
to convert them into numerical
For categorical variables where order can’t be specified - Label encoder is used.
WEEK 6 & 7
Design the output– communicating the message to the audience through visuals
General Purpose tool– Excel, Google Data Studio, Power BI, Tableau
Specialized purpose tool– Excel(VBA), Flourish Studio(for better animation),
Kumu(network visualization), QGIS(geographic visualization)
Classification report- text blob sentiment prediction with respect to human labels
The Two apply functions:-
1)TextBlob_subjectivity score lies between 0 and 1
2)TextBlob_polarity score lies between -1 and 1. Around 0 means neutral.
Flourish
Line chart duration(movement from one datapoint to another datapoint) is in
milliseconds.
Line bar/pie template animates in terms of both drawing and morphing.
Survey- animation duration and stagger settings(in milliseconds)
Heatmap- fade and flip animation settings( in seconds)
Spider- animation duration for both ‘draw’ and ‘morph’(in seconds)
Hierarchy- animation duration-simple speed settings(in seconds)
Google Charts
1. Extensive library of plots / charts
a. Useful to browse around for inspiration on how to tell a story using
your data (e.g., Sankey / alluvial charts)
2. User supplies information; Google charts returns graphical charts
3. Easy interface via R (googleVis)
Tableau: one of the widely used enterprise software for data visualization.
Tableau Prep: etl tool for data engineering
Tableau Desktop: development tool for building interactive dashboards.
Tableau server: a hosting server to host large real time dashboards.
Week 8
Tools to narrate story:- Excel, PowerBi, Tableau, Google sheets & Comicgen
Quill is used for narrating stories. It's an extension used in tableau. It transforms
visualizations into narratives.
PowerBi :-
1) updates narrative automatically when data selection is modified.
2) allows custom narratives with dynamic values
3) use option “Summarise” to generate automatic narratives.
Glitch
- User friendly browser text editor
- Difficulty in scaling.
- Low cache memory limit in runtime
- Github integration
Netlify
- Used for Static Web Hosting
- Global Network
- Instant Cache Validation
- Continuous deployments
- html/css/js backend
- Github integration
- Recommended for static websites
Vercel
- Used for Static Web Hosting
- Low cache memory limit in runtime
- Continuous deployments
- Github integration