《Bad Data Handbook》电子书下载

Bad Data Handbooktxt,chm,pdf,epub,mobi下载
作者: Q. Ethan McCallum
出版社: O'Reilly Media
副标题: Cleaning Up The Data So You Can Get Back To Work
出版年: 2012-11-21
页数: 264
定价: USD 39.99
装帧: Paperback
ISBN: 9781449321888

内容简介  · · · · · ·

What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems.

From cranky storage to poor representation to misguided policy, there...




作者简介  · · · · · ·

Q Ethan McCallum is a consultant, writer, and technology enthusiast, though perhaps not in that order. His work has appeared online on The O’Reilly Network and Java.net, and also in print publications such as C/C++ Users Journal, Doctor Dobb’s Journal, and Linux Magazine. In his professional roles, he helps companies to make smart decisions about data and technology.




目录  · · · · · ·

Chapter 1 Setting the Pace: What Is Bad Data?
Chapter 2 Is It Just Me, or Does This Data Smell Funny?
Understand the Data Structure
Field Validation
Value Validation
Physical Interpretation of Simple Statistics
· · · · · · ()
Chapter 1 Setting the Pace: What Is Bad Data?
Chapter 2 Is It Just Me, or Does This Data Smell Funny?
Understand the Data Structure
Field Validation
Value Validation
Physical Interpretation of Simple Statistics
Visualization
Keyword PPC Example
Search Referral Example
Recommendation Analysis
Time Series Data
Conclusion
Chapter 3 Data Intended for Human Consumption, Not Machine Consumption
The Data
The Problem: Data Formatted for Human Consumption
The Solution: Writing Code
Postscript
Other Formats
Summary
Chapter 4 Bad Data Lurking in Plain Text
Which Plain Text Encoding?
Guessing Text Encoding
Normalizing Text
Problem: Application-Specific Characters Leaking into Plain Text
Text Processing with Python
Exercises
Chapter 5 (Re)Organizing the Web’s Data
Can You Get That?
General Workflow Example
The Real Difficulties
The Dark Side
Conclusion
Chapter 6 Detecting Liars and the Confused in Contradictory Online Reviews
Weotta
Getting Reviews
Sentiment Classification
Polarized Language
Corpus Creation
Training a Classifier
Validating the Classifier
Designing with Data
Lessons Learned
Summary
Resources
Chapter 7 Will the Bad Data Please Stand Up?
Example 1: Defect Reduction in Manufacturing
Example 2: Who’s Calling?
Example 3: When “Typical” Does Not Mean “Average”
Lessons Learned
Will This Be on the Test?
Chapter 8 Blood, Sweat, and Urine
A Very Nerdy Body Swap Comedy
How Chemists Make Up Numbers
All Your Database Are Belong to Us
Check, Please
Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository
Rehab for Chemists (and Other Spreadsheet Abusers)
tl;dr
Chapter 9 When Data and Reality Don’t Match
Whose Ticker Is It Anyway?
Splits, Dividends, and Rescaling
Bad Reality
Conclusion
Chapter 10 Subtle Sources of Bias and Error
Imputation Bias: General Issues
Reporting Errors: General Issues
Other Sources of Bias
Conclusions
References
Chapter 11 Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
But First, Let’s Reflect on Graduate School …
Moving On to the Professional World
Moving into Government Work
Government Data Is Very Real
Service Call Data as an Applied Example
Moving Forward
Lessons Learned and Looking Ahead
Chapter 12 When Databases Attack: A Guide for When to Stick to Files
History
Consider Files as Your Datastore
File Concepts
A Web Framework Backed by Files
Reflections
Chapter 13 Crouching Table, Hidden Network
A Relational Cost Allocations Model
The Delicate Sound of a Combinatorial Explosion…
The Hidden Network Emerges
Storing the Graph
Navigating the Graph with Gremlin
Finding Value in Network Properties
Think in Terms of Multiple Data Models and Use the Right Tool for the Job
Acknowledgments
Chapter 14 Myths of Cloud Computing
Introduction to the Cloud
What Is “The Cloud”?
The Cloud and Big Data
Introducing Fred
At First Everything Is Great
They Put 100% of Their Infrastructure in the Cloud
As Things Grow, They Scale Easily at First
Then Things Start Having Trouble
They Need to Improve Performance
Higher IO Becomes Critical
A Major Regional Outage Causes Massive Downtime
Higher IO Comes with a Cost
Data Sizes Increase
Geo Redundancy Becomes a Priority
Horizontal Scale Isn’t as Easy as They Hoped
Costs Increase Dramatically
Fred’s Follies
Myth 1: Cloud Is a Great Solution for All Infrastructure Components
Myth 2: Cloud Will Save Us Money
Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID
Myth 4: Cloud Computing Makes Horizontal Scaling Easy
Conclusion and Recommendations
Chapter 15 The Dark Side of Data Science
Avoid These Pitfalls
Know Nothing About Thy Data
Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks
Thou Shalt Analyze for Analysis’ Sake Only
Thou Shalt Compartmentalize Learnings
Thou Shalt Expect Omnipotence from Data Scientists
Final Thoughts
Chapter 16 How to Feed and Care for Your Machine-Learning Experts
Define the Problem
Fake It Before You Make It
Create a Training Set
Pick the Features
Encode the Data
Split Into Training, Test, and Solution Sets
Describe the Problem
Respond to Questions
Integrate the Solutions
Conclusion
Chapter 17 Data Traceability
Why?
Personal Experience
Immutability: Borrowing an Idea from Functional Programming
An Example
Conclusion
Chapter 18 Social Media: Erasable Ink?
Social Media: Whose Data Is This Anyway?
Control
Commercial Resyndication
Expectations Around Communication and Expression
Technical Implications of New End User Expectations
What Does the Industry Do?
What Should End Users Do?
How Do We Work Together?
Chapter 19 Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
Framework Introduction: The Four Cs of Data Quality Analysis
Complete
Coherent
Correct
aCcountable
Conclusion
· · · · · · ()

下载地址

发布者:Amor颖

文件说明:zip / 解压密码:yiquhai.com

迅雷下载:您需要先后,才能查看

网盘下载:您需要先后,才能查看

关于内容:内容自于互联网,如果发现有违规内容请联系管理员删除!

作者: Amor颖

该用户很懒,还没有介绍自己。

19 条评论

发表评论

  1. 起点无限极 起点无限极说道:
    1#

    回转曲折,坎坷不平

  2. 叶宇昕 叶宇昕说道:
    2#

    可谓字字珠玑

  3. 黄桃tao罐头 黄桃tao罐头说道:
    3#

    一种宝贵的积累!

  4. 老衲特别冲动 老衲特别冲动说道:
    4#

    喜欢

  5. 显示更多