我换主题了

2017-05-15

为什么换

之前我用的是next主题，也是很火的一款theme，不能说丑，也是一款极简的，使用方便的主题。一般人都有审美疲劳，而我又是一个俗人，故换之。当然，next theme 占网页屏幕的比例，太过剧中，旁边很大地方空白，导致代码片段稍微写长一点都看不全，浪费了，这是一个很实在的原因。

换成啥了

对，就是 yilia, 这是由 litten 设计制作的，感谢作者。

GitHub上简介：
一个简洁优雅的hexo主题 A simple and elegant theme for hexo.

more >>

展开全文 >>

cookie操作

2017-05-12

获取服务端cookie

//获取cookie
NSDictionary *headers = [((NSHTTPURLResponse *)resp) allHeaderFields];
NSLog(@"headers:%@",headers);
NSDictionary *cookies = [NSHTTPCookie cookiesWithResponseHeaderFields:headers forURL:[NSURL URLWithString:@"http://localhost/"]];

for (NSHTTPCookie *cookie in cookies) {
    NSLog(@"cookie:%@",cookie);
    if ([[cookie name] isEqualToString:@"JSESSIONID"]) {
        NSLog(@"session id is %@",[cookie value]);
    }
}

more >>

展开全文 >>

cocoa基础

2017-05-11

类别的使用例子

#import "Roommate.h"
@interface Roommate (Roommate_Say)
-(void) introduceSelf;
//-(void)hello:(NSString *) str;
@end

#import "Roommate+Roommate_Say.h"
@implementation Roommate (Roommate_Say)
-(void) introduceSelf{   
    NSLog(@"my name is liuyanwei");
}
@end

协议的使用例子

more >>

展开全文 >>

学习GitHub上的iOS-tips项目

2017-05-10

最近光看项目源码，似乎有点霸王硬上弓的感觉，所以决定来刷一遍基础，边看边在项目中找到相关用法，没用过的用进去。
挑选这个项目纯属巧合，选哪个不重要，重要的是坚持把这个项目学完。
其中的代码会敲一遍，上传到GitHub上。

ios-tips是ios开发当中常用遇到的问题和解决方法的收集，包括ios和swif。

more >>

展开全文 >>

program常用标识

2017-05-10

1：注释代码段

#program mark -[注释内容]
说明：用于注释内容
#program mark 是每个ios程序员都必须会用的技巧，通过#program mark 把代码分为个个部分，良好的注释是好代码的开始
# linking哥注：在xcode的代码页最上的目录最后，鼠标停住一会，就可以清晰的看到层次关系。亲测

more >>

展开全文 >>

《利用Python进行数据分析》读书笔记-8-第八章-绘图和可视化

2017-04-20

matplotlib API入门

Figure 和 Subplot

more >>

展开全文 >>

《利用Python进行数据分析》读书笔记-7-第七章-数据规整化：清理、转换、合并与重塑

2017-04-14

合并数据集

pandas对象中的数据可通过内置方式合并：

pandas.merge根据键将不同DataFrame种的行连接起来，实现的是数据库的连接操作；
panda.concat沿一条轴将多个对象堆叠在一起；
实例方法combine_first将重复数据编结在一起，用一个对象中的值填写另一个中的缺失值。

数据库风格的DataFrame合并

数据集的合并（merge）和连接（join）是通过键将行连接起来的。

df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df2 = DataFrame({'key': ['、a', 'b', 'c'],
                 'data2': range(3)})
# print df1
#    data1 key
# 0      0   b
# 1      1   b
# 2      2   a
# 3      3   c
# 4      4   a
# 5      5   a
# 6      6   b

# print df2
#    data2 key
# 0      0   a
# 1      1   b
# 2      2   c

# print pd.merge(df1, df2)
#    data1 key  data2
# 0      0   b      1
# 1      1   b      1
# 2      6   b      1
# 3      2   a      0
# 4      4   a      0
# 5      5   a      0
# 6      3   c      2

# 这里默认将重叠列列名当做键；显式指定哪列连接用on参数
pd.merge(df1, df2, on='key')
# 还可以分别指定左右连接的键值，left_on,right_on
# 默认merge"inner"连接，结果中的键是交集
# how参数，'inner'是交集，outer是并集

# 重复列名的处理
# merge的suffixes参数，指定附加到左右两个DataFrame对象的重叠列名上的字符串

两边都有的列就都有数据，一边有列一边的数据，没有就用缺省值NaN填充。

索引上的合并

DataFrame的连接键位于索引上；传入left_index=True或right_index=True（或两者）指定。

left1 = DataFrame({'key': ['a', 'b', 'c', 'a', 'b', 'a'],
                   'value': range(6)})
right1 = DataFrame({'group_value': [3.5, 7]}, index=['a', 'b'])
# print pd.merge(left1, right1, left_on='key', right_index=True)
#     key  value  group_value
# 0   a      0          3.5
# 3   a      3          3.5
# 5   a      5          3.5
# 1   b      1          7.0
# 4   b      4          7.0

# 若要取并集，用how='outer';一边没有的会用缺省值NaN填充
# print pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
#    key  value  group_value
# 0   a      0          3.5
# 3   a      3          3.5
# 5   a      5          3.5
# 1   b      1          7.0
# 4   b      4          7.0
# 2   c      2          NaN

# 层次化索引，暂时不看了，用到再研究；
# 层次化索引，只是指明合并键是多个列的列表形式，left_on=['a','b']

# DataFrame的join方法，可以实现同样的功能，而不管是否有重叠列
# left.join([right, another])

对于像这样的东西，了解一下，知道有这么个东西，以后用到能想到，查一下；
现在第一次看没必要看得这么深，也很难懂，看不懂还影响了心情。

轴向连接

arr = np.arange(12).reshape((3, 4))
# print arr
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# print np.concatenate([arr,arr],axis=1)
# [[ 0  1  2  3  0  1  2  3]
#  [ 4  5  6  7  4  5  6  7]
#  [ 8  9 10 11  8  9 10 11]]

s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = Series([5, 6], index=['f', 'g'])
# print pd.concat([s1,s2,s3])
# a    0
# b    1
# c    2
# d    3
# e    4
# f    5
# g    6
# dtype: int64

# 默认concat在axis=0上工作，产生一个新的Series，如axis=1(axis=1是列)则会产生一个DataFrame
# print pd.concat([s1, s2, s3], axis=1)
#      0    1    2
# a  0.0  NaN  NaN
# b  1.0  NaN  NaN
# c  NaN  2.0  NaN
# d  NaN  3.0  NaN
# e  NaN  4.0  NaN
# f  NaN  NaN  5.0
# g  NaN  NaN  6.0

合并重叠数据

NumPy的where函数，实现合并两个有重叠部分数据集；Series的combine_first方法实现同样的功能，且数据对齐。

重塑和轴向旋转

重新排列表格型数据的基础运算，谓之重塑(reshape)或轴向旋转运算(pivot)。

重塑层次化索引

stack：将列转为行；
unstack：将行转为列。

数据转换

移除重复数据

data = DataFrame({'k1':['one']*3 + ['two']*4,
                  'k2':[1,1,2,3,4,4,5]})
# print data
#     k1  k2
# 0  one   1
# 1  one   1
# 2  one   2
# 3  two   3
# 4  two   4
# 5  two   4
# 6  two   5

## 判断重复
# print data.duplicated()
# 0    False
# 1     True
# 2    False
# 3    False
# 4    False
# 5     True
# 6    False
# dtype: bool

## 移除重复
# print data.drop_duplicates()
#     k1  k2
# 0  one   1
# 2  one   2
# 3  two   3
# 4  two   4
# 6  two   5

## 指定判断重复项的列；如还有一列值，只通过k1过滤重复项
data['v1'] = range(7)
# print data.drop_duplicates(['k1'])
#     k1  k2  v1
# 0  one   1   0
# 3  two   3   3

了解到有这么个知识点可以完成那些功能，将来遇到要实现的时候能回忆起来就可以了。

利用函数或映射进行数据转换

Series的map方法可接受一个函数或含有映射关系的自典型对象。

## 肉类
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'coned', 'Bacon', 'Pastrami', 'honey', 'nova'],
                  'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
# print data
#           food  ounces
# 0        bacon     4.0
# 1  pulled pork     3.0
# 2        bacon    12.0
# 3     Pastrami     6.0
# 4        coned     7.5
# 5        Bacon     8.0
# 6     pastrami     3.0
# 7        honey     5.0
# 8         nova     6.0

## 添加该肉类食物的动物本体
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'Pastrami': 'cow',
    'coned': 'cow',
    'honey': 'pig',
    'nova': 'salmon'
}
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
## 转换了大小写
# print data
#         food  ounces  animal
# 0        bacon     4.0     pig
# 1  pulled pork     3.0     pig
# 2        bacon    12.0     pig
# 3     Pastrami     6.0     NaN
# 4        coned     7.5     cow
# 5        Bacon     8.0     pig
# 6     Pastrami     3.0     NaN
# 7        honey     5.0     pig
# 8         nova     6.0  salmon

## map是一种实现元素级转换以及其他数据清理工作的便捷方式。

替换值

replace提供了简单灵活的替换方式。

data = Series([1.,-999.,2.,-999.,-1000.,3.])
# print data
# 0       1.0
# 1    -999.0
# 2       2.0
# 3    -999.0
# 4   -1000.0
# 5       3.0
# dtype: float64

## -999可能是缺失数据标记值，替换成pandas的NA
# print data.replace(-999, np.nan)
# 0       1.0
# 1       NaN
# 2       2.0
# 3       NaN
# 4   -1000.0
# 5       3.0
# dtype: float64

## 一次替换多个值为一个值
# print data.replace([-999,-1000], np.nan)
# 0    1.0
# 1    NaN
# 2    2.0
# 3    NaN
# 4    NaN
# 5    3.0
# dtype: float64

## 分别替换为不同值
# print data.replace([-999,-1000], [np.nan,0])
# 0    1.0
# 1    NaN
# 2    2.0
# 3    NaN
# 4    0.0
# 5    3.0
# dtype: float64

字符串操作

字符串对象方法

1.连接符
join连接列表或元祖

# print piece 
# ['a','b','c']
# print '::'.join(pieces)
# 'a::b::c'

2.子串定位in关键字

# print 'guido' in val
# True
## 可以用index和find查找子串
val.index(',')
val.find(':')

3.count计算子串出现的次数

1	val.conut(',')

正则表达式

re模块函数可分三大类：模式匹配
替换以及拆分。

import re

# 空白符（制表符、空格、换行符等）
# text = "foo bar\t baz \tqux"
# print re.split('\s+', text)
# ['foo', 'bar', 'baz', 'qux']

# re.compile编译regex以得到一个可重用的regex对象
regex = re.compile('\s+')
# print regex.split(text)
# ['foo', 'bar', 'baz', 'qux']
# print regex.findall(text)
# [' ', '\t ', ' \t']
# 建议使用re.compile创建regex对象，节省CPU

# re.IGNORECASE对大小写不敏感

text1 = """Dave lin@qq.com
lin king@163.com
king wu@sina.com
ryle ryal@gmail.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2，4}'
regex = re.compile(pattern, flags=re.IGNORECASE)
# print regex.findall(text1)
# []

m = regex.search(text1)
# print m
# none

# regex.sub方法匹配成指定字符串
# print regex.sub('REDACTED', text1)

# 将邮件地址分成用户名、域名及域后缀三部分，用()包起来即可
pattern1 = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex1 = re.compile(pattern1, flags=re.IGNORECASE)

pandas中矢量化的字符串函数

清理散乱数据，字符串规整化工作。
data.map，所有字符串和正则表达式方法都能被应用于（传入lambda表达式或其他函数）各个值，但如果存在NA就会报错。通过Series的str属性即可访问这些方法。如str.contains检查是否含”gmail”.

1	data.str.contains('gmail')

示例：USDA食品数据库

import json

db = json.load(open('../pydata-book-master/ch07/foods-2011-10-03.json'))

# print len(db)
# 6636

# print db[0].keys()
# [u'portions', u'description', u'tags', u'nutrients', u'group', u'id', u'manufacturer']

# print db[0]['nutrients'][0]
# {u'units': u'g', u'group': u'Composition', u'description': u'Protein', u'value': 25.18}

# nutrients = DataFrame(db[0]['nutrients'])
# print nutrients[:7]
#                    description        group units    value
# 0                      Protein  Composition     g    25.18
# 1            Total lipid (fat)  Composition     g    29.20
# 2  Carbohydrate, by difference  Composition     g     3.06
# 3                          Ash        Other     g     3.28
# 4                       Energy       Energy  kcal   376.00
# 5                        Water  Composition     g    39.28
# 6                       Energy       Energy    kJ  1573.00

# DataFrame可以只抽取一部分字段，如食物的名称、分类、编号以及制造商
info_key = ['description', 'group', 'id', 'manufacturer']
info = DataFrame(db, columns=info_key)
# print info[:5]
#                           description                   group    id  manufacturer
# 0                     Cheese, caraway  Dairy and Egg Products  1008
# 1                     Cheese, cheddar  Dairy and Egg Products  1009
# 2                        Cheese, edam  Dairy and Egg Products  1018
# 3                        Cheese, feta  Dairy and Egg Products  1019
# 4  Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028

# 查看食物类别分布情况
# print pd.value_counts(info.group)[:10]
# Vegetables and Vegetable Products    812
# Beef Products                        618
# Baked Products                       496
# Breakfast Cereals                    403
# Legumes and Legume Products          365
# Fast Foods                           365
# Lamb, Veal, and Game Products        345
# Sweets                               341
# Fruits and Fruit Juices              328
# Pork Products                        328
# Name: group, dtype: int64

nutrients = []
for rec in db:
    fnuts = DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)
# print  nutrients
# 丢弃重复项
# print nutrients.duplicated().sum()
# 14179
# nutrients = nutrients.drop_duplicates()
# print nutrients

# 明确重命名
col_mapping = {'description': 'food',
               'group': 'fgroup'}

# info = info.rename(columns=col_mapping, copy=False)
# print info

展开全文 >>

ListView嵌套GridView显示多张图片出现图片重复、错乱、闪烁等问题

2017-03-29

场景

为了实现一个订单下多个商品同时评价功能，列表分五种状态（非常满意，满意，一般，较差，不满意）展示，用一个listview，item中有一个GridView，GridView中放置最多五张图片；在listview上下滑动时会出现item中的图片重复、错乱、闪烁等问题。

我的错误

先讲我的错误。遇到问题后Google，参考了一些资料，都是提示要在外层listview的adapter上设置setTag(), 然后在内层GridView的adapter上整体和需要标记复用的imageView上setTag()。照此执行了，还是出错；后来在同事指点下，原来外层adapter中设置gridAdapter时只是在没有塞值时gridView.setVisibility(View.GONE)，忘了要在有值的时候gridView.setVisibility(View.VISIBLE)，所以导致在复用的时候，没有了GridView可复用。

more >>

展开全文 >>

《利用Python进行数据分析》读书笔记-6-第六章-数据加载、存储和数据格式

2017-03-25

输入输出分类：

读取文本文件
其他更高效的磁盘存储格式
加载数据库中数据
利用Web API操作网络资源

读取文本格式的数据

pandas提供了将表格数据读取为DataFrame对象的函数；其中read_csv和read_table是应用最多的两个：

函数	说明
read_csv	从文件、URL、文件型对象中加载带分隔符的数据；默认逗号
read_table	从文件、URL、文件型对象中加载带分隔符的数据；默认制表符”\t”

类型推断是这些函数中最重要的功能之一。

$ cat ex1.csv
<!--out: -->
<!--a,b,c,d,message-->
<!--0,1,2,3,4,hello-->
<!--1,5,6,7,8,world-->
<!--2,9,10,11,12,foo-->
#以上数据用逗号隔开

import pandas as pd
import numpy as np
from pandas import Series, DataFrame

df = pd.read_csv('../pydata-book-master/ch06/ex1.csv')
# print df
#  a   b   c   d message
# 0  1   2   3   4   hello
# 1  5   6   7   8   world
# 2  9  10  11  12     foo
pt = pd.read_table('../pydata-book-master/ch06/ex1.csv', sep=',')
# print pt
# out: 同上

对于没有标题行的文件

$ cat ex2.csv
<!--out: --> 
<!--0,1,2,3,4,hello-->
<!--1,5,6,7,8,world-->
<!--2,9,10,11,12,foo-->

# 自动分配默认列名

# pc = pd.read_csv('../pydata-book-master/ch06/ex2.csv', header=None)
# print pc
#    0   1   2   3      4
# 0  1   2   3   4  hello
# 1  5   6   7   8  world
# 2  9  10  11  12    foo
# 手动定义列名
# pc2 = pd.read_csv('../pydata-book-master/ch06/ex2.csv', names=['a','b','c','d','msg'])
# print pc2
#    a   b   c   d    msg
# 0  1   2   3   4  hello
# 1  5   6   7   8  world
# 2  9  10  11  12    foo

指定索引

# 指定索引
names = ['a','b','c','d','msg']
pc3 = pd.read_csv('../pydata-book-master/ch06/ex2.csv', names=names, index_col='msg')
# print pc3
#        a   b   c   d
# msg
# hello  1   2   3   4
# world  5   6   7   8
# foo    9  10  11  12
# 层次化索引
parsed = pd.read_csv('../pydata-book-master/ch06/csv_mindex.csv', index_col=['key1', 'key2'])
# print parsed
#            value1  value2
# key1 key2
# one  a          1       2
#      b          3       4
#      c          5       6
#      d          7       8
# two  a          9      10
#      b         11      12
#      c         13      14
#      d         15      16

# 非固定分隔符，需要正则表达式
# print list(open('../pydata-book-master/ch06/ex3.csv'))
# ['            A         B         C\n',
#  'aaa -0.264438 -1.026059 -0.619500\n',
#  'bbb  0.927272  0.302904 -0.032399\n',
#  'ccc -0.264273 -0.386314 -0.217601\n',
#  'ddd -0.871858 -0.348382  1.100491']
# 观察到该文件各个字段由不定数量的空白符分割，用`\s+`这个正则来表示：
result = pd.read_table('../pydata-book-master/ch06/ex3.csv', sep='\s+')
# print result
#             A         B         C
# aaa -0.264438 -1.026059 -0.619500
# bbb  0.927272  0.302904 -0.032399
# ccc -0.264273 -0.386314 -0.217601
# ddd -0.871858 -0.348382  1.100491

跳过异常行(使用参数skiprows)

$ cat ch06/ex4.csv 
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo%

可以认为1，3，4行为异常数据

print pd.read_csv('../pydata-book-master/ch06/ex4.csv', skiprows=[0,2,3])
#    a   b   c   d message
# 0  1   2   3   4   hello
# 1  5   6   7   8   world
# 2  9  10  11  12     foo

逐块读取文本文件

大文件只想读取一小部分或者逐块对文件进行迭代

result = pd.read_csv('../pydata-book-master/ch06/ex6.csv', nrows=5)
# print result
#         one       two     three      four key
# 0  0.467976 -0.038649 -0.295344 -1.824726   L
# 1 -0.358893  1.404453  0.704965 -0.200638   B
# 2 -0.501840  0.659254 -0.421691 -0.057688   G
# 3  0.204886  1.074134  1.388361 -0.982404   R
# 4  0.354628 -0.133116  0.283763 -0.837063   Q

逐块读取;用到chunksize(行数)

# chunker = pd.read_csv('../pydata-book-master/ch06/ex6.csv', chunksize=100)
# print chunker
# <pandas.io.parsers.TextFileReader object at 0x10f6175d0>
# TextFileReader对象可实现盾剑逐块迭代
chunker = pd.read_csv('../pydata-book-master/ch06/ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
print tot[:10]
# E    368.0
# X    364.0
# L    346.0
# O    343.0
# Q    340.0
# M    338.0
# J    337.0
# F    335.0
# K    334.0
# H    330.0
# dtype: float64

将数据写到文本格式

data = pd.read_csv('../pydata-book-master/ch06/ex5.csv')
# print data
#   something  a   b     c   d message
# 0       one  1   2   3.0   4     NaN
# 1       two  5   6   NaN   8   world
# 2     three  9  10  11.0  12     foo

pandas.read_csv生成的是DataFrame对象；

DataFrame的to_csv方法，可以将数据写到一个用逗号分隔的文件中

data.to_csv('./ch06_out.csv')
# $ cat ./ch06_out.csv
# ,something,a,b,c,d,message
# 0,one,1,2,3.0,4,
# 1,two,5,6,,8,world
# 2,three,9,10,11.0,12,foo

# 可以指定分隔符,如|；这里只为了展示，所以用sys.stdout直接打印出来了。
data.to_csv(sys.stdout, sep='|')
# |something|a|b|c|d|message
# 0|one|1|2|3.0|4|
# 1|two|5|6||8|world
# 2|three|9|10|11.0|12|foo

将缺失值替换为其他标记值

data.to_csv(sys.stdout, na_rep='NULL')
# ,something,a,b,c,d,message
# 0,one,1,2,3.0,4,NULL
# 1,two,5,6,NULL,8,world
# 2,three,9,10,11.0,12,foo

只输出特定列，指定顺序

data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
# a,b,c
# 1,2,3.0
# 5,6,
# 9,10,11.0

Series也有to_csv()方法

datas = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7), index=datas)
ts.to_csv('./ch06tsseries.csv')
# 2000-01-01,0
# 2000-01-02,1
# 2000-01-03,2
# 2000-01-04,3
# 2000-01-05,4
# 2000-01-06,5
# 2000-01-07,6

更简单的读取方法from_csv

# print Series.from_csv('./ch06tsseries.csv', parse_dates=True)
# 2000-01-01    0
# 2000-01-02    1
# 2000-01-03    2
# 2000-01-04    3
# 2000-01-05    4
# 2000-01-06    5
# 2000-01-07    6
# dtype: int64

手工处理分隔符格式

场景：pandas.read_table加载表格数据时，由于畸形行导致错误

# $ cat ch06/ex7.csv
# "a","b","c"
# "1","2","3"
# "1","2","3","4"

import csv

lines = list(csv.reader(open('../pydata-book-master/ch06/ex7.csv')))
headers, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(headers, zip(*values))}
# print data_dict
# {'a': ('1', '1'), 'c': ('3', '3'), 'b': ('2', '2')}

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_ALL

# delimiter分隔符指定

f = open('../pydata-book-master/ch06/ex7.csv')
reader = csv.reader(f, dialect=my_dialect)
# print reader
# <_csv.reader object at 0x10a5bb440>
# 以上类的参数也可以单个的传入csv.reader方法

手动写入分隔符文件csv.writer

with open('mydata.csv', 'w') as fw:
    writer = csv.writer(fw, my_dialect)
    writer.writerow(("one", "two", "three"))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))
# out: mydata.csv
# "one";"two";"three"
# "1";"2";"3"
# "4";"5";"6"
# "7";"8";"9"

tip:一些东西还是要看源码才最直接

JSON数据

JSON已成为HTTP请求在Web浏览器和其他程序之间发送数据的标准格式之一、

obj1 = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet":null,
"siblings":[{"name":"Scott", "age":25, "pet":"Zuko"}, {"name":"Katie", "age":22, "pet":"Cisco"}]
}
"""
import json
# json.loads将JSON字符串转换成Python形式
result = json.loads(obj1)
# print result
# {u'pet': None,
# u'siblings': [{u'pet': u'Zuko', u'age': 25, u'name': u'Scott'},
#               {u'pet': u'Cisco', u'age': 22, u'name': u'Katie'}],
# u'name': u'Wes',
# u'places_lived': [u'United States', u'Spain', u'Germany']}
# 对应的，json.dumps将Python转换成JSON形式
asjson = json.dumps(result)
# print asjson
# {"pet": null, "siblings": [{"pet": "Zuko", "age": 25, "name": "Scott"},
#                            {"pet": "Cisco", "age": 22, "name": "Katie"}],
#  "name": "Wes", "places_lived": ["United States", "Spain", "Germany"]}

XML和HTML：Web信息收集

# http://finance.yahoo.com/quote/AAPL/options?ltr=1
from lxml.html import parse
from urllib2 import urlopen
# 原网址404，在雅虎随便找了一个阿里巴巴的
# parsed = parse(urlopen('http://finance.yahoo.com/quote/BABA/options?p=BABA'))
# doc = parsed.getroot()
# 通过上面的对象，可获取特定的tag
# 如获取所有的链接；<a href="#">word</a>
# links = doc.findall('.//a')
# print links[:5]
# [<Element a at 0x1109ace10>,
# <Element a at 0x1109ace68>,
# <Element a at 0x1109acec0>,
# <Element a at 0x1109acf18>,
# <Element a at 0x1109acf70>]
# lnk = links[7]
# print lnk
# <Element a at 0x1040d3628>
# 从以上对象中提取出href
# print len(links)
# 50个
# print lnk.get('href')
# https://www.yahoo.com/celebrity/
# print lnk.text_content()
# Celebrity

# 构建列表推导式，获取全部URL
# urls = [lnk1.get('href') for lnk1 in doc.findall('.//a')]
# print urls[-10:]
# ['//finance.yahoo.com/broker-comparison?bypass=true',
# 'https://help.yahoo.com/kb/index?page=content&y=PROD_MAIL_ML&locale=en_US&id=SLN2310&actp=productlink',
# 'http://help.yahoo.com/l/us/yahoo/finance/',
# 'https://yahoo.uservoice.com/forums/382977',
# 'http://info.yahoo.com/privacy/us/yahoo/',
# 'http://info.yahoo.com/relevantads/',
# 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
# 'http://twitter.com/YahooFinance',
# 'http://facebook.com/yahoofinance',
# 'http://yahoofinance.tumblr.com']

# 从文档中找出正确的表格就是反复试验
# parsed1 = parse(urlopen('http://finance.yahoo.com/quote/BABA/options?p=BABA'))
# doc1 = parsed1.getroot()
# tables = doc1.findall('.//table')
# calls = tables[0]
# puts = tables[0]
# rows = calls.findall('.//tr')
# print rows[0].text_content()
# Search
# def unpack(row, kind='td'):
#     elts = row.findall('.//%s' % kind)
#     return [val.text_content() for val in elts]

# print unpack(rows[0], kind='th')
# 由于对Safari怎么获取XPath不熟悉，后期还要专门了解
# 原网址变动，书本代码无法运行，需要深入理解改写一下
# 完全照着书本敲收效甚微

# 利用lxml.objectify解析XML
from lxml import objectify
path = '../pydata-book-master/ch06/mta_perf/Performance_MNR.xml'
parsed2 = objectify.parse(open(path))
root = parsed2.getroot()
print root

利用lxml.objectify解析XML

from lxml import objectify
# path = '../pydata-book-master/ch06/mta_perf/Performance_MNR.xml'
# parsed2 = objectify.parse(open(path))
# root = parsed2.getroot()
# print root

# 解析HTML标签
from StringIO import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()
# print root.get('href')
# http://www.google.com
# print root.text
# Google

二进制数据格式

# 保存到磁盘上
frame = pd.read_csv('../pydata-book-master/ch06/ex1.csv')
# print  frame
#    a   b   c   d message
# 0  1   2   3   4   hello
# 1  5   6   7   8   world
# 2  9  10  11  12     foo
# 从磁盘读取到Python对象中
print pd.load('../pydata-book-master/ch06/frame_pickle')

读取Microsoft Excel文件(pandas的ExcelFile类）

xls_file = pd.ExcelFile('../pydata-book-master/ch06/ex1.xlsx')
# 将某个工作表中的数据通过parse读取到DataFrame中
table = xls_file.parse('Sheet1')
# print table
#    a   b   c   d message
# 0  1   2   3   4   hello
# 1  5   6   7   8   world
# 2  9  10  11  12     foo

使用HTML和Web API

import requests
url = 'https://www.google.com.hk/search?safe=strict&site=&source=hp&q=test&btnK=Google+搜索'
resp = requests.get(url)
# print resp
# <Response [200]>
# Web API一般返回JSON字符串
import json
data = json.loads(resp.text)
# print data.keys()
# ValueError: No JSON object could be decoded
# 书上的Twitter链接打不开

# 取出感兴趣的tweet字段，然后将results列表传给DataFrame
tweet_field = ['created_at', 'from_user', 'id', 'text']
tweets = DataFrame(data['result'], columns=tweet_field)
# print tweets
# DataFrame中每一行有了来自一条tweet的数据
tweets._ix[7]

使用数据库

# 数据库选择标准：性能、数据完整性、应用程序的伸缩性需求
# 引入Python内置的sqlite3驱动器
import sqlite3
query = """
CREATE TABLE test (a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
# 插入几条数据
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Wangjiang', 'Maoan', 2.5, 5),
        ('Hefei', 'Gaoxing', 5.5, 9)]
stmt = "INSERT INTO test VALUES(?,?,?,?)"
con.executemany(stmt, data)
con.commit()
# 从表中选取数据
cursor = con.execute('select * from test')
rows = cursor.fetchall()
# print rows
# [(u'Atlanta', u'Georgia', 1.25, 6),
(u'Wangjiang', u'Maoan', 2.5, 5), 
(u'Hefei', u'Gaoxing', 5.5, 9)]
# 列名位于游标的description属性中
# print cursor.description
# (('a', None, None, None, None, None, None),
# ('b', None, None, None, None, None, None),
# ('c', None, None, None, None, None, None),
# ('d', None, None, None, None, None, None))
# 传入DataFrame
# print DataFrame(rows, columns=zip(*cursor.description)[0])
#            a        b     c  d
# 0    Atlanta  Georgia  1.25  6
# 1  Wangjiang    Maoan  2.50  5
# 2      Hefei  Gaoxing  5.50  9

# pandas有一个简化过程的read_frame函数
import pandas.io.sql as sql
# 书本上read_frame已失效
# print sql.read_sql('select * from test', con)
#            a        b     c  d
# 0    Atlanta  Georgia  1.25  6
# 1  Wangjiang    Maoan  2.50  5
# 2      Hefei  Gaoxing  5.50  9

存取MongoDB数据

# 电脑上需要装MongoDB数据库；
# 使用pymongo驱动通过默认端口连接
# 存储在MongoDB中的文档被组织在数据库的集合（想象成表）
# 没有装，姑且记录一下该怎么写,留作参考
import pymongo
con = pymongo.Connection('localhost', port=27017)
# MongoDB服务运行的每个实例可以有多个数据库，每个数据库又可以有多个集合。
# 访问之前获取的tweets集合
tweets = con.db.tweets
# 加载数据并存入集合
import requests, json

url = 'https://www.google.com.hk/search?safe=strict&site=&source=hp&q=test&btnK=Google+搜索'
data = json.loads(requests.get(url).text)
for tweet in data['results']:
    tweets.save(tweet)

# 从集合中取出
cursor = tweets.find({'from_user': 'linking'})
# 转换成DataFrame
tweet_fileds = ['create_at', 'from_user', 'id', 'text']
result = DataFrame(list(cursor), columns=tweet_fileds)

走过了一遍第六章，给我的感觉是，工具很强大，可以做很多数据分析工作，数据结构知识还是匮乏，一些比较复杂的东西还是不能很好的理解，需要加强，多看几遍也是一个方法。

时间上，下班时间有点晚，可以看书的时间都在9点左右，有点晚，太累了，以后下班是不是可以早一点，做饭时间吃饭时间太晚；另外严重缺乏锻炼了，涨了一些肉的，膝盖还是没有恢复，需要针对性锻炼。

记一下，博客的评论插件，多说社会化评论系统已经关闭；选择移除多说或者换成其他的。

………………………………………………2017-03-25 22:57 Sat&合肥-皖水公寓

展开全文 >>

《利用Python进行数据分析》读书笔记-第五章-pandas入门

2017-03-19

pandas数据结构介绍

Series

Series最重要的一个功能：在算术运算中会自动对齐不同索引的数据（键相同）。

Series对象本身及其索引都有一个name属性。

more >>

展开全文 >>