Você está na página 1de 8

Python For Web Scraping - Week 1

Andrew Hall October 6, 2010

Environment
1. Open the Terminal (in OS X) or the command prompt in Windows, and type Python. This will bring you into the interactive Python environment; you can type in commands and press enter to execute. 2. Put your code in a text le, save it with the extension .py, and execute it by typing python myle.py in the Terminal.

Two easy ways to execute Python code:

2
Or...

Hello, World

print hello, world

string = hello, world print string

Or...
string = hello, world print string[0:5] + string[5:]

Or...
string = hello, world for char in string: print char

Andrew Hall, Department of Government, Harvard University

Variables

Like most (all?) programming languages, Python lets (requires?) you store values as variables. In Python, unlike in many other languages you might have seen, you do not have to tell it what kind of variable you are creating:
>>> num = 9 >>> type(num) <type int> >>> num 9 >>> num + 2 11 >>> num + pwn Traceback (most recent call last): File <stdin>, line 1, in <module> TypeError: unsupported operand type(s) for +: int and str

Variables can refer to each other or themselves, and this is important for writing programs. For example, you may want to set up a counter that keeps track of how many times you have carried out an operation (e.g. how many items you have added to a list).
>>> count = 0 >>> count = count + 1 >>> count 1

Its important to keep track of what type of variable Python has created for you, though.
>>> >>> >>> 1 >>> >>> >>> 1.5 numerator = 3 denominator = 2 numerator/denominator numerator = float(3) denominator = float(2) numerator/denominator

Strings

Working with strings is essential for web-scraping. This is arguably the most important concept to understand. A string is a collection of characters, like: 1. string Please send corrections to hall@fas.harvard.edu

Andrew Hall, Department of Government, Harvard University 2. a string 3. 4. 12345 As we saw before, we can save a string to a variable.
>>> string = length >>> type(string) <type str> >>> len(string) 6

By and large, the data you pull of the web will be formatted as strings. So you want to know how to manipulate, analyze, and store them.

4.1

Slicing Strings

>>> string = hello, world >>> print string[0] h >>> print string[11] d >>> print string[12] Traceback (most recent call last): File <stdin>, line 1, in <module> IndexError: string index out of range >>> print string[-1] d

A common challenge with web-scraping is that you get a string containing a date and value you want, like Jul 4 2009 20 1. How do you get the month?
>>> line = Jul 4 2009, 20 >>> line[0:4] Jul

2. How do you get the day?


>>> line[4] 4

3. How do you get the year?


>>> line[6:10] 2009

Please send corrections to hall@fas.harvard.edu

Andrew Hall, Department of Government, Harvard University 4. How do you get the value?
>>> line[-2:] 20

Note, nally, that strings are immutable, meaning that you cant modify one youve created:
>>> string = test >>> string[2] = p Traceback (most recent call last): File <stdin>, line 1, in <module> TypeError: str object does not support item assignment

4.2

String Methods

Python provides a whole suite of really helpful functions for working with strings. You should check out the Python documentation online to nd all of them. Ive picked a couple to show here as a preview. 4.2.1 Convert to all upper or lower case

>>> string = This is a String >>> string.upper() THIS IS A STRING >>> string.lower() this is a string

This is particularly useful if youre trying to match values; say for example you have two different datasets with country names. In one data set they might say Afghanistan and in the other they might have AFGHANISTAN. If you try to match the two data sets as they are, you wont nd the match - you need to convert to all upper-case (or all lower-case) before comparing. 4.2.2 Strip out whitespace at beginning and end of strings

>>> string = annoying spaces >>> string annoying spaces >>> string.strip() annoying spaces

This comes up a lot. Oftentimes when you scrape data, its weirdly formatted. Spaces at the beginning and end crop up, and they mess up comparisons and other stu.

Please send corrections to hall@fas.harvard.edu

Andrew Hall, Department of Government, Harvard University 4.2.3


>>> >>> 0 >>> 3 >>> -1 >>> 14

Searching for a substring


string = what to look for string.find(what) string.find(t) string.find(z) string.find(or)

4.2.4

Dealing with numbers that have been read in as strings

>>> number = 8 >>> number.isdigit() True >>> number = eight >>> number.isdigit() False >>> number + 2 Traceback (most recent call last): File <stdin>, line 1, in <module> TypeError: cannot concatenate str and int objects >>> int(number) + 2 Traceback (most recent call last): File <stdin>, line 1, in <module> ValueError: invalid literal for int() with base 10: eight >>> number = 8 >>> int(number) + 2 10

5
>>> >>> [1, >>> 1 >>> 4

Lists
list = [1,2,3,4] list 2, 3, 4] list[0] list[3]

Lists are a crucial data type. They let you store groups of values for later use.

Please send corrections to hall@fas.harvard.edu

Andrew Hall, Department of Government, Harvard University


>>> list[4] Traceback (most recent call last): File <stdin>, line 1, in <module> IndexError: list index out of range >>> list[-1] 4

Lists are a natural way to store the data you read in from the web. For example, suppose you are reading in the names of the senators that voted for a bill; for each Senator on the web-page you are reading, you add the senator to the list.
>>> senators = [] >>> senators.append(Daniel Webster) >>> senators.append(Hillary Clinton) >>> senators [Daniel Webster, Hillary Clinton]

You dont have to append things onto the end, though - you can insert them wherever you please:
>>> senators.insert(0, Tom Coburn) >>> senators [Tom Coburn, Daniel Webster, Hillary Clinton] >>> senators.insert(1, Joe Lieberman) >>> senators [Tom Coburn, Joe Lieberman, Daniel Webster, Hillary Clinton]

There are tons of other important things to do with lists, so I encourage you to check out the Python documentation.

For Loops

For Loops are a must for basically any programming, and certainly for web scraping. Suppose you have a set of URLs you want to scrape; you need a way to tell your program to iterate over each of the URLs. This is one of a million situations in which a For Loop gets the job done. The general idea with a For Loop in any language is to take a variable and a range of values, and set that variable to each of the values in the given range, one by one. So if I say (in psuedocode), for i in (1,2,3,4) I mean rst, set i = 1 and do something, then set i = 2 and do the same thing over again, and keep doing this until after you do it for i=4.
>>> for i in range(0,10): print i ... 0 1 2

Please send corrections to hall@fas.harvard.edu

Andrew Hall, Department of Government, Harvard University


3 4 5 6 7 8 9

Python lets you abstract away your For Loops much more than most languages. For example, suppose you have a list of baseball teams:
>>> teams = [Red Sox, Yankees, Rays, Blue Jays] >>> for team in teams: print team ... Red Sox Yankees Rays Blue Jays

Python magically knows (thanks to the in) keyword that you want it to loop through each of the elements in the list you give it.

Logical Tests

Equally as important as For Loops are If Statements. We use If Statements to check values, e.g. to see if a certain Senator is in a list of nay votes.
>>> nays = [Coburn, Specter, DeMint] >>> if Coburn in nays: print No! ... No!

We can check all sorts of things. Maybe we want to know whether the bill passed, given the list of nay voters.
>> if len(nays) > 50: print Bill does not pass ... >>> if len(nays) <= 50: print Bill passes! ... Bill passes!

Scripting

So far, everything weve done has been in the interactive Python environment. But for any web scraping, youre not going to want to input each line of code manually through the command Please send corrections to hall@fas.harvard.edu

Andrew Hall, Department of Government, Harvard University line! So we need to write a script. A script is just a list of commands for Python to execute, which you save in a text le. Its like giving the computer a list and letting it do the commandline entries for you. For example, take a look at this script. Note that lines that start with the pound sign are comment lines - these are not executed by Python, and are little notes we can leave for ourselves so we understand what we were thinking when we wrote the code.
#FirstScript.py #Our first script! list = [3, 7, 3, 5, 1, 2] max = 0 for num in list: if num > max: max = num print Max is: + str(max)

To run this script, we open the Terminal (in OS X), make sure we are in the same directory as the script le (or write out the entire path of the le name), and go:
Tue Oct 05 17:41:12 559 $ python FirstScript.py Max is: 7

Its important to note Pythons rules of syntax, which did not come up much when we were using the interactive interpreter, but are unavoidable when writing a script. Note, for example, the colon at the end of the For Loop. This tells Python that it is going inside of a For Loop. Note also that the next line is indented exactly 4 spaces. Each line inside the For Loop must be indented four spaces, so that Python knows where the loop continues and where it ends (it ends when it gets to a line that is NOT indented four spaces). Likewise, there is a colon at the end of the If Statement. In this case, I have left the result of the If Statement on the same line as the If Statement, meaning I dont have to indent. This works as long as the result of the If Statement is only one line. If I needed more than one thing to execute, Id need to put them on new lines and indent them four spaces. Fortunately, any good text editor will know youre writing a Python le, and thus will make the Tab key indent exactly four spaces.

Next Topics
1. Functions 2. Modules 3. Regular Expressions 4. Other stu ?

Please send corrections to hall@fas.harvard.edu

Você também pode gostar