Você está na página 1de 9

SMEE Documentation

Lewis Cawthorne 08/17/2011


The purpose of SMEE is to identify sorting motifs in documents retrieved by PubMedCentral. It currently consists of 2218 lines of Python code, multiple Django templates, some css, and a shell script. This document will attempt to cover the major parts of the program. To get a broad overview of the current general state of the project, refer to the diagram below.

A decent amount of work has been done on optimization. Originally, a run could have taken weeks (if I ever let one complete). This quickly got shaved down to days, but hung around at taking hours for a relatively small set for a while. Current performance is much better. A uniprot import (which must occur about once a month) takes about 20 minutes. Processing 100 files takes about 3 minutes. Processing 2000 files takes about 10 to 15 minutes. There are two major parts of the project, but they speak together well. The processing parts are writing in Python and use Biopython and the NLTK (natural language toolkit). The web accessible parts are designed using the Django python web framework, so they can freely interact with the code and database. All database models are declared in the Django models.py file. The main webpage is accessed from the webserver root, or from the /smee path on the webserver (either works the same). The pages for curations, approved listing, completed db listing (including uncleared rejects and unapproved entries), and listings sorted in a table and ordered by confidence are here. Also, there is a direct address for the admin page: server.address.com/admin. This gives you convenient direct web access to the raw database.

A normal development process involves me modifying a number of files. If no changes have been made to the ParseRules, then I clear the db by going to the smee directory and typing ./manage.py dbshell < clear_leave_initial.txt This avoids a uniprot re-parse. If I have modified the ParseRules, I run: ./manage.py dbshell < clear_db.txt After either one, you should run: ./manage.py syncdb to reinitialize the tables, then: ./run-smee.sh starts up the webserver, and ./process.py to the primary script. When DEBUG=True in settings.py, process.py will import rules from smee/loaddb/initial

What various files do


All files currently reside in /srv/django/smee on my workstation. The only place this value is hardcoded is where Django makes it necessary. This is within settings.py and urls.py. They are marked with a comment saying HARD_LOCATION. The other setting that varies by machine is in the run-smee.sh shell script. It is a convenient one line script that starts the Django server. The ip address and port to run the server on is a require command line parameter. Change it to match the machine where it is running. You will note that a lot of the code is very well documented. This habit became less common as the project continued on and time became more of a constraint. However, I attempted to put some comment relating to every function and make updates elsewhere. You will also notice in the comments where I mark something as 'brittle' or 'temp'. These are specific code sections that I never was quite happy with and planned to replace or make more general later. There are only a few of them. There are one particular part of process.py that are roped off with if False. The protein finding code was never fully implementing. Also, there is cleanup code in process.py the clean up code at the bottom that works perfectly, but got in the way of how I was running my tests. The clean up code moves all documents that were analyzed from the working_directory (xml_docs) to the storage directory (stored_docs). This way it can keep what it has analyzed and ensure not fetching the same document twice, but get them out of the way of its analysis directory. All other files are relative to the smee directory. We'll start with the main logic of the program. process.py mainly serves to call methods in other python modules. It serves as an excellent starting point to see how things work together in SMEE processing. In the section after that, I will briefly cover the main files of interest in the django web project, then give a run down in brief of other files in the directories of the project.

process.py process.py is the wheel around which all other scripts revolve. It is suitable for being run nightly from a cronjob to scan PubMedCentral for new documents matching search terms and pull them down, analyze them, and insert them into the database for manual approval. If you want to learn the order of major functions calls. It is best to run process.py at night, hence its design for cron compatibility. For small record sets, you can run it anytime. PubMedCentral prefers for large requests to be run at off-peak hours. First and foremost, the sget.GetCorpus call takes a mandatory parameter of your email address. This way, if you hit the PubMedCentral servers to hard, they can warn you before blocking your access. It would be best to keep this current. It is now set to Doctor Hu's email address, as I am retiring from the project. Also of special interest is the pmc_corpus.search_terms variable. It is currently setup to search for ["'sorting motif'", "'sorting signal'", "'sorting sequence'", "'signaling peptide'", "'signaling sequence'", "'targeting signal'" ]. You'll note the use of single quotes on the inside and double quotes on the outside of each term. This keeps it to a complete term match. If you want to simply and together several values, just put them in double quotes without the inside set of single quotes. I am using a very restricted set when testing. Ultimately, you will want to broaden this set and let the parsing rules drill down to the good data. When DEBUG is set, the search_terms are initialized to be a set of PMCIDs located in the pmcids_known.txt file. These are 118 known positives that I use for testing purpose. If DEBUG is not true, it will use the previously mentioned search terms in pmc_corpus.search. The search_all method of the corpus object is not called at all when DEBUG is set to True currently, since this stops it from re-retrieving the files over and over (since they are not moved off to the storage_dir in DEBUG mode). Normally, anything stored would not be fetched again. If DEBUG is set, it will initialize the tables with initial ParseRules found in smeedb/loaddb/initial. If you want ParseRule changes to survive a database dump and reprocess during testing, add the rule to the appropriate find in this directory. It then sets the working_dir and storage_dir variables directly afterwards. The following other modules are imported within process.py, and I will use their abbreviated names to refer to their functions here. import loaddb.getcorpus import loaddb.preprocess import loaddb.findmatch as sget as spre as sfind

sget.GetCorpus creates a corpus object. It requires your email address be passede to the constructor. This also objects houses the previously mentioned search_terms variable which you can set. The search_all() method of the corpus object searches for and retries results. Once this is completed, process.py outputs 'Retrieval Complete'. spre contains several preprocessing functions that are called from process.py. scan_files, remove_oldfiles, and scan_files(*,*, '.txt') are called in order. These methods in order basically process all the retrieved xml files to extract some information and remove tags, remove our old 'working_data.txt' file, then concatenate all of the retrieved textfiles together will some appropriate separators into a 'working_data.txt' file that is used for further analysis. Documentation of these two important modules of processing follows: loaddb/getcorpus.py Contains code for PubMedCentral interaction.

Exported Classes:

GetCorpus - Encapsulates data and methods for PubMed Central document retrieval

This file is designed to be imported and not ran as a script.

loaddb/preprocess.py Custom code for preprocessing PubMed Central XML into unicode text files.

Exported Methods: - scan_files - Recursively finds all files matching an extension

and executes a function on them. - pp_file - Extracts text from XML and converts to UTF-8 format.

- strip_spec - Strips unnecessary tags and characters from an XML string. - pp_string - Calls a series of functions on a string to prepare it for Natural Language Processing.

After getting and preprocessing the records, we pass our working file to sfind.find_matches. find_matches is the heart of the analysis. I will go into detail on it next. For each returned match from find_matches, we call the sfind.insert_matches function to put them into the database. This insert_matches function also calls out to other functions in the module to assign confidence and test the match before insertion. The actually code to insert matches into the database is in the insertmatch.py module. It is the only part of the processing stages to talk directly to the database. It uses the Django models defined in smeedb/models.py for this access.

Smee's web access is provided through Django. It is controlled like any other Django project. If you have never worked with Django, you can find thorough documentation here: https://docs.djangoproject.com/en/1.2/ A brief run-down of the important locations are provided in the file-by-file list below. It is important to recognize that smee/manage.py controls syncing the database definitions, starting up the webserver, entering a python shell that has access to the database and functions directly, and entering a dbshell to interact with the database as the smee user. The also important to note that smee/smeedb/models.py controls all the database definitions (fields, field types, etc). This is the one place to look for all things relating to the database. smee/urls.py controls the urls that the server responds to via regular expressions. Also, smee/smeedb/views.py contains functions called from urls.py to display various webpages. The functions render responses via djhtml templates that are stored in the smee/templates directory. Finally, smeedb/loaddb/findmatches.py is the core logic for match analysis. Right at the top, the motif_reg that we use to identify possible motif candidates can be easily found. Beyond that are a number of methods that work together for analysis. Of special note are the assign_conf

and the test_match functions. assign_conf assigns a confidence rating to the match and test_match is the final True/False returning function that determines if the match will make it to the database. Currently it only tests for a confidence rating cut off, but this could be easily expanded.

File-by-file descriptions. Any files ending in .pyc are simply compiled version of the matching .py file. smee/ clear_db.txt import with ./manage.py dbshell < clear_db to clear out the database. clear_leave_initial.txt use this the same way to clear out the database but leave the parserules & uniprot import in tact. confirmed_matches.txt the PMCIDs of items found in the last run of SMEE processing that match id's of known positives. (made by hand / manual scripts) manage.py controls various Django functionality pmcids_known.txt known positives PMCIDs from hand curated database. process.py main script to run from cron or by hand run-smee.sh launches the Django webserver for smee settings.py controls Django settings urls.py Django url patterns working_data.txt last set of SMEE run's working_data smee/loaddb contains the various modules already discussed and a other modules and data they use. Findmatch.py, getcorpus.py, insertmatch.py, preprocess.py were previously discussed. geniaproc.py no longer used. initdata.py parses the initial data text files and imports the initial data into the databases. inituniprot.py part of a failed experiment to parse and store all of uniprot. Not currently used. pmcftp.py handles retrieval of OpenAccess archives from the PubMedCentral FTP servers. Called by getcorpus.py process_uniprot.py imports a formated list of uniprot names that match our motif matching regex, so we can avoid protein names that our system mistakes for motifs. initial/ - stores the initial data text files. This is also where you would store the uniprot_trembl.dat file for import. It currently must be manually downloaded and decompressed. It updates about once a month. The other files correspond to databases one-to-one. Indicators, levs, locs, nonmotifs, and verbs.txt. initial/format_uniprot.sh formats a uniprot_trembl.dat file so the uniprot import routines will work with it.

smee/OA_archs stores OpenAccess archives when available and the setting is turned on during the call to the corpus object's search_all method. These would be tar files with pdf's, xml, txt, pictures, and extended material relating to papers. Not available for all documents, but a wealth of information when it is available. smee/xml_docs where docs currently being analyzed or being stored for next run analysis are held. Controlled by a variable in process.py. Currently holds 118 known positives. smee/xml_old A set of over 2000 mixed documents to test slightly larger analysis with. smee/stored_docs stores documents that have already been processed. smee/smeedb contains Django related database files: models.py, views.py, tests.py (unused), and admin.py smee/static contains static files access by django (usually pictures and css files) smee/templates stores Django djhtml templates for webpages smee/testing some old test code no longer necessary. around for reference. I keep it

Suggested Items to Complete (in no particular order): 1) Increase accuracy by adjusting rules (contained in smee/loaddb/findmatch.py) in the testmatch function. Current accuracy allows us to identify motifs in 58 out of 118 known positives. This is slighly under half, so leaves room for much improvement. 2) Create some new user side forms to display and query the database, and make the existing forms look a bit better. 3) Set it to download it's own uniprot updates, or at least to do so via a manual click on the weblink. I have optimized the import procedure so that it takes about 30 mins each time it is updated. Uniprot releases about once a month. The results need to be formated, imported, then the huge uniprot dat file should be removed. I did the download and removal by hand. In the smee/loaddb/initial/ directory, there is a file named 'format_uniprot.sh' that will format a uniprot database for import. The import routines are in processuniprot.py in the smee/loaddb/ directory. The import is currently only called from process.py 4) Extend it to import from PubMed (where full-text is not available but abstracts are) and possibly Google Scholar. It is already set up in an attempt to make it easy to extend in this fashion. 5) More general polish. 6) Debugging. I did what I could, but no one is perfect.

Você também pode gostar