-
Oscar: training data, models, etc
Oscar uses a Maximum Entropy Markov Model (MEMM) based on n-grams. Peter Corbett has written this up (doi:10.1186/1471-2105-9-S11-S4). So, it basically is statistics once more. If you really want a proper bioinformatics education, so do your PhD at a (proteo)chemometrics department. -
Status update on BJOC analysis with Oscar and ChemicalTagger #3
The two earlier posts in this series showed screenshots of results of Oscar, but the title also promised results by Lezan’s ChemicalTagger. Sam helped with getting the HTML pages online via the Cambridge Hudson installation. Where Oscar find named entities (chemical compounds, processes, etc), ChemicalTagger finds roles, like solvent, acid, base, catalyst. Roles are properties of chemical compounds in certain situations. Ethanol is not always a solvent, sometimes it is a Xmas present. The current output is not entirely where I want to go yet, but makes it easy which solvents are frequently found in the BJOC corpus: -
Supramolecular chemistry
Some smart software developer once said to not optimize your code too early. However, not caring about it at all does not help either. Some basic knowledge of memory management can keep you going. That is, I just ran into the limits of Oscar and ChemicalTagger. As I blogged earlier today, I am analyzing the BJOC literature , but Lezan and I are running into a reproducible out-of-memory exception. At first I thought it was a memory leak, as it was the 95th paper if fell over on, but after we optimized our code a bit, by reusing classes, the problem remained and turned out to be not in recreating objects (though the code is significantly faster now), but in a single BJOC paper being too large. -
Adding a new dictionary to Oscar
Say, you have your own dictionary of chemical compounds. For example, like your company’s list of yet-unpublished internal research codes. Still, you want to index your local listserv to make it easier for your employees to search for particular chemistry you are working on and perhaps related to something done at other company sites. This is what Oscar is for.