User loginSearchNavigationActive forum topics |
J2ME Example of Scraping Web Pages
The following simple example demonstrates how a web page can be retrieved. You can add your own algorithms for scraping. In other words, process the data and extract just the information you want. Scrape Source Code
All the magic happens in the getViaStreamConnection() function. Give it a URL and it returns a String containing the entire page. Read in ChunksI've experimented with several methods for reading websites and converting them into String objects. In this example, the data arrives as an array of byte variables. A byte is a native type in Java and difficult to process. I convert the byte array into a more useful String object that has methods like indexOf() for locating sub-strings. The conversion is done by reading large chunks of bytes and appending them on to a String. This is not a technique you'll find often in literature. Typical examples will read one byte at a time. Each byte would be appended to a String as the page is read. I found that horribly inefficient. By declaring a largish buffer size, my loop runs much faster than if I read one byte at a time. Avoid Multi-threadingReading web pages is often used to demonstrate another cool feature in Java; multi-threading. You could create a separate thread to read your page in the background while the user interface is still doing something entertaining such as displaying an animated "loading" screen. I've tried doing this too, not because I wanted a flashy application but because I wanted to give the user a chance to abort a read operation if it starts to run long. Unfortunately, multi-threading is slower than watching paint dry. I quickly decided that reading the page in less than a week is more important than giving the user a chance to abort. Sure a cancel option is nice but multi-threading makes every page retrieval seem like the phone has locked up. Don't Do This Other Stuff EitherThere are lots of ugly things about my technique. First of all I have declared the size of my byte array using a magic number (200). It is indeed just an arbitrarily chosen magic number. You can play around with this number and you'll see some changes in the performance of this function. You may want to tune it based on the speed of the remote server you are reading from and the size of the page you are reading. The speed of your device probably plays a major factor as well. I've tested this routine on the Sanyo 4900 which is perhaps the slowest Java implementation on the Sprint network. New phones might be able to get away with reading pages one byte at a time, but older phones will benefit from using a large buffer. As I already noted, you can't cleanly abort this read operation once it's started. If the server (or the network) responds slowly, this program can easily hang your phone. That's a crummy wart to live with but multi-threading just isn't a practical solution for most cases. Try ItIf you want to test it out yourself, this program is available for download from my other website. Point your phone's browser to http://apgap.com and look for Scrape in the Applications section. Basic ExampleYou might want to start with a Simple J2ME MIDlet Example. On the other hand, if you'd like to try some tricks very specific to Sprint handsets, learn to Launch MIDlets From a Webpage.
|