smb:// URL Support in pysmb

Starting from pysmb version 1.0.4, application developers can utilize smb:// URLs with pysmb and urllib2 python package. One benefit from this is that if your existing application already supports downloads via other protocols like http or ftp, you can "upgrade" your application to support downloads from Windows shared folders or from Samba servers by making small modifications to your existing code.

pysmb tries to follow the smb URL scheme proposed in the IETF draft which is not much different from other standard URL schemes like http and ftp. In short, your smb URL should follow this format - smb://[:@][:][///]

Sample File Retrieving Code

The following code illustrates a simple file retrieval operation from the remote server at 192.168.1.1. The userID and password are embedded in the URL and will be properly parsed by the SMBHandler for authentication with the remote server.

import urllib2
from smb.SMBHandler import SMBHandler 

director = urllib2.build_opener(SMBHandler)
fh = director.open('smb://myuserID:mypassword@192.168.1.1/sharedfolder/rfc1001.txt') 

# Process fh like a file-like object and then close it.
fh.close()

The most important line is to register the SMBHandler class with the urllib2 library via the urllib2.build_opener() method. Note that there are other ways to register additional handlers, but this is the most straightforward method (in my opinion).

File Uploading to Remote Servers

In addition to file retrieval/download, pysmb also supports uploads to the remote servers. The following code example helps to illustrate a simple file upload operation.

import urllib2
from smb.SMBHandler import SMBHandler

file_fh = open('local_file.dat', 'rb')

director = urllib2.build_opener(SMBHandler)
fh = director.open('smb://myuserID:mypassword@192.168.1.1/sharedfolder/upload_file.dat',
                   data = file_fh)

# Reading from fh will only return an empty string
fh.close()

The URL used is similar to that used in the file retrieval operation. However, observe that the open() method accepts a data parameter pointing to an opened file object. With this parameter, pysmb will perform a file upload operation instead of the usual file retrieval operation.

I hope the above 2 examples provide a quick insight on how to use pysmb and urllib2 in your application. If you need more advanced operations like folder creation and file deletion, pysmb already has built-in support for these via the SMBConnection class (synchronous I/O implementation) and SMBProtocol class (Twisted protocol implementation).

EGD: Entropy Gathering Daemon Client Protocol

The protocol is inferred by reading the source code of egd version 0.9. It is a simple binary protocol which contains 4 commands. When the client connects to EGD via unix sockets or TCP sockets, the client will initiate the communication by sending the request. The format of the request is dependent upon the nature of request; the first byte of the request will be the command code (0x01 to 0x04) that will tell the service the nature of the request.

Get Entropy Level

Client sends 1-byte value of 0x00

EGD replies with a 4-byte value indicating the current size of the entropy pool in big-endian order which will indicate the number of the random data that can be read without blocking.

Read Entropy (Non-Blocking Mode)

Client sends 1-byte value of 0x01 and 1-byte value indicating the number of bytes to request from the service. Up to 255 bytes of random data can be requested.

EGD replies (without delay) with a 1-byte value indicating the number of random data that will be returned after reading this first byte.

Read Entropy (Blocking Mode)

Client sends 1-byte value of 0x02 and 1-byte value indicating the number of bytes to request from the service. Up to 255 bytes of random data can be requested.

EGD replies (possibly with delay) with the requested random data. Note that there is no 1-byte value to indicate the number of returned random data as opposed to the non-blocking read request.

Write Entropy

Client sends 1-byte value of 0x03, 2-bytes value reflecting the number of entropy bits that is being given in this request, 1-byte value reflecting the number of bytes of random data (for the entropy bits) and followed by the string of random data. So the length of request is 4+N bytes where N is the value of the 4th byte of request.

EGD does not reply to this request.

Report PID

Client sends 1-byte value of 0x04.

EGD replies with a 1-byte value reflecting the length of pid string that will follow after this byte. Note that  the pid string is not null-terminated.

That's all for this EGD client protocol. More in-depth knowledge of the workings of the EGD can be found in egd.pl. If you have some experience with Perl, the code is actually quite self-documenting. :-P

Reuters NewsML Syndication Integration

The manual offered by Reuters explains how to write Java code to inform your web application of new stories from Reuters and how to extract the information from the NewsML files for your web site. This blog post will illustrate another way of integration by scanning the syndication folder for the NewsML files. The alternate method can be implemented in various programming languages such as Python, Perl, PHP or Ruby.

Introduction

Reuters Online Syndication offers a suite of ready to publish news stories for subscription by corporates. These news stories are categorized by geographical regions and by interests such as financial and sports. The syndication is not the same as the RSS feeds that you see on reuters.com; Reuters Online Syndication will push the latest breaking news, associated graphics and award-winning pictures to your site.

The Syndication Software

Your Reuters representative will pass you an URL or CD which contains the syndication software and your license key. The syndication software should be in a zipped or jar file, and will resemble something like this.

1. Unzip the zipped/jar file. Then create a license.key file in the same folder as the siclone.config. The first line of the license.key will contain the license key.

2. Now create a shell script in the same folder as the license.key

#!/bin/sh
PATH=.:$PATH
finish-upgrade.sh
launch-program.sh

With this shell script, the syndication software can be run on a command-line or as a service without the need for a graphical user interface. You can customize the shell script into a batch file for MS-DOS/Windows

3. Execute the shell script and when the syndication software is running, you should see a bunch of folders being created. Examples of these new folders are OLGBWORLD_iptc and OLUSPOLITIC_iptc. You will also see some XML and JPG files in these new folders. The XML files are the NewsML files which contain the Reuters stories while the JPG are the pictures and photos for some of the stories.

The folders or news channels are controlled by Reuters, so if you need more news channels, you need to call your Reuters representative. Each folder corresponds to one news channel. When the syndication software is running, it will write its status into siclone.config.

NewsML Files

Reuters has a good showcase for the NewsML format. The showcase contains some example NewsML packages which you can download to get a feel what are the key attributes in the NewsML files. You can also try out the NewsML toolkit from sourceforge.

By now, you can start coding your own application in your favourite programming language to scan the news folders at regular intervals for NewsML files. Once the syndication software has completed writing and closed the NewsML files, the files can be deleted from the folders without affecting the syndication software. Occasionally, you may come across files with filenames ending in .tmp. These are temporary files that the syndication software are using to dump the NewsML contents. Your application should ignore these temporary files.

The NewsML file contains numerous attributes which can be quite daunting initially. For starters, you can use a XML parser library that is available in your programming language and use it to parse out these attributes first:

  • DateTime
  • Role
  • USN
  • Slug
  • Headline
  • body

If you need to understand the classification for the story, you may also want to parse out the IPTCSubjectCodes and N2000 attributes as well. For more information on these attributes, call your Reuters representative :-)

All the contents (nodes) within the ... tag-pair will be the story that can be used to display to your audience. It is a well-formed xHTML document, so it can be included directly in any web page. In some stories, you will also find some RICs that are embedded within the body. You can replace these RICs using some regular expressions or convert them to valid URLs that link to the pages displaying the financial information for these codes.

This blog post has illustrated an alternative method of integrating with the Reuters syndication software which allows you to write your own NewsML processing routines in a non-Java language. It has also highlighted some NewsML attributes which you can look at first to start setting up your news site.

Use of Software Platforms (or Frameworks)

Software industry has been moving towards the adoption of software platforms in their products over the number of years. I don't mean platforms from SDK like .NET, C++ STL, or JDK. These are "essentials" to the language that they are supporting. Without them, these languages will be lying in software graveyards now. What I'm referring are platforms that bring in benefits other than language-based support for common coding tasks. These platforms generally provide an abstraction layer on top of the underlying programming interface to provide a more usable programming interface for the developers to work on. Examples:

  • Qt from Trolltech for cross-platform development on Windows, Linux, etc
  • Django, TurboGears, Pylons, Flex (and many more) for rapid web 2.0 application development
  • Microsoft Foundation Classes (MFC), Cocoa (OS X), KDE, Gnome for GUI development

In general, coding on these platforms usually lead to shorter development cycles (when you have learnt how to use them effectively). The more "utilities" these platforms provide, the less code you have to write and you can concentrate more on your application logic. However, there are certain trade-offs that come with the use of these platforms.

Performance

From my experience, any time saved during development will cost you more time during runtime. O-R mapping for database objects, cross-platform coding, etc saves you time from having to deal with low-level issues directly. However, the platform will consume extra milliseconds from doing all these work for you, so eventually these milliseconds will add up and snowball into a significant performance penalty.

Now, the trick to this is to see at which load that this snowballed penalty becomes so severe that it starts affecting the usability of your application. If you are lucky, your users might just accept it to be part of the network connectivity issues and just wait patiently. Otherwise, you can always argue that it's a happy problem that the performance issue is due to the wide acceptance of the public and request for more funding to buy more servers or more hardware.

Testing

When you adopt a software platform for use, you also assume the responsibilities to test the platform as well. In cases of bugs, you may also need to find workarounds or to work with the platform developers to fix these bugs. You should also consider situations when the platform developers may not be responsive or simply ignore your pleas for the bugs to be fixed and the platform source code may not be available for you to perform the debugging.

Installation

My advice: do your users a favor and choose a platform that is easy to install. You will end up giving yourself less headaches, trying to troubleshoot with your users why your application installs but gives runtime errors due to failed dependencies. (This is a common issue in Python unless you are using setuptools /ez_install to auto-install the dependencies).

For opensource platforms, take extra care to mirror your dependencies on your server. Good opensource projects release frequently, but sometimes the new releases may obsolete and change their function signatures, and end up breaking your application after the upgrades. Another issue to be take note is that the opensource projects can die out due to unforeseen circumstances, and the softwares just disappear on the Internet.

In times of Bad Luck...

What if you have made a bad choice and select the wrong platform only to discover that you need a more advanced feature? Sometimes, requirements can change drastically and end up affecting your fundamental assumptions which you have assumed when you decided that the selected platform is the best in terms of development costs and time needed.

This has happened to me once when I had used Django for one project only to discover later that I needed to use vertical partitioning to re-model my data models because not all data reside in a single database. Pylons or TurboGears with SQL Alchemy can resolve this issue, but not Django as its O-R engine is built on the assumption that you need only one database for all your models.

Conclusion

To conclude, software platforms are useful and handy to work with, but there are cons which you need to evaluate carefully before deciding on their use. Otherwise, you may find yourself in sticky situations when it's really difficult to resolve without a full rewrite.

Why Software Development Has Always Been Difficult ?

I had this interesting debate with one of my NS buddies who is now working as a software project manager. He's a "born" programmer who has a good number of years in software development. Both of us lamented over the fact that good software developers are hard to find nowadays, and outsourcing is not a healthy long-term solution for our local IT environment.

As the conversation progressed, we slowly touched on this argument why software development has always been difficult. The following summarizes some of our arguments.

1. Required Skill Set

Programmers are often faced with this question: "what do I need to know to develop software". My answer is to learn what you need to know in order to develop the software that you are going to develop. There are far too many dimensions to consider.

If we look at technical skills, for programming languages, the commonly used languages are C, C++, Java, Delphi (Pascal), Python, Perl, C#, Ruby, Tcl, ActionScript, JavaScript. Wikipedia has a good list of programming languges. Next, you look at frameworks which may also depend on what platform you are going to develop for. In Windows, Visual .Net is popular; for cross-platforms, you have Qt, wxWindows, J2EE, etc. For web applications, there are even more: ASP.NET, ColdFusion, AJAX, Django, TurboGears, etc.

Other skils to consider is the depth of the software development methodology that you are familiar with. Most academics will advise against waterfall, and promote iterative models. Once again, Wikipedia has listed the common ones. Personally, I would prefer an architecture-centric approach. A good IT practioner must have undergone at least a full SDLC which usually consists of these phases: requirements analysis, design, development, testing and deployment. Every phase by itself will require some level of expertise which (I'm confident to say) can only be acquired through experience and exposure.

2. Software Writing and Design is a Creative Process

In my opinion, software development is a creative process. A system architect can dictate the high-level architecture design and how the components interface with each other. But when it comes to coding, different programmers will produce different codes that fufill the same requirements, but may work and break at different points. Furthermore, it is also difficult to determine how correct each code is; we can easily verify that the code is working but it's often harder to verify where the code breaks and throws exceptions. And the quality of the testing process is dependent upon how skillful the tester is.

3. Software Development Discipline is still a New Field

Software development only begun sixty years ago. Compare this to building architecture which is like almost a thousand years. There are few proven precedents and a lack of real-world activities which software development industry can leverage the base of experience on and to pattern after.

4. Difficult to Model Human/Business Process into Software

Even though we have 4GL (generation languages) which closely resemble English, it is still fundamentally difficult to program a computer to do useful things. For an application with a decent scope, you are looking at converting a good number of business processes and many human interactions into highly specific machine instructions. And most humans don't think in 1s and 0s; there's always a good nested levels of "if we have this condition then we do this else we do that." Most business processes require some level of human intervention to perform some in-house rules and modifications to proceed to next stage.

5. Programmers are often not Domain Expertises

Most schools often train their students to do programming and to acquire a basic understanding of how systems work. Once these students graduate and step into the working society, they can be expected to undertake programming jobs which may involve a set of knowledge that they are totally ignorant of. You can engage a domain expertise to help out in the analysis to straighten out all these issues and pen them down in the specifications, that is if you know how to ask intelligent questions that will bring out all these issues in the first place. Most domain expertises assume that their domain knowledge is "common sense" and expect you to be aware of them as well.

Conclusion

Software programming is an extremely difficult, demanding and torturous process that most people would rather not get involved in. We have been talking about the need to improve and guarantee software quality for ages for more than 40 years, and we still can't find a solution. For now, software programming will continue to be a mind-boggling process that will continue to plague the industry for a good number of years.