Less Known Solaris Features
Less Known Solaris Features
Less Known Solaris Features
Contents
I. Introduction
1. The 1.1. 1.2. 1.3. 1.4. 1.5. genesis of LKSF How it started . . . The scope . . . . . The disclaimer . . Credits and Kudos Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
17 17 17 17 18 18 19 19 19 19 19 20 20 20 20 20 21 21 21 21 22 22 22 22 22 23 23 23 23
2. The guide to LKSF 2.1. Solaris Administration . . . . . . . . . . . . . . . . . . 2.1.1. Liveupgrade . . . . . . . . . . . . . . . . . . . . 2.1.2. Boot environments based on ZFS snapshots . . 2.1.3. Working with the Service Management Facility 2.1.4. Solaris Resource Manager . . . . . . . . . . . . 2.1.5. /home? /export/home? AutoFS? . . . . . . . . 2.1.6. lockfs . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Solaris Security . . . . . . . . . . . . . . . . . . . . . . 2.2.1. Role Based Access Control and Least Privileges 2.2.2. The Solaris Security Toolkit . . . . . . . . . . . 2.2.3. Auditing . . . . . . . . . . . . . . . . . . . . . . 2.2.4. Basic Audit Reporting Tool . . . . . . . . . . . 2.2.5. IPsec . . . . . . . . . . . . . . . . . . . . . . . . 2.2.6. On Passwords . . . . . . . . . . . . . . . . . . . 2.2.7. Signed binaries . . . . . . . . . . . . . . . . . . 2.3. Networking . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1. Crossbow . . . . . . . . . . . . . . . . . . . . . 2.3.2. IPMP . . . . . . . . . . . . . . . . . . . . . . . 2.3.3. kssl . . . . . . . . . . . . . . . . . . . . . . . . . 2.4. Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1. fssnap - snapshots for UFS . . . . . . . . . . . . 2.4.2. iSCSI . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
Contents 2.4.3. Remote Mirroring with the Availability Suite . . . 2.4.4. Point-in-Time Copy with the Availability Suite . . 2.4.5. SamFS - the Storage Archive Manager File System 2.5. Solaris Administrators Toolbox . . . . . . . . . . . . . . . 2.5.1. fuser . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2. ples . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3. Installing Solaris Packages directly via Web . . . . 2.5.4. About crashes and cores . . . . . . . . . . . . . . . 2.6. Nontechnical feature . . . . . . . . . . . . . . . . . . . . . 2.6.1. Long support cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 24 24 24 24 24 24 25 25 25
26
27 27 27 27 29 29 32 34
4. Boot environments based on ZFS snapshots 35 4.1. Using snapshots for boot environments . . . . . . . . . . . . . . . . . . . 35 4.2. A practical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5. Working with the Service Management Facility 5.1. Introduction . . . . . . . . . . . . . . . . . . 5.1.1. init.d . . . . . . . . . . . . . . . . . . 5.1.2. Service Management Facility . . . . . 5.2. The foundations of SMF . . . . . . . . . . . 5.2.1. Service and Service Instance . . . . . 5.2.2. Milestone . . . . . . . . . . . . . . . 5.2.3. Fault Manager Resource Identier . . 5.2.4. Service Model . . . . . . . . . . . . . 5.2.5. Transient service . . . . . . . . . . . 5.2.6. Standalone model . . . . . . . . . . . 5.2.7. Contract service . . . . . . . . . . . . 5.2.8. A short digression: Contracts . . . . 5.2.9. Service State . . . . . . . . . . . . . 40 40 40 40 41 41 42 42 42 43 43 43 43 46
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
Contents 5.2.10. Service Conguration Repository . . . . . . . . . . 5.2.11. Dependencies . . . . . . . . . . . . . . . . . . . . . 5.2.12. Master Restarter Daemon and Delegated Restarter 5.2.13. Delegated Restarter for inetd services . . . . . . . . 5.2.14. Enough theory . . . . . . . . . . . . . . . . . . . . Working with SMF . . . . . . . . . . . . . . . . . . . . . . 5.3.1. Whats running on the system . . . . . . . . . . . . 5.3.2. Starting and stopping a service . . . . . . . . . . . 5.3.3. Automatic restarting of a service . . . . . . . . . . 5.3.4. Obtaining the conguration of a service . . . . . . . 5.3.5. Dependencies . . . . . . . . . . . . . . . . . . . . . Developing for SMF . . . . . . . . . . . . . . . . . . . . . . 5.4.1. Prerequisites . . . . . . . . . . . . . . . . . . . . . . 5.4.2. Preparing the server . . . . . . . . . . . . . . . . . 5.4.3. Preparing the client . . . . . . . . . . . . . . . . . . 5.4.4. Before working with SMF itself . . . . . . . . . . . 5.4.5. The Manifest . . . . . . . . . . . . . . . . . . . . . 5.4.6. The exec methods script - general considerations . . 5.4.7. Implementing a exec method script . . . . . . . . . 5.4.8. Installation of the new service . . . . . . . . . . . . 5.4.9. Testing it . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Do you want to learn more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 47 47 47 48 48 48 48 50 51 51 52 52 52 53 53 54 56 56 58 59 60 60 61 61 61 62 62 64 67 68 70 70 71 71 72 73 73 74 76 76
5.3.
5.4.
5.5. 5.6.
6. Solaris Resource Manager 6.1. Why do you need Resource Management? . . . . . . . . 6.2. Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3. The basic idea of Solaris Resource Management . . . . . 6.4. How to work with projects and tasks . . . . . . . . . . . 6.5. A practical example . . . . . . . . . . . . . . . . . . . . . 6.6. Why do I need all this stu? . . . . . . . . . . . . . . . . 6.7. Limiting operating environment resources . . . . . . . . . 6.8. Limiting CPU resources . . . . . . . . . . . . . . . . . . 6.8.1. Without Resource Management . . . . . . . . . . 6.8.2. Using the Fair Share Scheduler . . . . . . . . . . 6.8.3. Shares . . . . . . . . . . . . . . . . . . . . . . . . 6.8.4. Behavior of processes with Resource Management 6.9. Limiting memory resources . . . . . . . . . . . . . . . . . 6.9.1. Without memory resource management . . . . . . 6.9.2. With memory resource management . . . . . . . 6.10. Resource Management and SMF . . . . . . . . . . . . . . 6.10.1. Assigning a project to an already running service
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Contents 6.10.2. Conguring the project in a SMF manifest . . . . . . . . . . . . . 6.11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12. Do you want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . 7. /home? /export/home? AutoFS? 7.1. History . . . . . . . . . . . . . . . . . . . 7.2. The use case . . . . . . . . . . . . . . . . 7.3. Prerequisites . . . . . . . . . . . . . . . . 7.4. Creating users and home directories . . . 7.5. Conguring the automounter . . . . . . . 7.6. Testing the conguration . . . . . . . . . 7.7. Explanation for the seperated /home and 7.8. The /net directory . . . . . . . . . . . . 7.9. Do you want to learn more? . . . . . . . 8. lockfs 8.1. Types of Locks . . . . . . . 8.2. Write Lock . . . . . . . . . 8.3. Delete lock . . . . . . . . . . 8.4. Conclusion . . . . . . . . . . 8.5. Do you want to learn more? 78 78 79 80 80 81 81 82 83 83 84 84 85 87 87 88 89 90 90 91 91 91 92 92 92 94 95 95 97 99 99 100 100 101 101 102 103 103
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . /export/home . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
9. CacheFS 9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 9.2. History of the feature . . . . . . . . . . . . . . . . . 9.3. CacheFS in theory . . . . . . . . . . . . . . . . . . 9.4. A basic example . . . . . . . . . . . . . . . . . . . . 9.4.1. Preparations . . . . . . . . . . . . . . . . . . 9.4.2. Mounting a lesystem via CacheFS . . . . . 9.4.3. Statistics about the cache . . . . . . . . . . 9.5. The cache . . . . . . . . . . . . . . . . . . . . . . . 9.6. On-demand consistency checking with CacheFS . . 9.7. An practical usecase . . . . . . . . . . . . . . . . . 9.8. The CacheFS feature in future Solaris Development 9.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 9.9.1. Do you want to learn more ? . . . . . . . . . 10.The curious case of /tmp in Solaris 10.1. tmpfs and its usage . . . . . . . . . . . . . . 10.2. Conguring the maximum size of the tmpfs 10.3. Conclusion . . . . . . . . . . . . . . . . . . . 10.4. Do you want to learn more? . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Contents
104
105 105 105 105 106 106 107 108 108 108 110 111 111 113 114 114 116 117 118 119 119 121 124 124 125 125 127 127 131 132 133 134 135 135 136 136 137 138
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Contents 13.6. More auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 13.7. Want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 14.Basic Audit Reporting Tool 141 14.1. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 14.2. Want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 15.IPsec 15.1. The secrets of root . . . . . 15.2. Foundations . . . . . . . . . 15.3. IPsec in Solaris . . . . . . . 15.4. Example . . . . . . . . . . . 15.5. Prepare the installation . . . 15.6. Conguration of IPsec . . . 15.7. Activate the conguration . 15.8. Check the conguration . . 15.9. Do you want to learn more? 16.Signed binaries 17.On passwords 17.1. Using stronger password hashing . . . . . . . 17.1.1. Changing the default hash mechanism 17.2. Password policies . . . . . . . . . . . . . . . . 17.2.1. Specing a password policy . . . . . . 17.2.2. Using wordlists . . . . . . . . . . . . . 17.3. Conclusion . . . . . . . . . . . . . . . . . . . . 17.3.1. Do you want to learn more= . . . . . . 18.pfexec 18.1. Delegating Administration Tasks . . . . . . 18.2. Granting Root Capabilities to Regular Users 18.3. An important advice . . . . . . . . . . . . . 18.4. Conclusion . . . . . . . . . . . . . . . . . . . 143 143 143 144 144 144 145 150 150 151 152 153 153 155 156 156 159 160 160 161 161 163 164 164
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
IV. Networking
19.Crossbow 19.1. Introduction . . . . . . . 19.2. Virtualisation . . . . . . 19.2.1. A simple network 19.3. Bandwidth Limiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
167 167 167 168 184
Contents 19.3.1. Demo environment . . . . . . . . . . 19.3.2. The rationale for bandwitdth limiting 19.3.3. Conguring bandwidth limiting . . . 19.4. Accouting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 184 184 186 187 187 187 188 188 191 196 196 197 198 199 199 200 201 204 208 209 211 212 212 213 213 214 216 219 219 219 222 222 223 223
20.IP Multipathing 20.1. The bridges at SuperUser Castle . . . . . . . . . . . . . . . . . . . . 20.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.2.1. Where should I start? . . . . . . . . . . . . . . . . . . . . . 20.2.2. Basic Concept of IP Multipathing . . . . . . . . . . . . . . . 20.2.3. Link based vs. probe based failure/repair detection . . . . . 20.2.4. Failure/Repair detection time . . . . . . . . . . . . . . . . . 20.2.5. IPMP vs. Link aggregation . . . . . . . . . . . . . . . . . . 20.3. Loadspreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.1. Classic IPMP vs. new IPMP . . . . . . . . . . . . . . . . . . 20.4. in.mpathd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.5. Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.6. New IPMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.6.1. Link based failure detection . . . . . . . . . . . . . . . . . . 20.6.2. Probe based failure detection . . . . . . . . . . . . . . . . . 20.6.3. Making the conguration boot persistent . . . . . . . . . . . 20.6.4. Using IPMP and Link Aggregation . . . . . . . . . . . . . . 20.6.5. Monitoring the actions of IPMP in your logles . . . . . . . 20.7. Classic IPMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.7.1. Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.7.2. Link based classic IPMP . . . . . . . . . . . . . . . . . . . . 20.7.3. Probe based classic IPMP . . . . . . . . . . . . . . . . . . . 20.7.4. Making the conguration boot persistent . . . . . . . . . . . 20.8. Classic and new IPMP compared . . . . . . . . . . . . . . . . . . . 20.9. Tips, Tricks and other comments . . . . . . . . . . . . . . . . . . . 20.9.1. Reducing the address sprawl of probe based failure detection 20.9.2. Explicitly conguring target systems . . . . . . . . . . . . . 20.9.3. Migration of the classic IPMP conguration . . . . . . . . . 20.9.4. Setting a shorter or longer Failure detection time . . . . . . 20.10. onclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C 20.11. o you want to learn more? . . . . . . . . . . . . . . . . . . . . . . D
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.Boot persistent routes 225 21.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 21.2. Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 21.3. Do you want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . 226
Contents 22.kssl - an in-kernel SSL proxy 22.1. The reasons for SSL in the kernel 22.2. Conguration . . . . . . . . . . . 22.3. Conclusion . . . . . . . . . . . . . 22.4. Do you want to learn more? . . . 227 227 228 230 230
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
V. Storage
23.fssnap - snapshots for UFS 23.1. fssnap . . . . . . . . . . . . 23.2. A practical example. . . . . 23.3. Conclusion . . . . . . . . . . 23.4. Do you want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231
232 232 232 235 235 236 236 236 237 237 238 238 238 239 239 240 240 241 242 242 243 243 244 245 246 247 248 248 249 249 252
24.Legacy userland iSCSI Target 24.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 24.2. The jargon of iSCSI . . . . . . . . . . . . . . . . . . 24.3. The architecture of iSCSI . . . . . . . . . . . . . . 24.4. Simple iSCSI . . . . . . . . . . . . . . . . . . . . . 24.4.1. Environment . . . . . . . . . . . . . . . . . 24.4.2. Prerequisites . . . . . . . . . . . . . . . . . . 24.4.3. Conguring the iSCSI Target . . . . . . . . 24.4.4. Conguring the iSCSI initiator . . . . . . . 24.4.5. Using the iSCSI device . . . . . . . . . . . . 24.5. Bidirectional authenticated iSCSI . . . . . . . . . . 24.5.1. Prerequisites . . . . . . . . . . . . . . . . . . 24.5.2. Conguring the initiator . . . . . . . . . . . 24.5.3. Conguring the target . . . . . . . . . . . . 24.5.4. Conguration of bidirectional conguration . 24.5.5. Reactivation of the zpool . . . . . . . . . . . 24.6. Alternative backing stores for iSCSI volumes . . . . 24.6.1. File based iSCSI target . . . . . . . . . . . . 24.6.2. Thin-provisioned target backing store . . . . 24.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 24.8. Do you want to learn more? . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
25.COMSTAR iSCSI Target 25.1. Why does COMSTAR need a dierent administrative model? . 25.2. Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3. Preparing the target system . . . . . . . . . . . . . . . . . . . 25.4. Conguring an iSCSI target . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Contents 25.5. Conguring the initator without authentication 25.6. Conguring the initator with authentication . . 25.7. Conclusion . . . . . . . . . . . . . . . . . . . . . 25.8. Do you want to learn more? . . . . . . . . . . . 26.Remote Mirroring with the Availability Suite 26.1. Introduction . . . . . . . . . . . . . . . . . . . . 26.2. Implementation of the replication . . . . . . . . 26.3. Wording . . . . . . . . . . . . . . . . . . . . . . 26.4. Synchronous Replication . . . . . . . . . . . . . 26.5. Asynchronous Replication . . . . . . . . . . . . 26.6. Choosing the correct mode . . . . . . . . . . . . 26.7. Synchronization . . . . . . . . . . . . . . . . . . 26.8. Logging . . . . . . . . . . . . . . . . . . . . . . 26.9. Prerequisites for this tutorial . . . . . . . . . . . 26.9.1. Layout of the disks . . . . . . . . . . . . 26.9.2. Size for the bitmap volume . . . . . . . . 26.9.3. Usage of the devices in our example . . . 26.10. etting up an synchronous replication . . . . . . S 26.11. esting the replication . . . . . . . . . . . . . . T 26.11.1.Disaster test . . . . . . . . . . . . . . . . 26.12. synchronous replication and replication groups A 26.12.1.The problem . . . . . . . . . . . . . . . . 26.12.2.Replication Group . . . . . . . . . . . . 26.12.3.How to set up a replication group? . . . 26.13. eleting the replication conguration . . . . . . D 26.14. ruck based replication . . . . . . . . . . . . . . T 26.14.1.The math behind the phrase . . . . . . . 26.14.2.Truck based replication with AVS . . . . 26.14.3.On our old server . . . . . . . . . . . . . 26.14.4.On our new server . . . . . . . . . . . . 26.14.5.Testing the migration . . . . . . . . . . . 26.15. onclusion . . . . . . . . . . . . . . . . . . . . . C 26.16. o you want to learn more? . . . . . . . . . . . D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 255 258 258 260 260 261 261 261 262 262 262 263 263 264 264 264 265 267 267 269 270 270 270 272 273 273 273 273 275 276 276 277 278 278 278 279 279 280 280
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.Point-in-Time Copy with the Availability Suite 27.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 27.2. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . 27.2.1. Availability Suite . . . . . . . . . . . . . . . . 27.2.2. The jargon of Point in Time Copies with AVS 27.2.3. Types of copies . . . . . . . . . . . . . . . . . 27.3. Independent copy . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
10
Contents 27.3.1. Deeper dive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 27.3.2. Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . 282 27.4. Dependent Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 27.4.1. h4Deeper dive . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 27.5. Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . 284 27.6. Compact dependent copy . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 27.6.1. Deeper dive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 27.6.2. Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . 286 27.7. Preparation of the test environment . . . . . . . . . . . . . . . . . . . . . 287 27.7.1. Disklayout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 27.7.2. Calculation of the bitmap volume size for independent and dependent shadows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 27.7.3. Calculation of the bitmap volume size for compact dependent shadows288 27.7.4. Preparing the disks . . . . . . . . . . . . . . . . . . . . . . . . . . 289 27.8. Starting a Point-in-time copy . . . . . . . . . . . . . . . . . . . . . . . . 290 27.8.1. Common prerequisite . . . . . . . . . . . . . . . . . . . . . . . . . 290 27.8.2. Create an independent copy . . . . . . . . . . . . . . . . . . . . . 290 27.8.3. Create an independent copy . . . . . . . . . . . . . . . . . . . . . 291 27.8.4. Create an compact independent copy . . . . . . . . . . . . . . . . 291 27.9. Working with point-in-time copies . . . . . . . . . . . . . . . . . . . . . . 292 27.10. isaster Recovery with Point-in-time copies . . . . . . . . . . . . . . . . 296 D 27.11. dministration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 A 27.11.1.Deleting a point-in-time copy conguration . . . . . . . . . . . . . 297 27.11.2.Forcing a full copy resync of a point-in-time copy . . . . . . . . . 298 27.11.3.Grouping point-in-time copies . . . . . . . . . . . . . . . . . . . . 299 27.12. onclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 C 27.13. o you want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . 301 D 28.SamFS - the Storage Archive Manager FileSystem 28.1. Introduction . . . . . . . . . . . . . . . . . . . . 28.2. The theory of Hierarchical Storage Management 28.2.1. First Observation: Data access pattern . 28.2.2. Second observation: The price of storage 28.2.3. Third observation: Capacity . . . . . . . 28.2.4. Hierarchical Storage Management . . . . 28.2.5. An analogy in computer hardware . . . . 28.2.6. SamFS . . . . . . . . . . . . . . . . . . . 28.3. The jargon of SamFS . . . . . . . . . . . . . . . 28.3.1. Lifecycle . . . . . . . . . . . . . . . . . . 28.3.2. Policies . . . . . . . . . . . . . . . . . . 28.3.3. Archiving . . . . . . . . . . . . . . . . . 28.3.4. Releasing . . . . . . . . . . . . . . . . . 302 302 302 302 303 303 303 304 304 304 304 305 305 305
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
11
Contents 28.3.5. Staging . . . . . . . . . . . . . . . . . . . 28.3.6. Recycling . . . . . . . . . . . . . . . . . . 28.3.7. The circle of life . . . . . . . . . . . . . . . 28.3.8. Watermarks . . . . . . . . . . . . . . . . . 28.3.9. The SamFS lesystem: Archive media . . 28.4. Installation of SamFS . . . . . . . . . . . . . . . . 28.4.1. Obtaining the binaries . . . . . . . . . . . 28.4.2. Installing the SamFS packages . . . . . . . 28.4.3. Installing the SamFS Filesystem Manager 28.4.4. Modifying the prole . . . . . . . . . . . . 28.5. The rst Sam lesystem . . . . . . . . . . . . . . 28.5.1. Prerequisites . . . . . . . . . . . . . . . . . 28.5.2. The conguration itself . . . . . . . . . . . 28.6. Using disk archiving . . . . . . . . . . . . . . . . 28.6.1. Prerequisites . . . . . . . . . . . . . . . . . 28.6.2. Conguring the archiver . . . . . . . . . . 28.7. Working with SamFS . . . . . . . . . . . . . . . . 28.7.1. Looking up SamFS specic metadata . . . 28.7.2. Manually forcing the release . . . . . . . . 28.7.3. Manually forcing the staging of a le . . . 28.8. Usecases and future directions . . . . . . . . . . . 28.8.1. Unconventional Usecases . . . . . . . . . . 28.8.2. Future directions and ideas . . . . . . . . . 28.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . 28.10. o you want to learn more? . . . . . . . . . . . . D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 306 306 307 307 307 308 308 311 316 317 317 317 320 320 320 324 324 325 326 327 327 328 328 328
330
331 331 332 332 333 334 336
32.About crashes and cores 337 32.1. A plea for the panic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 32.2. Dierence between Crash Dumps and Cores . . . . . . . . . . . . . . . . 338
12
Contents 32.3. Forcing dumps . . . . . . . . . . . . . . . . . . . . 32.3.1. Forcing a core dump . . . . . . . . . . . . . 32.3.2. Forcing a crash dump . . . . . . . . . . . . . 32.4. Controlling the behaviour of the dump facilities . . 32.4.1. Crash dumps . . . . . . . . . . . . . . . . . 32.4.2. Core dumps . . . . . . . . . . . . . . . . . . 32.4.3. Core dump conguration for the normal user 32.5. Crashdump analysis for beginners . . . . . . . . . . 32.5.1. Basic analysis of a crash dump with mdb . . 32.5.2. A practical usecase . . . . . . . . . . . . . . 32.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 32.7. Do you want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 338 339 340 340 341 342 343 343 345 347 347 348 348 348 349 349 350 350 350 351 351 352 352 352 353 354 354 355 355 355 355 356 356 356 358 358 359 360 360 360
33.Jumpstart Enterprise Toolkit 33.1. Automated Installation . . . . . . . . . . . . . . . . . . . . 33.2. About Jumpstart . . . . . . . . . . . . . . . . . . . . . . . 33.2.1. The Jumpstart mechanism for PXE based x86 . . . 33.3. Jumpstart Server . . . . . . . . . . . . . . . . . . . . . . . 33.3.1. Development . . . . . . . . . . . . . . . . . . . . . 33.4. Control Files for the automatic installation . . . . . . . . . 33.4.1. rules . . . . . . . . . . . . . . . . . . . . . . . . . . 33.4.2. prole . . . . . . . . . . . . . . . . . . . . . . . . . 33.4.3. The sysidcfg le . . . . . . . . . . . . . . . . . . . . 33.5. Jumpstart FLASH . . . . . . . . . . . . . . . . . . . . . . 33.5.1. Full Flash Archives . . . . . . . . . . . . . . . . . . 33.5.2. Dierential Flash Archives . . . . . . . . . . . . . . 33.5.3. Challenges of Jumpstart Flash for System Recovery 33.6. About the Jumpstart Enterprise Toolkit . . . . . . . . . . 33.6.1. The basic idea behind JET . . . . . . . . . . . . . . 33.6.2. Additional features of JET . . . . . . . . . . . . . . 33.7. Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . 33.7.1. Systems . . . . . . . . . . . . . . . . . . . . . . . . 33.8. Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.9. Installation of JET . . . . . . . . . . . . . . . . . . . . . . 33.9.1. Preparation of the system . . . . . . . . . . . . . . 33.9.2. The installation . . . . . . . . . . . . . . . . . . . . 33.10. reparations for out rst installation . . . . . . . . . . . . P 33.10.1.From a mounted DVD media . . . . . . . . . . . . 33.10.2.From a .iso le . . . . . . . . . . . . . . . . . . . . 33.10.3.Looking up the existing Solaris versions . . . . . . . 33.11. basic automated installation . . . . . . . . . . . . . . . . A 33.11.1.The template for the install . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Contents 33.11.2.The generated Jumpstart conguration les . . . 33.11.3.The installation boot . . . . . . . . . . . . . . . . 33.12. basic automated installation - more polished . . . . . . A 33.12.1.Adding the recommended patch cluster . . . . . . 33.12.2.Adding custom packages . . . . . . . . . . . . . . 33.12.3.Extending the template . . . . . . . . . . . . . . 33.12.4.The installation . . . . . . . . . . . . . . . . . . . 33.12.5.Eects of the new modules . . . . . . . . . . . . . 33.13. utomatic mirroring of harddisks . . . . . . . . . . . . . A 33.13.1.Conguration in the template . . . . . . . . . . . 33.13.2.Eects of the conguration . . . . . . . . . . . . . 33.14. utomatic hardening . . . . . . . . . . . . . . . . . . . . A 33.14.1.Preparing the Jumpstart for installation . . . . . 33.14.2.Conguring the template . . . . . . . . . . . . . . 33.14.3.After Jumpstarting . . . . . . . . . . . . . . . . . 33.15. eep Dive to the installation with JET . . . . . . . . . . D 33.15.1.Post installation scripts . . . . . . . . . . . . . . . 33.15.2.An example for boot levels and postinstall scripts 33.15.3.The end of the post installation . . . . . . . . . . 33.16. sing Jumpstart Flash . . . . . . . . . . . . . . . . . . . U 33.16.1.Creating a ash archive . . . . . . . . . . . . . . 33.16.2.Preparing the template . . . . . . . . . . . . . . . 33.16.3.While Jumpstarting . . . . . . . . . . . . . . . . 33.17. sing Jumpstart Flash for System Recovery . . . . . . . U 33.17.1.The basic trick . . . . . . . . . . . . . . . . . . . 33.17.2.Using an augmented Flash archive . . . . . . . . 33.18. onclusion . . . . . . . . . . . . . . . . . . . . . . . . . . C 33.18.1.Do you want to learn more? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 364 366 366 366 367 368 369 369 369 370 371 371 372 372 373 375 376 377 377 377 378 380 381 381 381 383 383
VII.Nontechnical feature
35.Long support cycles 35.1. The support cycle . . . . . 35.2. An example: Solaris 8 . . 35.3. Sidenote . . . . . . . . . . 35.4. Do you want to learn more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
385
386 386 386 388 388
14
Contents
389 391
15
Part I. Introduction
16
17
18
19
2.1.6. lockfs
Sometimes you have to ensure that a le system doesnt change while youre working on it. To avoid that, use the lockfs command. You will learn more about this function in section 8 on page 87.
20
2.2.3. Auditing
What happens on your system? When did a user use which command? When did a user delete a particular le? You need log les to answers this question. The auditing functionality in Solaris generates these and reports on a vast amount of actions happening on your system. The conguration of this feature is explained in 13 on page 135.
2.2.5. IPsec
Secure communication between hosts gets more and more important. Secure communication does not only mean encrypted trac. It also includes the authentication of your communication partner. Solaris has had an IPsec implementation since a number of versions. The conguration of IPsec is described in 15 on page 143.
21
2.2.6. On Passwords
It is important to have good and secure passwords. All other security systems are rendered worthless without good keys to the systems. Solaris has some features to help the administrator to enforce good passwords. Section ?? on page ?? describes this feature.
2.3. Networking
2.3.1. Crossbow
Project Crossbow resulted in a new IP stack for Solaris. It solves challenges like the question how do load network interfaces in the 10GBe age and introduces an integrated layer for network virtualization. Some interesting features of Crossbow and their conguration is described in section 19 on page 167.
2.3.2. IPMP
Solaris provides an matured mechanism to ensure the availability of the network connection. This feature is called IP Multipathing (or short: IPMP). Its in Solaris for several versions now and its easy to use. An description of the conguration of new and classic IPMP is available in section 20 on page 187.
22
2.3.3. kssl
In Solaris 10 got an interesting feature to enable SSL for any service by adding a transparent SSL proxy in front of its . This proxy runs completely in kernel-space and yields better performance compared to a solution in the user-space. The section 22 on page 227 explains, how you enable kssl.
2.4. Storage
2.4.1. fssnap - snapshots for UFS
File system backups can be faulty. When they take longer, the le system has a dierent content at the beginning at the end of the backup, thus they are consistent. A solution to this problem is freezing the le system. fssnap delivers this capability to UFS. Section 23 describes this feature. The tutorial starts on page 232.
2.4.2. iSCSI
With increasing transfer speed of Ethernet it gets more and more feasible to use this media to connect block devices such as disks to a server. Since Update 4 Solaris 10 has a built-in functionality to act as an iSCSI initiator and target. The conguration of iSCSI is the topic of section 25 on page 248.
23
2.5.2. ples
Many people install lsof on their system as they know it from Linux. But you have an similar tool in Solaris. In section 30 on page 334 you will nd a short tip for its usage.
24
25
26
3. Liveupgrade
Solaris 10/Opensolaris
27
3. Liveupgrade environment and a empty slice or disk (the symbol with the thick lines is the active boot environment).
Figure 3.1.: Live Upgrade: Situation before start Now you create an alternate boot environment. Its a copy of the actual boot environment. The system still runs on this environment.
Figure 3.2.: Live Upgrade: Creating the alternate boot environment The trick is: The update/patch processes doesnt work on the actual boot environment, they use this alternate but inactive boot environment. The running boot environment isnt touched at all. After the completion of the updating you have an still running boot environment and a fully patched and updated alternate boot environment. Now the boot environments swap their roles with a single command and a single reboot. After the role swap the old system stays untouched. So, whatever happens with your new installation, you can fall back to you old system. In case you see problems with your new conguration, you switch back the but environments and you run with your old operating environment.
28
3. Liveupgrade
Figure 3.4.: Live Upgrade: Switching alternate and actual boot environment
29
3. Liveupgrade separate those lesystems without a longer service interruption. Bringing the system down, moving les around, booting it up is not an option for productive system. Moving while running isnt a good idea. Live Update is a nice, but simple solution for this problem: Live Upgrade replicates the boot environments by doing a le system copy. The lesystem layout of the old boot environment and the new environment doesnt have to be the same. Thus you can create a lesystem layout with a bigger /, a smaller /export/home and a separate /var. And the best is: The system runs while doing this steps. In my example I will start with an operating system on a single partition. The partition is located on /dev/dsk/c0d0s0 and has the size of 15 GB.
/ on / dev / dsk / c0d0s0 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980000 on Mon Feb 11 21:06:02 2008
At the installation time Ive created some additional slices. c0d0s3 to c0d0s6. Each of the slices has the size of 10 GB. Separating the single slice install to multiple slices is nothing more than using Live Upgrade without upgrading. At rst I create the alternate boot environment:
# lucreate -c " sx78 " -m /:/ dev / dsk / c0d0s3 : ufs -m / usr :/ dev / dsk / c0d0s4 : ufs -m / var :/ dev / dsk / c0d0s5 : ufs -n " sx 78_res tructu red " Discovering physical storage devices [..] Populating contents of mount point </ >. Populating contents of mount point </ usr >. Populating contents of mount point </ var >. [..] Creation of boot environment < sx78_restructured > successful .
Weve successfully created a copy of the actual boot environment. But we told the mechanism to put / on c0d0s3, /usr on c0d0s4 and /var on c0d0s5. As this was the rst run of Live Upgrade on this system the naming of the environment is more important than on later runs. Before this rst run, the boot environment has no name. But you need it to tell the process, which environment should be activated, patched or updated. Okay, my actual environment runs with Solaris Express CE build 78, thus Ive called it sx78. The lucreate command set this name to the actual environment. My new boot environment has the name "sx78_restructured" for obvious reasons. Okay, now you have to activate the alternate boot environment.
30
3. Liveupgrade
# luactivate sx78_ restru ctured Saving latest GRUB loader . Generating partition and slice information for ABE < sx78_restructured > Boot menu exists . Generating direct boot menu entries for ABE . Generating direct boot menu entries for PBE . [..] Modifying boot archive service GRUB menu is on device : </ dev / dsk / c0d0s0 >. Filesystem type for menu device : <ufs >. Activation of boot environment < sx78_restructured > successful .
Now we have to reboot the system. Just use init or shutdown. If you use any other command to reboot the system, Live Upgrade will not switch to new environment:
# init 6
Okay, this takes a minute. But lets have a look on the mount table after the boot.
# mount / on / dev / dsk / c0d0s3 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980003 on Tue Feb 12 05:52:50 2008 [...] / usr on / dev / dsk / c0d0s4 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980004 on Tue Feb 12 05:52:50 2008 [...] / var on / dev / dsk / c0d0s5 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980005 on Tue Feb 12 05:53:12 2008
Mission accomplished. Okay, but we want to use LiveUpgrading for upgrading, later. Switch back to your old environment:
# luactivate sx78
Boot the system. And your are back on your old single-slice installation on c0d0s0:
/ on / dev / dsk / c0d0s0 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980000 on Mon Feb 12 06:06:02 2008
31
3. Liveupgrade
You dont have to rename it, you just could use the old name. But why should you confuse your fellow admins by calling your Build 81 boot environment sx78_restructured. Okay, now start the upgrade. My installation DVD was mounted under cdrom/sol_11_x86 by Solaris and I want to upgrade the sx81 boot environment. This will take a while. Do this overnight or go shopping or play with your children. Your system is still running and the process will not touch your running installation:
# luupgrade -u -n sx81 -s / cdrom / sol_11_x86 Copying failsafe kernel from media . Uncompressing miniroot [...] The Solaris upgrade of the boot environment < sx81 > is complete . Installing failsafe Failsafe install is complete .
Okay. Lets check the /etc/release before booting into the new environment:
# cat / etc / release Solaris Express Community Edition snv_78 X86 Copyright 2007 Sun Microsystems , Inc . All Rights Reserved . Use is subject to license terms . Assembled 20 November 2007
32
3. Liveupgrade
# luactivate sx81 Saving latest GRUB loader . Generating partition and slice information for ABE < sx81 > Boot menu exists . Generating direct boot menu entries for ABE . Generating direct boot menu entries for PBE . [...] Modifying boot archive service GRUB menu is on device : </ dev / dsk / c0d0s0 >. Filesystem type for menu device : <ufs >. Activation of boot environment < sx81 > successful .
Wait a minute, login to the system and lets have a look at /etc/release again:
bash -3.2 $ cat / etc / release Solaris Express Community Edition snv_81 X86 Copyright 2008 Sun Microsystems , Inc . All Rights Reserved . Use is subject to license terms . Assembled 15 January 2008
By the way, the system runs on the three seperated slices now:
/ on / dev / dsk / c0d0s3 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980003 on Tue Feb 12 07:22:32 2008 [..] / usr on / dev / dsk / c0d0s4 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980004 on Tue Feb 12 07:22:32 2008 [..] / var on / dev / dsk / c0d0s5 read / write / setuid / devices / intr / largefiles / logging / xattr / onerror = panic / dev =1980005 on Tue Feb 12 07:22:54 2008
Neat, isnt it ?
33
3. Liveupgrade
34
This mirrors the actual state in your ZFS pools. You will nd lesystems named accordingly.
35
NAME USED AVAIL REFER MOUNTPOINT rpool G 142 G 56.5 K / rpool rpool@install K 55 K rpool / ROOT G 142 G 18 K / rpool / ROOT rpool / ROOT@install 0 18 K rpool / ROOT / opensolaris M 142 G 2.23 G legacy rpool / ROOT / opensolaris -1 G 142 G 2.24 G legacy rpool / ROOT / opensolaris -1 @install M - 2.22 G rpool / ROOT / opensolaris -1 @static : -:2008 -04 -29 -17:59:13 M - 2.23 G rpool / ROOT / opensolaris -1/ opt M 142 G 3.60 M / opt rpool / ROOT / opensolaris -1/ opt@install 0 - 3.60 M rpool / ROOT / opensolaris -1/ opt@static : -:2008 -04 -29 -17:59:13 0 - 3.60 M rpool / ROOT / opensolaris / opt 0 142 G 3.60 M / opt rpool / export M 142 G 19 K / export rpool / export@install K 19 K rpool / export / home M 142 G 18.9 M / export / home rpool / export / home@install K 21 K -
18.9 15 18.9 18
After doing some conguration, you can create an boot environment called opensolaris-baseline : Its really easy. You just have to create a new boot environment:
# beadm create -e opensolaris -1 opensolaris - baseline
We will not work with this environment. We use it as a baseline, as a last resort when we destroy our running environment. To run the system we will create another snapshot:
# beadm create -e opensolaris -1 opensolaris - work
36
jmoekamp @g la md ri ng :~# beadm list BE Name ---opensolaris - baseline opensolaris -1 opensolaris opensolaris - work Active Active on reboot ------ - -- -- -- -no no yes yes no no no no Mountpoint Space Used - - - - - - - - - - ----53.5 K legacy 2.31 G 62.72 M 53.5 K
You will see that the opensolaris-1 snapshot is still active, but that the opensolaris-work boot environment will be active at the next reboot. Okay, now reboot:
jmoekamp @g la md ri ng :~# beadm list BE Name ---opensolaris - baseline opensolaris -1 opensolaris opensolaris - work Active Active on reboot ------ - -- -- -- -no no no no no no yes yes Mountpoint Space Used - - - - - - - - - - ----53.5 K 54.39 M 62.72 M legacy 2.36 G
Okay, you see that the boot environment opensolaris-work is now active and its activated for the next reboot (until you activate another boot environment). Now we can reboot the system again. GRUB comes up and it will default to the opensolaris-work environment. Please remember on which position you nd opensolaris-baseline
37
4. Boot environments based on ZFS snapshots in the boot menu. You need this position in a few moments. After a few seconds, you can log into the system and work with it. Now lets drop the atomic bomb of administrative mishaps to your system. Log in to your system, assume the root role and do the following stu:
# cd / # rm - rf *
You know what happens. Depending on how fast you are able to interrupt this run, you will end up somewhere between a slightly damaged system and a system fscked up beyond any recognition. Normally the system would send you to the tapes now. But remember - you have some alternate boot environments. Reboot the system, wait for GRUB. You may have garbled output, so its hard to read the output from GRUB. Choose opensolaris-baseline. The system will boot up quite normally. You need a terminal window now. How you get such a terminal window depends on the damage incurred. The boot environment snapshots dont cover the home directories, so you may not have a home directory any more. I will assume this for this example: you can get a terminal window by clicking on Options, then Change Session and choose Failsafe Terminal there. Okay, log in via the graphical login manager; a xterm will appear. At rst we delete the defunct boot environment:
# beadm destroy opensolaris - work Are you sure you want to destroy opensolaris - work ? This action cannot be undone ( y /[ n ]) : y
Okay, now we clone the opensolaris-baseline environment to form a new opensolaris-work environment.
# beadm create -e opensolaris - baseline opensolaris - work
Now, check if you still have a home directory for your user:
# ls -l / export / home / jmoekamp / export / home / jmoekamp : No such file or directory
38
# mkdir -p / export / home / jmoekamp # chown jmoekamp : staff / export / home / jmoekamp
Wait a few moments. The system starts up. GRUB defaults to opensolaris-work and the system starts up normally, without any problems, in the condition that the system had when you created the opensolaris-baseline boot environment.
# beadm list BE Name ---opensolaris - baseline opensolaris -1 opensolaris opensolaris - work Active Active on reboot ------ - -- -- -- -no no no no no no yes yes Mountpoint Space Used - - - - - - - - - - ----3.18 M 54.42 M 62.72 M legacy 2.36 G
Obviously you may have to recover your directories holding your own data. Its best practice to make snapshots of these directories on a regular schedule, so that you can simply promote a snapshot to recover a good version of the directory.
4.3. Conclusion
You see, recovering from a disaster in a minute or two is a really neat feature. Snapshotting opens a completely new way to recover from errors. Unlike with LiveUpgrade, you dont need extra disks or extra partitions and, as ZFS snapshots are really fast, creating alternate boot environments on ZFS is extremely fast as well. At the moment this feature is available on Opensolaris 2008.05 only. With future updates it will nd its way into Solaris as well.
39
5.1. Introduction
The Service Management Facility is a quite new feature. But sometimes I have the impression that the most used feature is the capability to use old legacy init.d scripts. But once you use SMF with all its capabilities, you see an extremely powerful concept.
5.1.1. init.d
For a long time, the de-facto standard of starting up services was the init.d construct. This concept is based of startup scripts. Depending from their parametrisation they start, stop or restart a service. The denition of runlevels (what has to be started at a certain stage of booting) and the sequencing is done by linking this startup scripts in a certain directory and the naming of link. This mechanism worked quite good, but has some disadvantages. You cant dene dependencies between the services. You emulate the dependencies by sorting the links, but thats more of a kludge as a solution. Furthermore the init.d scripts run only once. When the service stops, there are no means to start it again by the system (you have to login to the system and restart it by using the init.d script directly or using other automatic methods) With init.d a service (like httpd on port 80) is just a consequence of running scripts, not a congurable entity in itself.
40
5. Working with the Service Management Facility needed for other services? What is the status of a service? Should I restart another service (e.g. database) to circumvent problems in another service (an old web application for example)? Okay, an expert has the knowledge to do such tasks manually ... but do you want to wake up at night, just to restart this fscking old application? The concepts of SMF enables the admin to put this knowledge into a machine readable format, thus the machine can act accordingly. This knowledge about services makes the SMF a powerful tool to manage services at your system. SMF enables the system to: starting, restarting and stopping services according to their dependencies resulting from this the system startup is much faster, as services are started in a parallel fashion when possible When a service fails, SMF restarts this service the delegation of tasks like starting, stopping and conguration of services to non-root users and much more The following tutorial wants to give you some insights to SMF. Have fun!
41
5.2.2. Milestone
A milestone is somehow similar to the old notion of runlevel. With milestones you can group certain services. Thus you dont have to dene each service when conguring the dependencies, you can use a matching milestones containing all the needed services. Furthermore you can force the system to boot to a certain milestone. For example: Booting a system into the single user mode is implemented by dening a single user milestone. When booting into single user mode, the system just starts the services of this milestone. The milestone itself is implemented as a special kind of service. Its an anchor point for dependencies and a simplication for the admin. Furthermore some of the milestones including single-user, multi-user and multi-user-server contain methods to execute the legacy scripts in rc*.d
42
43
Table 5.1.: Events of the contract subsystem Event empty process exit core signal contract hwerr Description the last process in the contract has exited a process in the process contract has exited a member process dumped core a member process received a fatal signal from outside the a member process received a fatal signal from outside the contract a member process has a fatal hardware error
794
With the -c option ptree prints the contract IDs of the processes. In our example, the sendmail processes run under the contract ID 107. With ctstat we can lookup the contents of this contract:
# ctstat - vi 107 CTID ZONEID TYPE STATE 107 0 process owned cookie : informative event set : critical event set : fatal event set : parameter set : member processes : inherited contracts : HOLDER EVENTS 7 0 0 x20 none hwerr empty none inherit regent 792 794 none QTIME NTIME -
Contract 107 runs in the global zone. Its an process id and it was created by process number 7 (the svc.startd). There wasnt any events so far. The contract subsystem should only throw critical evens when the processes terminate due hardware errors and when no processes are left. At the moment there are two processes under the control of the contract subsystem (the both processes of the sendmail daemon) Lets play around with the contracts:
# ptree -c pgrep sendmail [ process contract 1]
44
/ sbin / init [ process contract 4] 7 / lib / svc / bin / svc . startd [ process contract 99] 705 / usr / lib / sendmail - bd - q15m -C / etc / mail / local . cf 707 / usr / lib / sendmail - Ac - q15m
Okay, open a second terminal window to your system and kill the both sendmail processes:
# kill 705 707
After we submitted the kill, the contract subsystem reacts and sends an event, that there are no processes left in the contract.
# ctwatch 99 CTID EVID 99 25 CRIT ACK CTTYPE crit no process SUMMARY contract empty
Besides of ctwatch the event there was another listener to the event: SMF. Lets look for the sendmail processes again.
# ptree -c pgrep sendmail [ process contract 1] 1 / sbin / init [ process contract 4] 7 / lib / svc / bin / svc . startd [ process contract 103] 776 / usr / lib / sendmail - bd - q15m -C / etc / mail / local . cf 777 / usr / lib / sendmail - Ac - q15m
Et voila, two new sendmail processes with a dierent process id and a dierent process contract ID. SMF has done its job by restarting sendmail. To summarize things: The SMF uses the contracts to monitor the processes of a service. Based on this events SMF can take action to react on this events. Per default, SMF stops and restart a service, when any member of the contract dumps core, gets a signal or dies due a hardware failure. Additionally the SMF does the same, when theres no member process left in the contract.
45
disabled
online oine
maintenance
legacy run
Each service under the control of the SMF has an service state throughout it whole lifetime on the system.
46
5.2.11. Dependencies
The most important feature of SMF is the knowledge about dependencies. In SMF you can dene two kinds of dependency in a services: which services this service depends on the services that depend on this service This second way to dene a dependency has an big advantage. Lets assume, you have a new service. You want to start it before an other service. But you dont want to change the object itself (perhaps, you need this service only in one special conguration and the normal installation doesnt need your new service ... perhaps its the authentication daemon for a hyper-special networking connection ;)). By dening, that another service depends on your service, you dont have to change the other one. I will show you how to look up the dependencies in the practical part of this tutorial.
47
This is only a short snippet of the conguration. The output of this command is 105 lines long on my system. But you services in several service states in it. For example I hadnt enabled xvm on my system (makes no sense, as this Solaris is already virtualized, and the smb server is still online. Lets look after a certain service
# svcs name - service - cache STATE STIME FMRI online 10:08:01 svc :/ system / name - service - cache : default
The output is separated into three columns. The rst shows the service state, the second the time of the last start of the service. The last one shows the exact name of the service.
48
# svcs sendmail STATE STIME FMRI online 10:23:19 svc :/ network / smtp : sendmail
Okay, a few days later we realize that we need the sendmail service on the system. No problem we enable it again:
# svcadm enable sendmail # svcs sendmail STATE STIME FMRI online 10:25:30 svc :/ network / smtp : sendmail
The service runs again. Okay, we want to restart the service. This is quite simple, too
# svcadm restart sendmail # svcs sendmail STATE STIME FMRI online * 10:25:55 svc :/ network / smtp : sendmail # svcs sendmail STATE STIME FMRI online 10:26:04 svc :/ network / smtp : sendmail
Did you notice the change in the STIME column. The service has restarted. By the way: STIME doesnt stand for start time. Its a short form for State Time. It shows, when the actual state of the services was entered. Okay, now lets do some damage to the system. We move the cong le for sendmail, the glorious sendmail.cf. The source of many major depressions under sys admins.
# mv / etc / mail / sendmail . cf / etc / mail / sendmail . cf . old # svcadm restart sendmail # svcs sendmail STATE STIME FMRI offline 10:27:09 svc :/ network / smtp : sendmail
Okay, the service went in the oine state. Oine? At rst, the maintenance state would look more sensible. But lets have a look in some diagnostic informations. With svcs -x you can print out fault messages regarding services.
49
# svcs -x svc :/ network / smtp : sendmail ( sendmail SMTP mail transfer agent ) State : offline since Sun Feb 24 10:27:09 2008 Reason : Dependency file :// localhost / etc / mail / sendmail . cf is absent . See : http :// sun . com / msg / SMF -8000 - E2 See : sendmail (1 M ) See : / var / svc / log / network - smtp : sendmail . log Impact : This service is not running .
The SMF didnt even try to start the service. There is an dependency implicit to the service.
svcprop sendmail config / v al u e _a u t ho r i za t i on astring solaris . smf . value . sendmail config / local_only boolean true config - file / entities fmri file :// localhost / etc / mail / sendmail . cf config - file / grouping astring require_all config - file / restart_on astring refresh config - file / type astring path [..]
The service conguration for sendmail denes a dependency to the cong-le /etc/mail/sendmail.cf. Do you remember the denition of the service states? A service stays in oine mode until all dependencies are fullled. We renamed the le, the dependencies isnt fullled. The restart leads correctly to the oine state Okay, we repair the damage:
# mv / etc / mail / sendmail . cf . old / etc / mail / sendmail . cf
50
# svcs sendmail STATE STIME online 10:33:54 # pkill " sendmail " # svcs sendmail STATE STIME online 10:38:24
The SMF restarted the daemon automatically as you can see from the stime-column
5.3.5. Dependencies
But how do I nd out the dependencies between services. The svcadm commands comes to help: The -d switch shows you all services, on which the service depends. In this example we check this for the ssh daemon.
# svcs -d ssh STATE disabled online online online online online online code > </ STIME 8:58:07 8:58:14 8:58:25 8:59:32 8:59:55 9:00:12 9:00:12 FMRI svc :/ network / physical : nwam svc :/ network / loopback : default svc :/ network / physical : default svc :/ system / cryptosvc : default svc :/ system / filesystem / local : default svc :/ system / utmp : default svc :/ system / filesystem / autofs : default </
51
5. Working with the Service Management Facility To check, what services depend on ssh, you can use the -D switch:
# svcs -D ssh STATE online default STIME FMRI 9:00:22 svc :/ milestone / multi - user - server :
There is no further service depending on ssh. But the milestone codemulti-userserver/code depends on ssh. As long the ssh couldnt started successfully, the multi-user-server milestone cant be reached.
5.4.1. Prerequisites
A good source for this program is Blastwave1 . Please install the package tun and openvpn I want to show a running example, thus we have to do some work. It will be just a simple static shared key conguration, as this is a SMF tutorial, not one for OpenVPN. We will use theoden and gandalf again. gandalf will be the server. theoden the client.
10.211.55.201 gandalf 10.211.55.200 theoden
http://www.blastwave.org
52
Okay, and now lets hack the startup script....? Wrong! SMF can do many task for you, but this needs careful planing. You should answer yourself some questions: 1. What variables make a generic description of a service to a specic server? 2. How do I start the process? How do I stop them? How can I force the process to reload its cong? 3. Which services are my dependencies? Which services depend on my new service? 4. How should the service react in the case of a failed dependency? 5. What should happen in the case of a failure in the new service. Okay, lets answer this questions for our OpenVPN service. The variables for our OpenVPN client are the hostname of the remote hosts and the local and remote IP of the VPN tunnel . Besides of this the lename of the secret key and the tunnel device may dier, thus it would be nice to keep them congurable. Starting openvpn is easy. We have just to start the openvpn daemon with some command line parameters. We stop the service by killing the process. And a refresh is done via stopping and starting the service.
53
5. Working with the Service Management Facility We clearly need the networking to use a VPN service. But networking isnt just bringing up the networking cards. You need the name services for example. So make things easier, the service dont check for every networking service to be up and running. We just dene an dependency for the network milestone. As it make no sense to connect to a server without a network it looks like a sensible choice to stop the service in case of a failed networking. Furthermore it seems a good choice to restart the service when the networking conguration has changed. Perhaps we modied the conguration of the name services and the name of the OpenVPN server resolves to a dierent IP. What should happen in the case of a exiting OpenVPN daemon? Of course it should started again. Okay, now we can start with coding the scripts and xml les.
In this example, we dene some simple dependencies. As I wrote before: Without networking a VPN is quite useless, thus the OpenVPN service depends on the reached network milestone.
< dependency name = network grouping = require_all restart_on = none type = service > < service_fmri value = svc :/ milestone / network : default > < dependency >
54
5. Working with the Service Management Facility In this part of the manifest we dene the exec method to start the service. We use a script to start the service. The %m is a variable. It will be substituted with the name of the called action. In this example it would be expanded to =verb=/lib/svc/method/openvpn start=.
< exec_method type = method name = start exec = / lib / svc / method / openvpn %m timeout_seconds = 2 / >
Okay, we can stop OpenVPN simply by sending a SIGTERM signal to it. Thus we can use a automagical exec method. In case you use the :kill SMF will kill all processes in the actual contract of the service.
< exec_method type = method name = stop exec = : kill timeout_seconds = 2 > </ exec_method >
Okay, thus far weve only dene the service. Lets dene a service. We call the instance theoden2gandalf for obvious names. The service should run with root privileges. After this we dene the properties of this service instance like the remote host or the le with the secret keys.
< instance name = theoden2gandalf enabled = false > < method_context > < method_cr edenti al user = root group = root > </ method_context > < property_group name = openvpn type = application > < propval name = remotehost type = astring value = gandalf > < propval name = secret type = astring value = / etc / openvpn / static . key / > < propval name = tunnel_local_ip type = astring value = 172.16.1.2 > < propval name = tunnel_remote_ip type = astring value = 172.16.1.1 / > < propval name = tunneldevice type = astring value = tun > </ property_group > </ instance >
55
< stability value = Evolving / > < template > < common_name > < loctext xml : lang = C > OpenVPN </ loctext > </ common_name > < documentation > < manpage title = openvpn section = 1 > < doc_link name = openvpn . org uri = http :// openvpn . org > </ documentation > </ template > </ service > </ service_bundle >
56
5. Working with the Service Management Facility Here comes the svcprop command to help:
# svcprop -p openvpn / remotehost svc :/ application / network / openvpn : theoden2gandalf gandalf
With a little bit of shell scripting we can use this properties to use them for starting our processes.
#!/ bin / sh . / lib / svc / share / smf_include . sh getproparg () { val = svcprop -p $1 $SMF_FMRI [ -n " $val " ] && echo $val } if [ -z " $SMF_FMRI " ]; then echo " SMF framework variables are not initialized ." exit $SMF_EXIT_ERR fi OPENVPNBIN = / opt / csw / sbin / openvpn REMOTEHOST = getproparg openvpn / remotehost SECRET = getproparg openvpn / secret TUN_LOCAL = getproparg openvpn / tunnel_local_ip TUN_REMOTE = getproparg openvpn / tunnel_remote_ip DEVICETYPE = getproparg openvpn / tunneldevice if [ -z " $REMOTEHOST " ]; then echo " openvpn / remotehost property not set " exit $S M F _ E X I T _ E R R _ C O N F I G fi if [ -z " $SECRET " ]; then echo " openvpn / secret property not set " exit $S M F _ E X I T _ E R R _ C O N F I G fi if [ -z " $TUN_LOCAL " ]; then echo " openvpn / tunnel_local_ip property not set " exit $S M F _ E X I T _ E R R _ C O N F I G fi if [ -z " $TUN_REMOTE " ]; then
57
echo " openvpn / tunnel_remote_ip property not set " exit $S M F _ E X I T _ E R R _ C O N F I G fi if [ -z " $DEVICETYPE " ]; then echo " openvpn / tunneldevice property not set " exit $S M F _ E X I T _ E R R _ C O N F I G fi case " $1 " in start ) $OPENVPNBIN -- daemon -- remote $REMOTEHOST -- secret $SECRET -ifconfig $TUN_LOCAL $TUN_REMOTE -- dev $DEVICETYPE ;; stop ) echo " not implemented " ;; refresh ) echo " not implemented " ;; *) echo $ " Usage : $0 { start | refresh }" exit 1 ;; esac exit $SMF_EXIT_OK
After this step you have to import the manifest into the Service Conguration Repository:
58
# svccfg validate / export / home / jmoekamp / openvpn . xml # svccfg import / home / jmoekamp / openvpn . xml
5.4.9. Testing it
Lets test our brand new service:
# ping 172.16.1.2 ^C
The OpenVPN service isnt enabled. Thus there is no tunnel. The ping doesnt get through. Now we enable the service and test it again.
# svcadm enable openvpn : theoden2gandalf # ping 172.16.1.2 172.16.1.2 is alive
Voila ... SMF has started our brand new service. When we look into the list of services, we will nd it:
# svcs openvpn : theoden2gandalf STATE STIME FMRI online 18:39:15 svc :/ application / network / openvpn : theoden2gandalf
When we look into the process table, we will nd the according process:
# / usr / ucb / ps - auxwww | grep " openvpn " | grep -v " grep " root 1588 0.0 0.5 4488 1488 ? S 18:39:15 0:00 / opt / csw / sbin / openvpn -- daemon -- remote gandalf -- secret / etc / openvpn / static . key -- ifconfig 172.16.1.2 172.16.1.1 -- dev tun
Okay, we doesnt need the tunnel any longer after a few day,thus we disable it:
# svcadm disable openvpn : theoden2gandalf # / usr / ucb / ps - auxwww | grep " openvpn " | grep -v " grep " #
No process left.
59
5.5. Conclusion
Okay, I hope I was able to give you some insights into the Service Management Framework. Its a mighty tool and this article only scratched on the surface of the topic. But there are several excellent resources out there.
60
Resource Management is an rather old feature in Solaris, albeit it got more mind share since it got a really important part of the Solaris Zones. But this tutorial doesnt focus on the usage of Resource Management in conjunction with the zone conguration. I want to go to the basics, because at the end the zone conguration just uses this facilities to control the resource consumption of zones. Secondly you can use this knowledge to limit resource usage in a zone itself. Resource Management was introduced to solve one important question. You can run multiple programs and services at once in a single instance of the operating system, but how do I limit the consumption of resources of a single application? How do I prevent a single program from consuming all the resources leaving nothing to others? Resource Management in Solaris solves this class of problems.
6.2. Denitions
Okay, as usual, this technology has its own jargon. So I have to dene some of it at rst:
61
6. Solaris Resource Manager Tasks: A task is a group of processes. For example when you log into a system and do some work all the steps youve done are an task until you logout or open a new task. Another example would be a webserver. It consists out of a multitude of processes, but they are all part of the same task. The database server on the same machine may have a completely dierent task id. Projects: A project is a group of tasks. For example you have a webserver. It consists out of the processes of the database task and the webserver task. Zones: From the perspective of the resource management, a Solaris Zone is just a group of one or more projects.
62
# su jmoekamp # id -p uid =100( jmoekamp ) gid =1( other ) projid =3( default )
When you assume the privileges of the root user you work as a member of the user.root project.
$ su root Password : # id -p uid =0( root ) gid =0( root ) projid =1( user . root )
Lets have another look at the alread running processes of your system.
# ps - ef -o pid , user , zone , project , taskid , args PID USER ZONE PROJECT TASKID COMMAND 0 root global system 0 sched [...] 126 daemon global system 22 / usr / lib / crypto / kcfd 646 jmoekamp global default 73 / usr / lib / ssh / sshd [...] 413 root global user . root 72 - sh 647 jmoekamp global default 73 - sh [...] 655 root global user . root 74 ps - ef -o pid , user , zone , project , taskid , args 651 root global user . root 74 sh
As you see from the output of ps you already use some projects, that are default on every Solaris system. And even the concept of task is in use right now. Whenever a user log into a Solaris system, a new task is opened. Furthermore you will see, that all your services started by the Service management facility have their own task id. The SMF starts every service as a new task.
# ps - ef -o pid , user , zone , project , taskid , args | grep " 74 " 653 root global user . root 74 bash 656 root global user . root 74 ps - ef -o pid , user , zone , project , taskid , args 657 root global user . root 74 bash 651 root global user . root 74 sh
The projects are stored in a le per default, but you can use LDAP or NIS for a common project database on all your systems:
# cat / etc / project system :0::::
63
user . root :1:::: noproject :2:::: default :3:::: group . staff :10::::
A freshly installed system has already this pre-dened projects: Table 6.1.: Factory-congured project in Solaris Project system user.root no.project Description The system project is used for all system processes and daemons. All root processes run in the user.root project. The noproject project is a special project for IP Quality of Service. You can savely ignore it for this tutorial When there isnt a matching group, this is the catch-all. A user without an explicitly dened project is member of this project The group.sta project is used for all users in the group sta
default
group.sta
Okay, but how do we create our own projects? Its really easy:
[ root@theoden :~] $ projadd -p 1000 testproject [ root@theoden :~] $ projmod -c " Testserver project " testproject [ root@theoden :~] $ projdel testproject
We created the project testproject with the project id 1000. Then we modied it by adding informations to the comment eld. After this, weve deleted the project.
64
# useradd laura
Looks familiar? Ive omitted to set the password here. But of course you have to set one. Now we create some projects for two classes:
# projadd class2005 # projadd class2006
Okay. This projects have no users. We assign the project to our users.
# # # # usermod usermod usermod usermod -K -K -K -K project = class2005 project = class2005 project = class2006 project = class2006 alice bob mike laura
Okay, lets su to the user alice and check for the project assignment:
bash -3.2 $ su alice Password : $ id -p uid =2005( alice ) gid =1( other ) projid =100( class2005 )
As congured the user alice is assigned to the project class2005. When we look into the process table, we can check the forth columns. All processes owned by alice are assigned to the correct project.
# ps - ef -o pid , user , zone , project , taskid , args | grep " class2005 " 752 alice global class2005 76 sh 758 alice global class2005 76 sleep 10
Okay, obviously our teachers want their own projects: You can congure this on two ways. You can congure it by hand, or create a project beginning with user. and ending with the username. We use the second method in this example.
# useradd einstein # projadd user . einstein # passwd einstein New Password : Re - enter new Password : passwd : password successfully changed for einstein
65
Et voila! We dont have to assign the project explicitly. Its done automagically by the system. The logic behind the automatic assigning of the project to a user is simple: If the name of the project is dened by adding the project user attribute, use the assigned project as the default for the user. If its not dened with the user, look for a project beginning with user. and ending with the name of the user and use it as default project. For example: user.root or user.jmoekamp If there no such project, search for a project beginning with group. and ending with the name of the group of a user and use it as default project. For example: group.staff If theres no group with this name, use the project default as the default project. But how do you create a task? You dont have to congure tasks! A task is created automatically by the system in certain events. Those events are: login cron newtask setproject su Okay, lets start a sleep in a new task:
$ newtask sleep 10 & 761
66
6. Solaris Resource Manager Okay, when you look at the process table now, you will see the dierent taskid in the fth column:
# ps - ef -o pid , user , zone , project , taskid , args | grep " class2005 " 752 alice global class2005 76 sh 761 alice global class2005 77 sleep 10
But you can use the newtask command to assign a dierent project, too. At rst we create another project. We work at the Large Hadron Collider project, thus we call it lhcproject
$ newtask -p lhcproject sleep 100 & 802 $ newtask : user " einstein " is not a member of project " lhcproject "
Hey, not this fast! You have to be a member of the project to add a task to a project. Without this hurdle it would be too easy to use the resources of dierent project ;) Okay, lets add the user einstein to the group lhcproject again.
# projmod -U einstein lhcproject
67
Ive stored this little script at /opt/bombs/forkbomb.pl. A few seconds after starting such a script, the system is toast because of the hundreds of forked processes. Dont try this without resource management. Okay, but this year, youve migrated to Solaris. You can impose resource management. Okay, we have to modify our project conguration:
# projmod -K " task . max - lwps =( privileged ,10 , deny ) " class2005
Now we have congured a resource limit. A single task in the class2005 cant have more than 9 processes. The tenth attempt to fork will be denied. Okay, do you remember the reasons, why the system starts a new task? One of it is login. Thus every login of a user gives him 10 threads to work with. And this is exactly the behavior we want. Lets assume Alice starts her forkbomb:
# ps - ef | grep " alice " alice 685 682 0 14:58:12 / sshd alice 693 686 14 14:58:42 perl / opt / bombs / forkbomb . pl alice 686 685 0 14:58:12 alice 694 693 15 14:58:42 perl / opt / bombs / forkbomb . pl alice 695 694 14 14:58:42 perl / opt / bombs / forkbomb . pl alice 696 695 14 14:58:42 perl / opt / bombs / forkbomb . pl alice 697 696 14 14:58:42 perl / opt / bombs / forkbomb . pl ? pts /1 pts /1 pts /1 pts /1 pts /1 pts /1 0:00 / usr / lib / ssh 0:38 / usr / bin / 0:00 - sh 0:38 / usr / bin / 0:37 / usr / bin / 0:37 / usr / bin / 0:37 / usr / bin /
68
698 697 14 14:58:42 pts /1 / opt / bombs / forkbomb . pl 699 698 14 14:58:42 pts /1 / opt / bombs / forkbomb . pl grep " alice " | wc -l
After forking away 7 forkbomb.pl processes, any further fork is denied by the system. The load of the system goes up (as there are hundreds of denied forks) but the system stays usable. Alice sends her script to Bob. He tries it, too:
# ps - ef | grep " alice " alice 685 682 0 14:58:12 ? / sshd alice 28520 28519 6 15:15:08 pts /1 perl / opt / bombs / forkbomb . pl alice 686 685 0 14:58:12 pts /1 alice 28521 28520 6 15:15:08 pts /1 perl / opt / bombs / forkbomb . pl alice 28519 686 6 15:15:08 pts /1 perl / opt / bombs / forkbomb . pl alice 28522 28521 6 15:15:08 pts /1 perl / opt / bombs / forkbomb . pl alice 28524 28523 6 15:15:08 pts /1 perl / opt / bombs / forkbomb . pl alice 28523 28522 6 15:15:08 pts /1 perl / opt / bombs / forkbomb . pl alice 28525 28524 6 15:15:08 pts /1 perl / opt / bombs / forkbomb . pl # ps - ef | grep " bob " bob 28514 28511 0 15:14:47 ? / sshd bob 28515 28514 0 15:14:47 pts /3 bob 2789 2502 6 15:15:10 pts /3 perl / opt / bombs / forkbomb . pl bob 2791 2790 6 15:15:10 pts /3 perl / opt / bombs / forkbomb . pl bob 2502 28515 6 15:15:10 pts /3 perl / opt / bombs / forkbomb . pl bob 2790 2789 6 15:15:10 pts /3 perl / opt / bombs / forkbomb . pl bob 2792 2791 6 15:15:10 pts /3 perl / opt / bombs / forkbomb . pl 0:00 / usr / lib / ssh 0:03 / usr / bin / 0:00 - sh 0:03 / usr / bin / 0:02 / usr / bin / 0:03 / usr / bin / 0:03 / usr / bin / 0:03 / usr / bin / 0:02 / usr / bin /
0:00 / usr / lib / ssh 0:00 - sh 0:03 / usr / bin / 0:03 / usr / bin / 0:03 / usr / bin / 0:03 / usr / bin / 0:03 / usr / bin /
69
bob
2793 2792 6 15:15:10 pts /3 perl / opt / bombs / forkbomb . pl bob 2794 2793 6 15:15:10 pts /3 perl / opt / bombs / forkbomb . pl
This is still no problem for the system. After a few forks of the forkbomb, the system denies further forks. And the system stays usable. The limitation of the number of processes is only one example. You can limit other resources. You can nd a list of all controls at the man page of resource_controls
After a few moments the system will stabilize at aprox. 50% CPU resources for both processes. Just look at the rst column:
bash -3.2 $ ps -o pcpu , project , args % CPU PROJECT COMMAND 0.0 user . einstein - sh 0.3 user . einstein bash 47.3 user . einstein / usr / bin / perl / opt / bombs / cpuhog . pl 48.0 user . einstein / usr / bin / perl / opt / bombs / cpuhog . pl 0.2 user . einstein ps -o pcpu , project , args
70
There are ways to enable this scheduler with a running system, but its easier to reboot the system now. When the system has started, we get root privileges by using su. At rst we create an additional project for the SHC project. We created the other project (lhcproject) before:
# projadd shcproject # projmod -U einstein shcproject
Weve used the resource control project.cpu-shares. With this control we can assign an amount of CPU power to an project. Weve dened an privileged limit, thus only root can change this limit later on.
6.8.3. Shares
Okay, what is the meaning of these numbers 150 and 50 in the last commands ? Where are the 25% and the 75%? Well, the resource management isnt congured with percentages, its congured in a unit called shares. Its like with stocks. The person with the most stocks owns most of the company. The project with the most shares owns most of the CPU. In our example we divided the CPU in 200 shares. Every share represents 1/200 of the CPU. Project shcproject owns 50 shares. Project lhcproject owns 150. I think, you already saw it: 150 is 75% of 200 and 50 is 25% of 200. Here we nd our planed partitioning of the CPU weve planed before.
71
6. Solaris Resource Manager By the way: I deliberately choose 150/50 instead of 75/25 to show you that these share denitions are not scaled in percent. Okay, but what happens when you add a third project and you give this project 200 shares (For example because a new project gave money for buying another processor board). Then the percentages are dierent. In total you have 400 shares on the system. The 200 shares of the new project are 50%, thus the project gets 50% of the compute power. The 150 shares of the lhcproject are 37.5 percent. This project gets 37.5 of the computing power. And the 50 shares of the shcproject are now 12.5 percent and thus the project get this part of the CPU power.
Wait ... the process gets 95.9 percent? An error ? No. It makes no sense to slow down the process, when there is no other process needing the compute power. Now we start the second process, this time as a task in the lhcproject:
bash -3.2 $ newtask -p lhcproject / opt / bombs / cpuhog . pl & [2] 784
72
6. Solaris Resource Manager Voila, each of our compute processes get their congured part of the compute power. It isnt exactly 75%/25% all the time but in average the distribution will be this way. A few days later, the Dean of the department comes into the oce and tells you that we need the results of the SHC project earlier, as important persons want to see them soon to spend more money. So you have to change the ratio of the shares. We can do this without restarting the processes at runtime. But as weve dened the limits as privileged before, we have to login as root:
# prctl -n project . cpu - shares -r -v 150 -i project shcproject # prctl -n project . cpu - shares -r -v 50 -i project lhcproject
The ratio has changed to new settings. Its important to know that only the settings in /etc/projects is boot-persistent. Everything you set via prctl is lost at the boot.
73
sleep (30) ; for ( $i =0; $i <10; $i ++) { push @_ ," x " x (1*1024*1024) ; sleep (5) ; }
When we start this script, it will allocate memory by pushing blocks of 1 Megabyte of x chars onto a stack.
# ps - ef -o pid , user , vsz , rss , project , args | grep " bob " | grep v " grep " 1015 bob 8728 892 class2005 / usr / lib / ssh / sshd 1362 bob 26148 24256 class2005 / usr / bin / perl ./ memoryhog . pl 1016 bob 1624 184 class2005 - sh 1031 bob 3260 1336 class2005 bash
When you look in the forth column you see the resident set size. The resident set is the amount of data of a process in the real memory. After a short moment the process ./memoryhog.pl uses 24 MB of our precious memory (20 times 1 MB plus the perl interpreter minus some shared stu).
The rcapd daemon enforces resource caps on a group of processes. The rcapd supports caps on projects and zones at the moment. When the resident set size of a group of processes exceeds the dened cap. To reduce the resource consumption of a process group, the daemon can page out infrequently uses pages of memory to swap. For testing purposes we congure the daemon in such a way, that it scans every second for new processes in projects with a resource cap. Furthermore we congure it to sample the resident set size every 1 second, too. Additionally the pageout statistics of the rcapd will be updated every second, too.
# rcapadm -i scan =1 , sample =1 , report =1
For testing purposes we dene a new project called mmgntcourse and add the user bob to this project:
74
Now we set an resource cap for the resident set size of 5 Megabytes for this project:
# projmod -K rcap . max - rss =5242880 mmgntcourse
Okay, now get back to the window with the login of user codebob/code. Lets start the memoryhog.pl in the mmgntcourse:
$ newtask -p mmgntcourse ./ memoryhog . pl
Okay, get back to a dierent window and login as root. With the rcapstat you can observe the activities of the rcapd. In our example we tell rcapstat to print the statistics every 5 seconds:
# rcapstat 5 id project pg avgpg 105 mmgntcourse K 0K 105 mmgntcourse K 0K 105 mmgntcourse K 804 K 105 mmgntcourse K 1126 K 105 mmgntcourse K 0K 105 mmgntcourse K 760 K 105 mmgntcourse K 512 K 105 mmgntcourse K 1024 K 105 mmgntcourse K 1024 K 105 mmgntcourse K 1024 K 105 mmgntcourse K 1024 K 105 mmgntcourse K 0K [...] 105 mmgntcourse K 0K nproc vm rss cap at avgat 0K 0K 812 K 0K 0K 812 K 0 0 804
1 5424 K 6408 K 5120 K 3380 K 1126 K 3380 - 6448 K 4856 K 5120 K 1 7472 K 5880 K 5120 K 0K 760 K 0K 760 K 0 760
512 K 1024
1 9520 K 6144 K 5120 K 1024 K 1024 K 1024 1 1 1 10 M 5120 K 5120 K 1024 K 1024 K 1024 11 M 6144 K 5120 K 1024 K 1024 K 1024 11 M 4096 K 5120 K 1024 K 1024 K 1024 11 M 4096 K 5120 K 0K 0K 0
11 M 4096 K 5120 K
0K
0K
75
105 mmgntcourse K 940 K 105 mmgntcourse K 640 K 105 mmgntcourse K 1024 K 105 mmgntcourse K 1024 K 105 mmgntcourse K 1024 K 105 mmgntcourse K 0K 105 mmgntcourse K 1024 K 105 mmgntcourse K 0K 105 mmgntcourse K 1024 K
1 1 1 1 1 1 1
16 M 6144 K 5120 K 1024 K 1024 K 1024 17 M 6144 K 5120 K 1024 K 1024 K 1024 18 M 5120 K 5120 K 1024 K 1024 K 1024 18 M 5120 K 5120 K 0K 0K 0
As you see, the resident set size stays at approx. 5 Megabyte. The RSS may increase above the congured size, as the applications may allocate memory between the RSS sampling intervals. But at the next sampling size the rcap starts to force the page out of this exceeding pages. After a few seconds the resident set size is enforced, again. You can observe this behaviour in column 5 of the rcapstat printout.
http://www.opensolaris.org/os/community/smf/faq#33
76
6. Solaris Resource Manager At the moment, the sendmail runs in the system project:
# ps -o pid , project , args - ef | grep " sendmail " | grep -v " grep " 648 system / usr / lib / sendmail - Ac - q15m 647 system / usr / lib / sendmail - bd - q15m -C / etc / mail / local . cf
How do you start it as a part of a dierent project? Okay, check for an already congured project.
# svcprop -p start / project smtp : sendmail svcprop : Couldn t find property start / project for instance svc :/ network / smtp : sendmail .
Okay, nothing dened ... this makes it a little bit harder, because we cant set the project only at the moment, as there is a bug in the restarter daemon. You need a fully populated start method. If the svcprop run delivers you the name of a project, you can ignore the next block of commands:
svccfg -s sendmail svccfg -s sendmail svccfg -s sendmail default svccfg -s sendmail default svccfg -s sendmail default svccfg -s sendmail svccfg -s sendmail default svccfg -s sendmail setprop start / user = astring : root setprop start / group = astring : : default setprop start / wo rking_ direct ory = astring : : setprop start / resource_pool = astring : : setprop start / supp_groups = astring : : setprop start / privileges = astring : : default setprop start / limit_privileges = astring : : setprop start / use_profile = boolean : false
Okay, now we can set the project property of the start method:
svccfg -s smtp / sendmail setprop start / project = astring : sendmail \ end { lstlsting } Now we have to refresh the configuration of the service . After refreshing the service we can check the property : \ begin { lstlisting } # svcadm refresh sendmail # svcprop -p start / project sendmail sendmail
Okay, the new properties are active. Now restart the service:
77
Thats all. The important part is the second row of the fragment. We dene the project as a part of the startup method context. The rest is done as described in the SMF tutorial2
6.11. Conclusion
The Solaris Resource Management is an powerful tool to partition the resources of a single instance of an operating system. By using Solaris Resource Management you can use a single operating system for a multitude of services but still ensuring the availability of resources to them. I hope, this tutorial gave you an good insights into the basics of resource management.
http://www.c0t0d0s0.org/archives/4147-Solaris-Features-Service-Management-Facility-Part-4-Develop html
78
79
7.1. History
The ever recurring question to me at customer sites relatively new to Solaris is: Okay, on Linux I had my home directories at /home. Why are they at /export/home on Solaris? This is old hat for seasoned admins, but I get this question quite often. Well, the answer is relatively simple and it comes from the time when we started to use NIS and NFS and it had something to do with our slogan The network is the computer, because it involves directories distributed in the network. Okay, we have to go back 20 years in the past. There was a time, long, long ago, when you worked at your workstation. The harddisk in your workstation was big and it was a time when you didnt need 200 megabytes for your oce package alone. So you and your working group used workstations for storing their data, but there were several workstations and even some big servers for big computational tasks. The users wanted to share the data, so Sun invented NFS to share the les between the systems. As it was a tedious task to distribute all of the user accounts on all of the systems, Sun invented NIS (later NIS+, but this is another story). But the users didnt want to mount their home directories manually on every system. They wanted to log in to a system and work with their home directory on every system. They didnt want to search in separate places depending on whether they were using their own machine or a dierent one. So Sun invented the automounter - it found its way into SunOS 4.0 in 1988. The automounter mounts directories on a system based upon a ruleset. In Solaris 2.0 and later the automounter was implemented as a pseudo lesystem called autofs. autofs was developed to mount directories based on rules dened in so-called maps. There are two of them provided by default. At rst there is the /etc/auto_master. To cite the manual: The auto_master map associates a directory with a map. The map is a master list that species all the maps that autofs should check
80
7. /home? /export/home? AutoFS? On a freshly installed system the le looks like this:
[ root@gandalf :/ net / theoden / tools / solaris ] $ cat / etc / auto_master + auto_master / net - hosts - nosuid , nobrowse / home auto_home - nobrowse
The le /etc/auto_home is such a map referenced by the master map. To cite the manual again: An indirect map uses a substitution value of a key to establish the association between a mount point on the client and a directory on the server. Indirect maps are useful for accessing specic le systems, such as home directories. The auto home map is an example of an indirect map. We will use this map later in this article.
7.3. Prerequisites
At rst we have to export the directories which store the real home directories on both hosts via NFS. At rst on gandalf:
[ root@gandalf :/ etc ] $ echo " share -F nfs -d \" Home Directories \" / export / home " >> / etc / dfs / dfstab [ root@gandalf :/ etc ] $ shareall [ root@gandalf :/ etc ] $ exportfs / export / home rw " Home Directories "
81
[ root@theoden :/ export / home ] $ echo " share -F nfs -d \" Home Directories \" / export / home " >> / etc / dfs / dfstab [ root@theoden :/ export / home ] $ shareall [ root@theoden :/ export / home ] $ exportfs / export / home rw " Home Directories "
Okay, its important that both hosts can resolve the hostname of the other system. Ive added some lines to code/etc/hosts/code in my test installation:
10.211.55.201 gandalf 10.211.55.200 theoden
Now I set the home directory of both users to the /home under the control of autofs:
[ root@gandalf :~] $ usermod -d / home / statler statler [ root@gandalf :~] $ usermod -d / home / waldorf waldorf
Now I create the users for the other team, without the -m switch and directly with the correct home directory. The home directories come from the other system, so we dont have to create them:
[ root@gandalf :~] $ useradd -u 2002 -d / home / gonzo gonzo [ root@gandalf :~] $ useradd -u 2003 -d / home / scooter scooter
Now we switch to Theoden. We do almost the same on this system. We create the accounts for Waldorf and Statler without creating a home directory. After this we create the local users together with their home directories, which we then set to be autofs controlled:
82
[ root@theoden :~] $ [ root@theoden :~] $ [ root@theoden :~] $ gonzo 64 blocks [ root@theoden :~] $ scooter 64 blocks [ root@theoden :~] $ [ root@theoden :~] $
useradd -u 2001 -d / home / statler statler useradd -u 2000 -d / home / waldorf waldorf useradd -u 2002 -d / export / home / gonzo -m
Here, the ampersand is a variable. It stands for the key in the table. So gonzo theoden:/export/home/& translates to theoden:/export/home/gonzo. Now start the autofs on both hosts:
[ root@theoden :~] $svcadm enable autofs
and
[ root@gandalf :~] $svcadm enable autofs
83
7. /home? /export/home? AutoFS? Now we try waldorf on theoden. Waldorf doesnt have its home directory on theoden, its on gandalf.
$ ssh waldorf@10 .211.55.200 Password : Last login : Sun Feb 17 14:17:47 2008 from 10.211.55.2 Sun Microsystems Inc . SunOS 5.11 snv_78 October 2007 $ / usr / sbin / mount [...] / home / waldorf on gandalf :/ export / home / waldorf remote / read / write / setuid / devices / xattr / dev =4 dc0001 on Sun Feb 17 14:17:48 2008
autofs has mounted the /export/home/waldorf automatically to /home/waldorf, the directory we used when we created the user. Lets crosscheck. We log into gandalf with the user waldorf. Now this user has a local home directory. Its a local mount again.
$ ssh waldorf@10 .211.55.201 Password : Last login : Sat Feb 16 09:12:47 2008 from 10.211.55.2 Sun Microsystems Inc . SunOS 5.11 snv_78 October 2007 $ / usr / sbin / mount [...] / home / waldorf on / export / home / waldorf read / write / setuid / devices / dev =1980000 on Sat Feb 16 09:12:47 2008
84
[ root@theoden :/ tools / solaris ] $ ls -l / tools / solaris total 0 -rw -r - -r - 1 root root 0 Feb 17 15:21 tool1 -rw -r - -r - 1 root root 0 Feb 17 15:21 tool2 -rw -r - -r - 1 root root 0 Feb 17 15:21 tool3
Now change to the other workstation. Look into the directory /net/theoden:
[ root@gandalf :/] $ cd / net / theoden [ root@gandalf :/ net / theoden ] $ ls export tools
You will notice all of the directories shared by theoden. Change into the tools/solaris directory:
[ root@gandalf :/ net / theoden ] $ cd tools [ root@gandalf :/ net / theoden / tools ] $ ls solaris [ root@gandalf :/ net / theoden / tools ] $ cd solaris [ root@gandalf :/ net / theoden / tools / solaris ] $ ls -l total 0\ -rw -r - -r - 1 root root 0 Feb 17 2008 tool1 -rw -r - -r - 1 root root 0 Feb 17 2008 tool2 -rw -r - -r - 1 root root 0 Feb 17 2008 tool3 [ root@gandalf :/ net / theoden / tools / solaris ] $ [ root@gandalf :/ net / theoden / tools / solaris ] $ mount [..] / net / theoden / tools / solaris on theoden :/ tools / solaris remote / read / write / nosetuid / nodevices / xattr / dev =4 dc0002 on Sat Feb 16 10:23:01 2008
Neat isnt it... its congured by default, when you start the autofs.
85
7. /home? /export/home? AutoFS? How Autofs Works Task Overview for Autofs Administration
86
8. lockfs
Solaris 10/Opensolaris
Quite often the conguration of a feature or application mandates that the data on the disk doesnt change while you activate it. An easy way to achieve this would be simply un-mounting the disk - thats possible but then you cant access the data on the disk at all. You cant even read from the lesystem, even though this doesnt change anything on the disk (okay, as long youve mounted the disk with noatime). So: How else can you ensure that the content of a lesystem doesnt change while you work with the disk? ufs has an interesting feature. Its called lockfs and with it, you can lock the lesystem. You can lock it to an extent that you can only unmount and remount it to gather access to the data, but you can also lock out a subset of the many ways in which one might try to access it.
87
8. lockfs
No problem. Our testle found its way into the le system. Now we establish a write lock on our le system.
# lockfs -w / mnt
You set the lock with the lockfs command, and the switch -w tells lockfs to set a write lock. With a write lock, you can read a lesystem, but you cant write to it. Okay, lets check the existing locks. You use the lockfs command without any further options.
# lockfs Filesystem / mnt Locktype write Comment
When we try to add an additional le, the write system call simply blocks.
# echo " test " > testfile2 ^ Cbash : testfile2 : Interrupted system call
We have to break the echo command with CTRL-C. Okay, now lets release the lock.
# lockfs -u / mnt
The -u commands lockfs to release the lock. When you list the existing locks, the lock on /mnt is gone.
# lockfs
The command returns instantly. When you check the lesystem, you will see both les.
88
8. lockfs
# ls -l total 20 drwx - - - - - found -rw -r - -r - -rw -r - -r - 2 root root 8192 Apr 25 18:10 lost +
1 root 1 root
root root
No problem. Now we establish the delete lock. This time we also add a comment. You can use this command to tell other administrators why you have established the lock.
# lockfs -c " no deletes today " -d / mnt
When you check for existing locks, you will see the delete lock on /mnt and the comment:
# lockfs Filesystem / mnt Locktype delete Comment no deletes today
When you try to delete the le, the rmjust blocks and you have to break it with CTRL-C again:
# rm testfile2 ^C
When youve delete-locked an lesystem, you can create new les, you can append data to existing les and you can overwrite them:
89
8. lockfs
# echo " test " > testfile3 # echo " test " >> testfile3 # echo " test " > testfile3
There is only one thing you cant do with this new le: delete it.
# rm testfile3 ^C
8.4. Conclusion
The lockfs is a really neat feature to deny certain accesses to your lesystem without un-mounting it completely. Some locks are more useful for general use than others. For example, the write lock is really useful when you want to freeze the content of the lesystem while working with tools like AVS. Delete locks or name locks are useful when you need a stable directory, which is less of a day-to-day problem for administrators.
http://docs.sun.com/app/docs/doc/816-5166/6mbb1kq6a?l=en&a=view
90
9. CacheFS
Solaris10
9.1. Introduction
There is a hidden gem in the Solaris Operating Environment, solving a task that many admins solve with scripts. Imagine the following situation. You have a central leserver, and lets say 40 webservers. All of these webservers deliver static content and this content is stored on the harddisk of the leserver. Later you recognize that your leserver is really loaded by the webservers. Harddisks are cheap, thus most admins will start to use a recursive rcp or an rsync to put a copy of the data to the webserver disks. Well... Solaris gives you a tool to solve this problem without scripting, without cron based jobs, just by using NFSv3 and this hidden gem: CacheFS. CacheFS is a really nifty tool. It does exactly what the name says. Its a lesystem that caches data of another lesystem. You have to think about it like a layered cake. You mount a CacheFS lesystem with a parameter that tells CacheFS to mount another one in the background.
91
9. CacheFS
9.4.1. Preparations
Lets create an NFS server at rst. This is easy. Just share a directory on a Solaris Server. We log in to theoden and execute the following commands with root privileges.
[ root@theoden :/]# mkdir / export / files [ root@theoden :/]# share -o rw / export / files # share / export / files rw ""
Okay, of course it would be nice to have some les to play around with in this directory. I will use some les of the Solaris Environment.
[ root@theoden :/]# cd / export / files [ root@theoden :/ export / files ]# cp -R / usr / share / doc / pcre / html /* .
92
9. CacheFS
[ root@gandalf :/]# mkdir / files [ root@gandalf :/]# mount theoden :/ export / files / files [ root@gandalf :/]# unmount / files
Now you should be able to access the /export/files directory on theoden by accessing /files on gandalf. There should be no error messages. Okay, rstly we have to create the location for our caching directories. Lets assume we want to place our cache at /var/cachefs/caches/cache1. At rst we create the directories above the cache directory. You dont create the last part of the directory structure manually.
[ root@gandalf :/]# mkdir -p / var / cachefs / caches
This directory will be the place where we store our caches for CacheFS. After this step we have to create the cache for the CacheFS.
[ root@gandalf :/ files ]# cfsadmin -c -o maxblocks =60 , minblocks =40 , threshblocks =50 / var / cachefs / caches / cache1
The directory cache1 is created automatically by the command. In the case where the directory already exists, the command will quit and do nothing. Additionally you have created the cache and you specied some basic parameters to control the behavior of the cache. Citing the manpage of cfsadmin: maxblocks: Maximum amount of storage space that CacheFS can use, expressed as a percentage of the total number of blocks in the front le system. minblocks: Minimum amount of storage space, expressed as a percentage of the total number of blocks in the front le system, that CacheFS is always allowed to use without limitation by its internal control mechanisms. threshblocks: A percentage of the total blocks in the front le system beyond which CacheFS cannot claim resources once its block usage has reached the level specied by minblocks. Each of these parameters can be tuned to prevent CacheFS from eating away all of the storage available in a lesystem, a behavior that was quite common to early versions of this feature.
93
9. CacheFS
You may notice the parameter that sets the NFS version to 3. This is necessary as CacheFS isnt supported with NFSv4. Thus you can only use it with NFSv3 and below. The reason for this limitation has its foundation in the dierent way NFSv4 handles inodes. Okay, now we mount the cache lesystem at the old location:
[ root@gandalf :/ files ]# mount -F cachefs -o backfstype = nfs , backpath =/ var / cachefs / backpaths / files , cachedir =/ var / cachefs / caches / cache1 theoden :/ export / files / files
The options of the mount command control some basic parameters of the mount: backfstype species what type of lesystem is proxied by the CacheFS lesystem backpath species where this proxied lesystem is currently mounted cachedir species the cache directory for this instance of the cache. Multiple CacheFS mounts can use the same cache. From now on every access to the /files directory will be cached by CacheFS. Lets have a quick look into the /etc/mnttab. There are two important mounts for us:
[ root@gandalf :/ etc ]# cat mnttab [...] theoden :/ export / files / var / cachefs / backpaths / files nfs vers =3 , xattr , dev =4 f80001 1219049560 / var / cachefs / backpaths / files / files cachefs backfstype = nfs , backpath =/ var / cachefs / backpaths / files , cachedir =/ var / cachefs / caches / cache1 , dev =4 fc0001 1219049688
The rst mount is our back le system, its a normal NFS mountpoint. But the second mount is a special one. This one is the consequence of the mount with the -F cachefs option.
94
9. CacheFS
To ensure that multiple caches using a single cache directory of the time arent mixing up their data, they are divided at this place. At rst a special directory is generated and secondly a more friendly name is linked to this. Its pretty obvious how this name is generated. theoden:_export_files:_files can be easily translated to theoden:/export/files mounted at /files. Lets assume weve used the cache for another lesystem (e.g. /export/binaries on theoden mounted to /binaries):
[ root@gandalf :/ var / cachefs / cache1 ]# ls -l total 10
95
9. CacheFS
drwxrwxrwx 5 root root 512 Aug 18 10:54 0000000000044 e30 drwxrwxrwx 3 root root 512 Aug 18 11:18 0000000000044 e41 drwx - - - - - 2 root root 512 Aug 11 08:11 lost + found lrwxrwxrwx 1 root root 16 Aug 18 11:18 theoden : _export_binaries : _binaries -> 0000000000044 e41 lrwxrwxrwx 1 root root 16 Aug 11 08:18 theoden : _export_files : _files -> 0000000000044 e30
With this mechanism, the caches are separated in their respective directories... no mixing up. When we dig down a little bit deeper to the directories, we will see an additional layer of directories. This is necessary to prevent a situation where a directory contains too many les and thus slows down.
[ root@gandalf :/ var / cachefs / cache1 /0000000000044 e30 /0000000000044 e00 ]# ls -l total 62 -rw - rw - rw 1 root root 0 Aug 18 10:54 0000000000044 e66 -rw - rw - rw 1 root root 1683 Aug 11 08:24 0000000000044 eaa -rw - rw - rw 1 root root 29417 Aug 11 08:22 0000000000044 eba
When you examine these les, you will see that they are just a copy of the original les:
[ root@gandalf :/ var / cachefs / cache1 /0000000000044 e30 /0000000000044 e00 ]# cat 0000000000044 eaa [...] This page is part of the PCRE HTML documentation . It was generated automatically from the original man page . If there is any nonsense in it , please consult the man page , in case the conversion went wrong . [...] [ root@gandalf :/ var / cachefs / cache1 /0000000000044 e30 /0000000000044 e00 ]#
96
9. CacheFS
When we go to the NFS client and access the directory, this new le is visible instantaneously. And when we access it, we see the content of the le.
[ root@gandalf :/ files ]# cat t e s t _ w i t h _ c o n s i s t e n c y _ c h e c k Tue Aug 12 14:59:54 CEST 2008
Now we go back to the server, and append additional data to the le:
[ root@theoden :/ export / files ]# date >> test_with_consistency_check [ root@theoden :/ export / files ]# cat t e s t _ w i t h _ c o n s i s t e n c y _ c h e c k Tue Aug 12 14:59:54 CEST 2008 Tue Aug 12 15:00:11 CEST 2008
97
9. CacheFS You may have noticed the demandconst option. This option changes everything. Lets assume you created another le on the NFS server:
[ root@theoden :/ export / files ]# date >> test_with_ondemand_consistency_check [ root@theoden :/ export / files ]# cat test_with_ondemand_consistency_check Tue Aug 12 15:00:57 CEST 2008
Back on the NFS client you will not even see this le:
[ root@gandalf :/ files ]# ls index . html [...] pcre_info . html pcre_maketables . html pcre_refcount . html pcretest . html test_with_consistency_check
Now we append a new line to the le on the server by executing the following commands on the NFS server
[ root@theoden :/ export / files ] date >> test_with_ondemand_consistency_check [ root@theoden :/ export / files ] cat test_with_ondemand_consistency_check Tue Aug 12 15:00:57 CEST 2008 Tue Aug 12 15:02:03 CEST 2008
When we check this le on our NFS client, we still see the cached version.
98
9. CacheFS
[ root@gandalf :/ files ] cat t e s t _ w i t h _ o n d e m a n d _ c o n s i s t e n c y _ c h e c k Tue Aug 12 15:00:57 CEST 2008 [ root@gandalf :/ files ] cfsadmin -s all
Now we can look into the le again, and you will see the new version of the le.
[ root@gandalf :/ files ] cat t e s t _ w i t h _ o n d e m a n d _ c o n s i s t e n c y _ c h e c k Tue Aug 12 15:00:57 CEST 2008 Tue Aug 12 15:02:03 CEST 2008
Okay, its pretty obvious this isnt a feature for a lesystem that changes in a constant and fast manner. But its really useful for situations, where you have control over the changes. As long as a le is cached, the le server will see not a single access for such les. Thus such a le access doesnt add to the load of the server. There is an important fact here: It doesnt tell CacheFS to check the les right at that moment. It just tells CacheFS to check it at the next access to the le. So you dont have an consistency check storm.
99
9. CacheFS but no new features will nd their way into this component. In recent days there was some discussion about the declaration of the End-of-Feature status for CacheFS which will lead to the announcement of the removal of CacheFS. While this isnt a problem for Solaris 10, I strongly disagree with the idea of removing this part of Solaris, as long there is no other boot-persistent non-main memory caching available for Solaris.
9.9. Conclusion
CacheFS is one of the features that even some experienced admins arent aware of. But as soon as they try it, most of them cant live without it. You should give it a try.
1 2
100
This article isnt really about a feature, its about a directory and its misuse. Furthermore its an article about dierent default congurations, that lead to misunderstandings. This is a pretty old hat for experienced Solaris admins (many of them learned it the hard tour), but it seems to be totally unknown to many admins new to the business or for people switching from Linux to Solaris, as many distribution are congured in a dierent way per default. A reader of my blog just found a 2GB .iso in /tmp on a Solaris system and thats not really a good idea. A few days ago, a user in twitter had vast problems with memory usage on a system which boiled down to a crowed /tmp.
101
10. The curious case of /tmp in Solaris Its called tmpfs because everything you write into it is temporary, the next boot or unmount will kill all the les on it. When you look at the mount table of a Solaris System you will recognize, that the usual locations for such temporary les are mounted tmpfs:
jmoekamp@a380 :/ var$ mount | grep " swap " / etc / svc / volatile on swap read / write / setuid / devices / xattr / dev =4 f00001 on Fri Jul 31 06:33:33 2009 / tmp on swap read / write / setuid / devices / xattr / dev =4 f00002 on Fri Jul 31 06:34:14 2009 / var / run on swap read / write / setuid / devices / xattr / dev =4 f00003 on Fri Jul 31 06:34:14 2009
Keeping these le systems in virtual memory is a reasonable choice. The stu in this directory is normally stale after a reboot, most of the time the les are many, but rather small and putting them on disk would just eat away your IOPS budget on your boot disks. As the le system resides in memory, its much faster and that really helps on jobs with many small les. A good example is compiling software when you use /tmp on as the TMPDIR. All this advantages come with a big disadvantage, when you are not aware of the nature of the /tmp directory. I assume, you already know why using /tmp for storing ISOs is a bad idea. It eats away your memory and later on your swap. And for all the experienced admins: When someone has memory problems, ask at rst about the /tmp directory, we tend to forget about this, as weve learned this lesson a long time ago and thus dont think about this problem. When you really need a temporary place to store data in it for a while you should use /var/tmp. This is a directory on a normal disk based lesystem and thus its content it boot persistent.
102
10. The curious case of /tmp in Solaris When you want to make a boot persistent change to the maximum size of the /tmp directory, you have to congure this in the /etc/vfstab:
swap / tmp tmpfs yes size =512 m
10.3. Conclusion
I hope its now clear, why you shouldnt use /tmp as a place for storing big les, at least move them directly somewhere else, when you use /tmp as the target directory to ssh a le to a system. At the other side the /tmp in virtual memory gives you some interesting capabilities to speed up your applications. As usual, everything has two sides: Making it default in Solaris gives you speedups per default. But when you are unaware of this default situation your virtual memory may be used for collection of videos ;) Obviously the same is valid, when you congure our Linux system with a tmpfs based /tmp and dont tell it to your fellow admins.
1 2
http://docs.sun.com/app/docs/doc/816-5177/tmpfs-7fs http://docs.sun.com/app/docs/doc/816-5166/mount-tmpfs-1m
103
104
11.1. Introduction
11.1.1. The Story of root
And then there was root. And root was almighty. And that wasnt a good thing. root was able to control the world without any control. And root needed control. It was only a short chant between the mere mortals and root. Everybody with the knowledge of the magic chant was able to speak through root. But root wasnt alone. root had servants called daemons. Some of one them needed divine powers to do their daily job. But root was an indivisible being. So the servants had to work with the powers of root. But the servants wasnt as perfect as root: Some of the servants started to do everything mere mortals said to them if they only said more than a certain amount of prayers at once. One day, the world of root experienced a large disaster, the negation of being. Top became bottom, left became right, the monster of erem-ef annihilated much of the world. But it got even stranger. root destroyed its own world, and by the power of root the destruction was complete. Then there was a FLASH. The world restarted, root got a second attempt to reign his world. But this time, it would be dierent world.
11.1.2. Superuser
The old model of rights in a unix systems is based on a duality. There is the superuser and the normal user. The normal users have a restricted set of rights in the system, the superuser has an unrestricted set of rights. To modify the system, a normal user has to login as root directly or assume the rights of root (by su -). But such a user has unrestricted access to system. Often this isnt desirable. Why should you enable
105
11. Role Based Access Control and Least Privileges an operator to modify a system, when all he or she has do to on the system is creating some users from time to time. Youve trained him to do useradd or passwd, but its a Windows admin who doesnt know anything about being an Unix admin. What do you do when he gets to curious. He needs root privileges to create a user or change a password. You need some mechanisms to limit this operator. But its get more problematic. Programs have to modify the system to work. A webserver is a nice example. It uses port 80. Ports beneath port number 1024 have a special meaning. They are privileged ports. You need special rights to modify the structures of the system to listen to the port 80. A normal user doesnt have this rights. So the webserver has to be started as root. The children of this process drop the rights of root by running with a normal user. But there is this single instance of the program with all the rights of the user. This process has much rights than needed, a possible attack vector for malicious users. This led to the development to dierent models of handling the rights of users in the system: Privileges and Role Based Access Control.
106
11. Role Based Access Control and Least Privileges When they start to work in their job, they assume a role. From the privilege perspective its not important who is the person, but what role the person has assumed. Lenny punches the clock and assumes the role of the plumbing janitor for the next 8 hours. And while he is doing its job he uses the privileges inherent to the role. But he has to do tasks in his oce or in his workshop. Its his own room, so he doesnt need the privileges. He doesnt need the special privileges. Role Based Access Control is quite similar. You login to the system, and then you start work. You read your emails (no special privileges needed), you nd an email Create user xy45345. Your Boss. Okay, now you need special privileges. You assume the role of an User Administrator create the user. Job done, you dont need the privileges anymore. You leave the role and write the Job done mail to your boss with your normal users. Role Based Access Control is all about this: Dening roles, giving them privileges and assigning users to this roles.
11.1.5. Privileges
Ive used the word quite often in the article so far. What is a privilege. A privilege is the right to do something. For example, having the keys for the control panel of the heating system. Unix users are nothing dierent. Every user has privileges in a unix system. A normal user has the privilege to open, close, read write and delete les when he his allowed to do this (Because he created it, because he belongs to the same group as the create of the le or the creator gave everybody the right to do it). This looks normal to you, but its privilege based on the login credentials you gave to system. You dont have the privilege to read all les on the system or to use a port number 1024. Every thing done in the system is based on this privileges. Solaris separated the tasks into many privilege sets. At the moment, there are 70 dierent privileges in the system. The dierence between the normal user is that the users has only a basic set, the root has all. But it hasnt to be this way. Privileges and users arent connected with each other. You can give any user the power of the root user, and restrict the privileges of the root user. Its just our binary compatibility guarantee that mandates that the standard conguration of the system resembles the superuser model. There are application out there, which assume that only the root user or the uid 0 as unrestricted rights and exit when they are started with a dierent user.
107
Lets use the standard example for RBAC: reboot the system. To do this task, you need to be root.
108
You are not allowed to do this. Okay, until now you would give the root account to all people, who have to reboot the system. But why should someone be able to modify users, when all he or she should to is using the reboot command ? Okay, at rst you create a role. As mentioned before, its a special user account.
# roleadd -m -d / export / home / reboot reboot 64 blocks
Okay, when you look into the /etc/passwd, you see a quite normal user account.
# grep reboot / etc / passwd reboot : x :101:1::/ export / home / reboot :/ bin / pfsh
There is one important dierence. You use a special kind of shell. This shell are called prole shells and have special mechanisms to check executions against the RBAC databases. Okay, weve created the role, now we have to assign them to a user:
# usermod -R reboot jmoekamp UX : usermod : jmoekamp is currently logged in , some changes may not take effect until next login .
But at the moment, this role isnt functional, as this role has no assigned role prole. Its a role without rights an privileges. At rst, lets create a REBOOT role prole. Its quite easy. Just a line at the end of prof_attr. This le stores all the attributes of
# echo " REBOOT ::: profile to reboot : help = reboot . html " >> / etc / security / prof_attr
Okay, now assign the role prole REBOOT to the role reboot
109
The information of this assignment is stored in the /etc/usr. Lets have a look into it:
# grep reboot / etc / user_attr reboot :::: type = role ; profiles = REBOOT jmoekamp :::: type = normal ; roles = reboot
But this isnt enough: The prole is empty. You have to assign some administrative command to it.
# echo " REBOOT : suser : cmd :::/ usr / sbin / reboot : euid =0" >> / etc / security / exec_attr
110
11.5. Authorizations
But RBAC can do more for you. There is an additional concept in it: Authorizations. Authorizations is a mechanism that needs support of the applications. This application checks if the user has the necessary authorization to use a program. Lets use the example of the janitor: Rights give him the access to the drilling machine. But this is a rather strange drilling machine. It checks, if the janitor has the permission to drill holes, when he trigger the button. The concept of authorization is a ne grained system. An application can check for a vast amount of privileges. For example the application can check for the authorization to modify the conguration, to read the conguration or printing the status. A user can have all this authorizations, none or something in between. Its like the janitors new power screwdriver. It checks if the janitor has the permission to use it at anticlockwise rotation, the permission to use it at clockwise rotation and the permission to set dierent speeds of rotation.
Wouldnt it be nice, to have an authorisation that enables an regular user to restart it? Okay, no problem. Lets create one:
$ su root # echo " solaris . smf . manage . apache / server ::: Apache Server management ::" >> / etc / security / auth_attr
Thats all. Where is the denition of the permission that the authorization means? There is no denition. Its the job of the application to work with. Now assign this authorization to the user:
111
# usermod -A solaris . smf . manage . apache / server jmoekamp UX : usermod : jmoekamp is currently logged in , some changes may not take effect until next login .
Okay, but at the moment no one checks for this authorization, as no application is aware of it. We have to tell SMF to use this authorization. The authorizations for an SMF servers is part of the general properties of the service. Lets have a look at the properties of this services.
# svcprop -p general apache2 general / enabled boolean false general / entity_stability astring Evolving
No authorization congured. Okay ... lets add the authorization weve dened before:
svccfg -s apache2 setprop general / a c t i o n _ a u t h o r i z a t i o n = astring : solaris . smf . manage . apache / server
Okay, a short test. Exit your root shell and login as the regular user you have assigned the authorization.
bash -3.2 $ svcs apache2 STATE STIME FMRI disabled 22:49:51 svc :/ network / http : apache2
Okay, I can view the status of the service. Now I try to start it.
bash -3.2 $ / usr / sbin / svcadm enable apache2 svcadm : svc :/ network / http : apache2 : Permission denied .
What the hell ...? No permission to start the service? Yes, enabling the service is not only a method (the start up script), its a value of a certain parameter. When you only have the action authorization you can only do task, that doesnt change the state of the service. You can restart it (no change of the service properties), but not enable or disable it (a change of the service properties). But this is not a problem. You have to login as root again and assign the solaris.smf.manage.apache/server authorization to the value authorization.
112
# svccfg -s apache2 setprop general / v a l ue _ a ut h o ri z a ti o n = astring : solaris . smf . manage . apache / server
With the value authorization SMF allows you to change the state of the service. Try it again.
bash -3.2 $ / usr / sbin / svcadm enable apache2 bash -3.2 $
This role prole has already some predened command, that need special security attributes to succeed:
Software Installation : solaris : act ::: Open ;*; JAVA_BYTE_CODE ;*;*: uid =0; gid =2 Software Installation : suser : cmd :::/ usr / bin / ln : euid =0 Software Installation : suser : cmd :::/ usr / bin / pkginfo : uid =0 Software Installation : suser : cmd :::/ usr / bin / pkgmk : uid =0 Software Installation : suser : cmd :::/ usr / bin / pkgparam : uid =0 Software Installation : suser : cmd :::/ usr / bin / pkgproto : uid =0 Software Installation : suser : cmd :::/ usr / bin / pkgtrans : uid =0 Software Installation : suser : cmd :::/ usr / bin / prodreg : uid =0 Software Installation : suser : cmd :::/ usr / ccs / bin / make : euid =0 Software Installation : suser : cmd :::/ usr / sbin / install : euid =0 Software Installation : suser : cmd :::/ usr / sbin / patchadd : uid =0 Software Installation : suser : cmd :::/ usr / sbin / patchrm : uid =0 Software Installation : suser : cmd :::/ usr / sbin / pkgadd : uid =0; gid = bin Software Installation : suser : cmd :::/ usr / sbin / pkgask : uid =0 Software Installation : suser : cmd :::/ usr / sbin / pkgchk : uid =0
113
Software Installation : suser : cmd :::/ usr / sbin / pkgrm : uid =0; gid = bin
This is all you need to install software on your system. You can use this predened role proles at your will. You dont have to do dene all this stu on your own.
11.8. Privileges
Weve talked a lot about RBAC, roles, role proles. But what are Privileges? Privileges are rights to do an operation in the kernel. This rights are enforced by the kernel. Whenever you do something within the kernel the access is controlled by the privileges. At the moment, the rights to do something with the kernel are separated into 70 classes:
contract_event co ntract _obser ver cpc_cpu dtrace_kernel dtrace_proc dtrace_user file_chown file_chown_self file_dac_execute file_dac_read file_dac_search file_dac_write fil e_down grade_ sl file_flag_set file_link_any file_owner file_setid file_upgrade_sl graphics_access graphics_map ipc_dac_read ipc_dac_write ipc_owner net_bindmlp net_icmpaccess net_mac_aware net_privaddr net_rawaccess proc_audit proc_chroot pr oc _c lo ck _h ig hr es proc_exec proc_fork proc_info proc_lock_memory proc_owner proc_priocntl proc_session proc_setid proc_taskid proc_zone sys_acct sys_admin sys_audit sys_config sys_devices sys_ip_config sys_ipc_config sys_linkdir sys_mount sys_net_config sys_nfs sys_res_config sys_resource sys_smb sys_suser_compat sys_time sys_trans_label win_colormap win_config win_dac_read win_dac_write win_devices win_dga win_downgrade_sl win_fontpath win_mac_read win_mac_write win_selection win_upgrade_sl
Every UNIX-System does this task hidden behind this privileges. There are many dierent privileges in the kernel. This privileges are not Solaris specic. Its the way to control the access to this privileges.
114
$ ls -l / usr / sbin / traceroute -r - sr - xr - x 1 root bin / traceroute $ ls -l / usr / sbin / ping -r - sr - xr - x 1 root bin / ping
setuid is nothing else than a violation of the security policy. You need a special privilege to ping: The privilege to use access ICMP. On conventional system this right is reserved to the root user. Thus the ping program has to be executed with the rights of root. The problem: At the time of the execution of the program, the program has all rights of the user. Not only to access ICMP, the program is capable to do everything on the system, as deleting les in /etc. This may not a problem with ping or traceroute but think about larger programs. An exploit in a setuid program can lead to the escalation of the users privileges. Setuid root and you are toast. Lets have a look at the privileges of an ordinary user. There is a tool to get the privileges of any given process in the system, its called codeppriv/code.$$ is a shortcut for the actual process id (in this case the process id of the shell):
bash -3.2 $ ppriv -v $$ 646: bash flags = < none > E : file_link_any , proc_exec , proc_fork , proc_info , proc_session I : file_link_any , proc_exec , proc_fork , proc_info , proc_session P : file_link_any , proc_exec , proc_fork , proc_info , proc_session L : contract_event , (..) , win_upgrade_sl
Every process in the system has four sets of privileges that determine if a process is enabled to use a privilege or not. The theory of privileges is quite complex. I would suggest to read the chapter How Privileges Are Implemented in the a href=http://docs.sun.com/app/docs/doc/8164557/prbactm-1?a=viewSecurity Services/a manual to learn, how each set controls or is controlled other privilege sets. At this time, I want only to explain the meaning of the rst letter: E: eective privileges set P: permitted privileges set L: limit privileges set I: inheritable privileges set
115
11. Role Based Access Control and Least Privileges You can think about the privilege sets as keyrings. The eective privilege set are the keys the janitor has on its keyring. The permitted privilege set are the keys the janitor is allowed to put on its keyring. The janitor can decide to remove some of the keys. Perhaps he thinks: I work only in room 232 today. I dont need all the other keys. I leave them in my oce. When he looses his keyring he lost only the control about this single room, not about the complete campus. The inheritable privilege is not a really a keyring. The janitor thinks about his new assistant: Good worker, but I wont give him my key for the room with the expensive tools. The limited privilege set is the overarching order from the boss of janitor to his team leaders: You are allowed to give your assistant the keys for normal rooms, but not for the rooms with all this blinking boxes from Sun. At the moment the most interesting set is the E:. This is the eective set of privileges. This is the set of privilege eectively available to process. Compared to the full list of privileges mentioned above the set is much smaller. But this matches your experience when you use a unix system.
Okay, this example looks dierent than the one shown before. Nevertheless is has the same meaning. With the switch code-v/code you can expand the aliases.
bash -3.2 $ ppriv -v $$ 815: bash flags = < none > E : file_link_any , proc_exec , proc_fork , proc_info , proc_session I : file_link_any , proc_exec , proc_fork , proc_info , proc_session P : file_link_any , proc_exec , proc_fork , proc_info , proc_session L : contract_event , (..) , win_upgrade_sl
116
11. Role Based Access Control and Least Privileges Looks a little bit more familiar? Okay, now lets login as root.
$su root Password : # ppriv $$ 819: sh flags = < none > E : all I : basic P : all L : all
This user has much more privileges. The eective set is much broader. The user has all privileges in the system.
Exit to the login prompt and login as the user youve assigned the privilieges.
$ ppriv $$ 829: - sh flags = < none > E : basic , dtrace_kernel , dtrace_proc , dtrace_user I : basic , dtrace_kernel , dtrace_proc , dtrace_user P : basic , dtrace_kernel , dtrace_proc , dtrace_user L : all
Simple ...
117
As you might have espected, the user itself doesnt have the privileges to use dtrace.
$ ppriv $$ 883: - sh flags = < none > E : basic I : basic P : basic L : all
118
This daemon doesnt have even the basic privileges of a regular user. It has the only the bare minimum of privileges to do its job.
119
As expected for a root process, this process has the complete set of privileges of a root user. Okay, now one of its children.
# ppriv 1124 1124: / usr / apache2 /2.2/ bin / httpd -k start flags = < none > E : basic I : basic P : basic L : all
Much better ... only basic privileges. Okay, There is a reason for this conguration. On Unix systems, you have two groups of ports. Privileged ones from 1-1023 and unprivileged ones from 1024 up. You can only
120
11. Role Based Access Control and Least Privileges bind to a privileged port with the privilege to do it. A normal user doesnt have this privilege, but root has it. And thus there has to be one process running as root. Do you remember the list of privileges for the apache process running at root. The process has all privileges but needs only one of them, that isnt part of the basic privilege set.
I wont explain the Service Management Framework here, but you can set certain properties in SMF to control the startup of a service.
# svccfg -s apache2 svc :/ network / http : apache2 > setprop start / user = astring : webservd svc :/ network / http : apache2 > setprop start / group = astring : webservd svc :/ network / http : apache2 > setprop start / privileges = astring : basic ,! proc_session ,! proc_info ,! file_link_any , net_privaddr svc :/ network / http : apache2 > setprop start / limit_privileges = astring : : default svc :/ network / http : apache2 > setprop start / use_profile = boolean : false svc :/ network / http : apache2 > setprop start / supp_groups = astring : : default svc :/ network / http : apache2 > setprop start / workin g_dire ctory = astring : : default svc :/ network / http : apache2 > setprop start / project = astring : : default svc :/ network / http : apache2 > setprop start / resource_pool = astring : : default svc :/ network / http : apache2 > end
Line 2 to 4 are the most interesting ones. Without any changes, the Apache daemon starts as root and forks away processes with the webservd user. But we want to get rid of the root user for this conguration. Thus we start the daemon directly with the webservd user. Same for the group id.
121
11. Role Based Access Control and Least Privileges Now it gets interesting. Without this line, the kernel would deny Apache to bind to port 80. webservd is a regular user without the privilege to use a privileged port. The property start/privileges sets the privileges to start the service. At rst, we give the service basic privileges. Then we add the privilege to use a privileged port. The service would start up now. But wait, we can do more. A webserver shouldnt do any hardlinks. And it doesnt send signals outside its session. And it doesnt look at processes other than those to which it can send signals. We dont need this privileges. proc session, proc info and le link any are part of the basic privilege set. We remove them, by adding a code!/code in front of the privilege. Okay, we have notify the SMF of the conguration changes:
# svcadm -v refresh apache2 Action refresh set for svc :/ network / http : apache2 .
Until now, the apache daemon used the root privileges. Thus the ownership of les and directories were unproblematic. The daemon was able to read and write in any directory of le in the system. As we drop this privilege by using a regular user, we have to modify the ownership of some les and move some les.
# chown webservd : webservd / var / apache2 /2.2/ logs / access_log # chown webservd : webservd / var / apache2 /2.2/ logs / error_log mkdir -p -m 755 / var / apache2 / run
We need some conguration changes, too. We have to move the LockFile and the PidFile. There wasnt one of the two conguration directives in my cong le, thus Ive simply appended them to the end of the le.
# echo " LockFile / var / apache2 /2.2/ logs / accept . lock " >> / etc / apache2 /2.2/ httpd . conf # echo " PidFile / var / apache2 /2.2/ run / httpd . pid " >> / etc / apache2 /2.2/ httpd . conf
122
webservd 2235 1 /2.2/ bin / httpd -k webservd 2238 2235 /2.2/ bin / httpd -k webservd 2240 2235 /2.2/ bin / httpd -k webservd 2242 2235 /2.2/ bin / httpd -k webservd 2236 2235 /2.2/ bin / httpd -k
1 19:29:53 start 0 19:29:54 start 0 19:29:54 start 0 19:29:54 start 0 19:29:54 start
? ? ? ? ?
0:00 / usr / apache2 0:00 / usr / apache2 0:00 / usr / apache2 0:00 / usr / apache2 0:00 / usr / apache2
You notice the dierence ? There is no httpd running as root. All processes run with the userid webservd. Mission accomplished. Lets check the privileges of the processes. At rst the one, who ran as root before.
# ppriv 2235 2235: / usr / apache2 /2.2/ bin / httpd -k start flags = < none > E : basic ,! file_link_any , net_privaddr ,! proc_info ,! proc_session I : basic ,! file_link_any , net_privaddr ,! proc_info ,! proc_session P : basic ,! file_link_any , net_privaddr ,! proc_info ,! proc_session L : all
Only the least privileges to do the job, no root privileges. And even the other processes are more secure now:
# ppriv 2238 2238: / usr / apache2 /2.2/ bin / httpd -k start flags = < none > E : basic ,! file_link_any , net_privaddr ,! proc_info ,! proc_session I : basic ,! file_link_any , net_privaddr ,! proc_info ,! proc_session P : basic ,! file_link_any , net_privaddr ,! proc_info ,! proc_session L : all
Before we changed the conguration of the webserver, it has the basic privileges of a regular user. Now we limited even this set.
123
124
When you want to place a system into a network, its a good practice to harden the system. Hardening is the conguration of a system to minimize the attack vectors for an intruder by closing down services, conguring stricter security policies and activating a more verbose logging or auditing. But hardening is not a really simple task: You have to switch o as much services as possible and modify the conguration of many daemons. Furthermore you have to know, what your application needs to run, you cant close down a service that another service needs to execute. Those dependencies may be simple for a server with an apache daemon, but to harden a Sun Cluster needs a little bit more knowledge. Furthermore you have to keep the conguration in a way, thats supported by Sun.
125
12. The Solaris Security Toolkit h4How to install the Solaris Security Toolkit?/h4Installation of the Toolkit is really easy. At rst you have to gather it from the Sun Download Center. Sorry, you need a account for it, but you can register for free. You will nd it here.Before login in as root, Ive copied the le SUNWjass-4.2.0.pkg.tar.Z via scp to my freshly installed system with Solaris 10 Update 5
# cd / tmp # ls SUNWjass -4.2.0. pkg . tar . Z hsperfdata_root typescript hsperfda t a _n o a cc e s s ogl_select216 # bash # uncompress SUNWjass -4.2.0. pkg . tar . Z # tar xfv SUNWjass -4.2.0. pkg . tar x SUNWjass , 0 bytes , 0 tape blocks x SUNWjass / pkgmap , 33111 bytes , 65 tape blocks [...] x SUNWjass / install / preremove , 1090 bytes , 3 tape blocks x SUNWjass / install / tsolinfo , 52 bytes , 1 tape blocks
126
12. The Solaris Security Toolkit Now I can use the Sun Security Toolkit for system hardening. Its installed at /opt/SUNWjass/
127
12. The Solaris Security Toolkit Now you should look into the desired drivers. An example: The hardening.driver contains a like to disable the nscd.
disable - nscd - caching . fin
But you want another behavior for some reason. You just have to add an # in front of the line:
# disable - nscd - caching . fin
Well, there is another behavior I dont want. The default locks sshd via tcpwrapper except from accesses from the local host. But there is a better template at /SUNWjass/Files/etc/hosts.allow allowing ssh access from all hosts. You can force SST to use it by adding another line to the hardening.driver. Ive added a line to do so:
JASS_FILES =" / etc / hosts . allow / etc / dt / config / Xaccess / etc / init . d / set - tmp - permissions / etc / issue / etc / motd / etc / rc2 . d / S00set - tmp - permissions / etc / rc2 . d / S07set - tmp - permissions / etc / syslog . conf "
Now the Toolkit copies the le /opt/SUNWjass/Files/etc/hosts.allow to /etc/hosts.allow. As you may have noticed, the template as to be in the same directory as the le you want to substitute with the dierence, that the directory of the template has to be relative to /opt/SUNWjass/Files/ and not to / Okay, now we have modied our driver, now we can execute it:
# cd / opt / SUNWjass / bin # ./ jass - execute secure . driver [ NOTE ] The following prompt can be disabled by setting JASS_NOVICE_USER to 0. [ WARN ] Depending on how the Solaris Security Toolkit is configured , it is both possible and likely that by default all remote shell and file transfer access to this system will be disabled upon reboot effectively locking out any user without console access to the system .
128
This warning is not a joke. Know what you do, when you use this toolkit. Hardening means real hardening and this process may leave you with a paranoid hosts.allow locking you out from accessing the sshd on your system. Without console access you would be toast now. But as we use the more sensible template for hosts.allow, we can proceed by answering with yes:
Executing driver , secure . driver
===========================================================================
===========================================================================
Toolkit Version : 4.2.0 Node name : gondor Zone name : global Host ID : 1911578 e Host address : 10.211.55.200 MAC address : 0:1 c :42:24:51: b9 OS version : 5.10 Date : Wed Apr 16 16:15:19 CEST 2008 ===========================================================================
After a rst status report , a long row of scripts will print log messages to the terminal. For example the nish script enable-coreadm.fin:
===========================================================================
Configuring coreadm to use pattern matching and logging . [ NOTE ] Creating a new directory , / var / core . [ NOTE ] Copying / etc / coreadm . conf to / etc / coreadm . conf . JASS .20080416161705
129
Many reports of this kind will scroll along and at the end jass.execute prints out some diagnostic:
===========================================================================
=========================================================================== [ SUMMARY ] [ SUMMARY ] [ SUMMARY ] [ SUMMARY ] [ SUMMARY ] [ SUMMARY ] Results Summary for APPLY run of secure . driver The run completed with a total of 97 scripts run . There were Failures in 0 Scripts There were Errors in 0 Scripts There were Warnings in 3 Scripts There were Notes in 81 Scripts
[ SUMMARY ] Warning Scripts listed in : / var / opt / SUNWjass / run /20080416161504/ jass - script warnings . txt [ SUMMARY ] Notes Scripts listed in : / var / opt / SUNWjass / run /20080416161504/ jass - script - notes . txt
===========================================================================
When you look around at you system, you will notice some new les. Every le in the system changed by the SST will be backed up before the change is done. For example you will nd a le named vfstab.JASS.20080416161812 in /etc. JASS contains a nish script to limit the size of the /tmp. As the /tmp lesystem resides in the main memory, this is a sensible thing to do. Lets check for the dierences
# diff vfstab vfstab . JASS .20080416161812 11 c11 < swap - / tmp tmpfs - yes size =512 m --> swap / tmp tmpfs yes
130
12. The Solaris Security Toolkit The script has done its job and added the size=512m option to the mount.
131
| and is advised that if such monitoring reveals possible | | evidence of criminal activity , system personnel may provide the | | evidence of such monitoring to law enforcement officials . | | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|
===========================================================================
You may have changed some les since using the toolkit, thus the SST will ask you if you what it should do with those les. For example, Ive changed the password of my account, thus the /etc/shadow has changed:
===========================================================================
undo . driver : Undoing Finish Script : set - flexible - crypt . fin ===========================================================================
132
[ NOTE ] Undoing operation COPY . [ WARN ] Checksum of current file does not match the saved value . [ WARN ] filename = / etc / shadow [ WARN ] current = 8 e 2 7 a 3 9 1 9 3 3 4 d e 7 c 1 c 5 f 6 9 0 9 9 9 c 3 5 b e 8 [ WARN ] saved = 86401 b 2 6 a 3 c f 3 8 d 0 0 1 f d f 6 3 1 1 4 9 6 a 4 8 c Select your 1. Backup 2. Keep 3. Force course of action : Save the current file , BEFORE restoring original . Keep the current file , making NO changes . Ignore manual changes , and OVERWRITE current file .
NOTE : The following additional options are applied to this and ALL subsequent files : 4. ALWAYS Backup . 5. ALWAYS Keep . 6. ALWAYS Force . Enter 1 , 2 , 3 , 4 , 5 , or 6: 2
After this command the Solaris Security Toolkit has reverted all changes.
12.6. Conclusion
With the Solaris Security Toolkit you can deploy an certain baseline of security congurations to all your systems in an automatic manner. But it isnt limited to run once after the installation, you can run it as often as you want to ensure that you get to an known secured state of your system after patching or recongurations of your system. By doing this automatically, you get a big advantage. Once youve developed your own driver for your site, nobody forgets to set a certain conguration leaving an attack vector open, you assumed as being closed down. This tutorial demonstrated only a small subset of the capabilties of the toolkit. For example you can integrate it into Jumpstart to automatically harden systems at their installation, you can use it to install a minimal patch cluster on each system where you execute the toolkit. So you should really dig down into the documentation of this toolkit to explore all the capabilities.
133
134
13. Auditing
Solaris 10/Opensolaris
One of the less known features in Solaris is the Auditing. Auditing solves an important problem: What happens on my system, and whodunit. When something strange happens on your system or you recognize, that you are not the only one who owns your system, its a good thing to have some logs for analysis. The nice thing about the auditing in Solaris: Its quite simple to activate. In this article I will give you a short overview to enable and use the auditing in Solaris. This feature is really old, its in Solaris for since the last century but nevertheless its a less known Solaris feature.
135
13. Auditing
Then go to /etc/security and edit the le /etc/security/audit_control. This le controls where what classes of information are logged and where you write the log. For example: The lo is the audit class for all events in regard of logins and logos:
dir :/ var / audit / aragorn - sol flags : lo minfree :20 naflags : lo
Okay, conguration is done.But lets have another look the le /etc/security/audit_startup. The commands in this script control the audit policies and thus the behavior of the logging and the amount of informations in the log records:
/ usr / bin / echo " Starting BSM services ." / usr / sbin / auditconfig - setpolicy + cnt / usr / sbin / auditconfig - conf / usr / sbin / auditconfig - aconf
The second line is the most interesting. Without this line the system would stop user interaction when the system is unable to log. You would deactivate this behavior, when logging is more important than system availability. For the moment we dont change this le.
136
13. Auditing
bsmconv : INFO : initializing device allocation . The Basic Security Module is ready . If there were any errors , please fix them now . Configure BSM by editing files located in / etc / security . Reboot this system now to come up with BSM enabled . # reboot
Okay, now you have completed the conguration. The system has started to write audit logs.
With this command the actual le gets closed and a new one gets opened.
# cd / var / audit / aragorn - sol / # ls -l total 24 -rw -r - - - - 1 root root 684 Feb 2 0 0 8 0 2 0 1 2 2 3 0 0 3 . 2 0 0 8 0 2 0 1 2 2 5 5 4 9 . aragorn - sol -rw -r - - - - 1 root root 571 Feb 2 0 0 8 0 2 0 1 2 2 5 5 4 9 . 2 0 0 8 0 2 0 1 2 3 0 6 3 9 . aragorn - sol -rw -r - - - - 1 root root 2279 Feb 2 0 0 8 0 2 0 1 2 3 0 8 3 4 . 2 0 0 8 0 2 0 1 2 3 1 0 1 0 . aragorn - sol
137
13. Auditing
-rw -r - - - - 1 root root 755 Feb 2 0 0 8 0 2 0 1 2 3 1 0 1 0 . 2 0 0 8 0 2 0 1 2 3 1 2 4 5 . aragorn - sol -rw -r - - - - 1 root root 4274 Feb 2 0 0 8 0 2 0 1 2 3 1 2 4 5 . 2 0 0 8 0 2 0 2 0 7 3 6 2 4 . aragorn - sol -rw -r - - - - 1 root root 200 Feb 20080202073624. not_terminated . aragorn - sol
This sequence of commands translate all you audit logs into an human readable form. Ive cut out some of the lines for an example:
header ,69 ,2 , AUE_ssh , , localhost ,2008 -02 -01 23:49:17.687 +01:00 subject , jmoekamp , jmoekamp , other , jmoekamp , other ,720 ,3447782834 ,6969 5632 10.211.55.2 return , success ,0 header ,77 ,2 , AUE_su , , localhost ,2008 -02 -01 23:49:55.336 +01:00 subject , jmoekamp , root , other , jmoekamp , other ,729 ,3447782834 ,6969 5632 10.211.55.2 text , root return , failure , Authentication failed header ,69 ,2 , AUE_su , , localhost ,2008 -02 -01 23:50:11.311 +01:00 subject , jmoekamp , root , root , root , root ,730 ,3447782834 ,6969 5632 10.211.55.2 return , success ,0
What tells this snippet to you: Ive logged into my system as the user jmoekamp, tried to assume root privileges, failed the rst time (due wrong password), tried it again and succeeded.
138
13. Auditing
The ex audit class matches to all events in system in regard to the execution of a program. This tells the auditing subsystem to log all execve() system calls. But you have to signal this change to the audit subsystem to start the auditing of this events. With audit -s you notify the audit daemon to read the /etc/security/audit_control le again.
header ,113 ,2 , AUE_EXECVE , , localhost ,2008 -02 -02 00:10:00.623 +01:00 path ,/ usr / bin / ls attribute ,100555 , root , bin ,26738688 ,1380 ,0 subject , jmoekamp , root , root , root , root ,652 ,2040289354 ,12921 71168 10.211.55.2 return , success ,0
But this conguration only logs the path of the command, not the command line parameters. You have to congure to log this information. You remember: The audit policy controls the kind of information in the audit logs. Thus we have to modify the audit policy. With the command auditconfig -setpolicy +argv you change the policy. You dont have to activate it, its immediately eective:
header ,124 ,2 , AUE_EXECVE , , localhost ,2008 -02 -02 00:12:49.560 +01:00 path ,/ usr / bin / ls attribute ,100555 , root , bin ,26738688 ,1380 ,0 exec_args ,2 , ls , - l subject , jmoekamp , root , root , root , root ,665 ,2040289354 ,12921 71168 10.211.55.2 return , success ,0
To make this behavior persistent, you have add the codeauditcong -setpolicy +argv/code to the le
139
13. Auditing
140
Apropos auditing. There is a small but cool tool in Solaris. It solves the problem of No, I havent changed anything on the system. Its called BART, the Basic Audit Reporting Tool. It a really simple tool and its really easy to use.
14.1. Usage
Okay, lets assume after some days of work you nally congured all components of your new system. Okay, create a nice place to store the output of the bart tool. After this you start bart for the rst time to create the rst manifest of your system.
# mkdir / bart - files # bart create -R / etc > / bart - files / etc . control . manifest
The manifest stores all informations about the les. This is the example for the code/etc/nsswitch.nisplus/code:
# cat etc . control . manifest | grep "/ nsswitch . nisplus " / nsswitch . nisplus F 2525 100644 user :: rw - , group :: r - - , mask :r - - , other :r - - 473976 b5 0 3 79 e 8 f d 6 8 9 a 5 2 2 1 d 1 c d 0 5 9 e 5 0 7 7 d a 7 1 b 8
Okay, enough changes. Lets create a new manifest of the changed /etc. Pipe it to a dierent le.
# bart create -R / etc > / bart - files / etc . check20080202 . manifest
Now we can compare the baseline manifest with the actual manifest.
# cd / bart - files # bart compare etc . control . manifest etc . check20080202 . manifest
141
14. Basic Audit Reporting Tool This command prints all dierences between the two manifests and thus the dierence between the tow states of the system
/ nsswitch . files : mode control :100644 test :100777 acl control : user :: rw - , group :: r - - , mask :r - - , other :r - - test : user :: rwx , group :: rwx , mask : rwx , other : rwx / nsswitch . nisplus : size control :2525 test :2538 mtime control :473976 b5 test :47 a44862 contents control :79 e 8 f d 6 8 9 a 5 2 2 1 d 1 c d 0 5 9 e 5 0 7 7 d a 7 1 b 8 test :3 f79176ec352441db11ec8a3d02ef67c / thisisjustatest : add
142
15. IPsec
Solaris 10/Opensolaris
15.2. Foundations
This will the only article in this series without an explanation of the technologies. I could write for days about it and still leaving out important things. Instead of this, please look at the links section at the end of the article. Only some short words about this topic. Encryption is essential in networking. The Internet is a inherently insecure media. You have to assume, that you dont talk to the right person as long as you didnt authenticated him, you have to assume, that the data will be read by someone as long you wont encrypt the data. IPsec solves this problems. I wont tell you that IPsec is an easy protocol. The stu around IPsec is dened in several RFC and the documentation is rather long. The encryption itself isnt the complex part. But you need a key to encrypt your data. And there starts the complexity. Its absolutely essential, that this key stays secret. How do you distribute keys through an inherently insecure transport channel. And the next problem: How do you negotiate the encryption mechanism? And how to ensure, that you talk with the right system ? Problems ... problems ... problems! IPsec solves this problems. IPsec isnt a single protocol. Its a suite of protocols consisting out Internet Security Association and Key Management Protocol, this protocol build on the protocol for Internet Key Exchange, this protocol is based on the Oakley protocol
143
15. IPsec And there is a whole wad of further protocols and mechanisms and algorithms to ensure secure communication.
15.4. Example
The task for this example is to secure all trac between two hosts. Ive used two VM with Solaris Express Build 78 for this conguration. (You need at least an Solaris 10 Update 4 for this tutorial. In this update the ipsecconf command was introduced making the conguration much easier) theoden has the IP number 10.211.55.200. Gandalf has the IP number 10.211.55.201. I dont want to use manual keying, instead of this the example will use self signed certicates.
Look at the login prompt in the examples. They designate on which system you have to work on. Okay ... at rst you have to ensure, that the names of the systems can be resolved. Its a good practice to put the names of the systems into the /etc/hosts:
144
15. IPsec
::1 localhost loghost 127.0.0.1 localhost loghost 10.211.55.201 gandalf 10.211.55.200 theoden
Okay, we dont want manual keying or some stinking preshared keys. Thus we need to create keys. Login to gandalf and assume the root role:
[ root@gandalf :~] $ ikecert certlocal - ks -m 1024 -t rsa - md5 -D " C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = gandalf " -A IP =10.211.55.201 Creating private key . Certificate added to database . ----- BEGIN X509 CERTIFICATE - - - - MIICOzCCAaSgAwIBAgIFAJRpUUkwDQYJKoZIhvcNAQEEBQAwTzELMAkGA1UEBhMC [ ... some lines omitted ... ] oi4dO39J7cSnooqnekHjajn7ND7T187k +f+ BVcFVbSenIzblq2P0u7FIgIjdlv0 = ----- END X509 CERTIFICATE - - - - -
You need the output of this commands later, so past them to a text editor or at a save place ...
145
15. IPsec
[ root@gandalf :~] $ echo "\{ laddr gandalf raddr theoden } ipsec { auth_algs any encr_algs any sa shared \}" >> / etc / inet / ipsecinit . conf
This translates to: When im speaking to theoden, I have to encrypt the data and can use any negotiated and available encryption algorithm and any negotiated and available authentication algorithm. Such an rule is only valid on one direction. Thus we have to dene the opposite direction on the other host to enable bidirectional trac:
[ root@theoden :~] $ echo "{ laddr theoden raddr gandalf } ipsec { auth_algs any encr_algs any sa shared }" >> / etc / inet / ipsecinit . conf
Okay, the next conguration is le is a little bit more complex. Go into the directory /etc/inet/ike and create a le config with the following content:
cert_trust "10.211.55.200" cert_trust "10.211.55.201" p1_xform { auth_method preshared oakley_group 5 auth_alg sha encr_alg des } p2_pfs 5 { label " DE - theoden to DE - gandalf " local_id_type dn local_id " C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = theoden " remote_id " C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = gandalf " local_addr 10.211.55.200 remote_addr 10.211.55.201 p1_xform { auth_method rsa_sig oakley_group 2 auth_alg md5 encr_alg 3 des \} }
This looks complex, but once youve understand this its quite easy:
cert_trust "10.211.55.200" cert_trust "10.211.55.201"
146
15. IPsec We use self-signed certicate. The certicate isnt signed by an independent certication authority. Thus there is no automatic method to trust the certicate. You have to congure the iked explicitly to trust this certicates. This both lines tell the iked to trust the certicates with the alternate name 10.221.55.200 and 10.211.55.201. Where did this alternate names came from? You set them! Look in the command line for creating the certicate. You dened this name by using the -a switch.
label " DE - gandalf to DE - theoden " local_id_type dn local_id " C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = gandalf " remote_id " C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = theoden "
Now you dene an key exchange. You have to give each connection an unique name. After this you dene, what part of the certicate is used to authenticate the remote system. In this example we use the distinguished name. The local system identies itself with the certicate named C=de, O=moellenkamp, OU=moellenkamp-vpn, CN=gandalfto a remote system, and expect a trusted certicate with the distinguished name C=de, O=moellenkamp, OU=moellen
local_addr 10.211.55.200 remote_addr 10.211.55.201 \ end |{ lstlisting } Now the iked knows the ip addresses of the local and the remote host . \ begin { lstlisting } p1_xform { auth_method rsa_sig oakley_group 2 auth_alg md5 encr_alg 3 des }
After dening the authentication credentials, we have to dene how the system should communicate. This line means: Use the certicates to authenticate the other system. The key determination protocol is based on a prime number. We use md5 as the ground-laying algorithm to authenticate and 3des for encryption. This is the part where you congure the methods for authentication and encryption. They have to be the same on both hosts, otherwise they wont be able to negotiate to a common denominator thus you wont be able to communicate between the both hosts at all. Now we do the same on the other system.
[ root@gandalf :/ etc / inet / ike ] $ cd / etc / inet / ike [ root@gandalf :/ etc / inet / ike ] $ cat conf cert_trust "10.211.55.200" cert_trust "10.211.55.201" p1_xform { auth_method preshared oakley_group 5 auth_alg sha encr_alg des }
147
15. IPsec
p2_pfs 5 { label " DE - gandalf to DE - theoden " local_id_type dn local_id " C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = gandalf " remote_id " C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = theoden " local_addr 10.211.55.201 remote_addr 10.211.55.200 p1_xform { auth_method rsa_sig oakley_group 2 auth_alg md5 encr_alg 3 des } }
Obviously you have to swap the numbers for the local and remote system and you have to assign a unique label to it. Okay, we are almost done. But there is still a missing but very essential thing when you want to use certicates. We have to distribute the certicates of the systems.
[ root@gandalf :/ etc / inet / ike ] $ ikecert certdb -l Certificate Slot Name : 0 Key Type : rsa ( Private key in certlocal slot 0) Subject Name : <C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = gandalf > Key Size : 1024 Public key hash : 28 B 0 8 F B 4 0 4 2 6 8 D 1 4 4 B E 7 0 D D D 6 5 2 C B 8 7 4
At the beginning there is only the local key in the system. We have to import the key of the remote system. Do you remember the output beginning with -----BEGIN X509 CERTIFICATE----and ending with -----END X509 CERTIFICATE-----? You need this output now. The next command wont come back after you hit return. You have to paste in the key. On gandalf you paste the output of the key generation on theoden. On Theoden you paste the output of the key generation on gandalf. Lets import the key on gandalf
[ root@gandalf :/ etc / inet / ike ] $ ikecert certdb -a ----- BEGIN X509 CERTIFICATE - - - - MIICOzCCAaSgAwIBAgIFAIRuR5QwDQYJKoZIhvcNAQEEBQAwTzELMAkGA1UEBhMC
U H J 4 P 6 Z 0 d t j n T o Q b 3 7 H N q 9 Y W F R g u S s P Q v c / Lm + S 9 c J C L w I N V g 7 N O X X g n S f Y 3 k + Q =
148
15. IPsec
After pasting, you have to hit Enter once and after this you press Ctrl-D once. Now we check for the sucessful import. You will see two certicates now.
[ root@gandalf :/ etc / inet / ike ] $ ikecert certdb -l Certificate Slot Name : 0 Key Type : rsa ( Private key in certlocal slot 0) Subject Name : <C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = gandalf > Key Size : 1024 Public key hash : 28 B 0 8 F B 4 0 4 2 6 8 D 1 4 4 B E 7 0 D D D 6 5 2 C B 8 7 4 Certificate Slot Name : 1 Key Type : rsa Subject Name : <C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = theoden > Key Size : 1024 Public key hash : 76 B E 0 8 0 9 A 6 C B A 5 E 0 6 2 1 9 B C 4 2 3 0 C B B 8 B 8 \ end { listing } Okay , switch to theoden and import the key from gandalf on this system . \ begin { lstlisting }[ root@theoden :/ etc / inet / ike ] $ ikecert certdb -l Certificate Slot Name : 0 Key Type : rsa ( Private key in certlocal slot 0) Subject Name : <C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = theoden > Key Size : 1024 Public key hash : 76 B E 0 8 0 9 A 6 C B A 5 E 0 6 2 1 9 B C 4 2 3 0 C B B 8 B 8 [ root@theoden :/ etc / inet / ike ] $ ikecert certdb -a ----- BEGIN X509 CERTIFICATE - - - - MIICOzCCAaSgAwIBAgIFAJRpUUkwDQYJKoZIhvcNAQEEBQAwTzELMAkGA1UEBhMC
oi4dO39J7cSnooqnekHjajn7ND7T187k +f+ BVcFVbSenIzblq2P0u7FIgIjdlv0 = ----- END X509 CERTIFICATE - - - - [ root@theoden :/ etc / inet / ike ] $ ikecert certdb -l Certificate Slot Name : 0 Key Type : rsa ( Private key in certlocal slot 0) Subject Name : <C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = theoden >
149
15. IPsec
Key Size : 1024 Public key hash : 76 B E 0 8 0 9 A 6 C B A 5 E 0 6 2 1 9 B C 4 2 3 0 C B B 8 B 8 Certificate Slot Name : 1 Key Type : rsa Subject Name : <C = de , O = moellenkamp , OU = moellenkamp - vpn , CN = gandalf > Key Size : 1024 Public key hash : 28 B 0 8 F B 4 0 4 2 6 8 D 1 4 4 B E 7 0 D D D 6 5 2 C B 8 7 4
and
[ root@theoden :/ etc / inet / ike ] $ svcadm enable ike [ root@theoden :/ etc / inet / ike ] $ ipsecconf -a / etc / inet / ipsecinit . conf
Okay ... now login to gandalf and start some pings to theoden:
[ root@gandalf :~] $ ping theoden ; ping theoden ; ping theoden theoden is alive theoden is alive theoden is alive
Okay, theoden can speak with gandalf and vice versa. But is it encrypted? In the terminal window for theoden the following output should be printed:
150
15. IPsec
SPI =0 x1d2c0e88 SPI =0 x84599293 SPI =0 x1d2c0e88 SPI =0 x84599293 SPI =0 x1d2c0e88 SPI =0 x84599293
151
One of problems in computer security is the validation of binaries: Is this the original binary or is it a counterfeit binary? Since Solaris 10 Sun electronically signs the binaries of the Solaris Operating Environment. You can check the signature of the binaries with the elf-sign-tool.
[ root@gandalf :/ etc ] $ elfsign verify -v / usr / sbin / ifconfig elfsign : verification of / usr / sbin / ifconfig passed . format : rsa_md5_sha1 . signer : CN = SunOS 5.10 , OU = Solaris Signed Execution , O = Sun Microsystems Inc .
Obviously you have to trust the elfsign. But you can check it, when you boot the system from a trusted media (like a original media kit or a checksum validated iso-image. This enables you to check the signature of the elfsign independently from the system. By the way: This certicate and the signature is very important for crypto modules. The crypto framework of solaris just loads modules signed by Sun to prevent the usage of malicious modules (for example to read out the key store and send it somewhere) into the framework.
152
17. On passwords
Solaris 10/Opensolaris
The password is the key to the system. When you know the username and the password, you can use the system. If not ... well ... go away. You cant overemphasize the value of good passwords. There is something you can do as the admin at the system level to ensure such passwords. At rst you can use stronger mechanisms to hash the passwords. And you can force the users of your system to use good password. This tutorial will explain both tasks.
Now lets try a password thats dierent at the ninth character by logging into the Solaris system from remote:
mymac :~ joergm oellen kamp$ ssh jmoekamp@10 .211.55.200 Password : aa3456780 Last login : Wed May 28 11:24:05 2008 from 10.211.55.2 Sun Microsystems Inc . SunOS 5.11 snv_84 January 2008
Ive told you ... only the rst eight characters are relevant. But its not that way, that Solaris cant do better than that. Its just the binary compatibility guarantee again. You cant simply change the mechanism encrypting the password. There may be scripts that still need the old unix crypt variant. But in case you are sure, that you havent such an application you can change it, and its really simple to do:
153
17. On passwords When you look into the le /etc/security/crypt.conf you will nd the additional modules for password encryption.
# The algorithm name __unix__ is reserved . 1 2a md5 crypt_bsdmd5 . so .1 crypt_bsdbf . so .1 crypt_sunmd5 . so .1
The hashing mechanisms are loaded as libraries in the so-called Solaris Pluggable Crypt Framework. Its even possible to develop your own crypting mechanism in the case you dont trust the implementations delivered by Sun.
Table 17.1.: Cryptographic Mechanisms for password encryption Short Algorithm Description 1 BSD alike,md5 The crypt_bsdmd5 module is a one-way password hashing module for use with crypt(3C) based
that uses the MD5 message hash algorithm. The output is compatible with md5crypt on BSD and Linux systems. BSD alike, blow- The crypt_bsdbf module is a one-way password hashing module for use with crypt(3C) that uses sh based the Blowsh cryptographic algorithm. Sun, md5 based The crypt_sunmd5 module is a one-way password hashing module for use with crypt(3C) that uses the MD5 message hash algorithm. This module is designed to make it dicult to crack passwords that use brute force attacks based on high speed MD5 implementations that use code inlining, unrolled loops, and table lookup.
2a
md5
Each of the three mechanisms support passwords with up to 255 characters. Its important to know, that the dierent hashing algorithm can coexist in your password databases. The password hashing for a password will be changed when user change his or her password.
154
17. On passwords
Its simple to enable a dierent encryption algorithm for password. You have just to change one lines in /etc/security/policy.conf. To edit this le you have to login as root:
CRYPT_DEFAULT = md5
When you look in the /etc/shadow for the user, you will see a slighly modied password eld. Its much longer and between the rst and the second $ you see the used encryption mechanism:
# grep " jmoekamp " / etc / shadow jmoekamp : $md5$vyy8 . OVF$ $FY4TW zuauRl 4 . VQNobqMY .:14027::::::
Alec Muet wrote about the development of this hash mechanism in http://www.crypticide.com/ dropsafe/article/1389
155
17. On passwords You see, the correctness of the complete password is tested, not just the rst 8 characters.
One important note for trying out this feature. You need to log into your system as a normal user in a dierent window.root can set any password without a check by the password policy thus it would look like that your conguration changes had no eect
156
17. On passwords You enable the checks by uncommenting it and set a reasonable value to the line. When you enable all the checks, its actually harder to nd a valid password than a non-valid one. Whenever thinking about a really hard password policy you should take into consideration, that people tend to make notes about their password when they cant remember it. And a strong password under the keyboard is obviously less secure than a weak password in the head of the user.
Table 17.2.: /etc/default/password: standard checks Parameter Description MAXWEEKS This variable species the maximum age for a MINWEEKS
password. This variable species the minimum age for a password. The rationale for this settings gets clearer when I talk about the HISTORY setting. The minimum length for a password This variable species the length of a history buer. You can specify a length of up to 26 passwords in the buer. The MINWEEKS buer is useful in conjunction with this parameter. There is a trick to circumvent this buer and to get you old password back. Just change it as often as the length of the buer plus one time. The MINWEEK parameter prevents this. This variable denes if you are allowed to use a whitespace in your password When you set this variable to YES, the system checks if the password and login name are identical. So using the password root for the use root would be denied by this setting. The default, by the way is, yes.
PASSLENGTH HISTORY
WHITESPACE NAMECHECK
Besides of this basic checks you can use /etc/default/passwd/ enforce checks for the complexity of passwords. So you can prevent the user from setting to simple passwords.
157
17. On passwords
Table 17.3.: /etc/default/password: complexity checks Parameter Description MINDIFF Lets assume youve used 3 here. If your old
password was batou001, a new password would be denied, if you try to use batou002 as only on character was changed. batou432 would be a valid password. MINUPPER With this variable you can force the usage of upper case characters. Lets assume youve specied 3 here, a password like wasabi isnt an allowed choice, but you could use WaSaBi MINLOWER With this variable you enable the check for the amount of lower case characters in your password. In the case youve specied 2 here, a password like WASABI isnt allowed, but you can use WaSaBI MAXREPEATS Okay, some users try to use passwords like aaaaaa2. Obviously this isnt really a strong password. When you set this password to 2 you, it checks if at most 2 consecutive characters are identical. A password like waasabi would be allowed, but not a password like waaasabi MINSPECIAL The class SPECIAL consists out of characters like !\=. Lets assume youve specied 2, a password like !ns!st would be ne, but the password insist is not a valid choice. MINDIGIT With this password you can specify the amount of the numbers in your password. Lets a assume you specify 2, a password like snafu01 would will be allowed. A password like snafu1 will be denied. MINALPHA You can check with this variable for a minimum amount of alpha chars (a-z and A-Z) . When you set a value of 2 on this variable, a password like aa23213 would be allowed, a password like 0923323 would be denied MINNONALPHA This checks for the amount of non-alpha characters (0-9 and special chars). A value of 2 would lead to the denial of wasabi, but a password like w2sab!
158
17. On passwords
The le /usr/share/lib/dicts/words is a le in the Solaris Operating System containing a list of words. Its normally used by spell checking tools. Obviously you should use a wordlist in your own language, as user tend do choose words from their own language as passwords. So an English wordlist in Germany may be not that eective.3 Now you have to tell Solaris to use this lists.
Table 17.4.: /etc/default/password: Dictionaries Parameter Description DICTIONLIST This variable can contain a list of dictionary
les separated by a comma. You must specify full pathnames. The words from these les are merged into a database that is used to determine whether a password is based on a dictionary word DICTIONDBDIR The directory where the generated dictionary databases reside
When none of the both variables is specied in the /etc/default/passwd then no dictionary check is performed. Lets try it. Ive uncommented the DICTIONDBDIR line of the /etc/default/passwd le and used the standard value /var/passwd. One of the word in the dictionary I imported is the word airplane
$ passwd passwd : Changing password for jmoekamp Enter existing login password : chohw !2
3
159
17. On passwords
17.3. Conclusion
These are some simple tricks to make your system more secure, just by ensuring that the keys to your server are well-choosen and not simple ones. But as I stated before there is something you should keep in mind. Dont make the passwords too hard to remember.
4 5
160
18. pfexec
Solaris 10/Opensolaris
The Role-Based Access Control (RBAC) scheme in the OpenSolaris OS, Suns open-source project for the Solaris OS, oers rights proles. A rights prole is dened as a collection of administrative capabilities that can be assigned to a role or to a user in the RBAC and Privileges tutorial. Rights proles can contain authorizations, commands with security attributes, and other rights proles - a convenient way for grouping security attributes. With RBAC, you as the system administrator rst create proles and then assign them to roles. Those two tasks are beyond the scope of this article. Finally, you grant users the ability to assume the roles. You can also assign a prole directly to a user - the method described later in this article. Afterwards, that user can perform the tasks that are dened by the rights prole - even execute root commands without having to log in as superuser. All that person needs to do is prepend the utility pfexec to the commands. In eect, pfexec functions as a passwordless su or sudo in Linux. This article shows you how to delegate administration tasks and assign root capabilities to regular users by way of rights proles. It is assumed that you are familiar with RBAC concepts and commands in the OpenSolaris OS and have read the RBAC and Privileges tutorial referenced above. Note: pfexec has been available on the Solaris OS since Solaris 8 and Trusted Solaris 8 and is not unique to the OpenSolaris OS.
However, you can add a prole with the share right for testuser. Do the following:
161
18. pfexec 1. As root, create a new user and assign a password to that person. In the OpenSolaris OS, the rst user you create is already assigned to a prole that allows that person to perform all the root tasks. Following are the related commands and output:
# mkdir -p / export / home # useradd -m -d / export / home / testuser testuser 80 blocks # passwd testuser New Password : Re - enter new Password : passwd : password successfully changed for testuser
2. Log out and log in again as the new user. 3. Look for a matching prole in the exec\_attr le with the share command. Here are the command and subsequent output, which shows a match in File System Management:
$ grep " share " / etc / security / exec_attr File System Management : suser : cmd :::/ usr / sbin / dfshares : euid =0 File System Management : suser : cmd :::/ usr / sbin / share : uid =0; gid = root File System Management : suser : cmd :::/ usr / sbin / shareall : uid =0; gid = root File System Management : suser : cmd :::/ usr / sbin / sharemgr : uid =0; gid = root File System Management : suser : cmd :::/ usr / sbin / unshare : uid =0; gid = root File System Management : suser : cmd :::/ usr / sbin / unshareall : uid =0; gid = root [...]
4. Become root and assign the File System Management rights prole to testuser. Here are the commands and subsequent output:
$ su root Password : # usermod -P File System Management testuser UX : usermod : testuser is currently logged in , some changes may not take effect until next login .
In this case, despite the warning that some changes are only eective after user logout, the user needs not do so. This change is eective immediately.Voila! testuser can now share
162
18. pfexec and unshare directories by prepending pfexec to the share command without becoming superuser:
$ pfexec / usr / sbin / share / export / home / testuser $ / usr / sbin / share - / export / home / testuser rw ""
That means all the commands that are executed with this prole with pfexec prepended are run with uid=0 and gid=0, hence with root privileges.Granting users the root capabilities through the Primary Administrator rights prole has several advantages: You need not reveal the root password to the users. To withdraw a users root privilege, simply delete the Primary Administrator prole from the user setup-no need to set a new root password. Users with the Primary Administrator rights prole can set up a root shell and need not prepend root commands with pfexec afterwards. See this example: 1. As root, assign the Primary Administrator prole to testuser. Here are the commands and subsequent output:
$ usermod -P Primary Administrator testuser UX : usermod : testuser is currently logged in , some changes may not take effect until next login .
2. As a test, log in as testuser and execute the id -a command twice: once without pfexec and once with pfexec. Note the output:
163
18. pfexec
$ id -a uid =100( testuser ) gid =1( other ) groups =1( other ) $ pfexec id -a uid =0( root ) gid =0( root ) groups =1( other )
Without pfexec, testusers uid and gid values are those of testuser and other, respectively-that is, nonroot. With pfexec, uid and gid assume the value of root.To avoid having to type pfexec over and over again, testuser can set up a root Bash shell on his system, like this:
$ pfexec bash bash -3.2# id uid =0( root ) gid =0( root )
Afterwards, uid and gid revert to the normal values of the user, as reected in the output of the id -a command when executed under the control of pfexec. Without an assigned prole, pfexec yields no additional privileges to the user:
$ pfexec id -a uid =100( testuser ) gid =1( other ) groups =1( other )
18.4. Conclusion
Again, passwordless pfexec is the OpenSolaris version of Linuxs sudo. Assigning and revoking the root capabilities through the Primary Administrator prole is simple and
164
18. pfexec straightforward. In addition, you can monitor user actions by logging the executions of pfexec with the auditing subsystem of the Solaris OS. By default, the user account that you create while installing OpenSolaris 2008.11, the latest version of the OS, is automatically assigned the Primary Administrator rights prole even though the user cannot directly log in as root. Subsequently, that user can run pfexec with the root privilege - a big convenience to all!
165
166
19. Crossbow
Opensolaris
19.1. Introduction
At the moment ZFS, DTrace or Zones are the well known features of Solaris. But in my opinion there will be a fourth important feature soon. Since Build 105 its integrated (many people will already know which feature i want to describe in this artcle) into Opensolaris. This feature has the project name Crossbow. Its the new TCP/IP stack of Opensolaris and was developed with virtualisation in mind from ground up. Virtualisation in mind does not only lead to the concept of virtual network interface cards, you can congure virtual switches as well and even more important: You can control the resource usage of a virtual client or managing the load by distributing certain TCP/IP trac to dedicated CPUs. Ive already held some talks about Crossbow at dierent events, thus its time to write an article about this topic. I will start with the virtualisation part of Crossbow.
19.2. Virtualisation
This part is heavily inspired by this blog entry of Ben Rockwood1 , but he ommited some parts in the course of his article to make a full walk-through out of it, so i extended it a little bit. Normally a network consists out of switches and networking cards, server and router . Its easy to replicate this in a single system. Networking cards can be simulated by VNICS, switches are called etherstubs in the namespace of Crossbow. Server can be simulated by zones of course, and as router are not much more than special-purpose servers, we could simulate them by a zone as well.
http://www.cuddletech.com/blog/pivot/entry.php?id=1001
167
19. Crossbow
Okay, now we create virtual nics that are bound to the virtual switch etherstub0. These virtual nics are called vnic1 and vnic0.
# dladm create - vnic -l etherstub0 vnic1 # dladm create - vnic -l etherstub0 vnic2
Yes, thats all ... but what can we do with it? For example simulating a complete network in your system. Lets create a network with two networks, a router with a rewall and nat and a server in each of the network. Obviously we will use zones for this. A template zone At rst we create a template zone. This zone is just used for speeding up the creation of other zones. To enable zone creation based on ZFS snapshots, we have to create a lesystem for our zones and mount it at a nice position in your lesystem:
168
19. Crossbow
169
19. Crossbow
# zfs create rpool / zones # zfs set compression = on rpool / zones # zfs set mountpoint =/ zones rpool / zones
Now we prepare a command le for the zone creation. The pretty much the standard for a sparse root zone. We dont congure any network interfaces, as we never boot or use this zone. Its just a template as the name alreay states. So at rst we create a le called template in a working directory. All the following steps assume that you are in this directory as i wont use absolute paths.
create -b set zonepath =/ zones / template set ip - type = exclusive set autoboot = false add inherit - pkg - dir set dir =/ lib end add inherit - pkg - dir set dir =/ platform end add inherit - pkg - dir set dir =/ sbin end add inherit - pkg - dir set dir =/ usr end add inherit - pkg - dir set dir =/ opt end commit
Now we create the zone. Depending on your test equipment this will take some times.
# zonecfg -z template -f template # zoneadm -z template install A ZFS file system has been created for this zone . Preparing to install zone < template >. Creating list of files to copy from the global zone . Copying <3488 > files to the zone . Initializing zone product registry . Determining zone package initialization order . Preparing to initialize <1507 > packages on the zone . Initialized <1507 > packages on zone . Zone < template > is initialized . The file </ zones / template / root / var / sadm / system / logs / install_log > contains a log of the zone installation .
170
19. Crossbow
Got a coee? The next installations will be much faster.We will not boot it as we dont need it for our testbed. site.xml While waiting for the zone installation to end we can create a few other les. At rst you should create a le called site.xml. This les controls which services are online after the rst boot. You can think of it like an sysidcfg for the Service Management Framework. The le is rather long, so i wont post it in the article directly. You can download my version of this le here2 Zone congurations for the testbed At rst we have to create the zone congurations. The les are very similar. The dierences are in the zonepath and in the network conguration. The zone servera is located in /zones/serverA and uses the network interface vnic2. This will is called serverA:
create -b set zonepath =/ zones / serverA set ip - type = exclusive set autoboot = false add inherit - pkg - dir set dir =/ lib end add inherit - pkg - dir set dir =/ platform end add inherit - pkg - dir set dir =/ sbin end add inherit - pkg - dir set dir =/ usr end add inherit - pkg - dir set dir =/ opt end add net set physical = vnic2
2
http://www.c0t0d0s0.org/pages/sitexml.html
171
19. Crossbow
end commit
The zone serverb uses the directory /zones/serverB and is congured to bind to the interface vnic4. Obviously ive named the conguration le serverB
create -b set zonepath =/ zones / serverB set ip - type = exclusive set autoboot = false add inherit - pkg - dir set dir =/ lib end add inherit - pkg - dir set dir =/ platform end add inherit - pkg - dir set dir =/ sbin end add inherit - pkg - dir set dir =/ usr end add inherit - pkg - dir set dir =/ opt end add net set physical = vnic4 end commit
We have created both cong les for the simulated servers. Now we do the same for our simulated router. The conguration of the router zone is a little bit longer as we need more network interfaces. I opened a le called router and lled it with the following content:
create -b set zonepath =/ zones / router set ip - type = exclusive set autoboot = false add inherit - pkg - dir set dir =/ lib end add inherit - pkg - dir set dir =/ platform end
172
19. Crossbow
add inherit - pkg - dir set dir =/ sbin end add inherit - pkg - dir set dir =/ usr end add inherit - pkg - dir set dir =/ opt end add net set physical = vnic5 end add net set physical = vnic1 end add net set physical = vnic3 end commit
sysidcfg les To speed up installation we create some sysidcong les for our zones. Without this les, the installation would go interactive and you would have to use menus to provide the conguration informations. When you copy place such a le at /etc/sysidcfg the system will be initialized with the information provided in the le. I will start with the sysidcfg le of router zone:
system_locale = C terminal = vt100 name_service = none network_i nterfa ce = vnic5 { primary hostname = router1 ip_address =10.211.55.10 netmask =255.255.255.0 protocol_ipv6 = no default_route =10.211.55.1} network_i nterfa ce = vnic1 { hostname = router1 - a ip_address =10.211.100.10 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } network_i nterfa ce = vnic3 { hostname = router1 - b ip_address =10.211.101.10 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } nfs4_domain = dynamic root_password = cmuL . HSJtwJ . I security_policy = none
173
19. Crossbow
After this, we create a second sysidcong le for our rst server zone. I store the following content into a le called servera_sysidcfg:
system_locale = C terminal = vt100 name_service = none network_i nterfa ce = vnic2 { primary hostname = server1 ip_address =10.211.100.11 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } nfs4_domain = dynamic root_password = cmuL . HSJtwJ . I security_policy = none timeserver = localhost timezone = US / Central
When you look closely at the network_interface line you will see, that i didnt specied a default route. Please keep this in mind. In a last step i create serverb_sysidcfg. Its the cong le for our second server zone:
system_locale = C terminal = vt100 name_service = none network_i nterfa ce = vnic4 { primary hostname = server2 ip_address =10.211.101.11 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } nfs4_domain = dynamic root_password = cmuL . HSJtwJ . I security_policy = none timeserver = localhost timezone = US / Central
Firing up the zones After creating all this conguration les, we use them to create some zones. The procedure is similar for all zone. At rst we do the conguration. After this we clone the template zone. As we located the template zone in a ZFS lesystem, the cloning takes just a second. Before we boot the zone, we place our conguration les we prepared while waiting for the installation of the template zone.
# zonecfg -z router -f router # zoneadm -z router clone template
174
19. Crossbow
Cloning snapshot rpool / zones / te mp la te@ SU NW zo ne 3 Instead of copying , a ZFS clone has been created for this zone . # cp router_sysidcfg / zones / router / root / etc / sysidcfg # cp site . xml / zones / router / root / var / svc / profile # zoneadm -z router boot
All zones are up and running. Playing around with our simulated network At rst a basic check. Lets try to plumb one of the VNICs already used in a zone.
# ifconfig vnic2 plumb vnic2 is used by non - globalzone : servera
Excellent. The system prohibits the plumbing. Before we can play with our mini network, we have to activate forwarding and routing on our new router. Since Solaris 10 this is really easy. There is a command for it:
175
19. Crossbow
# # # #
routeadm -e ipv4 - forwarding routeadm -e ipv4 - routing routeadm -u routeadm Configuration Current Current Option Configuration System State --------------------------------------------------------------IPv4 routing enabled enabled IPv6 routing disabled disabled IPv4 forwarding enabled enabled IPv6 forwarding disabled disabled Routing services " route : default ripng : default "
Routing daemons : STATE disabled disabled disabled disabled disabled disabled disabled disabled disabled online disabled disabled online FMRI svc :/ network / routing / zebra : quagga svc :/ network / routing / rip : quagga svc :/ network / routing / ripng : default svc :/ network / routing / ripng : quagga svc :/ network / routing / ospf : quagga svc :/ network / routing / ospf6 : quagga svc :/ network / routing / bgp : quagga svc :/ network / routing / isis : quagga svc :/ network / routing / rdisc : default svc :/ network / routing / route : default svc :/ network / routing / legacy - routing : ipv4 svc :/ network / routing / legacy - routing : ipv6 svc :/ network / routing / ndp : default
This test goes only skin-deep into the capabilities of Solaris in regard of routing. But that is stu for more than one LKSF tutorial. Now lets look into the routing table of one of our server:
# netstat - nr Routing Table : IPv4 Destination -------------------default 10.211.100.0 127.0.0.1
Do you remember, that ive asked you to keep in mind, that we didnt specied a default route in the sysidcfg? But why have we such an defaultrouter now. There is some automagic in the boot. When a system with a single interfaces comes up without an default route specied in /etc/defaultrouter or without being a dhcp client it automatically starts up the router discovery protocol as specied by RPC 12563 . By using this protocol the hosts adds all available routers in the subnet as a defaultrouter. The rdisc protocol is implemented by the in.routed daemon. It implements two dierent protocols. The rst one is the already mentioned rdisc protocol. But it implements the
http://tools.ietf.org/html/rfc1256
176
19. Crossbow RIP protocol as well. The RIP protocol part is automagically activated when a system has more than one network interface.
# ping 10.211.100.11 10.211.100.11 is alive # traceroute 10.211.100.11 traceroute to 10.211.100.11 (10.211.100.11) , 30 hops max , 40 byte packets 1 10.211.101.10 (10.211.101.10) 0.285 ms 0.266 ms 0.204 ms 2 10.211.100.11 (10.211.100.11) 0.307 ms 0.303 ms 0.294 ms #
As you can see ... weve builded a network in a box. Building a more complex network Lets extend our example a little bit. We will create an example with more networks and switches, more servers and router. For a quick overview i put the gure 19.2 on page 178. At rst we congure additional etherstubs and VNICs:
# # # # # # dladm dladm dladm dladm dladm dladm create - etherstub etherstub10 create - vnic -l etherstub1 routerb1 create - vnic -l etherstub10 routerb10 create - vnic -l etherstub10 serverc1 create - vnic -l etherstub1 routerc1 create - vnic -l etherstub10 routerc2
As you see, you are not bound to a certain numbering scheme. You can call a vnic as you want, as long its beginning with letters and ending with numbers. Now we use an editor to create a conguration le for our routerB:
create -b set zonepath =/ zones / routerB set ip - type = exclusive set autoboot = false add inherit - pkg - dir set dir =/ lib end add inherit - pkg - dir set dir =/ platform end add inherit - pkg - dir set dir =/ sbin
177
19. Crossbow
178
19. Crossbow
end add inherit - pkg - dir set dir =/ usr end add inherit - pkg - dir set dir =/ opt end add net set physical = routerb1 end add net set physical = routerb10 end commit
We dont have to congure any default router in this sysidcfg even when the system is a router itself. The system boots up with a router and will get its routing tables from the RIP protocol.
system_locale = C terminal = vt100 name_service = none network_i nterfa ce = routerb1 { primary hostname = routerb ip_address =10.211.101.254 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } network_i nterfa ce = routerb10 { hostname = routerb - a ip_address =10.211.102.10 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } nfs4_domain = dynamic root_password = cmuL . HSJtwJ . I security_policy = none timeserver = localhost timezone = US / Central
Okay, the next zone is the routerc zone. We bind it to the matching vnics in the zone conguration:
179
19. Crossbow
create -b set zonepath =/ zones / routerC set ip - type = exclusive set autoboot = false add inherit - pkg - dir set dir =/ lib end add inherit - pkg - dir set dir =/ platform end add inherit - pkg - dir set dir =/ sbin end add inherit - pkg - dir set dir =/ usr end add inherit - pkg - dir set dir =/ opt end add net set physical = routerc1 end add net set physical = routerc2 end commit
The same rules as for the routerb apply to the routerc. We will rely on the routing protocols to provide a defaultroute, so we can just insert NONE into the sysidcfg for the default route.
# cat routerc_sysidcfg system_locale = C terminal = vt100 name_service = none network_i nterfa ce = routerc1 { primary hostname = routerb ip_address =10.211.102.254 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } network_i nterfa ce = routerc2 { hostname = routerb - a ip_address =10.211.100.254 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } nfs4_domain = dynamic root_password = cmuL . HSJtwJ . I security_policy = none timeserver = localhost
180
19. Crossbow
timezone = US / Central
Okay, i assume you already know the following steps. Its just the same just with other les.
# zonecfg -z routerc -f routerC # zoneadm -z routerc clone template Cloning snapshot rpool / zones / te mp la te@ SU NW zo ne 4 Instead of copying , a ZFS clone has been created for this zone . # cp routerb_sysidcfg / zones / routerC / root / etc / sysidcfg # cp site . xml / zones / routerC / root / var / svc / profile / # zoneadm -z routerc boot
Okay, this is the last zone conguration in my tutorial. Its the zone for serverc:
create -b set zonepath =/ zones / serverC set ip - type = exclusive set autoboot = false add inherit - pkg - dir set dir =/ lib end add inherit - pkg - dir set dir =/ platform end add inherit - pkg - dir set dir =/ sbin end add inherit - pkg - dir set dir =/ usr end add inherit - pkg - dir set dir =/ opt end add net set physical = serverc1 end commit
Again ... no defaultroute ... as this is a single-interface system we leave it to the ICMP Router Discovery Protocol to nd the routers. So create a le called serverC.
system_locale = C terminal = vt100 name_service = none
181
19. Crossbow
network_i nterfa ce = serverc1 { primary hostname = server2 ip_address =10.211.102.11 netmask =255.255.255.0 protocol_ipv6 = no default_route = NONE } nfs4_domain = dynamic root_password = cmuL . HSJtwJ . I security_policy = none timeserver = localhost timezone = US / Central
So at rst we have to make routers out of our routing zones. Obviously we have to login into the both routing zones and activating forwarding and routing. At rst on routerb:
# routeadm -e ipv4 - forwarding # routeadm -e ipv4 - routing # routeadm -u
Flags
Ref
Use
----- ----UG UG U 1 1 1
182
19. Crossbow
127.0.0.1 49 lo0
127.0.0.1
UH
As you see, there are two default routers in the routing table. The host receives router advertisments from two routers, thus it adds both into the routing table. Now lets have a closer at the routing table of the routerb system.
routerb # netstat - nr Routing Table : IPv4 Destination Gateway Interface -------------------- ----------------------- - - - - - - --- -- -- -default 10.211.101.10 0 routerb1 10.211.100.0 10.211.102.254 0 routerb10 10.211.101.0 10.211.101.254 0 routerb1 10.211.102.0 10.211.102.10 0 routerb10 127.0.0.1 127.0.0.1 23 lo0
Flags
Ref
Use
----- ----UG UG U U UH 1 1 1 1 1
This system has more than one devices. Thus the in.routed starts up as a RIP capable routing daemon. After a short moment the in.routed has learned enough about the network and adds its routing table to the kernel. And after a short moment the routing tables of our router are lled with the routing informations provided by the routing protocols. Conclusion The scope of the virtualisation with crossbow part is wider than just testing. Imagine the following situation: You want to consolidate several servers in a complex networks, but you want or you cant change a conguration le. In regard of the networking conguration you just could simulate it in one machine. And as its part of a single operating system kernel it is a very ecent way to do it. You dont need virtual I/O servers or something like that. Its the single underlying kernel of Solaris itself doing this job. Another interesting use case for Crossbow was introduced by Glenn Brunette in his concept for the immutable service containers4
http://wikis.sun.com/display/ISC/Home
183
19. Crossbow
As you see we are able to download the data 6464 Kilobyte per second. Okay, let us impose a limit for the http server. At rst we create a ow that matches on webserver trac.
jmoekamp@a340 :~# flowadm add - flow -l e1000g0 -a transport = tcp , local_port =80 httpflow
When you dissect this ow conguration you get to the following ruleset:
184
19. Crossbow the trac is on the ethernet interface e1000g0 it is tcp trac the local port is 80 for future reference the ow is called httpflow With owadm show-ow we can check the current conguration of ows on our system.
jmoekamp@a340 :~# flowadm show - flow FLOW LINK IPADDR httpflow e1000g0 -PROTO tcp PORT 80 DSFLD --
This is just the creation of the ow. To enable the bandwidth limiting we have to set some properties on this ow. To limit the trac we have to use the maxbw property. For our rst test, we set it to 2 Megabit/s:
jmoekamp@a340 :~# flowadm set - flowprop -p maxbw =2 m httpflow
As you see ... 266 Kilobyte per second thats, roughly 2 MBit/s. Okay, now we try 8 Megabit/s as a limit:
jmoekamp@a340 :~# flowadm set - flowprop -p maxbw =8 m httpflow
Okay, we yield 933 Kilobyte/s. Thats a little bit less than 8 Mbit/s
185
19. Crossbow
19.4. Accouting
Okay, all the trac in Crossbow is separated in ows (when its not part of a congured ow, its part of the default ow). It would be nice to use this ow information for accounting. Before doing the testing i activated the accounting with the following command line:
jmoekamp@a340 :~# acctadm -e extended -f / var / log / net . log net
Now i can check for bandwidth usage. For example when i want to know the trac usage between 18:20 and 18:24 on June 20th 2009 i can use the flowadm show-usage account you yield this data from the le ive congured before (in my case /var/log/net.log
jmoekamp@a340 :~# flowadm show - usage -s 06/20/2009 ,18:20:00 -e 06/20/2009 ,18:24:00 -f / var / log / net . log FLOW START END RBYTES OBYTES BANDWIDTH httpflow 18:20:27 18:20:47 0 0 0 Mbps httpflow 18:20:47 18:21:07 0 0 0 Mbps httpflow 18:21:07 18:21:27 104814 6010271 2.446 Mbp httpflow 18:21:27 18:21:47 0 0 0 Mbps httpflow 18:21:47 18:22:07 0 0 0 Mbps httpflow 18:22:07 18:22:27 0 0 0 Mbps httpflow 18:22:27 18:22:47 0 0 0 Mbps httpflow 18:22:47 18:23:07 0 0 0 Mbps httpflow 18:23:07 18:23:27 121410 5333673 2.182 Mbp httpflow 18:23:27 18:23:47 15246 676598 0.276 Mbps
The capability to do accounting on a per ow basis makes this feature really interesting even when you dont want to congure a trac limit. So i congured an additional ow for SMTP trac and now the accounting is capable to separate between the HTTP and the SMTP trac:
jmoekamp@a340 :~# flowadm show - flow -s FLOW IPACKETS RBYTES IERRORS httpflow 1168 77256 0 smtpflow 18 1302 0 OPACKETS 4204 13 OBYTES 6010271 1250 OERRORS 0 0
186
20. IP Multipathing
Solaris 10/Opensolaris
20.2. Introduction
Before people start to think about clusters and load balancers to ensure the availability, they should start with the low hanging fruits. Such a low hanging fruit is the protection of the availability of the network connection. Solaris has an integrated mechanism to ensure this availability. Its called IP Multipathing. IP Multipathing is an important part of the solution for an ever reoccurring problem , as almost all applications interact with the outside world on one way or the other. Thus ensuring the mechanisms of communication is a part of almost all architectures. Even when you have other availability mechanisms like balancers, you want to use a protection of the IP connection out of a simple reason: Many applications have a session context and not all software architectures can replicate those session contexts to another system to enable a failover without loosing the session. So do you really want to loose this context just because of a failing network card or because of a admin unplugging a cable?
187
20. IP Multipathing Or do you really want to provoke a cluster failover because of a failing network card? IPMP can keep such failures on a low level without needing high-availability mechanisms with a much larger impact. Out of this reason IP Multipathing is an important part for most HA infrastructures. This tutorial wants to give you an introduction in this topic. Its not really an less known feature because for many people working with Solaris, IPMP is a daily part of their work. But many people new to Solaris or OpenSolaris arent aware of the fact that Solaris has an integrated mechanism for IP Multipathing1 . Furthermore this tutorial wants to give some insights into new developments in the eld of IP Multipathing.
1 2
As well as most newbies to Solaris arent aware of MPxIO ... the counterpart of IPMP for storage Albeit I can think of remote cases were IPMP with one path can be useful
188
20. IP Multipathing IPMP Group: Now you have physical interfaces on some network interface cards into several IP Links. How do you tell the system, that certain interfaces are redundant connections into the same IP link? The concept of the IPMP group solves this problem. You put all interfaces into a IPMP group that connect into an IP link into a IPMP group. All interfaces in this group are considered as redundancy to each other, so the IPMP can use them to receive and transmit the trac out of this network. Failure: Okay, you may think, this one is so obvious you dont have to talk about it. Well, not really. This is one of the most frequent errors in HA. Buying or using a HA product without thinking about the failure modes that are addressed by the mechanism. You have to think about what failures are in-scope of IPMP and which one are out-of-scope. IPMP is called IP Multipathing for a reason: Its a tool for IP, it isnt meant for other protocols. So it protects the availability of IP services against failures. But for this task it uses information of other layers of the stack, for example the information if there is a link on the physical connection. Primarily it uses this information to speed up failover. There is no need to check upper layers if you already know that lower layers went away. Failure Detection: You do IPMP for a reason. You want protect your system from loosing its network connection in the case a networking component fails. One of the most important component of an automatic availability protection mechanism is its capability to detect the need of doing something like switching the IP conguration to another physical interface. Without such a mechanism its just an easier interface to switch such conguration manually. That said, IPMP provides two mechanisms to detect failures: Link based: As the name suggests, the link based failure detection checks if the physical interface has an active and operational link to the physical network. When a physical interface looses its link - for example by problems with the cabling or a switch powered down - IPMP considers the interface as failed and starts to failover to a operational link. The monitoring mechanism for the link state is quite simple. Its done by monitoring the RUNNING ag of an IP interface. When you look at a functional interface with ifconfig you will recognize this ag:
e1000g0 : flags =209040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , CoS > mtu 1500 index 11 inet 192.168. 56.201 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8: 0: 2 7: 11:3 4 :4 3
189
20. IP Multipathing
e1000g0 : flags =219040803 < UP , BROADCAST , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , FAILED , CoS > mtu 1500 index 11 inet 192.168. 56.201 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8: 0: 2 7: 11 :3 4 :4 3
This method of monitoring the interfaces mandates an capability of the networking card driver to use link-based IPMP. They have to set and unset the RUNNING ag based on the link state.3 Probe based: The probe base mechanism itself is independent from the hardware. It checks the IP layer on the IP layer. The basic idea is: When you are able to communicate via IP to other systems, its safe to assume that the IP layer is there. The probing itself is a simple mechanism. The probe based failure detection sends ICMP messages to a number of systems. As long the other systems react on those ICMP packets, a link is considered as ok. When those other systems dont react in a certain time, the link is considered as failed and the failover takes place I will talk about the advantages and disadvantages of both in a later section. Data Address: In IPMP terminology the data addresses are the addresses that are really used for communication. An IPMP group be used for multiple data addresses. However, all data addresses have to be in the same IP link. Test Address: When you send ICMP messages to detect a failure, you need a sourcing IP address for those messages. So each physical interface needs an address that is just used for testing purposes. This address is called test address. Repair and Repair detection: When you talk about failures, you have to talk about repairs as well. When an interface is functional again - for example by using another cable or a dierent switch - you have to detect this situation and reactivate the interface. Without repairs and the detection of repairs you would run out of interfaces pretty soon. The repair detection is just the other side of the failure dection, just that you check for probes getting through or a link thats getting up again. Target systems: A target system is the matching opposite part of the test address. When you want to check the availability of a network connection via sending probe messages via ICMP, you need a source as well as a target for this ICMP communication.
hme, eri, ce, ge, bge, qfe, dmfe, e1000g, ixgb, nge, nxge, rge, xge denitely work, ask the provider of the driver for other cards
190
20. IP Multipathing In IPMP speak a target system is a system that is used to test the availability of an IP interface. The IPMP mechanism tries to ping the target system in order to evaluate if the network interface is still fully functional. This is done for each interface by choosing the test address as the source address of the IPMP request. Target systems are chosen by the IPMP mechanism. The mechanism to do so is quite simple: Routers in an ip link are chosen as target systems automatically. When there are no routers connected to the IP-link, the IPMP mechanism tries to nd hosts in the neighborhood. A ping is sent to the all hosts-multicast address 224.0.0.1.4
j m o e k a m p @ h i v e m i n d :~ $ ping -s 224.0.0.1 PING 224.0.0.1: 56 data bytes 64 bytes from hivemind - prod ( 1 9 2 . 1 6 8 . 1 7 8 . 2 0 0 ) : icmp_seq =0. time =0.052 ms 64 bytes from 19 2. 1 68 .1 78 . 22 : icmp_seq =0. time =0.284 ms 64 bytes from 1 9 2 . 1 6 8 . 1 7 8 . 1 1 4 : icmp_seq =0. time =20.198 ms
The rst few systems replying to this ping are chosen as target systems. The automatic mechanism doesnt always choose the most optimal system for this check, thus you can specify them in the case you think a manual conguration ensures that the target system really represent a set of system, whose availability represents a check the availability of the network. Manually dened hosts have always precedence over routers, so manually dening such systems can reduce the ICMP load on your router. However, in most cases the automatic mechanism yields reasonable and sucient results.
191
20. IP Multipathing But there is a big disadvantage. The challenge lies in the point that it doesnt check the health of your IP connection, it just checks if there is a link. Its like a a small signal light, that indicates that theres power on the plug, but doesnt tell you if its 220v or 110v. There are situations when a purely link-based mechanism is misguiding, especially when the networks are getting more complex. Just think about the following network: Lets assume that link 1 fails. Obviously the link at the physical interface goes down. The link based mechanism can detect this failure and the system can react to this problem and switch over to the other networking card. But now lets assume that link 2 fails. The link on the connection 1 is still up and the system considers the connection to the network as functional. There is no change in the ags of the IP interface. However your networking connection is still broken as your defaultrouter is gone. A link means nothing when you cant communicate over it. At rst such scenarios doesnt sound so common and an intelligent network design can prevent such situations. Yes, thats correct, but just think about el-cheapo media converters from bre to copper, that doesnt take down the link on the copper side when the link is down on the bre side5 . Or small switches that are misused as media converters6 Probe based So how you can circumvent this problem? The solution is somewhat obvious. Dont check only the link on the physical layer. Check it on the layer that really matters. In the case of networking: Dont check if theres a physical link ... check if you can reach other systems with the IP protocol. And the probe base failure detection does exactly this. As i wrote before, the probe based failure detection uses ICMP messages to check a functional IP network. So it can check if you really have an IP connection to your default router and not just a link to a switch somewhere between the server and the router. But this method has a disadvantage as well: You need vastly more IP-addresses. Every interface in the IPMP address needs a test address. The test address is used to test the connection and stays on the interface even in the case of a failure7 . The IP address consumption is huge. Given you have n interfaces you need n test addresses.An IPMP group with four connections needs 4 test addresses. However you
Albeit any decent media converter has a feature that mirrors the link down state from one side to the other to ease management and to notify the connected system about problems 6 Dont laugh about it ... I found dusty old 10BASET hubs in raised oors working perfectly as media converters for years and years 7 Obviously you need the test mechanism to check if the physical link was repaired by the admin
5
192
20. IP Multipathing
193
20. IP Multipathing
194
20. IP Multipathing
195
20. IP Multipathing can ease the consumption of IP-Address by using a private network for the test addresses dierent to the network containing the data addresses. But I will get to this at the end of this chapter.
Given the 2 seconds between the probes, a failure is detected in 10 seconds by default, a repair is detected in 20 seconds. However you can change this number in the case you need a faster failure. I will explain that on page 222 in section 20.9.4
196
20. IP Multipathing somewhat similar to the link based failure detection. When the link is down on a member of an aggregation, the switch takes the link out of the aggregation and put its back as soon as the link gets up again. Later something similar to the probe-based mechanism found its way into the Ethernet standards. Its called LACP. With LACP special frames are used on a link to determine if the other side of the connection is in the same aggregate8 and if there is really an Ethernet connection between both switches.I wont go in the details now, as this will be the topic of another tutorial in the next days. But the main purpose of link aggregation is to create a bigger pipe when a single Ethernet connection isnt enough. So ... why should you use IPMP? The reason is a simple one. When you use link aggregation, all your connections have to terminate on the same switch, thus this mechanisms wont really help you in the case of a switch failure. The mechanisms of IPMP doesnt work in the Layer 2 of the network, it works in the third layer and so it doesnt have this constraint. The connections of an IPMP group can end in dierent switches, they can have dierent speeds, they could be even of a dierent technology, as long they use IP (this was more of advantage in the past, today in the Ethernet Everything age this point lost its appeal). I tend to say that link aggregation is a performance technology with some high availability capabilities, where as IPMP is a high-availability technology with some performance capabilities.
20.3. Loadspreading
A source of frequent questions is the load spreading feature in IPMP. Customers have asked me if this comparable to the aggregation. My answer is Yes, but not really! Perhaps this is the right moment to explain a thing about IPMP. When you look at the interfaces of a classic IPMP conguration, it looks like the IP addresses are assigned to physical interfaces. But that isnt the truth. When you send out data on such an interface, its spread on all active9 interfaces of such a group.
8 9
It was common conguration error in early times to have non-matching aggregation conguration Active doesnt mean functional. An interface can be functional but it isnt used by IP trac. An interface can be declared as a standby interface, thus it may be functional but the IPMP subsystem wouldnt use it. Thats useful when you have a 10 GBe Interface and a 1 GBe Interface. You dont want the 1 GBE interface for normal use, but its better than nothing in the case the 10 GBe interface fails
197
20. IP Multipathing But you have to be cautious: IPMP can do this only for outbound trac. As IPMP is a server-only technology, there is no counterpart for it on the switch. So there is no load spreading on the switch. The switches doesnt know about this situation. When an inbound packet reaches the default gateway, the router uses the usual mechanisms to get the ethernet address of the IP address and sends the data to this ethernet address. As there can be just one ethernet address for every IP address, the inbound communication will always use just one interface. This isnt a problem for many workloads as many server applications send more data than they receive10 . But as soon your application receives at lot of data11 , you should opt for another load distribution mechanism. However there is a trick to circumvent this constraint: A single IPMP group can provide several data addresses. By carefully distributing this data addresses over the physical interfaces you are able to distribute the inbound load as well. So when you are able to use multiple IP addresses you could do such a manual spreading of the inbound load. However real load spreading mechanisms with the help of the switches12 will yield a much better distribution for the inbound trac in many cases. But this disadvantage comes with an advantage: You are not bound to a single switch to use this load spreading. You could terminate every interface of you server in a separate switch and the IPMP group still spreads the trac on all interfaces. That isnt possible with the standard link aggregation technologies of Ethernet. I want to end this section a short warning: Both aggregation technologies will not increase your bandwidth when you have just a single IP data stream. Both technologies will use the same Ethernet interface for a communication relation between client and server. Its possible to separate them even based on Layer 4 properties, but at the end the single ftp download will use just one of your lines. This is necessary to prevent out-of-order packets13 due to dierent trip times of the data on separate links.
198
20. IP Multipathing With classic IPMP the data address is bound to a certain interface. In the case of the failure of an interface, the interface isnt used anymore for outbound trac and the data address gets switched to an operational and active interface. With new IPMP you have a virtual ipmp interface in front of the physical network interfaces representing the IPMP group. The ipmp interface holds the data address and it isnt switched at any time. A physical interface may have a test address, but they are never congured with a data address. The virtual IPMP interface is your point of administration when you want to snoop network trac for all interfaces in this group for example.
20.4. in.mpathd
There is a component in both variants that controls all the mechanisms surrounding IP multipathing. Its the in.mpathd daemon.
j m o e k a m p @ h i v e m i n d :~ $ ps - ef | grep " mpathd " | grep -v " grep " root 4523 1 0 Jan 19 ? 8:22 / lib / inet / in . mpathd
This daemon is automatically started by ifconfig, as soon you are conguring something in conjunction with IPMP on your system. The in.mpathd process is responsible for network adapter failure detection, repair detection, recovery, automatic failover and failback.
20.5. Prerequisites
At rst you need a testbed. In this tutorial I will use a system with three interfaces. Two of them are Intel networking cards. They are named e1000g0 and e1000g0. The third interface is an onboard Realtek LAN adapter called rge0. The conguration of the ip network is straight forward. The subnet in this test is 192.168.178.0/24. I have a router at 192.168.178.1. The physical network is a little bit more complex to demonstrate the limits of link-based failure detection. e1000g0 and e1000g1 are connected to a rst switch called Switch A. This switch connects to to a second switch called Switch B. The rge0 interface connects directly to Switch B. The router of this network is connected to Switch B as well. To make the conguration a little bit more comfortable, we add a few hosts to our /etc/hosts le. We need four adresses while going through the tutorial. At rst we need the name for the data address:
echo " 1 9 2 . 1 6 8 . 1 7 8 . 2 0 0 hivemind - prod " >> / etc / hosts
199
20. IP Multipathing
Figure 20.4.: Conguration for the demo Now we need names for our test addresses. Its a good practice to use the name of the data address appended with the name of the physical address:
echo " 1 9 2 . 1 6 8 . 1 7 8 . 2 0 1 hivemind - prod - e1000g0 " >> / etc / hosts echo " 1 9 2 . 1 6 8 . 1 7 8 . 2 0 2 hivemind - prod - e1000g1 " >> / etc / hosts echo " 1 9 2 . 1 6 8 . 1 7 8 . 2 0 3 hivemind - prod - rge0 " >> / etc / hosts
200
20. IP Multipathing With this command youve congured the IPMP interface. You can use any name for it you want, it just has to begin with a letter and has to end on a number. I have chosen the name production0 for this tutorial Now lets look at the interface:
j m o e k a m p @ h i v e m i n d :~# ifconfig production0 production0 : flags =8011000803 < UP , BROADCAST , MULTICAST , IPv4 , FAILED , IPMP > mtu 68 index 6 inet 1 92 .1 68 . 17 8. 20 0 netmask ffffff00 broadcast 1 9 2. 16 8. 1 78 .2 55 groupname production0
As you see, its pretty much looking like a normal network interface with some specialities: At rst its in the mode FAILED at the moment. There are no network interfaces congured to the group, thus you cant connect anywhere over this interface. The interface is already congured with the data address.14 . The data address will never move away from there. At the end you see the name of the IPMP group. The default behavior sets the name of the IPMP group and the name of the IPMP interface to the same value. Okay, now we have to assign some physical interfaces to it. This is the moment where we have to make a decision. Do we want to use IPMP with probes or without probes? As Ive explained before its important to know at this point, what failure scenarios you want to cover with your conguration. You need to know it now, as the conguration is slightly dierent.
14
Additional data addresses are congured as logical interfaces onto this virtual interface. You wont congure additional virtual IPMP interfaces
201
20. IP Multipathing Okay, now we add the three member interfaces into the IPMP group:
j m o e k a m p @ h i v e m i n d :/ etc # ifconfig e1000g0 - failover group production0 up j m o e k a m p @ h i v e m i n d :/ etc # ifconfig e1000g1 - failover group production0 up j m o e k a m p @ h i v e m i n d :/ etc # ifconfig rge0 - failover group production0 up
As you may have noticed, we really didnt specify an IP address or a hostname. With link-based failure detection you dont need it. The IP address of the group is located on the IPMP interface weve dened a few moments ago. But lets have a look at the ifconfig statements. There are two parameters you may not know: -failover: This parameter marks an interface as a non-failover one. In case of a failure, this interface conguration doesnt move. While a little bit strange in the context of a physical interface15 , but the rationale gets clearer with probe-based IPMP. group production0: the parameter group designates the IPMP group membership of an interface. Lets look at one of the interfaces:
rge0 : flags =9040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER > mtu 1500 index 5 inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255 groupname production0
We nd the consequences of both ifconfig parameters: The NOFAILOVER is obviously the result of the -failover and groupname production0 is the result of the group production0 statement. But there is another ag that is important in the realm of IPMP. Its the DEPRECATED ag. The DEPRECATED ag has a very simple meaning: Dont use this IP interface. When an interface has this ag, the IP address wont be used to send out data16 . As those IP addresses are just for test purposes, you dont want them to appear in packets to the outside world. Playing around Now we need to interact with the hardware, as we will fail network connections manually. Or to say it dierently: We will pull some cables.
15 16
Moving hardware is a little bit problematic just by software.... Of course there are exceptions like an application specically binding to the interface. Please look into the man page for further information
202
20. IP Multipathing But before we are doing this, we look at the initial status of our IPMP conguration. The new IPMP model improved the monitoring capabilities of its state by introducing a command for this task. Its called ipmpstat.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --- ---e1000g1 yes production0 - -----e1000g0 yes production0 --mb - - LINK up up up PROBE disabled disabled disabled STATE ok ok ok
Just to give you a brief tour through the output of the command. The rst column reports the name of the interface, the next one reports the state of the interface from the perspective of IPMP. The third column tells you which IPMP group was assigned to this interface. The next columns gives us some more in-depth information about the interface. The fourth column is a multipurpose column to report a number of states. In the last output, the --mb-- tells us, that the interface e1000g0 was chosen for sending and receiving multicast and broadcast data. Other interfaces doesnt have a special state, so there are just dashes in the respective FLAGS eld of these interfaces. The fth column reveals, that weve disabled probes17 . The last column details on the state of the interface. In this example it is OK and so its used in the IPMP group. Okay, now pull the cable from the e1000g0 interface. Its Cable 1 in the gure. The system automatically switches to e1000g1 as the active interface.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --- ---e1000g1 yes production0 --mb - - e1000g0 no production0 ------LINK up up down PROBE disabled disabled disabled STATE ok ok failed
As you can see, the failure has been detected on the e1000g0 interface. The link is down, thus it is no longer active. Okay, lets repair it. Put the cable back to the port of the e1000g0 interface. After a moments, the link is up. The in.mpathd gets aware of the RUNNING ag on the interface. in.mpathd assumes that the network connection got repaired, so the state of the interface is set to ok and thus the interface is reactivated.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --- ---e1000g1 yes production0 --mb - - e1000g0 yes production0 - -----LINK up up up PROBE disabled disabled disabled STATE ok ok ok
17
203
20. IP Multipathing The problem with link-based failure detection Just in case youve played with the ethernet cables, ensure that IPMP chooses an interface connecting via Switch A as the active interface by zipping Cable 3 from the switch B for a moment. When you check with ipmpstat -i the mb has to be assigned to the interface e1000g0 or e1000g1. As i wrote before there are failure modes link-based failure detection cant detect. Now lets introduce such a fault. To do so, just remove Cable 4 between switch A and B.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --- ---e1000g1 yes production0 --mb - - e1000g0 yes production0 - -----LINK up up up PROBE disabled disabled disabled STATE ok ok ok
As there is still a link on the Cables 1 and 2 everything is ne from the perspective of IPMP. It doesnt switch to the connection via rge0 which presents the only working connection to the outside world. IPMP is simply not aware of the fact that Switch A was seperated from the IP link 192.168.178.0/24 due to the removal of cable 4.
In this case, the rst three commands will fail, but you have the explicitly dened IPMP interface
204
20. IP Multipathing Okay, now all the interfaces are away. Now we recreate the IPMP group.
j m o e k a m p @ h i v e m i n d :/ etc # ifconfig production0 ipmp hivemind - prod up
We can check the successful creation of the IPMP interface by using the ipmpstat command.
j m o e k a m p @ h i v e m i n d :/ etc # ipmpstat -g GROUP GROUPNAME STATE FDT production0 production0 failed -INTERFACES --
At start there isnt an interface congured into the IPMP group. So lets start to ll the group with some life.
j m o e k a m p @ h i v e m i n d :/ etc # ifconfig e1000g0 plumb hivemind - prod - e1000g0 - failover group production0 up
There is an important dierence. This ifconfig statement contains an IP address, that is assigned to the physical interface. This automatically congures IPMP to use the probe based failure detection. The idea behind the -failover setting gets clearer now. Obviously the test addresses of an interface should be failovered by IPMP. They should stay on the logical interface. As the interface has the FAILOVER ag, the complete interface including its IP address is exempted from any failover. Lets check the ipmp group again:
j m o e k a m p @ h i v e m i n d :/ etc # ipmpstat -g GROUP GROUPNAME STATE FDT production0 production0 ok 10.00 s INTERFACES e1000g0
There is now an interface in the group. Of course an IPMP group with just one interface doesnt really make sense. So congure we will congure a second interface into the group. You may have recognized the FTD column. FTD stands for Failure Detection Time. Why is there an own column for this number? Due to the dynamic nature of the Failure Detection time, the FDT may be dierent for every group. With this column you can check the the current FDT.
j m o e k a m p @ h i v e m i n d :/ etc # ifconfig e1000g1 plumb hivemind - prod - e1000g1 - failover group production0 up
Now we add the third interface that is connected to the default gateway just via Switch B.
j m o e k a m p @ h i v e m i n d :/ etc # ifconfig rge0 plumb hivemind - prod - rge0 - failover group production0 up
205
All three interfaces are in the IPMP group now. And thats all ... weve just activated failure detection and failover by this four commands. Really simple, isnt it? Playing around I hope, you have still the hardware conguration in place, I used to show the problems of link based failure detection. In the case you havent please create the conguration weve used there. At rst we do a simple test: We simply unplug a cable from the system. In my case I removed the cable 1:
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --mb - - e1000g1 yes production0 - -----e1000g0 no production0 ------LINK up up down PROBE ok ok failed STATE ok ok failed
The system reacts immediately, as the link-based failure detection is still active, even when you use the probe-based mechanism. You can observe this in the ipmpstat output by monitoring the state of the link column. Its down at the moment and obviously probes cant reach their targets. The state is assumed as failed. Now plug the cable back to the system:
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --mb - - e1000g1 yes production0 - -----e1000g0 no production0 ------LINK up up up PROBE ok ok failed STATE ok ok failed
The link is back, but the interface is still failed. IPMP works as designed here. The probing of the interface with ICMP messages still considers this interface as down. As we have now two mechanism to check the availability of the interface, both have to conrm the repair. IPMP doesnt consider an interface as repaired when just one ICMP probe gets through, it waits until 20 ICMP probes were correctly replied by the target system. Due to this probing at repair time instead of just relying on the link, you can prevent that an interface is considered as OK when an uncongured switch brings the link back online, but the conguration of the switch doesnt allow to the server to connect anywhere (because of VLAN conguration for example).
206
20. IP Multipathing
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --mb - - e1000g1 yes production0 - -----e1000g0 yes production0 - -----j m o e k a m p @ h i v e m i n d :~#
LINK up up up
PROBE ok ok ok
STATE ok ok ok
As soon as the probing of the interface is successful, it brings the interface back to the OK state and everything is ne. Now we get to a more interesting use case of probe-based failure detection. Lets assume weve repaired everything and all is ne. You should see a situation similar to this one in your ipmpstat output:
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --- ---e1000g1 yes production0 - -----e1000g0 yes production0 --mb - - LINK up up up PROBE ok ok ok STATE ok ok ok
Now unplug cable 4, the cable between the switch A and B. At rst nothing happens, but a few seconds later IPMP switches the IP addresses to rge0 and set the state of the other interfaces to failed.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --mb - - e1000g1 no production0 ------e1000g0 no production0 ------LINK up up up PROBE ok failed failed STATE ok failed failed
When you look at the output of ipmpstat you will notice that the link is still up, but the probe has failed, thus the interfaces were set into the state failed. When you plug the cable 3 back to the switches nothing will happen at rst. You have to wait until the probing mechanism reports that the IPMP messages were correctly returned by the target systems.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --mb - - e1000g1 no production0 ------e1000g0 no production0 ------LINK up up up PROBE ok failed failed STATE ok failed failed
After a few seconds it should deliver an ipmpstat output reporting everything is well again.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --mb - - e1000g1 yes production0 - -----e1000g0 yes production0 - -----LINK up up up PROBE ok ok ok STATE ok ok ok
207
20. IP Multipathing
We reboot the system now to ensure that we did everything correctly. When the system has booted up, we will check if we made an error.
j m o e k a m p @ h i v e m i n d :~ $ ipmpstat -g GROUP GROUPNAME STATE FDT production0 production0 ok -j m o e k a m p @ h i v e m i n d :~ $ ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --- ---e1000g1 yes production0 - -----e1000g0 yes production0 --mb - - INTERFACES rge0 e1000g1 e1000g0 LINK up up up PROBE disabled disabled disabled STATE ok ok ok
208
20. IP Multipathing Boot persistent probe-based conguration We can do the same for the probe-based IPMP:
j m o e k a m p @ h i v e m i n d :/ etc # echo hostname . production0 j m o e k a m p @ h i v e m i n d :/ etc # echo / etc / hostname . e1000g0 j m o e k a m p @ h i v e m i n d :/ etc # echo / etc / hostname . e1000g1 j m o e k a m p @ h i v e m i n d :/ etc # echo etc / hostname . rge0 " ipmp group production0 hivemind - prod up " > / etc / " group production0 - failover hivemind - prod - e1000g0 up " > " group production0 - failover hivemind - prod - e1000g1 up " > " group production0 - failover hivemind - prod - rge0 up " > /
Reboot the system and login afterwards to check the list of interfaces.
j m o e k a m p @ h i v e m i n d :~ $ ifconfig -a lo0 : flags =2001000849 < UP , LOOPBACK , RUNNING , MULTICAST , IPv4 , VIRTUAL > mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 production0 : flags =8001000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , IPMP > mtu 1500 index 2 inet 1 92 .1 68 . 17 8. 20 0 netmask ffffff00 broadcast 1 9 2. 16 8. 1 78 .2 55 groupname production0 e1000g0 : flags =9040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER > mtu 1500 index 3 inet 1 92 .1 68 . 17 8. 20 1 netmask ffffff00 broadcast 1 9 2. 16 8. 1 78 .2 55 groupname production0 e1000g1 : flags =9040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER > mtu 1500 index 4 inet 1 92 .1 68 . 17 8. 20 2 netmask ffffff00 broadcast 1 9 2. 16 8. 1 78 .2 55 groupname production0 rge0 : flags =9040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER > mtu 1500 index 5 inet 1 92 .1 68 . 17 8. 20 3 netmask ffffff00 broadcast 1 9 2. 16 8. 1 78 .2 55 groupname production0 lo0 : flags =2002000849 < UP , LOOPBACK , RUNNING , MULTICAST , IPv6 , VIRTUAL > mtu 8252 index 1 inet6 ::1/128
Everything is ne.
209
20. IP Multipathing plug a second one into the system and use already existent 1GBe Interfaces as a backup for it instead. Its pretty straightforward to do so. At rst you have to congure the link aggregation.
j m o e k a m p @ h i v e m i n d :~ $ j m o e k a m p @ h i v e m i n d :~# j m o e k a m p @ h i v e m i n d :~# j m o e k a m p @ h i v e m i n d :~# j m o e k a m p @ h i v e m i n d :~# LINK PORT aggregate0 -e1000g0 e1000g1 pfexec bash ifconfig e1000g0 unplumb ifconfig e1000g1 unplumb dladm create - aggr -l e1000g0 -l e1000g1 aggregate0 dladm show - aggr -x aggregate0 SPEED DUPLEX STATE ADDRESS PORTSTATE 0 Mb unknown unknown 0:1 b :21:3 d :91: f7 -0 Mb half down 0:1 b :21:3 d :91: f7 standby 0 Mb half down 0:1 b :21:16:8 d :7 f standby
The dladm create-aggr creates an aggregation, that bundles the interfaces e1000g0 and e1000g1 into a single virtual interface. Now I plug both cables into the switch.
j m o e k a m p @ h i v e m i n d :~# dladm show - aggr -x aggregate0 LINK PORT SPEED DUPLEX STATE aggregate0 -100 Mb full up e1000g0 100 Mb full up e1000g1 100 Mb full up ADDRESS 0:1 b :21:3 d :91: f7 0:1 b :21:3 d :91: f7 0:1 b :21:16:8 d :7 f PORTSTATE -attached attached
Looks pretty much like a standard IPMP conguration. You can think of aggregate0 as a plain-standard physical interface from the perspective the the admin. When we check the IPMP conguration we will see both interfaces.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS rge0 yes production0 --- ---aggregate0 yes production0 --mb - - j m o e k a m p @ h i v e m i n d :~# ipmpstat -g GROUP GROUPNAME STATE FDT production0 production0 ok -LINK up up PROBE disabled disabled STATE ok ok
210
20. IP Multipathing Everything is still okay. The aggregate hides the fact of the one failed interface from the IPMP subsystem. Now we unplug the second interface.
j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE rge0 yes production0 --mb - - up disabled ok aggregate0 no production0 --- ---down disabled failed j m o e k a m p @ h i v e m i n d :~# ipmpstat -g GROUP GROUPNAME STATE FDT INTERFACES production0 production0 degraded -rge0 [ aggregate0 ] j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE rge0 yes production0 --mb - - up disabled ok aggregate0 no production0 --- ---down disabled failed j m o e k a m p @ h i v e m i n d :~# dladm show - aggr -x aggregate0 LINK PORT SPEED DUPLEX STATE ADDRESS aggregate0 -0 Mb unknown down 0:1 b :21:3 d :91: f7 e1000g0 0 Mb half down 0:1 b :21:3 d :91: f7 e1000g1 0 Mb half down 0:1 b :21:16:8 d :7 f
The links are both down, and without a functional interface left, the link of the aggregate goes down as well19 . Of course the IPMP subsystem switches to rge0 now. When we plug one cable back to the switch, the aggregate is functional again and IPMP detects this and the interface is considered as functional in IPMP again, too.
j m o e k a m p @ h i v e m i n d :~# dladm show - aggr -x aggregate0 LINK PORT SPEED DUPLEX STATE aggregate0 -100 Mb full up e1000g0 100 Mb full up e1000g1 0 Mb half down j m o e k a m p @ h i v e m i n d :~# ipmpstat -i INTERFACE ACTIVE GROUP FLAGS LINK rge0 yes production0 --mb - - up aggregate0 yes production0 ------ up ADDRESS 0:1 b :21:3 d :91: f7 0:1 b :21:3 d :91: f7 0:1 b :21:16:8 d :7 f PROBE disabled disabled STATE ok ok PORTSTATE -attached standby
When you plug the second interface into the interface, the aggregate is complete. But it doesnt change a thing from the IPMP side, as the aggregate0 interface was already functional from the perpective of IPMP with just one interface.
j m o e k a m p @ h i v e m i n d :~# dladm show - aggr -x aggregate0 LINK PORT SPEED DUPLEX STATE aggregate0 -100 Mb full up e1000g0 100 Mb full up e1000g1 100 Mb full up j m o e k a m p @ h i v e m i n d :~# ADDRESS 0:1 b :21:3 d :91: f7 0:1 b :21:3 d :91: f7 0:1 b :21:16:8 d :7 f PORTSTATE -attached attached
211
20.7.1. Prerequisites
This example works with the same conguration, but you need a system with Solaris 10 or an Opensolaris System with a build earlier than 107. I just use Opensolaris on my lab machines, thus i used an virtualized Solaris 10 to explain the conguration of classic IPMP. I will use the following addresses:
1 92 .1 68 . 17 8. 20 0 vhivemind - prod 1 92 .1 68 . 17 8. 20 1 vhivemind - e1000g0 1 92 .1 68 . 17 8. 20 2 vhivemind - e1000g1
212
20. IP Multipathing
The data address is directly bound to one of the interfaces. Its important to know,that even when the ifconfig output suggest something dierent, outbound data ows to the network on both interfaces, not just the one which holds the data address. Now unplug the cable connecting to e1000g0
bash -3.00# ifconfig -a lo0 : flags =2001000849 < UP , LOOPBACK , RUNNING , MULTICAST , IPv4 , VIRTUAL > mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 e1000g0 : flags =219000802 < BROADCAST , MULTICAST , IPv4 , NOFAILOVER , FAILED , CoS > mtu 0 index 15 inet 0.0.0.0 netmask 0 groupname production0 ether 8: 0: 2 7: 11 :3 4 :4 3 e1000g1 : flags =201000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , CoS > mtu 1500 index 16 inet 0.0.0.0 netmask ff000000 groupname production0 ether 8:0:27:6 d :9: be e1000g1 :1: flags =201000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , CoS > mtu 1500 index 16 inet 192.168. 56.200 netmask ffffff00 broadcast 192.16 8.56.255
The data address was moved away from e1000g1 and a logical interface was created to hold it instead.
213
20. IP Multipathing
bash -3.00# ifconfig e1000g0 plumb bash -3.00# ifconfig e1000g1 plumb bash -3.00# ifconfig e1000g0 vhivemind - e1000g0 deprecated - failover netmask + broadcast + group production0 up bash -3.00# ifconfig e1000g0 addif vhivemind - prod netmask + broadcast + up Created new logical interface e1000g0 :1 bash -3.00# ifconfig e1000g1 vhivemind - e1000g1 deprecated - failover netmask + broadcast + group production0 up bash -3.00# ifconfig -a
Please note that you have to use the deprecated option to set the DEPRECATED ag on your own. New IPMP do this automagically. Forgetting this option leads to interesting, but not always obvious malfunctions. Lets check the network conguration.
lo0 : flags =2001000849 < UP , LOOPBACK , RUNNING , MULTICAST , IPv4 , VIRTUAL > mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 e1000g0 : flags =209040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , CoS > mtu 1500 index 11 inet 192.168. 56.201 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8: 0: 2 7: 11 :3 4 :4 3 e1000g0 :1: flags =201000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , CoS > mtu 1500 index 11 inet 192.168. 56.200 netmask ffffff00 broadcast 192.16 8.56.255 e1000g1 : flags =209040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , CoS > mtu 1500 index 12 inet 192.168. 56.202 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8:0:27:6 d :9: be
Both interfaces have their test addresses. The data address is congured to an additional logical interface. As its the only interface without the -failover statement, this interface is automatically managed by IPMP. Now remove the cable from the e1000g0 networking card.
bash -3.00# ifconfig -a lo0 : flags =2001000849 < UP , LOOPBACK , RUNNING , MULTICAST , IPv4 , VIRTUAL > mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 e1000g0 : flags =219040803 < UP , BROADCAST , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , FAILED , CoS > mtu 1500 index 11 inet 192.168. 56.201 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8: 0: 2 7: 11 :3 4 :4 3 e1000g1 : flags =209040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , CoS > mtu 1500 index 12 inet 192.168. 56.202 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8:0:27:6 d :9: be e1000g1 :1: flags =201000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , CoS > mtu 1500 index 12 inet 192.168. 56.200 netmask ffffff00 broadcast 192.16 8.56.255
The virtual interface with the data address has moved from e1000g0 to e1000g1
214
20. IP Multipathing les. We just have to translate the command lines accordingly: Link-based IPMP At rst we congure the e1000g0 interface by creating the le /etc/hostname.e1000g0 containing a single line.
vhivemind - prod netmask + broadcast + group production0 up
Afterwards we do the same for e1000g1. We create a le named /etc/hostname.e1000g1 and put the following line (and just this line) in it:
group production0 up
Now reboot the system. After a few moments you can get a shell and check your conguration.
# ifconfig -a lo0 : flags =2001000849 < UP , LOOPBACK , RUNNING , MULTICAST , IPv4 , VIRTUAL > mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 e1000g0 : flags =201000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , CoS > mtu 1500 index 2 inet 192.168. 56.200 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8: 0: 2 7: 11 :3 4 :4 3 e1000g1 : flags =201000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , CoS > mtu 1500 index 3 inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255 groupname production0 ether 8:0:27:6 d :9: be
Everything congured as weve planned it. Probe-based IPMP Okay, lets do the same for the probe-based IPMP. This is the /etc/hostname.e1000g0 le conguring the test address on the physical interface and the data address:
vhivemind - e1000g0 deprecated - failover netmask + broadcast + group production0 up \ addif vhivemind - prod netmask + broadcast + up
The le /etc/hostname.e1000g1 with the following line will congures the e1000g1 interface of our system at boot:
vhivemind - e1000g1 deprecated - failover netmask + broadcast + group production0 up
Okay, reboot your system and you should yield an ifconfig output like this one afterwards.
215
20. IP Multipathing
# ifconfig -a lo0 : flags =2001000849 < UP , LOOPBACK , RUNNING , MULTICAST , IPv4 , VIRTUAL > mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 e1000g0 : flags =209040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , CoS > mtu 1500 index 2 inet 192.168. 56.201 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8: 0: 2 7: 11 :3 4 :4 3 e1000g0 :1: flags =201000843 < UP , BROADCAST , RUNNING , MULTICAST , IPv4 , CoS > mtu 1500 index 2 inet 192.168. 56.200 netmask ffffff00 broadcast 192.16 8.56.255 e1000g1 : flags =209040843 < UP , BROADCAST , RUNNING , MULTICAST , DEPRECATED , IPv4 , NOFAILOVER , CoS > mtu 1500 index 3 inet 192.168. 56.202 netmask ffffff00 broadcast 192.16 8.56.255 groupname production0 ether 8:0:27:6 d :9: be
Everything is ne.
216
20. IP Multipathing
217
20. IP Multipathing
218
20. IP Multipathing
219
20. IP Multipathing
grouping = require_all restart_on = none type = service > < service_fmri value = svc :/ milestone / network : default / > </ dependency > < exec_method type = method name = start exec = / lib / svc / method / ipmptargets %m t im eo ut _ se co nd s = 60 / > < exec_method type = method name = refresh exec = / lib / svc / method / ipmptargets %m t im eo ut _ se co nd s = 60 / > < exec_method type = method name = stop exec = / lib / svc / method / ipmptargets %m t im eo ut _ se co nd s = 60 / > < property _group name = startd type = framework > < propval name = duration type = astring value = transient / > </ property_group > < property _group name = general type = framework > < propval name = action_authorization type = astring value = solaris . smf . manage . ipmptargets / > < propval name = value_authorization type = astring value = solaris . smf . manage . ipmptargets / > </ property_group > < instance name = target1 enabled = false > < property _group name = config_params type = application > < propval name = ip type = astring value = 192.168.178.1 / > </ property_group > </ instance > < instance name = target2 enabled = false > < property _group name = config_params type = application > < propval name = ip type = astring value = 192.168.178.20 / > </ property_group > </ instance > < stability value = Unstable / > < template > < common_name > < loctext xml : lang = C > system - wide configuration of IP routes for IPMP </ loctext > </ common_name > < documentation > < manpage
220
20. IP Multipathing
title = ifconfig section = 1 M manpath = / usr / share / man / > </ documentation > </ template > </ service > </ service_bundle >
The specic host routes are implemented as instances of this service. So it is possible to control the routes with a ne granularity. Okay, obviously we need the script mentioned in the exec methods of the manifest. So put the following script into the the le /lib/svc/method/ipmptargets:
#!/ bin / sh . / lib / svc / share / smf_include . sh getproparg () { val = svcprop -p $1 $SMF_FMRI [ -n " $val " ] && echo $val } if [ -z " $SMF_FMRI " ]; then echo " SMF framework variables are not initialized ." exit $ S M F _ E X I T _ E R R _ C O N F I G fi OPENVPNBIN = / usr / sbin / route IP = getproparg config_params / ip
if [ -z " $IP " ]; then echo " config_params / ip property not set " exit $ S M F _ E X I T _ E R R _ C O N F I G fi case " $1 " in start ) route add - host $IP $IP - static ;; stop ) echo " not implemented " route delete - host $IP $IP - static ;; refresh ) route delete - host $IP $IP - static route add - host $IP $IP - static ;; *) echo $ " Usage : $0 { start | refresh }" exit 1 ;; esac exit $SMF_EXIT_OK
Okay. Now we have to import the the SMF manifest into the repository.
221
20. IP Multipathing
i pm p_ h os tr ou t es . xml
Its ready to use now. You can enable and disable your IPMP host routes as you need them:
j m o e k a m p @ h i v e m i n d :~# svcadm enable ipmptargets : target1 j m o e k a m p @ h i v e m i n d :~# netstat - nr Routing Table : IPv4 Destination -------------------default 127.0.0.1 192.168.56.0 192.168.178.0 192.168.178.1
Routing Table : IPv6 Destination / Mask Gateway Flags Ref Use If - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ----- --- --- ---- ----::1 ::1 UH 2 20 lo0 j m o e k a m p @ h i v e m i n d :~# svcadm disable ipmptargets : target1 j m o e k a m p @ h i v e m i n d :~# netstat - nr Routing Table : IPv4 Destination -------------------default 127.0.0.1 192.168.56.0 192.168.178.0
Routing Table : IPv6 Destination / Mask Gateway Flags Ref Use If - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ----- --- --- ---- ----::1 ::1 UH 2 20 lo0
222
20. IP Multipathing
# # Time taken by mpathd to detect a NIC failure in ms . The minimum time # that can be specified is 100 ms . # F A I L U R E _ D E T E C T I O N _ T I M E =10000
By using a smaller number, you can speed up the failure detection but you have a much higher load of ICMP probes on you system. Keep in mind that ive told you that if 5 consecutive probes fail, the interfaces is considered as failed. When the failure detection time is 10000 ms, the probes have to be sent every 2000 ms. When you congure 100 ms, you will see a probe every 20ms. Furthermore this probing is done on every interface. Thus at 100ms failure detection time, the targets will see 3 ping requests every 20 milliseconds. So keep in mind that lowering this number will increase load on other systems. So choose your the failure detection based on your business and application needs, not on the thought I want the lowest possible time. Just to demonstrate this eect and to learn how you set the failure detection time, you should modify the value in the line FAILURE_DETECTION_TIME to 100. Restart the in.mpathd afterwards by sending a HUP signal to it with verb=.pkill -HUP in.mpathd=. When you start a snoop with snoop -d e1000g0 -t a -r icmp you will have an output on your display scrolling at a very high speed.
20.10. Conclusion
The nice thing about IPMP is: Its simply there. You can use it. When you have more than one interface in your system and you care about the availability of your network, it takes you just a few seconds to activate at least the link-based variant of IPMP. This fruit is really hanging just a few centimeters above the ground.
http://docs.sun.com/app/docs/doc/816-4554/ipmptm-1?l=en&a=view http://docs.sun.com/app/docs/doc/816-5166/ifconfig-1m?a=view
223
20. IP Multipathing docs.sun.com - in.mpathd(1M)22 docs.sun.com - ipmpstat(1M)23 Misc. Project Clearview: IPMP Rearchitecture24
22
224
21.1. Introduction
When you work with the interesting features of Solaris, you forget about the small features in commands you use day by day. I used a SMF script In my IPMP tutorial to make some host routes persistent ... at foremost to show you that SMF scripts are really easy. But, well ... this script isnt nescessary because of the -p option of the route add command.
21.2. Conguration
Using the -p command makes the route boot persistent:
j m o e k a m p @ h i v e m i n d :~# route -p add - host 192.168.2.3 192.168.2.3 - static add host 192.168.2.3: gateway 192.168.2.3 add persistent host 192.168.2.3: gateway 192.168.2.3
At rst the command tries to congure the route in the live system. When this step was successful, it makes the route boot persistent. You can check the routes already made persistent by using the route -p showcommand:
j m o e k a m p @ h i v e m i n d :~# route -p show persistent : route add - host 192.168.2.3 192.168.2.3 - static
Every line that isnt empty or commented will be prepended with route add and executed at startup. The code responsible for this task is in the method script /lib/svc/method/net-routing-set which is used by the SMF service svc:/network/routing-setup. Deletion of a persitant routes is easy, too. Just use the route delete command with a -p:
225
j m o e k a m p @ h i v e m i n d :/ etc / inet # route -p delete - host 192.168.2.3 192.168.2.3 - static delete host 192.168.2.3: gateway 192.168.2.3 delete persistent host 192.168.2.3: gateway 192.168.2.3
http://docs.sun.com/app/docs/doc/816-5166/ifconfig-1m?a=view
226
There is an interesting feature in Solaris. It is called kssl. One component of this feature is obvious: SSL. So it has something to do with SSL encryption. As you may have already guessed, the k at the beginning stands for kernel. And kssl is exactly this: A proxy to do all the encryption stu in a kernel module. This feature is already four years old. Ive reported about kssl back in December 20051 for the rst time. After talking with a customer two days ago and looking at some new material about it2 , i thought it could not harm to write a new tutorial about this topic. So much appearances in such a small timeframe are a hint ;) .
1 2
http://www.c0t0d0s0.org/archives/1092-Links-of-the-week.html http://blogs.sun.com/yenduri/resource/kssl_osol.pdf
227
22.2. Conguration
The conguration of an SSL proxy is really easy. At rst you need a certifcate and the key. For this experiment we will use a self-signed certicate. Ive called the system a380, thus i have to use this name in my certicate. Use the name of your own system in your conguration. Furthermore the kssl system expect key and certicate in a single le. We concatenate both les afterwards:
# mkdir / etc / keys # cd / etc / keys # openssl req - x509 - nodes - days 365 - subj "/ C = DE / ST = Hamburg / L = Hamburg / CN = a380 " - newkey rsa :1024 - keyout / etc / keys / mykey . pem - out / etc / keys / mycert . pem # cat mycert . pem mykey . pem > my . pem # chown 600 *
At rst we create a le to automatically answer the passphrase question. Afterwards we congure the kssl service. This conguration statement tells the system to get the keys and from a pem le. The -i option species the location of the le. -p tells the service where it nds the passphrase le. At the end you nd a380 443. This species on which interface and on which port the ssl should listen. At last the -x 8080 species to what port the the unencrypted trac should be redirected. After conguring this service, you should see a new service managed by SMF:
# svcs -a | grep " kssl " online 9:03:33 svc :/ network / ssl / proxy : kssl - a380 -443
Obviously we need a webserver in the backend that listens to port 8080. I assume, that youve already installed an Apache on your server. We just add a single line to the conguration le of the webserver.
# svcadm disable apache22 # echo " Listen 1 9 2 . 1 6 8 . 1 7 8 . 1 0 8 : 8 0 8 0 " >> / etc / apache2 /2.2/ httpd . conf # svcadm enable apache22
When you put https://a380:443 in your browser, you should see an encrypted It works page after you dismissed the dialog warning you of the self-signed certicate. Or to show it to you on the command line:
228
# openssl s_client - connect 1 9 2 .1 6 8 .1 7 8 .1 0 8 :4 4 3 CONNECTED (00000004) depth =0 / C = DE / ST = Hamburg / L = Hamburg / CN = a380 verify error : num =18: self signed certificate verify return :1 depth =0 / C = DE / ST = Hamburg / L = Hamburg / CN = a380 verify return :1 --Certificate chain 0 s :/ C = DE / ST = Hamburg / L = Hamburg / CN = a380 i :/ C = DE / ST = Hamburg / L = Hamburg / CN = a380 --Server certificate ----- BEGIN CERTIFICATE - - - - MIICoTCCAgqgAwIBAgIJAKyJdj / [...] V5jX3MU = ----- END CERTIFICATE - - - - subject =/ C = DE / ST = Hamburg / L = Hamburg / CN = a380 issuer =/ C = DE / ST = Hamburg / L = Hamburg / CN = a380 --No client certificate CA names sent --SSL handshake has read 817 bytes and written 328 bytes --New , TLSv1 / SSLv3 , Cipher is RC4 - SHA Server public key is 1024 bit Compression : NONE Expansion : NONE SSL - Session : Protocol : TLSv1 Cipher : RC4 - SHA Session - ID : 32 CEF20CB9FE2A71C74D40BB2DB5CB304DA1B57540B7CFDD113915B99DBE9812
Session - ID - ctx : Master - Key : 1 E7B502390951124779C5763B5E4BBAF0A9B0693D08DCA8A587B503A5C5027B6FAD9C Key - Arg : None Start Time : 1242985143 Timeout : 300 ( sec ) Verify return code : 18 ( self signed certificate ) ---
229
GET / HTTP /1.0 HTTP /1.1 200 OK Date : Fri , 22 May 2009 09:39:13 GMT Server : Apache /2.2.11 ( Unix ) mod_ssl /2.2.11 OpenSSL /0.9.8 a DAV /2 Last - Modified : Thu , 21 May 2009 21:26:30 GMT ETag : "341 f3 -2 c -46 a72cc211a8f " Accept - Ranges : bytes Content - Length : 44 Connection : close Content - Type : text / html < html > < body > < h1 > It works ! </ h1 > </ body > </ html > read : errno =0
Voila, web server encrypted without a single line of SSL conguration in the webserver cong les itself.
22.3. Conclusion
Its really easy to add an kssl proxy in front of your webserver. So it isnt dicult to make encrypted webserving more ecient.
3 4
230
Part V. Storage
231
An ever recurring problem while doing backups is the problem, that you have to keep the state of the backup consistent. For example: You use the cyrus imapd for storing mails on a system. As its a system used for years, the message store is still on an UFS le system. Okay, now there is a little problem: You want to make backups, but its a mail server. With the amount of mail junkies in modern times, you cant take the system oine for an hour or so, just to make a backup. But you have to take it oine. Especially a mail server is a moving target, as cyrus imapd has indexes for its mailboxes, and index les for the mailboxes itself. Lets assume a backup takes 1 hour, and users delete mails in this time, create mailboxes and so on. Its possible that you backup your mailserver in an inconsistent state, that the mail directory of the user may represent the state one hour ago, but the mailboxes le represent the state of one minute ago.
23.1. fssnap
UFS has a little known feature, that comes to help in such a situation. You can do a lesystem snapshot of the system. This is a non-changing point-in-time view to the lesystem, while you can still change the original lesystem. fssnap is a rather old tool. Weve introduced in the 1/01 update of Solaris 8. There is a restriction with this snapshots: This snapshots are solely for the purpose of backups, thus they are not boot persistent. For boot persistent snapshots of a lesystem you will need the Sun Availability Suite.
232
# ls - ls total 10 2 -rw - - - - - - T testfile1 2 -rw - - - - - - T testfile2 2 -rw - - - - - - T testfile3 2 -rw - - - - - - T testfile4 2 -rw - - - - - - T testindex1
1024 Apr 23 01:41 1024 Apr 23 01:41 1024 Apr 23 01:41 1024 Apr 23 01:41 1024 Apr 23 01:41
Now we want to make a backup. Its sensible to take the mail server oine for a few seconds to keep the les consistent. In this moment we make the lesystem snapshot. This is really easy:
# fssnap -o bs =/ tmp / mailserver / dev / fssnap /0
With this command we told fssnap to take an snapshot of the lesystem mounted at /mailserver. Furthermore we congured that the snapshot uses the /tmp for its backing store. In the backing store the changes since the snapshot will be recorded. When fssnap is able to create the snapshot, it will return the name for the pseudo device containing the lesystem snapshot. In our case its /dev/fssnap/0. Please remember it, we need it later. When you look at the /tmp directory you will nd an backing store le for this snapshot. Its called snapshot0 for the rst snapshot on the system:
# ls -l / tmp total 910120 -rw -r - -r - 1 root ogl_select305 -rw - - - - - - 1 root snapshot0
root root
Now we bring the mailserver online again, and after a few seconds we see changes to the lesystem again (okay, in my example I will do this manually):
# mkfile 1 k testfile5 # mkfile 1 k testfile6 # mkfile 2 k testindex1 # ls -l total 16 -rw - - - - - - T 1 root
root
233
23 23 23 23 23 23
Now we want to make the backup itself. At rst we have to mount the lesystem. Thus we create a mountpoint
# mkdir / m a i l s e r v e r _ f o r b ac k u p
Now we mount the snapshot. You will recognize the pseudo device here again. The snapshot is read only thus you have to specify it at mount time:
# mount -o ro / dev / fssnap /0 / m a i l s e r v e r _ f o r b a c k u p
Okay, when we look into the lesystem, we will see the state of the lesystem at the moment of the snapshot. testfile5, testfile6 and the bigger testindex1 are missing.
# ls -l total 10 -rw - - - - - - T 1 -rw - - - - - - T 1 -rw - - - - - - T 1 -rw - - - - - - T 1 -rw - - - - - - T 1 testindex1
23 23 23 23 23
After this step we should clean-up. We unmount the snapshot and delete the snapshot with its backing store le:
# umount / m a i l s e r v e r _ f o r b a c k u p # fssnap -d / mailserver Deleted snapshot 0. # rm / tmp / snapshot0
234
23.3. Conclusion
With the fssnap command you have an easy way to do consistent backups on UFS lesystems. While its not as powerful as the functions of the point-in-time copy functionality in the Availability Suite, its a perfect match for its job.
235
24.1. Introduction
With more and more available bandwidth it gets more and more feasible to use only a single network to transmit data in the datacenter. Networking can get really complex when you have to implement two networks. For example for Ethernet and one for FC. Its getting more complex and expensive, as FC network are mostly optical ones and Ethernet is a copper network in most datacenters. The idea of using a single network for both isnt really a new one, thus there were several attempts to develop mechanisms to allow the usage of remote block devices. One of the protocols to use block devices over IP networks is iSCSI. iSCSI is a protocol that was designed to allow clients to send SCSI commands to SCSI storage devices on a remote server. There are other technologies like ATA or FC over Ethernet that were designed , but iSCSI is the most interesting one from my point of view, as you can use already existing and proven methods implemented for IP (for example IPsec) for encrypting the trac. The problem with iSCSI is the latency introduced with the usage of IP. But this problem can be solved by the usage of iSER or iSCSI Extensions for RDMA to use iSCSI as part of single-fabric network based on Inniband. We already implement this mechanism in Solaris as a part of the COMSTAR iSER1 project.
http://opensolaris.org/os/project/iser/
236
24. Legacy userland iSCSI Target LUN : A LUN is the logical unit number. It represents an individual SCSI device. In iSCSI its similar. When you want to use an iSCSI disk drive, the initiator connects the target and connects the initiator with the LUN in a iSCSI session. IQN : To identify a target or an initiator, one of the naming schemes is the iSCSI Qualied Name (IQN). An example for an IQN is iqn.1986-03.com.sun:02:b29a71fb60-c2c6-c92f-13555977e6
Figure 24.1.: The components of iSCSI Its a little bit simplied, but this is the core idea of iSCSI: How do I present a LUN on a remote server to another server via an IP network.
237
24.4.1. Environment
For this example, I will use my both demo VMs again:
10.211.55.200 10.211.55.201 theoden gandalf
Both systems runs with Solaris Express Build 84 for x86, but you can to the same with Solaris Update 4 for SPARC and x86 as well. In our example, theoden is the server with the iSCSI target. gandalf is the server, which wants to use the LUN via iSCSI on theoden, thus gandalf is the server with the initiator.
24.4.2. Prerequisites
At rst, we login to theoden and assume root privileges. Okay, to test iSCSI we need some storage volumes to play around. There is a nice way to create a playground with ZFS. You can use les as devices. But at rst we have to create this les
# mkdir / zfstest # cd / zfstest # mkfile 128 m test1 # mkfile 128 m test2 # mkfile 128 m test3 # mkfile 128 m test4
238
testpool / zfsvolume USED 200 M 200 M AVAIL 260 M 460 M REFER 18 K 16 K MOUNTPOINT / testpool -
The emulated volume has the size of 200M.Okay, its really easy to enable the iSCSI target. At rst we have to enable the iSCSI Target service:
# svcadm enable iscsitgt
After this we congure the initiator and tell the initiator to discover devices on our iSCSI target.
# iscsiadm modify initiator - node -A gandalf # iscsiadm add discovery - address 10.211.55.200 # iscsiadm modify discovery -t enable
The -c iscsi limits the scan to iSCSI devices. With the format command we look for the available disks in the system:
239
AVAILABLE DISK SELECTIONS : 0. c0d0 < DEFAULT cyl 4076 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1f ,1/ ide@0 / cmdk@0 ,0 1. c1d0 < DEFAULT cyl 4077 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1f ,1/ ide@1 / cmdk@0 ,0 2. c 2 t 0 1 0 0 0 0 1 C 4 2 E 9 F 2 1 A 0 0 0 0 2 A 0 0 4 7 E 3 9 E 3 4 d 0 < DEFAULT cyl 198 alt 2 hd 64 sec 32 > / scsi_vhci / d i s k @ g 0 1 0 0 0 0 1 c 4 2 e 9 f 2 1 a 0 0 0 0 2 a 0 0 4 7 e 3 9 e 3 4 Specify disk ( enter its number ) : ^ C
Okay, there is new device with a really long name. We can use this device for a zfs pool:
# zpool create zfsviaiscsi c2t0100001C42E9F21A00002A0047E39E34d0 # zpool list NAME SIZE USED AVAIL CAP HEALTH zfsviaiscsi 187 M 480 K 187 M 0% ONLINE #
ALTROOT -
As you see, we have created a zfs lesystem via iSCSI on an emulated volume on a zpool on a remote system.
24.5.1. Prerequisites
At rst we need some basic data. We need the IQN names of both. At rst we look up the IQN of the initiator. Thus we login to gandalf and assume root privileges:
2
http://en.wikipedia.org/wiki/Challenge-handshake_authentication_protocol
240
# iscsiadm list initiator - node Initiator node name : iqn .1986 -03. com . sun :01:00000000 b89a .47 e38163 Initiator node alias : gandalf [...]
Now we look up the IQN of the target. Okay, we login to theoden and assume root privileges:
# iscsitadm list target Target : testpool / zfsvolume iSCSI Name : iqn .1986 -03. com . sun :02: b29a71fb - ff60 - c2c6 - c92f ff13555977e6 iqn .1986 -03. com . sun :02: b29a71fb - ff60 - c2c6 - c92f - ff13555977e6 Connections : 1
Both IQNs are important in the following steps. We need them as a identier for the systems.
You dont have to do this steps. The zpool may only get unavailable while we congure the authentication and you will see a few more lines in your logles. Okay, now we congure the CHAP authentication.
# iscsiadm modify initiator - node -- CHAP - name gandalf # iscsiadm modify initiator - node -- CHAP - secret Enter secret : Re - enter secret : # iscsiadm modify initiator - node -- authentication CHAP
What have we done with this statements: We told the iSCSI initiator to identify itself as gandalf. Then we set the password and tell the initiator to use CHAP to authenticate.
241
This isnt an admin login. This is a little misguiding. Now we create an initiator object on the target.We connect the long IQN with a shorter name.
# iscsitadm create initiator -- iqn iqn .1986 -03. com . sun :01:00000000 b89a .47 e38163 gandalf
Now we tell the target, that the initiator on the system gandalfwill identify itself with the name gandalf:
# iscsitadm modify initiator -- chap - name gandalf gandalf
Okay, now we set the password for this initiator. This is the same password we set on the initiator.
# iscsitadm modify initiator -- chap - secret gandalf Enter secret : Re - enter secret :
Finally we tell the target, that the system gandalf is allowed to access the testpool/zfsvolume:
# iscsitadm modify target -- acl gandalf test / zfsvolume
Now the initiator has to authenticate itself before the target daemon grants access to the target. You could skip the next steps and fast-forward to the section Reactivation of the zpool but the authentication is only unidirectional at the moment. The client(initiator) authenticate itself at the server(target).
Okay, but it would be nice, that the target identies himself to initiator as well. Okay, at rst we tell the initiator, that the target with the IQN iqn.1986-03.com.sun:02:b29a71fb-ff60-c2c6will authenticate itself with the name theoden. This steps has to be done on the initiator, thus we login into gandalf again.
242
# iscsiadm modify target - param -- CHAP - name theoden iqn .1986 -03. com . sun :02: b29a71fb - ff60 - c2c6 - c92f - ff13555977e6
Now we set the secret to authenticate. This is the secret we congured as the CHAP-Secret on the target with iscsitadm modify admin --chap-secret:
# iscsiadm modify target - param -- CHAP - secret iqn .1986 -03. com . sun :02: b29a71fb - ff60 - c2c6 - c92f - ff13555977e6 Enter secret : Re - enter secret :
Okay, now we have completed the conguration for the bidirectional authentication.
243
At rst weve created an directory to keep the les, then we tell the target daemon to use this for storing the target. After this we can create the target:
# iscsitadm create target -- size 128 m smalltarget
Now we switch to the server we use as an initiator. Lets scan for new devices on gandalf. As weve activated the discovery of targets before, weve just have to scan for new devices.
# devfsadm -c iscsi -C # format Searching for disks ... done
AVAILABLE DISK SELECTIONS : 0. c0d0 < DEFAULT cyl 4076 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1f ,1/ ide@0 / cmdk@0 ,0 1. c1d0 < DEFAULT cyl 4077 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1f ,1/ ide@1 / cmdk@0 ,0
244
2. c 2 t 0 1 0 0 0 0 1 C 4 2 E 9 F 2 1 A 0 0 0 0 2 A 0 0 4 7 E 3 9 E 3 4 d 0 <SUN - SOLARIS -1 -200.00 MB > / scsi_vhci / d i s k @ g 0 1 0 0 0 0 1 c 4 2 e 9 f 2 1 a 0 0 0 0 2 a 0 0 4 7 e 3 9 e 3 4 3. c 2 t 0 1 0 0 0 0 1 C 4 2 E 9 F 2 1 A 0 0 0 0 2 A 0 0 4 7 E 4 5 1 4 5 d 0 < DEFAULT cyl 126 alt 2 hd 64 sec 32 > / scsi_vhci / d i s k @ g 0 1 0 0 0 0 1 c 4 2 e 9 f 2 1 a 0 0 0 0 2 a 0 0 4 7 e 4 5 1 4 5 Specify disk ( enter its number ) : Specify disk ( enter its number ) : ^ C #
Thats all. The dierence is the small -s. It tells ZFS to create an sparse (aka thin) provisioned volume. Well, I wont enable iSCSI for this by shareiscsi=on itself. I will congure this manually. As normal volumes zvols are available within the /dev tree of your lesystem:
# ls -l / dev / zvol / dsk / testpool total 4 lrwxrwxrwx 1 root root 35 Mar 22 02:09 bigvolume -> ../../../../ devices / pseudo / zfs@0 :2 c lrwxrwxrwx 1 root root 35 Mar 21 12:33 zfsvolume -> ../../../../ devices / pseudo / zfs@0 :1 c
Okay, we can use this devices as a backing store for an iSCSI target as well. Weve created a zvol bigvolume within the zpool testpool. Thus the device is /dev/zvol/dsk/testpool/bigvolume:
# iscsitadm create target -b / dev / zvol / dsk / testpool / bigvolume bigtarget
Okay, im switching to my root shell on the initiator. Again we scan for devices:
245
AVAILABLE DISK SELECTIONS : 0. c0d0 < DEFAULT cyl 4076 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1f ,1/ ide@0 / cmdk@0 ,0 1. c1d0 < DEFAULT cyl 4077 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1f ,1/ ide@1 / cmdk@0 ,0 2. c 2 t 0 1 0 0 0 0 1 C 4 2 E 9 F 2 1 A 0 0 0 0 2 A 0 0 4 7 E 3 9 E 3 4 d 0 <SUN - SOLARIS -1 -200.00 MB > / scsi_vhci / d i s k @ g 0 1 0 0 0 0 1 c 4 2 e 9 f 2 1 a 0 0 0 0 2 a 0 0 4 7 e 3 9 e 3 4 3. c 2 t 0 1 0 0 0 0 1 C 4 2 E 9 F 2 1 A 0 0 0 0 2 A 0 0 4 7 E 4 5 D B 2 d 0 < DEFAULT cyl 1021 alt 2 hd 128 sec 32 > / scsi_vhci / d i s k @ g 0 1 0 0 0 0 1 c 4 2 e 9 f 2 1 a 0 0 0 0 2 a 0 0 4 7 e 4 5 d b 2
ALTROOT -
Do you remember, that we used four 128 MB les as the devices for our zpool on our target. Well, you have an 1.98G lesystem running on this les. You can add more storage to the zpool on the target and you have nothing to do on the initiator. Not a real kicker for ZFS, but imagine the same for other lesystem that cant be grown so easy like a zpool.
24.7. Conclusion
Okay, this was a quick introduction to the actual implementation of iSCSI on Solaris. The future will bring changes to the implementation of the iSCSI target feature but new possibilities as well. iSCSI will be an part of the COMSTAR framework3 in the future besides of an SAS target or an FC target.
http://opensolaris.org/os/comstar
246
4 5
247
A while ago, i wrote a tutorial about using iSCSI in Opensolaris and Solaris 10. While the tutorial is still valid for Solaris 10, Opensolaris got a new, more ecient iSCSI Target. iSCSI in Opensolaris is a part of a more generic in-kernel framework to provide SCSI Target services. This framework isnt just capable to deliver iSCSI, you can have FCoE, SRP, FC targets as well. This generic approach led to a dierent administrative model. Thus when you want to congure iSCSI on a OpenSolaris system and you want the new framework (the old userland based version is still available) you have to congure the target side dierently. This tutorial will do pretty much the same than the old iSCSI tutorial, just with the new framework to give you an overview about the changes.
248
25. COMSTAR iSCSI Target called logical unit provider. The logical unit is the device within a target responsible for executing SCSI I/O commands. Perhaps youve heard about the acronym LUN in the past. LUN is the Logical Unit Number, a number designating a certain logical unit. Between the both components is the stmf,the SCSI target mode framework. It connects the port provider and the logical unit provider. Whats the advantage: You just have to develop the SCSI target once and for new protocols you just have to develop a relatively simple port provider. This is one of the reasons why we see new port providers quite frequently. In essence this is the reason why the administrative model has changed with the introduction of COMSTAR: The logical unit, the port and the logic to combine both are separate entities thus you congure them separately.
25.2. Prerequisites
Okay, lets look how you congure this stu. I will use two systems in this example:
192.1 68.56.1 01 initiator 192.1 68.56.1 02 target
The system target has to use a recent of OpenSolaris. You can use OpenSolaris as well as Opensolaris Community Edition. My testbed used OpenSolaris Developer Build 127, but it should work with any reasonable recent version. On the system initiator any Solaris system will do it, as COMSTAR is a change on the target side, it doesnt change anything on the initiator part.
ACTIONS 2205/2205
249
250
25. COMSTAR iSCSI Target Please reboot the system after this step. This framework has many connections to the rest of the system and its just easier to reboot after the initial install. Okay, right after the boot the service stmf is disabled.
j mo ek am p @t ar ge t :~# svcs stmf STATE STIME FMRI disabled 13:44:42 svc :/ system / stmf : default
Obviously we need a backing store for our iSCSI target. In this case i will use a spare provisioned 10 Terabyte emulated ZFS volume.
j mo ek am p @t ar ge t :~# zfs create -V 10 T -s rpool / iscsivol
There are other options besides an emulated ZFS volume like a pregenerated le (mkfile 10T backingstore), a le that grows from zero to a certain precongured le size (touch backingstore) and nally just mapping an physical device through the system. Now we have to congure a logical unit provider in the COMSTAR framework to use our backing store. We want a disk, so we use the disk logical unit provider. We have to use the sbdadm command for this task.
j mo ek am p @t ar ge t :~# sbdadm create - lu / dev / zvol / rdsk / rpool / iscsivol Created the following LU : GUID -------------------------------600144 f 0 b 1 5 2 c c 0 0 0 0 0 0 4 b 0 8 0 f 2 3 0 0 0 4 DATA SIZE ------------------10995 1162777 60 SOURCE ---------------/ dev / zvol / rdsk / rpool / iscsivol
251
25. COMSTAR iSCSI Target With the stmfadm command you can get some additional insight to your newly created logical unit.
j mo ek am p @t ar ge t :~# stmfadm list - lu -v LU Name : 600144 F 0 B 1 5 2 C C 0 0 0 0 0 0 4 B 0 8 0 F 2 3 0 0 0 4 Operational Status : Online Provider Name : sbd Alias : / dev / zvol / rdsk / rpool / iscsivol View Entry Count : 1 Data File : / dev / zvol / rdsk / rpool / iscsivol Meta File : not set Size : 10995 11627776 0 Block Size : 512 Management URL : not set Vendor ID : SUN Product ID : COMSTAR Serial Num : not set Write Protect : Disabled Writeback Cache : Enabled Access State : Active
But at the moment your brand new LU isnt visible. You have to add it to an entity called view. In a view you can dene which targets a certain initiator can see. For example you can congure with this views, that all your SAP systems just see the SAP logical units and all MS Exchange Servers just see the Exchange logical units on your storage service. Other logical units arent visible to the system and they are totally unaware of their existence. For people with a storage background: This is somewhat similar to the LUN masking part ;) But for this tutorial i will use a simple conguration. I will not impose any access control to this logical unit :
j mo ek am p @t ar ge t :~# stmfadm add - view 600144 F 0 B 1 5 2 C C 0 0 0 0 0 0 4 B 0 8 0 F 2 3 0 0 0 4
However, the logical unit is still not visible to the outside, as we havent congured a port provider. The port provider is the component that provides access to the logical unit via iSCSI or FC for example.
252
25. COMSTAR iSCSI Target portnumber you use to access an iSCSI target. Target portal groups are used to bundle such portals to ease the conguration of access controls.
# itadm create - tpg e1000g0 192.1 68.56.1 02
This IQN uniquely identies the target in a network. We will use it from now on when we address this iSCSI target. Thats all. For unauthenticated iSCSI this is all you have to do on the target side: Weve congured the logical unit provider and enabled a port provider to give other system access to the target with iSCSI.
AVAILABLE DISK SELECTIONS : 0. c7d0 < DEFAULT cyl 2085 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1 ,1/ ide@0 / cmdk@0 ,0 Specify disk ( enter its number ) : ^ C
Just our boot disk. At rst we have to install the iSCSI initator. This can be done by pkg install iscsi.After this step you have to reboot the system.
j m o e k a m p @ i n i t i t a t o r :~# svcs -a | grep " iscsi " online 14:08:34 svc :/ network / iscsi / initiator : default
Okay, the iSCSI service runs on the host initiator. Now we have to congure the initiator to discover possible logical units on our target:
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm add discovery - address 1 9 2 . 1 6 8 . 5 6 . 1 0 2 : 3 2 6 0 j m o e k a m p @ i n i t i t a t o r :~# iscsiadm modify discovery -- sendtargets enable
At rst we told the iSCSI initiator to discover logical units at the target portal on 192.168.56.102:3260. After this step, we congured the iSCSI initiator to use the SendTargets method to discover logical units on the other side. The SendTarget command is a command the initiator sends to the congured hosts to gather all targets available to the initiator issuing the SendTarget command. Every iSCSI target implementation has to support the SendTarget command. This is specied by Appendix D of the RFC 3720.
253
25. COMSTAR iSCSI Target This makes the conguration of the iSCSI easier, as you dont have to congure all the targets manually. Okay, lets check if the conguration of the discovery was successful:
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm list discovery Discovery : Static : disabled Send Targets : enabled iSNS : disabled
Now lets look, what targets were discovered by the iSCSI initiator:
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm list target Target : iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422 Alias : TPGT : 2 ISID : 4000002 a0000 Connections : 1
The iSCSI target IQN you saw at the time when you congured the target on the host target reappears here. Okay, now we have to make all discovered logical units for the use as block devices. In Solaris you use the devfsadm command for this task. This will populate the /dev tree:
j m o e k a m p @ i n i t i t a t o r :~# devfsadm -C -i iscsi j m o e k a m p @ i n i t i t a t o r :~#
When we start format again, we will see that our host initiator has a new block device:
j m o e k a m p @ i n i t i t a t o r :~# format Searching for disks ... done AVAILABLE DISK SELECTIONS : 0. c 0 t 6 0 0 1 4 4 F 0 B 1 5 2 C C 0 0 0 0 0 0 4 B 0 8 0 F 2 3 0 0 0 4 d 0 < SUN - COMSTAR -1.0 -10.00 TB > / scsi_vhci / d i s k @ g 6 0 0 1 4 4 f 0 b 1 5 2 c c 0 0 0 0 0 0 4 b 0 8 0 f 2 3 0 0 0 4 1. c7d0 < DEFAULT cyl 2085 alt 2 hd 255 sec 63 > / pci@0 ,0/ pci - ide@1 ,1/ ide@0 / cmdk@0 ,0 Specify disk ( enter its number ) : ^ C j m o e k a m p @ i n i t i t a t o r :~#
A brand new 10 TB iSCSI volume. Of course we can use it for zfs again:
j m o e k a m p @ i n i t i t a t o r :~# zpool create testpool c 0 t 6 0 0 1 4 4 F 0 B 1 5 2 C C 0 0 0 0 0 0 4 B 0 8 0 F 2 3 0 0 0 4 d 0 j m o e k a m p @ i n i t i t a t o r :~# zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 10 ,1 G 5 ,52 G 81 K / rpool [...] testpool 72 K 9 ,78 T 21 K / testpool
254
Okay, to congure the authentication we need to pieces of information at rst. Log into the system target to gather the iqn of the target.
j mo ek am p @t ar ge t :~# itadm list - target TARGET NAME STATE <b > iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422 </ b > SESSIONS online 1
Now log into the system initiator to gather the IQN of the initiator.
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm list initiator - node Initiator node name : <b > iqn .1986 -03. com . sun :01:809526 a500ff .4 b07e652 </ b > [...] Configured Sessions : 1
At rst we congure the iSCSI target in a way, that it uses chap authentication. We need the IQN of the target here.
j mo ek am p @t ar ge t :~# itadm modify - target -a chap iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 ecd7 - de40 - cb3bbb03b422 Target iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422 successfully modified
Now we have to congure the secrets. The CHAP secrets have be at last 12 characters. At rst we set congure the secret, that the initiator will use to authenticate at the target.
j mo ek am p @t ar ge t :~# itadm create - initiator -s iqn .1986 -03. com . sun :01:809526 a500ff .4 b07e652 Enter CHAP secret : f o o b a r f o o b a r f o o b a r Re - enter secret : f o o b a r f o o b a r f o o b a r
Now we congure the secret that the target uses to authenticate itself at the initiator.
255
j mo ek am p @t ar ge t :~# itadm modify - target -s iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 de40 - cb3bbb03b422 Enter CHAP secret : s na fu sn a fu sn af u Re - enter secret : sn af u sn af us n af u Target iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422 successfully modified
Okay, we have to something similar on the system initiator At rst we have to congure the CHAP secret that the initiator uses to authenticate itself
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm modify initiator - node -- CHAP - secret Enter secret : f o o b a r f o o b a r f o o b a r Re - enter secret : f o o b a r f o o b a r f o o b a r
Now we tell the system that this initiator should use CHAP authentication to authenticate a target.
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm modify initiator - node -- a uthentic ation CHAP
The next steps congure the authentication relation between our initiator and a dened target. At we activate an bi-directional authentication. So the initiator has to authenticate at the target as well as the target has to authenticate itself at the initiator.
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm modify target - param \ --bi - directional - authen tication enable iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 de40 - cb3bbb03b422
Now we tell the iSCSI initator that the iSCSI target uses CHAP to authenticate the ISCSI initiator.
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm modify target - param \ -- authenti cation CHAP iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422
In a last step we congure the shared secret, that the target uses to authenticate the initiator.
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm modify target - param \ -- CHAP - secret iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422 Enter secret : sn af u sn af us n af u Re - enter secret : sn af u sn af us n af u
Perhaps this gure explain the relation between the secrets better than just command lines can do: Now we are done. We can do a quick check if our conguration found its way to the system. At rst a quick look at the iSCSI Target on target:
j mo ek am p @t ar ge t :~# itadm list - initiator INITIATOR NAME iqn .1986 -03. com . sun :01:809526 a500ff .4 b07e652 j mo ek am p @t ar ge t :~# itadm list - target -v TARGET NAME iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422 alias : auth : chap targ etchapus er : CHAPUSER < none > STATE online SECRET set SESSIONS 1
256
257
t a r g e t c h a p s e c re t : tpg - tags :
set e1000g0 = 2
and
j m o e k a m p @ i n i t i t a t o r :~# iscsiadm list target - param -v Target : iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 - cb3bbb03b422 Alias : <b > Bi - directional Authenti cation : enabled Auth enticati on Type : CHAP CHAP Name : iqn .1986 -03. com . sun :02: fd9d53cd -6 fe5 - ecd7 - de40 cb3bbb03b422 </ b > [...] Configured Sessions : 1 j m o e k a m p @ i n i t i t a t o r :~#
Okay, everything is okay ... now lets just reimport the testpool
j m o e k a m p @ i n i t i t a t o r :~# zpool import testpool j m o e k a m p @ i n i t i t a t o r :~#
25.7. Conclusion
The conguration of an iSCSI Target isnt more dicult than before. Its just dierent. I hope i gave you some good insight into this topic.
258
25. COMSTAR iSCSI Target docs.sun.com: iscsiadm(1M)- enable management of iSCSI initiators Tutorials c0t0d0s0.org : Less known Solaris Features: iSCSI - tutorial for the userland iSCSI implementation in Solaris.
259
26.1. Introduction
Solaris was designed with commercial customers in mind. Thus this operating environment has some capabilities that are somewhat useless for the soho user, but absolutely essential for enterprise users. One of this capabilities is remote replication of disk volumes. Imagine the following situation. You have a database and a lesystem for a central application of your company. The lesystem stores binary objects (for example images or something like that). The application is so important for your company, that you plan to build a replicated site. And now your problems start. Replicating a database is fairly easy. Most databases have some functionality to do a master/replica replication. Filesystems are a little bit harder. Of course you can use rsync, but whats with the data you wrote after the last rsync run and the failure of your main system. And how do you keep the database and the lesystem consistent? The Solaris operating environment has a feature to solve this problem. Its called Availability Suite (in short AVS). Its a rather old feature, but I would call it a matured feature. The rst versions of the tool wasnt really enterprise ready and this led to some rude nick names for this feature, but thats so long ago .... AVS was designed to give the operating environment the capability to replicate a volume to another site independently from the way its used. Thus its irrelevant, if the volume is used by a lesystem, as a raw device for a database. Well you can even use it to give ZFS the capability to do a synchronous replication (a feature missing today) AVS has a point-in-time-copy feature (something similar to snapshots) as well, but this tutorial will concentrate to the remote mirror capability. Some words about Sun StorageTek Availability Suite at rst: Weve opensourced the product quite a while ago. While its a commercial product for Solaris, weve integrated it into Solaris Express Community Edition and Developer Edition.
260
26.3. Wording
Some denitions are important to understand the following text: primary host/primary volume: The volume or host which acts as the source of the replication secondary host/secondary volume: The volume or host which acts as the target of the replication bitmap volume: Each secondary and primary volume has a so called bitmap volume. This bitmap volume is used to store the changes to the primary or secondary volume when replication has failed or was deliberately stopped
261
26.7. Synchronization
Okay, the replication takes care of keeping the replicated volume and the replica identical, when the software runs. But how to sync both volumes when starting replication or later on, when the replication was interrupted? The process to solve this is called synchronization. AVS Remote Mirror knows four modes of replication: Full replication: You do at least one full replication with every volume. Its the rst one. The full replication copies all data to the secondary volume.
262
26. Remote Mirroring with the Availability Suite Update replication : When the volume is in logging mode, the changes to the primary volume is stored on the bitmap volume. Thus with the update replication you can transmit only the changed blocks to the secondary volume. Full reverse replication This is the other way round. Lets assume youve done a failover to your remote site, and youve worked on the replicated volume for some time. Now you want to switch back to your normal datacenter. You have to transport the changes from the mean time to your primary site as well. Thus there is a replication mode called reverse replication. The full reverse replication copies all data from the secondary volume to the primary volume. Update reverse replication: The secondary volume has a bitmap volume as well. Thus you can do an update replication from the secondary to the primary volume as well. Okay, but what mode of replication should you choose? For the rst replication its easy ... full replication. After this, there is a simple rule of thumb: Whenever in doubt of the integrity of the target volume, use the full replication.
26.8. Logging
At last there is another important term in this technology: Logging. This has nothing to do with writing log messages about the daemons of AVS. Logging is a special mode of operation. This mode is entered when the replication is interrupted. In this case the changes to the primary and secondary will be recorded in the bitmap volume. Its important that Logging dont record the change itself. It stores only the information, that a part of the volume has changed. Logging makes the resynchronization of volumes after a disaster more ecient, as you only have to resync the changed parts of a volume as Ive explained for the mechanism of update replication before.
263
Mount
Its important that the primary and secondary partitions and their respective bitmap partitions are equal in size. Furthermore: Dont use cylinder 0 for partitions under the control of AVS. This cylinder may contain administrative information from other components of the systems. Replication of this information may lead to data loss.
264
This command may ask for the approval to create the cong database, when you run this command for the rst time. Answert this question with y.After this we switch to gandalf to do the same.
[ root@gandalf :~] $ dscfgadm -e
Now we can establish the replication. We login at theoden rst, and congure this replication.
[ root@theoden :~] $ sndradm -e theoden / dev / rdsk / c1d0s1 / dev / rdsk / c1d0s0 gandalf / dev / rdsk / c1d0s1 / dev / rdsk / c1d0s0 ip sync Enable Remote Mirror ? ( Y / N ) [ N ]: y
What have we congured? We told AVS to replicate the content of /dev/rdsk/c1d0s1 on theoden to /dev/rdsk/c1d0s1 on gandalf. AVS uses the /dev/rdsk/c1d0s1 volume on each system as the bitmap volume for this application. At the end of this command we congure, that the replication uses IP and its a synchronous replication. Okay, but we have to congure it on the targeted system of the replication as well:
[ root@gandalf :~] $ sndradm -e theoden / dev / rdsk / c1d0s1 / dev / rdsk / c1d0s0 gandalf / dev / rdsk / c1d0s1 / dev / rdsk / c1d0s0 ip sync Enable Remote Mirror ? ( Y / N ) [ N ]: y
We repeat the same command we used on codetheoden/code on gandalf as well. Forgetting to do this step is one of the most frequent errors in regard of setting up an remote mirror. An interesting command in regard of AVS remote mirror is the dsstat command. It shows the status and some statistic data about your replication.
[ root@theoden :~] $ dsstat -m sndr name t s pct role dev / rdsk / c1d0s1 P L 100.00 net dev / rdsk / c1d0s0 bmp kps 0 0 tps 0 0 svt 0 0
The 100.00 doesnt stand for 100% of the replication is completed. It standing for 100% of the replication to do. We have to start the replication manually. Okay, more formally the meaning of this column is percentage of the volume in need of syncing. And as we freshly congured this replication its obivous, that the complete volume needs synchronisation.
265
26. Remote Mirroring with the Availability Suite Two other columns are important, too: Its the codet/code and the codes/code column. The codet/code column designates the volume type and the codes/code the status of the volume. In this case weve observed the primary volume and its in the logging mode. It records changes, but doesnt replicate them right now to the secondary volume. Okay, so lets start the synchronisation:
[ root@theoden :~] $ sndradm -m gandalf :/ dev / rdsk / c1d0s1 Overwrite secondary with primary ? ( Y / N ) [ N ]: y
We can lookup the progress of the sync with the codedsstat/code command again:
[ root@theoden :~] $ name dev / rdsk / c1d0s1 dev / rdsk / c1d0s0 [ root@theoden :~] $ name dev / rdsk / c1d0s1 dev / rdsk / c1d0s0 [...] [ root@theoden :~] $ name dev / rdsk / c1d0s1 dev / rdsk / c1d0s0 [ root@theoden :~] $ name dev / rdsk / c1d0s1 dev / rdsk / c1d0s0 dsstat -m sndr t s pct role P SY 97.39 net bmp dsstat -m sndr t s pct role P SY 94.78 net bmp dsstat -m sndr t s pct role P SY 3.33 net bmp dsstat -m sndr t s pct role P R 0.00 net bmp kps Inf Inf kps Inf Inf tps svt 0 - NaN 0 - NaN tps svt 0 - NaN 0 - NaN
When we start the synchronization the status of the volume switches to SY for synchronizing. After a while the sync is complete. The status switches again, this time to R for replicating. From this moment all changes to the primary volume will be replicated to the secondary one. Now lets play around with our new replication set by using the primary volume. Create a lesystem on it for example, mount it and play around with it:
[ root@theoden :~] $ newfs / dev / dsk / c1d0s1 newfs : construct a new file system / dev / rdsk / c1d0s1 : ( y / n ) ? y / dev / rdsk / c1d0s1 : 968704 sectors in 473 cylinders of 64 tracks , 32 sectors 473.0 MB in 30 cyl groups (16 c /g , 16.00 MB /g , 7680 i / g ) super - block backups ( for fsck -F ufs -o b =#) at :
266
32 , 32832 , 65632 , 98432 , 131232 , 164032 , 196832 , 229632 , 262432 , 295232 , 656032 , 688832 , 721632 , 754432 , 787232 , 820032 , 852832 , 885632 , 918432 , 951232 [ root@theoden :~] $ mount / dev / dsk / c1d0s1 / mnt [ root@theoden :~] $ cd / mnt [ root@theoden :~] $ touch test [ root@theoden :~] $ cp / var / log /* . [ root@theoden :~] $ mkfile 1 k test2
Okay, in a few seconds I will show you, that all changes really get to the other side.
messages spellhist
sulog test
Please keep the timestamp in mind. Now we switch both mirrors into the logging mode. As an alternative you can disconnect the network cable. This will have the same eect. Whenever the network link between the both hosts is unavailable, both volume will be set to the logging mode. As I use virtual servers, I cant disconnect a network cable, thus cant do it this way. Okay ...
267
When you look at the status of the replication on theoden, you will see the logging state again.
[ root@theoden :~] $ dsstat name t s pct role dev / rdsk / c1d0s1 P L 0.00 net dev / rdsk / c1d0s0 bmp ckps 0 dkps 0 0 tps 0 0 svt 0 0
Okay, now we mount the secondary volume. Please keep in mind, that we dont mount the volume via network or via a dual ported SAN. Its a independent storage device on a dierent system.
[ root@gandalf :~] $ mount / dev / dsk / c1d0s1 / mnt [ root@gandalf :~] $ cd / mnt [ root@gandalf :~] $ ls -l total 7854 -rw - - - - - - 1 root root 0 Mar [..] -rw -r - -r - 1 root root 29 Mar -rw -r - -r - 1 root root 2232 Mar -rw -r - -r - 1 root root 43152 Mar
Okay, there is a le called timetest. Lets look for the data in the le.
[ root@gandalf :~] $ cat timetest Sat Mar 29 19:28:51 CET 2008
The le and its content got replicated to the secondary volume instantaniously. Okay, now lets switch back to primary hosts, but we create another le with a timestamp before doing that.
[ root@gandalf :~] $ date > timetest2 [ root@gandalf :~] $ cat timetest2 Sat Mar 29 19:29:10 CET 2008 [ root@gandalf :~] $ cd / [ root@gandalf :~] $ umount / mnt
268
26. Remote Mirroring with the Availability Suite Okay, we changed the secondary volume by adding this le, thus we have to sync our primary volume. Thus we do an update reverse synchronisation:
[ root@theoden :~] $ sndradm -u -r Refresh primary with secondary ? ( Y / N ) [ N ]: y [ root@theoden :~] $ dsstat name t s pct role ckps dkps dev / rdsk / c1d0s1 P R 0.00 net 0 dev / rdsk / c1d0s0 bmp 0 0
tps 0 0
svt 0 0
This has two consequence. The changes to the secondary volumes are transmitted to the primary volume (as we use the update sync we just transmit this changes) and the replication is started again. Okay, but lets check for our second timestamp le. We mount our lesystem by using the primary volume.
[ root@theoden :~] $ mount / dev / dsk / c1d0s1 / mnt [ root@theoden :~] $ cd / mnt [ root@theoden :~] $ ls -l total 7856 -rw - - - - - - 1 root root 0 Mar 29 16:43 aculog [...] -rw -r - -r - 1 root root 29 Mar 29 19:28 timetest -rw -r - -r - 1 root root 29 Mar 29 19:32 timetest2 [...] [ root@theoden :~] $ cat timetest2 Sat Mar 29 19:29:10 CET 2008
Et voila, you nd two les beginning with timetest and the second version contains the new timestamp weve have written to the lesystem while using the secondary volume on the seondary host. Neat, isnt it?
269
270
26. Remote Mirroring with the Availability Suite Weve added the group property to the existing mirror, now we create the new mirror directly in the correct group
[ root@theoden :~] $ sndradm -e theoden / dev / rdsk / c1d1s1 / dev / rdsk / c1d1s0 gandalf / dev / rdsk / c1d1s1 / dev / rdsk / c1d1s0 ip sync g importantapp Enable Remote Mirror ? ( Y / N ) [ N ]: y
With sndradm -P you can look up the exact conguration of your replication sets:
[ root@theoden :~] $ sndradm -P / dev / rdsk / c1d0s1 -> gandalf :/ dev / rdsk / c1d0s1 autosync : off , max q writes : 4096 , max q fbas : 16384 , async threads : 2 , mode : sync , group : importantapp , state : syncing / dev / rdsk / c1d1s1 -> gandalf :/ dev / rdsk / c1d1s1 autosync : off , max q writes : 4096 , max q fbas : 16384 , async threads : 2 , mode : sync , group : importantapp , state : syncing
Okay, both are in the same group. As before, we have to perform this conguration on both hosts: So we repeat the same steps on the other hosts as well:
[ root@gandalf :~] $ sndradm -e theoden / dev / rdsk / c1d1s1 / dev / rdsk / c1d1s0 gandalf / dev / rdsk / c1d1s1 / dev / rdsk / c1d1s0 ip sync g importantapp Enable Remote Mirror ? ( Y / N ) [ N ]: y [ root@gandalf :~] $ sndradm -R g importantapp gandalf :/ dev / rdsk / c1d0s1 Perform Remote Mirror reconfiguration ? ( Y / N ) [ N ]: y
No we start the replication of both volumes. We can to this in a single step by using the name of the group.
[ root@theoden :~] sndradm -m -g importantapp Overwrite secondary with primary ? ( Y / N ) [ N ]: y
Two minutes later the replication has succeeded, we have now a fully operational replication group:
271
[ root@theoden :~] $ dsstat name t s pct role dev / rdsk / c1d0s1 P R 0.00 net dev / rdsk / c1d0s0 bmp dev / rdsk / c1d1s1 P R 0.00 net dev / rdsk / c1d1s0 bmp
ckps 0 0
dkps 0 0 0 0
tps 0 0 0 0
svt 0 0 0 0
Now both volumes are in replicating mode. Really easy, its just done by adding the group to the replication relations.
Okay, we can use the local or remote volume as a name to choose the conguration to be deleted:
[ root@gandalf :~] $ sndradm -d theoden :/ dev / rdsk / c1d1s1 Disable Remote Mirror ? ( Y / N ) [ N ]: y
As you see, the conguration is gone. But you have to do the same on the other host. So login as root to the other host:
[ root@theoden :~] $ sndradm -P / dev / rdsk / c1d1s1 -> gandalf :/ dev / rdsk / c1d1s1 autosync : off , max q writes : 4096 , max q fbas : 16384 , async threads : 2 , mode : sync , state : logging [ root@theoden :~] $ sndradm -d gandalf :/ dev / rdsk / c1d1s1 Disable Remote Mirror ? ( Y / N ) [ N ]: y [ root@theoden :~] $ sndradm -P [ root@theoden :~] $
272
273
[ root@theoden :~] $ newfs / dev / rdsk / c1d1s1 newfs : construct a new file system / dev / rdsk / c1d1s1 : ( y / n ) ? y / dev / rdsk / c1d1s1 : 968704 sectors in 473 cylinders of 64 tracks , 32 sectors 473.0 MB in 30 cyl groups (16 c /g , 16.00 MB /g , 7680 i / g ) super - block backups ( for fsck -F ufs -o b =#) at : 32 , 32832 , 65632 , 98432 , 131232 , 164032 , 196832 , 229632 , 262432 , 295232 , 656032 , 688832 , 721632 , 754432 , 787232 , 820032 , 852832 , 885632 , 918432 , 951232
Now we can generate a backup of this lesystem. You have to make a image of the volume, making a tar or cpio le backup isnt sucient.
[ root@theoden :~] $ dd if =/ dev / rdsk / c1d1s1 | gzip > 2 migrate . gz 968704+0 records in 968704+0 records out
Okay, now activate the replication on the primary volume. Dont activate it on the secondary one! The important dierence to a normal replication is the -E. When you use this switch, the system assumes that the primary and secondary volume are identical already.
[ root@theoden :~] $ sndradm -E theoden / dev / rdsk / c1d1s1 / dev / rdsk / c1d1s0 gandalf / dev / rdsk / c1d1s1 / dev / rdsk / c1d1s0 ip sync Enable Remote Mirror ? ( Y / N ) [ N ]: y
Okay, weve used the -E switch again to circumvent the need for a full syncronisation. When you look at the status of volume, you will see the volume in the logging state:
[ root@theoden :~] $ dsstat name t s pct role dev / rdsk / c1d1s1 P L 0.00 net dev / rdsk / c1d1s0 bmp ckps 0 dkps 0 0 tps 0 0 svt 0 0
274
26. Remote Mirroring with the Availability Suite This means, that you can do changes on the volume.
[ root@theoden :~] $ mount / dev / dsk / c1d1s1 / mnt [ root@theoden :~] $ cat / mnt / test5 Mon Mar 31 14:57:04 CEST 2008 [ root@theoden :~] $ date >> / mnt / test6 [ root@theoden :~] $cat / mnt / test6 Mon Mar 31 15:46:03 CEST 2008
Now we transmit our image of the primary volume to our new system. In my case its scp, but for huge amount of data sending the truck with tapes would be more sensible.
[ root@theoden :~] $ scp 2 migrate . gz jmoekamp@gandalf :/ export / home / jmoekamp /2 migrate . gz Password : 2 migrate . gz 100% | * * * * * * * * * * * * * * * * * * * * * * * * * * * * * | 1792 KB 00:00
Okay, our primary and secondary volumes are still in logging mode. How do we get them out of this? In our rst example we did an full syncronisation, this time we need only an update synchronisation. So login as root to our primary host and initiate such an update sync. This is the moment, where you have to stop working on the primary volume.
275
After this step all changes we did after creating the image from our primary volume will be synced to the secondary volume.
By the virtues of update synchronisation, the test6 appeared on the seondary volume.Lets have a look in /mnt/test6:
[ root@gandalf :~] $ cat test6 Mon Mar 31 15:46:03 CEST 2008
Cool, isnt it ?
26.15. Conclusion
How can you use this feature? Some use cases are really obvious. Its a natural match for disaster recovery. The Sun Cluster Geographic Edition even supports this kind of remote mirror out of the box to do cluster failover with wider distances than just a campus. But its usable for other jobs as well, for example for migrations to a new datacenter, when you have to transport a large amount data over long distances without a time window for a longer service interruption.
276
1 2
277
27.1. Introduction
The basic idea of Point-in-Time copies is the idea to freeze the contents of a disk or a volume at a certain time, thus other processes can work on data of a certain point in time, while your application works on the original dataset and changes it. Why is this important? Lets assume you want to make a backup. The problem is quite simple. When a backup takes longer than a few moments, the les backup rst my represent an dierent state of the system than the les backup last. You backup is inconsistent, as youve done a backup of a moving target. Okay, you could simply freeze your application, copy its data to another disk (via cp or dd) and backup it from there or backup it directly and restart your application, but most of the time, this isnt feasible. Lets assume you have a multi-terabyte database on your system. A simple copy can take quite a time, in this time your application doesnt work (at least when you database has no backup mode). Okay, simple copy is to ineective. We have to do it with other methods. This tutorial will show you the usage of a method integrated into OpenSolaris and Solaris.
27.2. Basics
One of this methods is the usage of the point in time copy functionality of the Availability Suite. Ive wrote about another function of AVS not long ago when I wrote the tutorial about remote replication. Point-in-time-copy and remote replication are somewhat similar (you detect and record changes and transmit those to a dierent disk, albeit the procedures are dierent). Thus it was quite logical to implement both in the AVS.
278
279
280
Figure 27.2.: Independent copy: After initialization Okay, now we change the fourth block of the disk. As the old data is already on the shadow volume, we dont have to move any data. But we log in the bitmap volume, that a block has changed, the block is dirty. From now the bitmap is in the dirty state. The dirty state tells us, that there are dierences between the master and the shadow volume. Okay, we dont need to move data, why do we need the bitmap volume. The bitmap volume makes the synchronization of master and shadow much more ecient. With the bitmap volume you known the position of changed blocks. So when you resync your shadow with the shadow you just have to copy this blocks, and not the whole disk. After copying the block, the adjacent bit in the bitmap is set to zero, the system known that the synced block on master and shadow are identical again.
281
282
Figure 27.5.: Dependent copy: After initialization When you change some data on the master volume, AVS starts to copy data. It copies the original content of the block onto the physical shadow volume at the same logical position in the volume. This is the reason, why master and shadow volumes have to have the same size when using dependent copies. Furthermore AVS logs in the bitmap that there is data on the shadow volumes, the block is dirty in the bitmap. When you access the virtual shadow volume now, the bitmap is checked again. But for blocks declared dirty in the bitmap, the data is delivered from the copy on the physical shadow volume, for all other clean blocks the data comes from the master volume. Resyncing the shadow to the master is easy. Just reinitializing the bitmap. Now all data comes from the master volume again, until you change some data on it.
283
284
Figure 27.8.: Compact Dependent copy: Initialization Lets assume that we change the fourth block on our disk. As with the normal copy, the block is declared as dirty. But now starts to work dierently. The original data of the master volume is stored to the rst free block on the physical shadow volume. In addition to that the position is stored at the bitmap. The way to read from the shadow volume changes accordingly. When the bitmap signals, that a block is clean, it just passed the data from the master volume to the user or application. When the bitmap signals a dirty, thus changed block, it reads the position
285
27. Point-in-Time Copy with the Availability Suite of the block on the physical shadow volume from the bitmap, reads the block from there and delivers it to the application or user.
Figure 27.9.: Compact Dependent copy: Change a rst block When we change the next block, for example the third one, the same procedure starts. The original data is stored to the next free block, now the second one, on the physical shadow volume and this position is stored in the bitmap together with the dirty state of the block.
Figure 27.10.: Compact Dependent copy: Change a second block Okay, resyncing the shadow with the master is easy again. Just initializing the bitmap.
286
Figure 27.11.: Compact Dependent copy: Resynchronization volume much smaller, thus saving space on your disk. In my opinion compact dependent copies are the only reasonable way to go when you want more than one copy of your master volume. The disadvantages ares pretty much the same of the normal dependent copies.
27.7.1. Disklayout
Okay, I will use two harddisks in my example: /dev/dsk/c1d00 and /dev/dsk/c1d1. Ive chosen the following layout for the disk.
. . Partition Tag Directory Flags First Sector Sector Count Last Sector Mount
287
2 3 4 5 6 8 9
5 0 0 0 0 1 9
01 00 00 00 00 01 00
With this conguration I have two 128 mb sized slices. I will use them for data in my example. Additionally Ive create two 32 mb small slices for the bitmaps. 32 mb for the bitmaps is too large, but I didnt wanted to calculate the exact size.
27.7.2. Calculation of the bitmap volume size for independent and dependent shadows
You can calculate the size for the bitmap size as follows: SizeBitmapvolume in kB = 24 + (SizeDatavolume in GB 8) Lets assume a 10 GB volume for Data: SizeBitmapvolume in kB = 24 + (10 8) SizeBitmapvolume in kB = 104kb
27.7.3. Calculation of the bitmap volume size for compact dependent shadows
You can calculate the size for the bitmap size as follows: SizeBitmapvolume in kB = 24 + (SizeDatavolume in GB 256) Lets assume a 10 GB volume for Data: SizeBitmapvolume in kB = 24 + (10 256) SizeBitmapvolume in kB = 2584kb
288
Okay, now lets create a le system for testing purposes on the master disk.
# newfs / dev / dsk / c1d0s3 newfs : construct a new file system / dev / rdsk / c1d0s3 : ( y / n ) ? y Warning : 3376 sector ( s ) in last cylinder unallocated / dev / rdsk / c1d0s3 : 273104 sectors in 45 cylinders of 48 tracks , 128 sectors 133.4 MB in 4 cyl groups (13 c /g , 39.00 MB /g , 18624 i / g ) super - block backups ( for fsck -F ufs -o b =#) at : 32 , 80032 , 160032 , 240032
Okay, as an empty lesystem is a boring target for point-in-time copies, we play around a little bit and create some les in our new lesystem.
# mount / dev / dsk / c1d0s3 # cd / mnt # mkfile 1 k test1 # mkfile 1 k test2 # mkfile 1 k test3 # mkfile 1 k test4 # mkfile 1 k testindex1 # ls -l total 26 drwx - - - - - 2 root found -rw - - - - - - T 1 root -rw - - - - - - T 1 root -rw - - - - - - T 1 root -rw - - - - - - T 1 root -rw - - - - - - T 1 root testindex1 / mnt
8192 Apr 25 18:10 lost + 1024 1024 1024 1024 1024 Apr Apr Apr Apr Apr 25 25 25 25 25 18:10 18:11 18:11 18:11 18:11 test1 test2 test3 test4
289
Please answer the last question with y . By doing so, all the services of the AVS we need in the following tutorial are started (besides the remote replication service)
Thats all. What does this command mean: Create an independent copy of the data on the slice /dev/rdsk/c1d0s3 on /dev/rdsk/c1d1s3 and use /dev/rdsk/c1d1s3 for the bitmap. As soon as you execute this command, the copy process starts. We decided to use an independent copy, thus we start a full copy of the master volume to the shadow volume. As long this fully copy hasnt completed, the point-in-time copy behaves like an dependent copy. Now we check the conguration.
290
# iiadm -i / dev / rdsk / c1d0s3 : ( master / dev / rdsk / c1d1s3 : ( shadow / dev / rdsk / c1d1s4 : ( bitmap Independent copy Latest modified time : Fri Volume size : 273105 Shadow chunks total : 4267 Percent of bitmap set : 0 ( bitmap clean )
The highlighted part is interesting. The bitmap is clean. This means, that there are no changes between the master and the shadow volume.
Just substitute the ind with the dep and you get a dependent copy.
# iiadm -i / dev / rdsk / c1d0s3 : ( master / dev / rdsk / c1d1s3 : ( shadow / dev / rdsk / c1d1s4 : ( bitmap Dependent copy Latest modified time : Sat Volume size : 273105 Shadow chunks total : 4267 Percent of bitmap set : 0 ( bitmap clean ) volume ) volume ) volume ) Apr 26 23:50:19 2008 Shadow chunks used : 0
291
27. Point-in-Time Copy with the Availability Suite has only a size 32 MB while the master volume is 256 MB large.At rst we create an dependent copy again, but with dierent volumes:
# iiadm -e dep / dev / rdsk / c1d0s3 / dev / rdsk / c1d1s6 / dev / rdsk / c1d1s4
292
Latest modified time : Fri Apr 25 18:16:59 2008 Volume size : 273105 Shadow chunks total : 4267 Shadow chunks used : 0 Percent of bitmap set : 0 ( bitmap dirty )
Please look at the highlighted part. The system detected the changes to the master volume and marked the changed block on the bitmap volumes. The bitmap is dirty now. Okay, now lets use our copy. We create a mountpoint and mount our shadow volume at this mountpoint.
# mkdir / backup # mount / dev / rdsk / c1d1s3 / backup
Just for comparison, we have a short look at our master volume again:
# cd / mnt # ls lost + found test1
test2 test3
test4 test5
test6 testindex1
testindex2
test2
test3
test4
We see the state of the lesystem at the moment weve created the point-in-time copy. Please notice the dierence. The les created after initiating the copy are not present in the shadow. You can make everything you want with the lesystem on the shadow volume. You can even write to it. But for this tutorial, we will make a backup from it. Whatever happens with the master volume during this backup, the data on the shadow wont change. Okay, thats isnt so interesting for a few bytes, but important for multi-terabyte databases or lesystems.
# a a a a a a tar cfv / backup20080424 . tar / backup / backup / 0 K / backup / lost + found / 0 K / backup / test1 1 K / backup / test2 1 K / backup / test3 1 K / backup / test4 1 K
293
a / backup / testindex1 1 K
As you see, no test5, test6 or testindex2. Okay, we have made our backup, now lets sync our copy.
# iiadm -u s / dev / rdsk / c1d1s3
Thats all. What have we done. We told the AVS to update the shadow copy on /dev/c1d1s3. Whenever you specify a disk or volume directly, you use the name of the shadow volume. A master volume can have several shadow volumes, but there can be only one shadow on a volume. So the copy conguration can be specied with the shadow volume. The -u s tells AVS to do an update (not a full copy) to the slave (from the master). Okay, now lets check the copy again.
# iiadm -i / dev / rdsk / c1d0s3 : ( master / dev / rdsk / c1d1s3 : ( shadow / dev / rdsk / c1d1s4 : ( bitmap Independent copy Latest modified time : Fri Volume size : 273105 Shadow chunks total : 4267 Percent of bitmap set : 0 ( bitmap clean ) volume ) volume ) volume ) Apr 25 19:30:19 2008 Shadow chunks used : 0
Please look at the highlighted part again. The bitmap is clean again. The master and the shadow are in sync. Okay, lets check it by mounting the lesystem.
# mount / dev / dsk / c1d1s3 # cd / backup # ls -l total 30 drwx - - - - - 2 root found -rw - - - - - - T 1 root -rw - - - - - - T 1 root -rw - - - - - - T 1 root -rw - - - - - - T 1 root -rw -r - -r - 1 root -rw -r - -r - 1 root -rw - - - - - - T 1 root testindex1 -rw - - - - - - T 1 root testindex2 / backup
8192 Apr 25 18:10 lost + 1024 1024 1024 1024 0 0 1024 Apr Apr Apr Apr Apr Apr Apr 25 25 25 25 25 25 25 18:10 18:11 18:11 18:11 18:20 18:20 18:11 test1 test2 test3 test4 test5 test6
294
27. Point-in-Time Copy with the Availability Suite Its the exact copy of the lesystem in the moment when youve initiated the copy. Okay, now lets play again with our point-in-time copy. Lets create some les in our master volume:
# # # # cd / mnt touch test7 touch test8 mkfile 3 k testindex2
Please note, that Ive overwritten the 2k sized version of testindex2 with a 3k sized version. A quick check of the directories:
# ls / mnt lost + found test2 testindex2 test1 test3 # ls / backup lost + found test2 test1 test3 test4 test5 test4 test5 test6 test7 test6 testindex1 test8 testindex1 testindex2
Okay, the directory are dierent. Now lets start the backup again.
# a a a a a a a a a a tar cfv backup20080425 . tar / backup / backup / 0 K / backup / lost + found / 0 K / backup / test1 1 K / backup / test2 1 K / backup / test3 1 K / backup / test4 1 K / backup / testindex1 1 K / backup / test5 0 K / backup / test6 0 K / backup / testindex2 2 K
Okay, test7 and test8 didnt made it into the tarball, as they were created after updating the point-in-time copy. Furthermore weve tared the 2k version of testindex2 not the 3k version. So you can backup a stable version of your lesystem, even when you modify your master volume during the backup. Okay, now we can unmount the lesystem again.
# cd / # umount / backup
After this we sync the slave volume with the master volume.
# iiadm -u s / dev / rdsk / c1d1s3
295
27. Point-in-Time Copy with the Availability Suite And when we compare the lesystems, they are identical again.
# mount / dev / dsk / c1d1s3 # ls / mnt lost + found test2 testindex2 test1 test3 # ls / backup lost + found test2 testindex2 test1 test3 / backup test4 test5 test4 test5 test6 test7 test6 test7 test8 testindex1 test8 testindex1
You can play this game forever, but I will stop now, before it gets boring.
The new code killed your codetestindex/code-les. Zero bytes. And you hear the angry guy or lady from customer support shouting your name. But you were cautious, youve created a point-in-time copy before updating the system. So, calm down and recover before a customer support lynch mob reach your oce with forks and torches. Leave the lesystem and unmount it.
# cd / # umount / mnt
Now sync the master with the slave. Yes, the other way round.
# iiadm -u m / dev / rdsk / c1d1s3 Overwrite master with shadow volume ? yes / no yes
296
27. Point-in-Time Copy with the Availability Suite Okay ... after a few moments the shell prompt appears again. Now you can mount it again.
# mount / dev / dsk / c1d0s3 / mnt # cd / mnt
Phew ... rescued ... and the lynch mob in front of your oce throws the torches out of the window, directly on the car of the CEO (of course by accident ;) )
27.11. Administration
Okay, there are several administrative procedures with the point-in-time copy functionality. I will describe only the most important ones, as I dont want to substitute the manal with this tutorial.
Its really easy to delete this cong. As I mentioned before, the name of the shadow volume clearly indicates a point-in-time copy conguration, as there can be only one conguration for any given shadow volume. So you use the name of the shadow volume to designate a conguration. Thus the command to delete the conguration is fairly simple:
# iiadm -d / dev / rdsk / c1d1s6
The -d tells iiadm to delete the cong. When we recheck the current AVS conguration, the cong for /dev/rdsk/c1d1s6 is gone:
# iiadm -l #
297
Again you use the name of the shadow volume to designate the conguration. You force the full copy resync with a single command:
# iiadm -c s / dev / rdsk / c1d1s3
When we check the status of the dependent copy, you will see that a full copy is in progress:
# iiadm -i / dev / rdsk / c1d0s3 : ( master / dev / rdsk / c1d1s3 : ( shadow / dev / rdsk / c1d1s4 : ( bitmap Independent copy , copy in Latest modified time : Sun Volume size : 273105 Shadow chunks total : 4267 Percent of bitmap set : 69 ( bitmap dirty ) volume ) volume ) volume ) progress , copying master to shadow Apr 27 01:49:21 2008 Shadow chunks used : 0
Lets wait for a few moments and check the status again:
# iiadm -i / dev / rdsk / c1d0s3 : ( master / dev / rdsk / c1d1s3 : ( shadow / dev / rdsk / c1d1s4 : ( bitmap Independent copy Latest modified time : Sun Volume size : 273105 Shadow chunks total : 4267 Percent of bitmap set : 0 ( bitmap clean ) volume ) volume ) volume ) Apr 27 01:49:21 2008 Shadow chunks used : 0
298
Now we want to congure another one for the volume /dev/rdsk/c1d0s5 with /dev/rdsk/c1d1s5 as the shadow volume and /dev/rdsk/c1d1s6 as the bitmap volume. At rst we move the existing conguration into a group. I will name it database in my example but you could choose any other name for it.
# iiadm -g database -m / dev / rdsk / c1d1s3
With -g we designate the groupname and with -m we move the volume into the group. As usual we use the name of the shadow volume to designate the conguration. Now we create the point-in-time copy of the second volume. But we will create it directly in the group. To do so, we need the -g switch.
# iiadm -g database -e dep / dev / rdsk / c1d0s5 / dev / rdsk / c1d1s5 / dev / rdsk / c1d1s6
Please notice, that we used a dierent copy mechanism for the point-in-time copy. The dont have to be identical in the group. Lets check the state of our copies:
# iiadm -i / dev / rdsk / c1d0s3 : ( master / dev / rdsk / c1d1s3 : ( shadow / dev / rdsk / c1d1s4 : ( bitmap Group name : database Independent copy Latest modified time : Sun Volume size : 273105 Shadow chunks total : 4267 Percent of bitmap set : 0 ( bitmap clean ) volume ) volume ) volume )
299
--------------------------------------------------------------------------/ dev / rdsk / c1d0s5 : ( master / dev / rdsk / c1d1s5 : ( shadow / dev / rdsk / c1d1s6 : ( bitmap Group name : database Dependent copy Latest modified time : Sun Volume size : 273105 Shadow chunks total : 4267 Percent of bitmap set : 0 ( bitmap clean ) volume ) volume ) volume )
When you check the state of your copies again, you will recognize that you initiated a full resync on both copies at the same time:
# iiadm -i / dev / rdsk / c1d0s3 : ( master volume ) / dev / rdsk / c1d1s3 : ( shadow volume ) / dev / rdsk / c1d1s4 : ( bitmap volume ) Group name : database Independent copy , copy in progress , copying master to shadow Latest modified time : Sun Apr 27 02:08:09 2008 Volume size : 273105 Shadow chunks total : 4267 Shadow chunks used : 0 Percent of bitmap set : 42 ( bitmap dirty ) --------------------------------------------------------------------------/ dev / rdsk / c1d0s5 : ( master volume ) / dev / rdsk / c1d1s5 : ( shadow volume ) / dev / rdsk / c1d1s6 : ( bitmap volume ) Group name : database Dependent copy , copy in progress , copying master to shadow Latest modified time : Sun Apr 27 02:08:09 2008 Volume size : 273105 Shadow chunks total : 4267 Shadow chunks used : 0 Percent of bitmap set : 40 ( bitmap dirty )
300
27.12. Conclusion
I hope I gave you some insight into this really interesting feature of Solaris and OpenSolaris. There are vast possibilities to use it in your daily use. Its not limited to disaster recovery or backups. One of my customers uses this tool to create independent copies of their database. They take a snapshot at midnight and export it on a dierent database server. The rationale for this process: They run some long running analytics with a huge load on the I/O system on this independent copy. By using the copy the analysis doesnt interfere with the production use of the database. Another customer uses this feature for generating test copies of their production data for testing new software versions. You see, the possibilities are vast and virtually endless.
1 2
301
28.1. Introduction
Okay, this is tutorial isnt really about feature of Solaris itself. But the feature of this tutorial its deeply coupled with Solaris. Thus you can view it as an optional part of Solaris. This time I will dig into the installation and conguration of SamFS. But a warning, SamFS is a feature monster. This tutorial is equivalent to put your toes in the Atlantic ocean, but when I saw the announcement of the open-sourcing of SamFS I thought, its time to write this document. In addition to that, it was a nice way to make a reality check on a thought game, Ive made some months ago.
302
28. SamFS - the Storage Archive Manager FileSystem a le and the moment the boss comes into your oce wanting the document. 2. In most countries you will nd regulations who prohibits the deletion of a le.
303
28.2.6. SamFS
SamFS is an implementation of this concept. It isnt the only one, but from my view its the best implementation in the unix world. SamFS stands for S torage Archive M anager F ile S system. Its an fully POSIX compliant le system, thus an user or an application wont see a dierent to an UFS for example, with a rich feature set. I would suggest, that you look at the Sun Website for the Sun StorageTek SamFS website for an overview.
28.3.1. Lifecycle
Before dening the jargon, its important to understand, that every le under the control of SamFS follows a certain lifecycle. You create or modify it, the system archives it, after a certain time without an access the system removes it from expensive storage, when it has copies on cheaper ones, when you access it, it will be gathered from the cheaper storage and delivered to you. When you delete it, you have to remove it from all your medias. This cycle is endless until a le is deleted.
304
28.3.2. Policies
Albeit every le is under the control of the described cycle, the exact life of a le doesnt have to be the same for every le. SamFS knows the concept of policies to describe the way, SamFS should handle a le. How many copies should SamFS make of a le on which media. The most dicult task of conguring SamFS is to nd a most adequate policy. You need experience for it, but its something that you can easily learn on the job.
28.3.3. Archiving
Okay, the rst step is archiving. Lets assume youve created a le. The data gets stored into the SamFS lesystem. Okay, but youve dened a policy, that you want two copies on a tape media. The process that does this job is called archiver, the process itself is called archiving. Archiving moves your les to the desired media. The metadata of the les is augmented with the positions of the le. SamFS can create up to 4 copies of a le. Important to know: SamFS doesnt wait with the archiving process until it needs space on the cache media. It starts the process of archiving les with the next run of the archive (for example every 5 minutes)
28.3.4. Releasing
Okay, lets assume you lesystem is 90% full. You need some space to work. Without SamFS you would move around the data manually. SamFS works similar and dierently at the same time. The archiver already moved your data to dierent places. Thus releasing is the process to delete the data from your lesystem. But it doesnt delete all of it. It keeps a stub of it in the lesystem. This process is called releasing. The metadata (lename, acl, ownership, rights, and the start of the le) stays on disk. Thus you wont see a dierence. You can walk around in your directories and you will see all your les. The dierence: The data itself isnt in the lesystem anymore, thus it dont consume space in it.
28.3.5. Staging
Okay, after long time (the le was already released) you want to access the data. You go into the lesystem, and open this le. SamFS intercepts this call, and automatically gathers the data from the archive media. In the meantime the reads from this le will be blocked, thus the process accessing the data blocks, too. SamFS uses informations from the metadata to nd the media.
305
28.3.6. Recycling
Okay, the end of the lifetime of a le is its deletion. Thats easy for disks. But you cant delete a single le from tape in an ecient manner. Thus SamFS uses a dierent method: The data on the tape is just marked as invalid, the stub gets deleted. But the data stays on tape. After a while more and more data may get deleted from tape. This may end in a swiss cheese where only a small amount of data is actual data. This would be waste of tape and the access pattern gets slower and slower. Recycling solves this by a single trick. The residual active data gets a special marker. When the archiver runs the next time, the data gets archived again. Now there is no actual data left on the tape. You can erase it by writing a new label to it and you can use it for new data again. This process is called recycling.
Figure 28.1.: Simplied lifecycle of a le in SamFS Once a le gets newly written or updated, it gets archived. Based on a combination policies, usage and the caching strategy its possible its getting released and staged again and again. And at the end, the tape with the data will be recycled.
306
28.3.8. Watermarks
Watermarks are an additional, but very important concept in SamFS. The cache is much smaller than the lesystem . Nevertheless you have to provide space for new and updated data. So SamFS implements two important watermarks: Then the cache gets lled to the high watermark, the system starts to release the least recently used les with a minimum number of copies on archive media automatically. This process stops, when the low water mark is reached. Thus you can ensure that you have at least a certain amount of free capacity to store new or updated data in the lesystem.
307
Processing package instance < SUNWsamfsr > from </ cdrom / sunstorageteksam - fs4 .6/ x64 /2.10 > Sun SAM and Sun SAM - QFS software Solaris 10 ( root ) ( i386 ) 4.6.5 , REV =5.10.2007.03.12 Sun SAMFS - Storage & Archiving Management File System Copyright ( c ) 2007 Sun Microsystems , Inc . All Rights Reserved .
----------------------------------------------------In order to install SUNWsamfsr , you must accept the terms of the Sun License Agreement . Enter " y " if you do , " n " if you don t , or " v " to view agreement . y - The administrator commands will be executable by root only ( group bin ) . If this is the desired value , enter " y ". the specified value enter " c ". y If you want to change
308
By default , elrond is not setup to be remotely managed by File System Manager . It can only be managed by the File System Manager if it is installed locally You can modify the remote management configuration at a later time using the command fsmadm If you want to keep the default behavior , enter " y ". Otherwise enter " n ". y ## Processing package information . ## Processing system information . 20 package pathnames are already properly installed . ## Verifying disk space requirements . ## Checking for conflicts with packages already installed . The following files are already installed on the system and are being used by another package : / etc / opt < attribute change only > / var / opt < attribute change only > Do you want to install these conflicting files [y ,n ,? , q ] y ## Checking for setuid / setgid programs . This package contains scripts which will be executed with super - user permission during the process of installing this package . Do you want to continue with the installation of < SUNWsamfsr > [ y ,n ,?] y Installing Sun SAM and Sun SAM - QFS software Solaris 10 ( root ) as < SUNWsamfsr > ## Executing preinstall script .
## Installing part 1 of 1. / etc / fs / samfs / mount [...] / var / svc / manifest / application / management / fsmgmt . xml [ verifying class < none > ] / opt / SUNWsamfs / sbin / samcmd < linked pathname > ## Executing postinstall script .
309
The administrator commands are executable by root only . -----------------------------------------------------------PLEASE READ NOW !!! -----------------------------------------------------------If you are upgrading from a previous release of SAM and have not read the README file delivered with this release , please do so before continuing . There were significant restructuring changes made to the system from previous releases . Failure to convert scripts to conform to these changes could cause dramatic changes in script behavior .
Installation of < SUNWsamfsr > was successful . Processing package instance < SUNWsamfsu > from </ cdrom / sunstorageteksam - fs4 .6/ x64 /2.10 > Sun SAM and Sun SAM - QFS software Solaris 10 ( usr ) ( i386 ) 4.6.5 , REV =5.10.2007.03.12 Sun SAMFS - Storage & Archiving Management File System Copyright ( c ) 2007 Sun Microsystems , Inc . All Rights Reserved . ## Executing checkinstall script . ## Processing package information . ## Processing system information . 10 package pathnames are already properly installed . ## Verifying package dependencies . ## Verifying disk space requirements . ## Checking for conflicts with packages already installed . ## Checking for setuid / setgid programs . This package contains scripts which will be executed with super - user permission during the process of installing this package . Do you want to continue with the installation of < SUNWsamfsu > [ y ,n ,?] y
310
Installing Sun SAM and Sun SAM - QFS software Solaris 10 ( usr ) as < SUNWsamfsu > ## Installing part 1 of 1. / opt / SUNWsamfs / lib / amd64 / libsamconf . so < symbolic link > [...] / usr / sfw / bin / tapealert_trap [ verifying class < none > ] ## Executing postinstall script . Configuring samst devices . Please wait , this may take a while .
Adding samst driver Building samst devices Issuing / usr / sbin / devfsadm -i samst >> / tmp / SAM_install . log 2 >&1 Adding samioc driver Adding samaio driver File System Manager daemon is configured to auto - restart every time the daemon dies Starting File System Manager daemon
Please check the log files for any errors that were detected during startup Installation of < SUNWsamfsu > was successful .
311
Start cleaning up to prepare for new software installation Start installing File System Manager packages ... This process may take a while ... Processing package instance < SUNWfsmgrr > from </ tmp / File _ Sy s t em _ M an a g er /2.10/ i386 > File System Manager Solaris 10 ( root ) ( i386 ) 3.0.4 , REV =5.10.2007.03.01 ## Executing checkinstall script . Sun SAMFS - Storage & Archiving Management File System Copyright ( c ) 2007 Sun Microsystems , Inc . All Rights Reserved . ## Processing package information . ## Processing system information . 1 package pathname is already properly installed . ## Verifying package dependencies . ## Verifying disk space requirements . Installing File System Manager Solaris 10 ( root ) as < SUNWfsmgrr > ## Executing preinstall script . Shutting down Sun Java ( TM ) Web Console Version 3.0.3 ... The console is stopped ## Installing part 1 of 1. / opt / SUNWfsmgr / bin / fsmgr [...] / opt / SUNWfsmgr / samqfsui / xsl / svg / storagetier . xsl [ verifying class < none > ] ## Executing postinstall script . Extracting online help system files ... Archive : en_samqfsuihelp . zip creating : en / help / [...] inflating : en / help / stopwords . cfg done Warning : smreg is obsolete and is preserved only for
312
compatibility with legacy console applications . Use wcadmin instead . Type " man wcadmin " or " wcadmin -- help " for more information .
Warning : smreg is obsolete and is preserved only for compatibility with legacy console applications . Use wcadmin instead . Type " man wcadmin " or " wcadmin -- help " for more information . Registering / opt / SUNWfsmgr / samqfsui / WEB - INF / lib / fsmgmtjni . jar as fsmgmtjni . jar for scope fsmgrAdmin_3 .0 Enabling logging ... Warning : smreg is obsolete and is preserved only for compatibility with legacy console applications . Use wcadmin instead . Type " man wcadmin " or " wcadmin -- help " for more information .
Installation of < SUNWfsmgrr > was successful . Processing package instance < SUNWfsmgru > from </ tmp / File _ S ys t e m_ M a na g e r /2.10/ i386 > File System Manager Solaris 10 ( usr ) ( i386 ) 3.0.4 , REV =5.10.2007.03.01 ## Executing checkinstall script . Sun SAMFS - Storage & Archiving Management File System Copyright ( c ) 2007 Sun Microsystems , Inc .
313
All Rights Reserved . ## Processing package information . ## Processing system information . 2 package pathnames are already properly installed . ## Verifying package dependencies . ## Verifying disk space requirements . Installing File System Manager Solaris 10 ( usr ) as < SUNWfsmgru > ## Installing part 1 of 1. / usr / lib / libfsmgmtjni . so / usr / lib / libfsmgmtrpc . so [ verifying class < none > ] ## Executing postinstall script . Current session timeout value is 15 minutes , change to 60 minutes ... Set 1 properties for the console application . done Starting Sun Java ( TM ) Web Console Version 3.0.3 ... The console is running Appending elrond into / var / log / webconsole / host . conf ... done ! Installation of < SUNWfsmgru > was successful . Done installing File System Manager packages .
Backing up / etc / security / auth_attr to / etc / security / auth_attr . old Start editing / etc / security / auth_attr ... Done editing / etc / security / auth_attr Backing up / etc / user_attr to / etc / user_attr . old Start editing / etc / user_attr ... Start editing / etc / user_attr ... Done editing / etc / user_attr File System Manager 3.0 and its supporting application is installed successfully .
******** * * ** * * ** * * * PLEASE READ ********************************** Please telnet to each Sun StorEdge ( TM ) QFS servers to be managed and run the following command :
314
This will determine if the File System Manager daemon is running . If it is not running , please run the following command : / opt / SUNWsamfs / sbin / fsmadm config -a
This command will start the File System Manager daemon that communicates with the File System Manager . Failure to do so will prevent File System Manager from communicating with the Sun StorEdge QFS servers . Remote access to the servers used by the File System Manager is now restricted based on host name or IP address . If you are managing a Sun StorEdge ( TM ) QFS Server from a remote management station , please telnet to the server and run the following command : / opt / SUNWsamfs / sbin / fsmadm add < management_station_host_name >. < domain > Press ENTER to continue ... File System Manager 3.0 supports the following browsers : Browser Type Operating System ======================================================================== Netscape 7.1/ Mozilla 1.7/ Firefox 1.5 Solaris OS , MS Windows 98 SE , ME , 2000 , and XP MS Windows 98 SE , ME ,
Now launch your web browser and type the following URL : https :// < hostname >. < domain >:6789 where < hostname > is the host that you have just installed the
315
File System Manager . If you are served with a security related certificate , go ahead and accept it . Please see user docs for username and password details .
It is required to clear the browser cache before accessing the File System Manager for the first time . Failure to do so may cause unexpected behavior in various pages .
File System Manager 3.0 has been tested with the Sun Java ( TM ) Web Console version 2.2.5 & 3.0.2. Installing this product with any older Sun Java ( TM ) Web Console version breaks both applications . This product may work on newer Sun Java ( TM ) Web Console versions , but this has not been tested . *****************************************************************
Install / Uninstall log file named / var / tmp / fsmgr . setup . log .03.23.20 08.11: 01 is created .
316
28.5.1. Prerequisites
Before we can congure SamFS, I want to describe the prerequisites for this task: We need some harddisks for this task. I made my example a little bit more complex, thus I used iSCSI volumes for this task. I created for this tutorial: a 64 MB emulated volume for the storage of metadata a 512 MB emulated volume for the lesystem itself a 2 GB emulated volumes to use them as archive disks I assume that you already know the tasks to creating them from the iSCSI tutorial. The rst and the second volume will be used by SamFS directly. You have to use the format command to put a label and a partition table on it. For the both archive volumes, we will use ZFS. Thus Ive created a zpool for both:
# zpool create samfs_archive_1 c1t0100001C42E9F21A00002A0047E54035d0 # zpool create samfs_archive_2 c1t0100001C42E9F21A00002A0047E54036d0
Okay, lets dissect this le. At rst I want to explain the general meaning of the columns.
317
28. SamFS - the Storage Archive Manager FileSystem The rst column of this le is the equipment identier. This eld serves multiple purposes. You dene lesystems, tape drives, disk partitions for the cache or tape robotics in this le. Please note: You do not dene media for disk archiving here! The second column is the equipment ordinal. This eld enumerates every component dened in the mcf le. This number has to be unique. The third column is the equipment type. SamFS supports a vast amount of device type. You dene it by using its shorthand here. mastands for a SamFS/QFS cache disk set with one or more dedicated metadevices.mo for example designates an 5 1/4 inch erasable optical drive. The forth column is the family set. With the family set name you group devices. For example all disks of a lesystem. All disks of a lesystem have the same name. Another example is the grouping of a tape robotic and all of its tape drive The fth column is the device state. Okay, what did we describe with our mcf le: We dened an lesystem with the name samfs1. The name of the family set is samfs1 as well. The lesystem is of the type SamFS disk cache with dedicated metadevices. In the next row weve dened that the device /dev/dsk/c1t0100001C42E9F21A00002A0047E6642Bd0s0 is a device solely for metadata. We gave it the ordinal number 11 and its part of the samfs family, thus a part of the lesystem dened before. The third line congures the /dev/dsk/c1t0100001C42E9F21A00002A0047E54033d0s0 as the data disk for this lesystem (as the family name is samfs1 as well. Yes, you have read it correctly. SamFS is capable to separating the metadata and the data of the les on dierent disks. The idea behind this concept is to use fast disks for the metadata (e.g. solid state disks) with short access times and slower disks for the data. By this separation the lesystem has doesnt have to step between the position of the metadata and the position of the data when its updated. The eect: Much better scaling when you use a large amount of disks. Okay, now we have fully congured the lesystem. Now we modify the /etc/vfstab to enable simpler mounting/auto mounting at start. The device name is the name of the lesystem, in your case samfs1. It dont have a raw device. The mountpoint is /samfs1, the type samfs and we want to mount it at the start. The options are SamFS specic. They mean: Start to release les (thus freeing space in the cache) when the cache is 80 percent full. Release until the cache is lled only 60 percent.
samfs1 / samfs1 samfs yes high =80 , low =60
318
You should see the obligatory lost+found now. But lets do an deeper check of the lesystem:
bash -3.00# samfsinfo samfs1 samfsinfo : filesystem samfs1 is mounted . name : samfs1 version : 2 time : Sun Mar 23 15:46:40 CET 2008 count : 2 capacity : 000000000007 f000 DAU : 64 space : 0000000000070 c40 meta capacity : 000000000000 f000 meta DAU : 16 meta space : 000000000000 aa80 ord eq capacity space device 0 11 000000000000 f000 000000000000 aa80 / dev / dsk / c1t0100001C42E9F21A00002A0047E6642Bd0s0 1 12 000000000007 f000 000000000007 efc0 / dev / dsk / c1t0100001C42E9F21A00002A0047E54033d0s0
319
28.6.1. Prerequisites
Okay, Ive created two iSCSI-base diskpool at the start to use them as disk archives. Now I will put some further separation in them by creating directories in it.
# # # # mkdir mkdir mkdir mkdir / samfs_archive_1 / dir1 / samfs_archive_1 / dir2 / samfs_archive_2 / dir2 / samfs_archive_2 / dir1
Now we have usable devices for archiving. But have to congure the archiving as the next step. In this step we dene the policies for archiving, control the behavior of he archiver and associate VSNs with archive sets. All this conguration takes place in the le /etc/opt/SUNWsamfs/archiver.cmd. Okay, lets create such a cong le for our environment.
logfile = / var / opt / SUNWsamfs / archiver / log interval = 2 m
Okay, this is easy: The archiver should log its work into the le /var/opt/SUNWsamfs/archiver/log. This le is really interesting. I will show you a nifty trick with it later in this tutorial. The interval directive was responsible for dening the interval between the starts of a process for nding new or updated les (sam-arfind). This behavior didnt scaled very well with millions of les in a directory.
320
28. SamFS - the Storage Archive Manager FileSystem Today the default is dierent. The le system itself knows what les have been updated and SamFS stores this information in a list. Today this setting has a similar eect, but with other methods: Its the default setting for the archive aging. It denes the amount of time between the rst le added to the todo list for the archiver and the start of the archive. Thus the archiving would start two minutes after adding the rst le to the list. Now we dene the archiving policy for the lesystem:
fs = samfs1 arset0 . 1 30 s 2 1200 s
What does this mean? arset0 is a name of a so called archive set. The contents of this set are dened later-on. The . stands for every le in the lesystem. Okay, now we tell SamFS to make a rst copy to the archiveset arset0.1 after 30 seconds. The second copy is made to the archiveset arset0.1 after 1200 seconds (20 minutes). We have just used the name of some archive sets, now we have to declare them:
vsns arset0 .1 dk disk01 arset0 .2 dk disk03 samfs1 .1 dk disk02 endvsns
Okay, The translation is quite simple: The archiveset arset0.1 consists is a disk based set and consists out of the VSN disk01. The same for the archive set arset0.2. But wait, we didnt used an archiveset samfs1.1 so far. Well, you havent dened it explicitly. But its implicit when you have an archiver conguration for an lesystem. Its the default archive set. You can use it for regular archiving, but as we havent dened a policy to do so, this archive set is used for storing the meta data of your lesystem. So the association of a VSN to this archive set is mandatory. So we end up with the following archiver.cmd
logfile = / var / opt / SUNWsamfs / archiver / log interval = 2 m fs = samfs1 arset0 . 1 30 s 2 1200 s vsns arset0 .1 dk disk01
321
Okay, weve nalized our conguration: Now we have to check the conguration:
bash -3.00# archiver - lv Reading / etc / opt / SUNWsamfs / archiver . cmd . 1: logfile = / var / opt / SUNWsamfs / archiver / log 2: interval = 2 m 3: 4: fs = samfs1 5: arset0 . 6: 1 30 s 7: 2 1200 s 8: 9: vsns 10: arset0 .1 dk disk01 11: arset0 .2 dk disk03 12: samfs1 .1 dk disk02 13: endvsns No media available for default assignment Notify file : / etc / opt / SUNWsamfs / scripts / archiver . sh Read timeout : 60 Request timeout : 15 m Stage timeout : 15 m Archive media : media : dk bufsize : 4 archmax :
Archive libraries : Device : disk archive_drives : 3 Dictionary : dk . disk01 capacity : dk . disk02 capacity : dk . disk03 capacity : dk . disk04 capacity :
Archive file selections : Filesystem samfs1 Examine : noscan Interval : 2 m archivemeta : on scanlistsquash : off setarchdone : off Logfile : / var / opt / SUNWsamfs / archiver / log samfs1 Metadata copy : 1 arch_age : 4 m
322
Archive sets : [...] arset0 .1 media : dk Volumes : disk01 (/ samfs_archive_1 / dir1 /) Total space available : 1.9 G arset0 .2 media : dk Volumes : disk03 (/ samfs_archive_2 / dir1 /) Total space available : 1.9 G samfs1 .1 media : dk Volumes : disk02 (/ samfs_archive_1 / dir2 /) Total space available : 1.9 G bash -3.00#
And now we have a running archiver. So ... lets have some fun with it. Copy some les in it. I tend do test it by making a recursive copy of the /var/sadm/pkg directory. Now lets look onto our archival disks:
bash -3.00# ls -l / samfs_archive_1 / dir1 total 40734 -rw - - - - - - 1 root root 56 Mar 23 19:39 diskvols . seqnum -rw - - - - - - 1 root root 19608576 Mar 23 17:43 f0 -rw - - - - - - 1 root root 1049088 Mar 23 19:39 f1 bash -3.00# ls -l / samfs_archive_1 / dir2 total 13593
323
-rw - - - - - - 1 root root 56 Mar 23 19:42 diskvols . seqnum -rw - - - - - - 1 root root 6891520 Mar 23 17:42 f0 -rw - - - - - - 1 root root 4608 Mar 23 19:42 f1 bash -3.00# ls -l / samfs_archive_2 / dir1 total 40736 -rw - - - - - - 1 root root 56 Mar 23 19:58 diskvols . seqnum -rw - - - - - - 1 root root 19608576 Mar 23 17:43 f0 -rw - - - - - - 1 root root 1049088 Mar 23 19:58 f1
You see, your archival media starts to populate. But where are your les, and whats up with this f1. Well, they are written in a very specic, very secret and very closed format: These les are simple tar les. SamFS uses the standard tar format to write the archive le.You can look in it with the standard tar or the tar of SamFS:
bash -3.00# star tfv f1 -rw - - - - - - T root / root 1048576 2008 -03 -23 19:28 testfile3
Please notice, that this isnt a version of Joerg Schillings star despite of the name.
We now look at the metadata of this le. There is a special version of ls that is capable to read the additional information. This version ls is called sls. So lets check for our test le.
[ root@elrond :/ samfs1 ] $ sls -D testfile3 testfile3 : mode : -rw - - - - - - T links : 1 owner : root group : root length : 1048576 admin id : 0 inode : 4640.1 access : Mar 23 19:28 modification : Mar 23 19:28 changed : Mar 23 19:28 attributes : Mar 23 19:28 creation : Mar 23 19:28 residence : Mar 23 19:28
324
28. SamFS - the Storage Archive Manager FileSystem There is nothing new. Okay, lets leave the computer alone, drink a coee or two, and now we check again:
bash -3.00# sls -D testfile3 testfile3 : mode : -rw - - - - - - T links : 1 owner : root length : 1048576 admin id : 0 inode : archdone ; copy 1: ----- Mar 23 19:39 1.1 dk copy 2: ----- Mar 23 19:58 1.1 dk access : Mar 23 19:28 modification : Mar changed : Mar 23 19:28 attributes : Mar creation : Mar 23 19:28 residence : Mar
I assume youve already noticed the three additional lines. The archiver did its job:
archdone ; copy 1: ----- Mar 23 19:39 copy 2: ----- Mar 23 19:58 1.1 1.1 dk disk01 f1 dk disk03 f1
The rst line says, that all outstanding archiving for the le is done. The two next lines tells you where the copies are located, when they were archived and tells you about some special ags. The 1.1 means rst le in the archive le , starting at the 513th bit of the archive le(the header of tar if 512 byte long, thus the 513th bit is the rst usable byte, thus the 1)
325
28. SamFS - the Storage Archive Manager FileSystem When we access it again, the le gets staged back to the cache again:
bash -3.00# cat testfile3 bash -3.00# sls -D testfile3 testfile3 : mode : -rw - - - - - - T links : 1 owner : root length : 1048576 admin id : 0 inode : archdone ; copy 1: ----- Mar 23 19:39 1.1 dk copy 2: ----- Mar 23 19:58 1.1 dk access : Mar 24 01:35 modification : Mar changed : Mar 23 19:28 attributes : Mar creation : Mar 23 19:28 residence : Mar
A colleague comes into your oce, and tells you that he wants to use a large le with simulation data tomorrow. It would be nice, if he dont have to wait for the automatic staging. We can force SamFS to get the le back to the cache.
bash -3.00# stage testfile3
326
copy 1: ----- Mar 23 19:39 1.1 dk disk01 f1 copy 2: ----- Mar 23 19:58 1.1 dk disk03 f1 access : Mar 24 01:35 modification : Mar 23 19:28 changed : Mar 23 19:28 attributes : Mar 23 19:28 creation : Mar 23 19:28 residence : Mar 24 01:37
327
28.9. Conclusion
Okay, this was a rather long tutorial and I didnt even talked about the conguration of tape devices. As I told you before: Only the toes in the Atlantic ocean. But I hope, I gave you some insight into a somewhat unconventional topic and a capability of an optional part the Solaris Operating Environment. I assume, with the opensourcing of SamFS we will see a much more widespread use of it.
http://docs.sun.com/source/819-7932-10/ http://docs.sun.com/source/819-7934-10
328
28. SamFS - the Storage Archive Manager FileSystem docs.sun.com: Sun StorageTek SAM Archive Conguration and Administration Guide3 Misc. information Sun Whitepaper: Sun StorEdge QFS and SAM-FS Software4
3 4
http://docs.sun.com/source/819-7931-10 http://www.sun.com/storagetek/white-papers/qfs-samfs.pdf
329
330
29. fuser
Solaris 10/Opensolaris
You know the problem. You try to unmount a lesystem, but all you get is a Device Busy. How do you nd the process blocking the unmount?
29.1. fuser
fuser enables you to look for the processes that access a directory or a le. For example we can check for all processes using the / lesystem as their working lesystem:
# fuser -c / /: 701 ctm 592 ctm 523 ctom 469 ctom 412 ctom 379 ctom 333 ctom 153 ctm 100 ctom 676 ctm 585 ctm 521 ctom 456 ctom 402 ctom 366 ctm 332 ctom 140 ctm 18 ctm 675 ctom 584 ctom 481 ctom 437 ctom 401 ctom 345 ctom 319 ctom 133 ctom 9 ctom 672 ctom 581 ctom 478 ctom 425 ctm 399 ctom 341 ctom 272 ctom 131 ctom 7 ctom 596 ctm 568 ctm 477 ctom 418 ctom 380 ctom 338 ctom 262 ctom 125 ctm 1 ctm
Im sure you already assume, that the numbers stand for the process ids. But what does all that letters mean. I will cite the manpage for this: c Indicates that the process is using the le as its current directory. m Indicates that the process is using a le mapped with mmap(2). n Indicates that the process is holding a non-blocking mandatory lock on the le. o Indicates that the process is using the le as an open le. r Indicates that the process is using the le as its root directory. t Indicates that the process is using the le as its text le. y Indicates that the process is using the le as its controlling terminal.
331
29. fuser
And now comes the kicker: fuser can kill all processes using a certain le or directory. You warned your users ...
# fuser -k -u / mnt / application / mnt / application : 692 c ( root ) 691 c ( root ) [2]+ Killed sleep 1000 ( wd : / mnt / application ) ( wd now : /) [1]+ Killed sleep 1000 ( wd : / mnt / application ) ( wd now : /) #
332
29. fuser
# cd / mnt / application # sleep 1000& [1] 726 # sleep 1000& [2] 727 # cd / # ps -o pid , args -p " $ ( fuser / mnt / application 2 >/ dev / null ) " PID COMMAND 726 sleep 1000 727 sleep 1000 #
1 2
http://docs.sun.com/app/docs/doc/816-5166/fuser-1m http://blogs.sun.com/petesh/date/20050127
333
30. ples
Solaris 10/Opensolaris
This is not a tutorial, just a hint from my toolbox. On customer systems I see the lsof tool quite often. But for a quick check for open les you dont need it. There is a small, but extremely useful tool in the collection of the p*-tools: pfiles prints all open les of a process. It takes the PID of the process to specify the process.
# pfiles 214 214: / usr / lib / inet / in . iked Current rlimit : 256 file descriptors 0: S_IFDIR mode :0755 dev :102 ,0 ino :2 uid :0 gid :0 size :512 O_RDONLY | O_LARGEFILE / 1: S_IFDIR mode :0755 dev :102 ,0 ino :2 uid :0 gid :0 size :512 O_RDONLY | O_LARGEFILE / 2: S_IFDIR mode :0755 dev :102 ,0 ino :2 uid :0 gid :0 size :512 O_RDONLY | O_LARGEFILE / 3: S_IFREG mode :0600 dev :102 ,0 ino :28994 uid :0 gid :0 size :47372 O_RDWR | O_APPEND | O_CREAT | O_LARGEFILE / var / log / in . iked . log 4: S_IFSOCK mode :0666 dev :304 ,0 ino :48934 uid :0 gid :0 size :0 O_RDWR | O_NONBLOCK SOCK_RAW SO_SNDBUF (8192) , SO_RCVBUF (8192) sockname : AF_INET 10.211.55.200 port : 4500 peername : AF_INET 10.211.55.200 port : 4500 [..] 10: S_IFDOOR mode :0777 dev :306 ,0 ino :0 uid :0 gid :0 size :0 O_RDWR FD_CLOEXEC door to in . iked [214]
And with the xargs tool there is an easy way to print out all open les on the system.
# ps - ef -o pid | sort | xargs pfiles 0: sched | more </ b >
334
30. ples
1:
[ system process ] / sbin / init Current rlimit : 256 file descriptors 0: S_IFIFO mode :0600 dev :301 ,3 ino :448255748 uid :0 gid :0 size :0 O_RDWR | O_NDELAY / var / run / initpipe 253: S_IFREG mode :0444 dev :298 ,1 ino :65538 uid :0 gid :0 size :0 O_RDONLY | O_LARGEFILE FD_CLOEXEC / system / contract / process / pbundle 254: S_IFREG mode :0666 dev :298 ,1 ino :65539 uid :0 gid :0 size :0 O_RDWR | O_LARGEFILE FD_CLOEXEC / system / contract / process / template 255: S_IFREG mode :0666 dev :298 ,1 ino :65539 uid :0 gid :0 size :0 O_RDWR | O_LARGEFILE FD_CLOEXEC / system / contract / process / template [...]
335
Until Ive nalized my next larger article, I want to give spotlight to a really small, but really useful feature: One relatively unknown feature of recent versions of pkgadd is the ability to load packages directly from web. You just have to specify an URL:
# pkgadd -d http :// www . blastwave . org / pkg_get . pkg ## Downloading ... ..............25%..............50%..............75%..............100% ## Download Complete
The following packages are available : 1 CSWpkgget pkg_get - CSW version of automated package download tool ( all ) 3.8.4 [..] Installation of < CSWpkgget > was successful . #
Thats all. As the packages just have to be accessible by http, you can use an existing internal webserver to serve your favorite must-have extra packages and install them directly from there. Okay, and solves the problem nicely to get started with Blastwave without moving around the pkg get package via ftp ;)
336
No software is without errors. This is a basic law of computer science. And when there is no bug in the software (by a strange kind of luck) your hardware has bugs. And when there are no bugs in the hardware, cosmic rays are ipping bits. Thus an operating system needs some mechanisms to stop a process or the complete kernel at once without allowing the system to write anything back to disk and thus manifesting the corrupted state. This tutorial will cover the most important concepts surrounding the last life signs of a system or an application.
337
Okay, now we can trigger the core dump by using the process id of the process.
# gcore 681 gcore : core .681 dumped
Okay, but the kicker is the fact, that the process still runs afterwards. So you can get an core dump of your process for analysis without interrupting it.
# ps - ef jmoekamp | grep " bash " | grep " jmoekamp " 681 675 0 20:59:39 pts /1 0:00 bash
Neat isnt it. Now you can use the mdb to analyse it, for example to print out the backtrace:
# mdb core .681 Loading modules : [ libc . so .1 ld . so .1 ] > $c libc . so .1 __waitid +0 x15 (0 , 2 a9 , 8047 ca0 , 83) libc . so .1 waitpid +0 x63 (2 a9 , 8047 d4c , 80) waitjob +0 x51 (8077098)
338
postjob +0 xcd (2 a9 , 1) execute +0 x77d (80771 c4 , 0 , 0) exfile +0 x170 (0) main +0 x4d2 (1 , 8047 e48 , 8047 e50 ) _start +0 x7a (1 , 8047 eec , 0 , 8047 ef0 , 8047 efe , 8047 f0f )
Why should you do something like that? Well, there are several reasons. For example, when you want to stop a system right at this moment. There is an eect in clusters called split brain . This happens, when both systems believe their are the surviving one, because theyve lost the cluster interconnect. Sun Cluster can prevent this situation by something called quorum. In a high availability situation the nodes of a cluster try to get this quorum. Whoever gets the quorum, runs the service. But you have to ensure that the other nodes dont even try to write something to disks. The simplest method: Panic the machine. Another use case would be the detection of an security breach. Lets assume, your developer integrated a security hole as large as the Rhine into a web applicaiton by accident and now someone else owns your machine. The false reaction would be: Switch the system o or trigger a normal reboot. Both would lead to the loss of the memory content and perhaps the hacker had integrated a tool in the shutdown procedure to erase logs. A more feasible possibility: Trigger a crash dump. You keep the content of the memory and you can analyse it for traces to the attacker.
339
This is the default setting: A crash dump contains only the memory pages of the kernel and uses /dev/dsk/c0d0s1 (the swap device) to store the crash dump in the case of a kernel panic. savecore is a special process, that runs at the next boot of the system. In the case of an crash dump at the dump device, it copies the dump to the congured directory to keep it for analysis before its used for swapping again. Lets change the behaviour. At rst we want to congure, that the complete memory is saved to the crash dump in case of a panic. This is easy:
# dumpadm -c all Dump content : Dump device : Savecore directory : Savecore enabled : all pages / dev / dsk / c0d0s1 ( swap ) / var / crash / incubator yes
Okay, now lets change the location for the crash dump. The actual name is an artefact of my orignal VM image called incubator. To get a new test machine I clone this image. I want to use the directory /var/crash/theoden for this purpose.
# mkdir / var / crash / theoden # chmod 700 / var / crash / theoden # dumpadm -s / var / crash / theoden Dump content : all pages Dump device : / dev / dsk / c0d0s1 ( swap ) Savecore directory : / var / crash / theoden Savecore enabled : yes
340
32. About crashes and cores Now the system will use the new directory to store the crash dumps. Setting the rights of the directory to 700 is important. The crash dump may contain sensitive information, thus it could be dangerous to make them readable by anyone else than root.
This programm has more options than dumpadm. I wont go through all options, but some important ones. From my view the le patterns are the most interesting ones. You can control, where core dumps are stored. The default is to store the core dumps in the working directory of a process. But this may lead to core dumps dispersed over the lesystem. With core adm you can congure a central location for all your coredumps.
# coreadm -i / var / core / core .% n .% f .% u .% p # coreadm -u
With -i you tell coreadm to set the location for the per-process core dumps. The parameter for this option is the lename for new core dumps. You can use variables in this lename. For example %n will be translated to the machine name, %f to name of the le, %u to the eective user id of the process and %p will be substituted with the process id. The coreadm -u forces the instant reload the conguration. Otherwise, this setting would get active at the next boot or the next refresh of the coreadm service. Okay, lets try our conguration.
# ps - ef | grep " bash " | grep " jmoekamp " jmoekamp 681 675 0 20:59:39 pts /1 0:00 bash
341
# gcore -p 681 gcore : / var / core / core . theoden . bash .100.681 dumped
As you see, the core dump isnt written at the current working directory of the process, its written at the congured position.
The system denies the access to this information. Now we change the setting for the process 669 from the rst example. Its quite simple:
$ coreadm -p / export / home / jmoekamp / cores / core .% n .% f .% u .% p 669 $ coreadm 669 669: / export / home / jmoekamp / cores / core .% n .% f .% u .% p default
This setting ist inherited to all The per-process core le name pattern is inherited by future child processes of the aected processes. Why should you set an own path and lename for an application or an user? There are several reasons. For example to ensure that you have the correct rights to an directory
342
32. About crashes and cores for the cores, when the process starts to dump the core or to seperate the cores from certain applications a dierent locations.
Solaris has an in-memory buer for the console messages. In the case you write a crash dump, obviously this messages are written into the crash dump as well. With the ::msgbuf command of mdb you can read this message buer.
> :: msgbuf MESSAGE SunOS Release 5.11 Version snv_84 32 - bit Copyright 1983 -2008 Sun Microsystems , Inc . All rights reserved . Use is subject to license terms . features : 10474 df < cpuid , sse3 , sse2 , sse , sep , cx8 , mmx , cmov , pge , mtrr , msr , tsc , lgpg > mem = 331388 K (0 x1439f000 ) root nexus = i86pc pseudo0 at root pseudo0 is / pseudo [...] devinfo0 is / pseudo / devinfo@0 panic [ cpu0 ]/ thread = db3aea00 : forced crash dump initiated at user request
343
d5efcf84 genunix : uadmin +8 e (5 , 0 , 0 , d5efcfac , ) syncing file systems ... done dumping to / dev / dsk / c0d0s1 , offset 108593152 , content : all >
So its really easy to get this last messages of a dying system with mdb from the crash dump alone. A nice information is the backtrace. This helps you to nd out, what triggered the crash dump. In this case its easy. It was the uadmin syscall.
> $c vpanic ( fea6388c ) kadmin +0 x10c (5 , 0 , 0 , db39e550 ) uadmin +0 x8e () sys_sysenter +0 x106 ()
But it would be nice, to know more of the state of the system, at the moment of the crash. For example we can print out the process table of the system like we would do it with ps
> :: ps S PID PPID R 0 0 [...] R 586 1 R 545 1 R 559 1 syslogd [...] R 533 494 PGID 0 586 545 559 SID 0 586 545 559 UID FLAGS ADDR NAME 0 0 x00000001 fec1d3d0 sched 0 0 x42000000 d55f58a8 sshd 0 0 x42000000 d5601230 fmd 0 0 x42000000 d55fb128
494
494
We can even lookup, which les or sockets where opened at the moment of the crash dump. For example: We want to know the open les of the ssh daemon. To get this information, we have to use the address of the process from the process table (the eighth column) and extend it with "::pfiles":
> d55f58a8 :: pfiles FD TYPE VNODE 0 CHR d597d540 1 CHR d597d540 2 CHR d597d540 3 SOCK db688300 INFO / devices / pseudo / mm@0 : null / devices / pseudo / mm@0 : null / devices / pseudo / mm@0 : null socket : AF_INET6 :: 22
344
32. About crashes and cores And here we look into the open les of the syslog process.
> d55fb128 :: pfiles FD TYPE VNODE INFO 0 DIR d5082a80 / 1 DIR d5082a80 / 2 DIR d5082a80 / 3 DOOR d699b300 / var / run / name_ servic e_door [ door to nscd ( proc = d5604890 ) ] 4 CHR db522cc0 / devices / pseudo / sysmsg@0 : sysmsg 5 REG db643840 / var / adm / messages 6 REG db6839c0 / var / log / syslog 7 CHR db522840 / devices / pseudo / log@0 : log 8 DOOR db6eb300 [ door to syslogd ( proc = d55fb128 ) ]
As the core dump contains all the pages of the kernel (or more, in the case you congure it) you have a frozen state of your system to investigate everything you want. And to get back to my security example: With the core dump and mdb you can gather really interesting informations. For example, you can see that an ssh connection was open at the time of the crash dump.
> :: netstat TCPv4 St Local Address Zone db35f980 0 10.211.55.200.22 0 [...] Remote Address 10.2 11.55. 2.5381 1 Stack 0
345
Loading modules : [ unix krtld genunix specfs dtrace cpu . AuthenticAMD .15 uppc pcplusmp ufs md ip sctp usba fcp fctl nca lofs cpc fcip random crypto zfs logindmux ptm sppp nfs ipc ]
Okay, now start at the beginning of the trace to strip all lines from the operating system infrastructure for error cases. Okay, vpanic() generates the panic. The second line is useless for our purposes to. The next both lines with segmap are generated by the error but not the root cause. The interesting line ist snf_smap_desbfree With this name you can go to Sunsolve or bugs.opensolaris.org. Et voila : System panic due to recursive mutex enter in snf smap desbfree trying to re-aquire Tx mutex. When you type this error into the PatchFinder, you will nd a patch xing this bug: 124255-03 Two hints: Its a good practice to know mdb. Its very useful at compiling open source software in the case your compiled code throw cores, but you dont know why. core les are not just for deleting them. Error reports with a stack trace are more usefull than an error report just with The system paniced when I did this
346
32.6. Conclusion
The Solaris Operating Environment has several functions to enable the user or the support engineer in the case something went wrong. Crash and core dumps are an invaluable resource to nd the root cause of a problem. Dont throw them away without looking at them.
347
348
349
33.3.1. Development
At the beginning Jumpstart was just able to do the installation. Recent versions1 include functions to automatically create boot mirrors.
33.4.1. rules
The rst important le for the automatic installation is the rules le. This les associates system with a installation prole.
# rule keywords and rule values begin script # ---------------------------------------hostname aramaki setup webserver any profile finish script - -- - - - - ------------completion generi cprofil e -
The rst rule can be divided in parts like this: When the hostname of the new server is aramaki, start the script begin on the client before starting the installation. For the installation use the prole le webserver. After the installation execute the script completion The second line is a catch-all condition. The le is used top down and the process of matching a system to a prole stops at the rst match. Thus an installation for aramaki would reach the second line. This line can be translated like this. For any other host, use the prole genericprofile. There is no begin or nish script. You cant use the rules le directly. The Jumpstart server provides a script to do a syntax check on the rules. When the le is correct, the script adds it gets renamed to rules.ok
350
33.4.2. prole
The prole le controls what we install on the system and how we partition the disks for the installation.
# profile keywords # ----------------install_type system_type partitioning filesys cluster package cluster profile values ----------------i ni t ia l_ in s ta ll standalone default any 512 swap SUNWCprog SUNWman delete SUNWCacc
You can have a multitude of proles in your system. A prole for system with large disks, a prole for system with small disks, a prole with a selection of packages customized for a webserver, a prole customized for a developer workstation. The jumpstart framework will choose the correct one on the basis of the rules.ok set The prole is capable to control almost any important parameter for the installation on the disk. You can dene pa
351
name_service = NIS { domain_name = marquee . central . example . com name_server = nmsvr2 (172.25.112.3) } nfs4_domain = example . com root_password = URFUni9
But: Whenever some essential information is missing, the installer will go interactive and ask for the missing information. This obviously is against our objective of an automated installation.
352
33. Jumpstart Enterprise Toolkit parts. The next time you want to install the system, you use both archives. At rst the full archive will be installed on the system, after this you use one or more dierential ash archives to complete your installation.
Table 33.1.: dierential archive behavior new Action exists File is included in archive exists but dier- The le from the new state is included in archive ent exists not File will be deleted, when the di archive is used
on a server
ar creation is just a big wrapper around cpio, thus its possible to some nifty tricks with it. The current states of the system doesnt have be the active one, and the old states doesnt have to be ar archives. Its possible to compare an old boot environment and the actual boot environment from Liveupgrade to generate a dierential ash archive. This dierential can be used to update other servers. You could even compare a remote system via NFS, when dont squash root.2
I know this has some security implication, but hey ... you should limit the access for such stunts to your admin networks and you can deactivate it afterwards.
353
33. Jumpstart Enterprise Toolkit removes the interface conguration of all congured interfaces. removes the root password removes /etc/sysidcfg removes /etc/defaultrouter removes /etc/ined/netmasks regenerates the ssh-keys sets the timezones in /etc/timezone to PST8PDT Albeit its not designed for system recovery, there is a trick you can use to recover the removed information. The knowledge about the removed part is important for the trick, thus Ive included a list of them in this tutorial. You will nd a script at the end of this tutorial.
354
33.7. Prerequisites
33.7.1. Systems
For these tests, I need quite a few systems. Ive used VirtualBox in the preparation of this tutorial. Ive populated my /etc/hosts with the following hostnames4 :
192.168.10.1 aramaki 192.168.10.10 togusa 192.168.10.11 ishikawa
In my tutorial aramaki will serve as the Jumpstart server, togusa and ishikawa are the installation targets.
33.8. Packages
For this tutorial Ive used the following packages and ISOs: Solaris 10 Update 5 This is the operating environment I will use to demonstrate automated patching OpenSolaris Community Edition Build 87 aramaki runs with this operating environment and its no problem to jumpstart OpenSolaris CE or DE with JET. Recommended Patch Cluster To demonstrate automated patching, Ive used the recommended patch cluster. You can gather it at http://sunsolve.sun.com
3 4
Those would make this action unavailable on Solaris 8 for example In case you wonder about the hostnames, these names are characters from Ghost in a Shell. And just in case you search for kusanagi, this is the name for the system hosting the virtual machines
355
33. Jumpstart Enterprise Toolkit SUNWjet The Jumpstart Enterprise Toolkit. You can get it at http://www.sun.com/ download/index.jsp?tab=2#J SUNWjass The Solaris Security Toolkit. You can get it from http://www.sun.com/ download/index.jsp?tab=2#S
The actual recommended patch cluster for Solaris 10 x86 is over 300 MB large
356
5 6 7 8 9 10
( sparc ) 1.1.5 JET san product ( sparc ) 3.1.7 Secure By Default product ( sparc ) 1.0.2 JET sds product ( sparc , i386 ) 3.4.4 JET VTS product ( sparc ) 3.0.11 JET WanBoot support ( sparc ) 1.1.1 JET Zones module ( sparc ) 1.1.12
... 2 more menu choices to follow ; < RETURN > for more choices , < CTRL -D > to stop display : 11 12 SUNWjet SUNWjetd Sun JumpStart Enterprise Toolkit ( sparc , i386 ) 4.4 JET Documentation ( sparc ) 3.3.1
Select package ( s ) you wish to process ( or all to process all packages ) . ( default : all ) [? ,?? , q ]: all
JetEXPLO Installs and congures the Sun Explorer JetFLASH Module to control Jumpstart Flash installed JetJASS executes the Solaris Security Toolkit on the new host JetRBAC congures the Role Based Access Control JetSAN congures the SAN framework of Solaris JetSBD congures the Secure-by-default setting JetSDS congures the Solaris Volume Management JetVTS installs the Sun Validation Test Suite. It tests and validates Sun hardware by verifying the connectivity and functionality of hardware devices, controllers and peripherals. JetWANboot congures the Jumpstart facilities for installation over the WAN JetZONES congures Solaris Zones on the newly installed zones.
357
33. Jumpstart Enterprise Toolkit As the package is quite small with half a megabyte, I always install all packages on a jumpstart server.
Processing package instance < SUNWjet > from </ root / jet . pkg > Sun JumpStart Enterprise Toolkit ( sparc , i386 ) 4.4 Copyright 2007 Sun Microsystems , Inc . All rights reserved . Use is subject to license terms . The selected base directory </ opt / SUNWjet > must exist before installation is attempted . Do you want this directory created now [y ,n ,? , q ] y Using </ opt / SUNWjet > as the package base directory . [...] Processing package instance < JetSBD > from </ root / jet . pkg > Secure By Default product ( sparc ) 1.0.2 #
358
Copying / boot netboot hierarchy ... Install Server setup complete Added Solaris image nv87 at the following location : Media : / export / install / media / solaris / x86 / nv87 removing directory / export / install / media / solaris / x86 /911 #
Lets dissect the command: -d species the target, where you copy the operating system. -n species a name for this media. From now on you refer this solaris media as nv87 in the templates for JET. At the end you specify the location, where the media is located at the moment.
359
33. Jumpstart Enterprise Toolkit The command line for using a .iso le is quite similar. You just specify with the -i that an =.iso= le has to be used and in which directory it should search for it. The last parameter is the name of the .iso le itself. The system mounts the dvd image by using the loopback facility of Solaris and copies the media to its target location afterwards.
360
Updating base_config template specifics Client template created in / opt / SUNWjet / Templates
Okay, this is too much ... at start we dont want all this modules right now. We can add them later, without loosing the conguration. Lets just use the module for the basic conguration:
# make_template -f togusa base_config Adding product configuration information for + base_config Updating base_config template specifics Client template created in / opt / SUNWjet / Templates
Even the basic template is quite long. Ive reduced it for this tutorial by deleting all comments, all empty lines and all variables without a value.
1
11
13
15
17
19
21
23
25
27
29
ba se _c o n f i g _ C l i e n t A r c h = i86pc b as e_ c o n f i g _ C l i e n t E t h e r =08:00 :27:97 :29:1 E base_co n f i g _ C l i e n t O S = nv87 b a s e _ c o n f i g _ c l i e n t _ a l l o c a t i o n =" grub " b a s e _ c o n f i g _ s y s i d c f g _ n a m e s e r v i c e = NONE b a s e _ c o n f i g _ s y s i d c f g _ n e t w o r k _ i n t e r f a c e = PRIMARY b a s e _ c o n f i g _ s y s i d c f g _ i p _ a d d r e s s =192.168.10.10 b a s e _ c o n f i g _ s y s i d c f g _ n e t m a s k =255.255.255.0 b a s e _ c o n f i g _ s y s i d c f g _ r o o t _ p a s s w o r d =" boajrOmU7GFmY " b a s e _ c o n f i g _ s y s i d c f g _ s y s t e m _ l o c a l e =" C " b a s e _ c o n f i g _ s y s i d c f g _ t i m e s e r v e r = localhost b a s e _ c o n f i g _ s y s i d c f g _ t i m e z o n e =" Europe / Berlin " b a s e _ c o n f i g _ s y s i d c f g _ t e r m i n a l = vt100 b a s e _ c o n f i g _ s y s i d c f g _ s e c u r i t y _ p o l i c y = NONE b a s e _ c o n f i g _ s y s i d c f g _ p r o t o c o l _ i p v 6 = no b a s e _ c o n f i g _ s y s i d c f g _ d e f a u l t _ r o u t e =192.168.10.1 bas e_co n f i g _ x 8 6 _ n o w i n =" yes " b as e _c o n f i g _ l a b e l _ d i s k s =" all " b a s e _ c o n f i g _ p r o f i l e _ c l u s t e r = SUNWCuser b a s e _ c o n f i g _ p r o f i l e _ u s e d i s k = rootdisk . b a se _ c o n f i g _ p r o f i l e _ r o o t = free b a se _ c o n f i g _ p r o f i l e _ s w a p =256 b a s e _ c o n f i g _ u f s _ l o g g i n g _ f i l e s y s =" all " b a s e _ c o n f i g _ p r o f i l e _ d e l _ c l u s t e r s =" SUNWCpm SUNWCpmx SUNWCdial SUNWCdialx " b a s e _ c o n f i g _ d n s _ d i s a b l e f o r b u i l d =" yes " b a s e _ c o n f i g _ u p d a t e _ t e r m i n a l =" yes " b a s e _ c o n f i g _ e n a b l e _ s a v e c o r e =" yes " b a s e _ c o n f i g _ d u m p a d m _ m i n f r e e ="20000 k " b a s e _ c o n f i g _ n o a u t o s h u t d o w n =" pm_disabled "
361
33. Jumpstart Enterprise Toolkit Lets dissect this template. Line 1-3 This lines are the most basic ones. The rst line denes the architecture of the system. The next line is the Ethernet-Address of the new system. The third one species the new operating system. Line 4 This line species, how the new system gathers the most basic informations like its own IP. Line 5-16 Do you remember the part about sysidcfg. The values for this les are dened in this part of the template. Line 17 This line tells the system to suppress the start of the windowing system. Line 18 Solaris needs a disk label on the disks for the system. This directive tells the system to write this label to all disks. Line 19-22 Another known phrase ... prole. Here you specify the partitioning for the system and what packages will be installed on it. Line 23-end There are several further statements. Please the original le for an explanation. Okay after this step, we have to generate the conguration for the Jumpstart mechanism. This is really easy:
# make_client -f togusa Gathering network information .. Client : 192.168.10.10 ( 1 9 2 . 1 6 8 . 1 0 . 0 / 2 5 5 . 2 5 5 . 2 5 5 . 0 ) Server : 192.168.10.1 (192.168.10.0/255.255.255.0 , SunOS ) Solaris : cl ie nt _pr ev al id at e Solaris : client_build Creating sysidcfg Creating profile Adding base_config specifics to client configuration Solaris : Configuring JumpStart boot for togusa Starting SMF services for JumpStart Solaris : Configure PXE / grub build Adding install client Doing a TEXT based install Leaving the graphical device as the primary console Configuring togusa macro Using local dhcp server PXE / grub configuration complete Running / opt / SUNWjet / bin / check_client togusa Client : 192.168.10.10 ( 1 9 2 . 1 6 8 . 1 0 . 0 / 2 5 5 . 2 5 5 . 2 5 5 . 0 )
362
Server : 192.168.10.1 (192.168.10.0/255.255.255.0 , SunOS ) Checking product base_config / solaris -------------------------------------------------------------Check of client togusa -> Passed ....
The nice thing about the make_client command: It doesnt just generate the Jumpstart conguration. It checks for the most dumb errors like forget to share the directory of your Solaris media with NFS. So you can detect many problems at an early stage. You dont have to wait until the jumpstart client comes up just to detect, that there is no NFS or no DHCP cong.
In the GRUB conguration we not only load the Kernel, we additionally name the location of the Jumpstart server, the exact location and name of the sysidconfig le, the position of our installation media and at last the location of the miniroot. In our example all locations are NFS locations. Okay, the install_config directory is the rst important location. We nd the rules.ok le there.
- bash -3.2 $ cat / opt / SUNWjet / rules . ok any any Utils / begin # version =2 checksum =3114 = Utils / finish
363
33. Jumpstart Enterprise Toolkit Okay, now lets have a look in the specied prole le:
- bash -3.2 $ cat / opt / SUNWjet / Clients / togusa / profile # # This is an automatically generated profile . Please modify the template . # # Created : Mon May 19 21:47:50 CEST 2008 # install_type initial_install system_type server cluster SUNWCuser partitioning explicit # # Disk layouts # filesys rootdisk . s0 free / filesys rootdisk . s1 256 swap cluster SUNWCpm delete cluster SUNWCpmx delete cluster SUNWCdial delete cluster SUNWCdialx delete
As I wrote before, we have to give the system an identity. The sysidcfg is responsible for this task, thus we nd such a le in our directory. Our new system will use it when the installation has completed.
- bash -3.2 $ cat / opt / SUNWjet / Clients / togusa / sysidcfg name_service = NONE root_password = boajrOmU7GFmY system_locale = C timeserver = localhost timezone = Europe / Berlin terminal = vt100 security_policy = NONE nfs4_domain = dynamic network_i nterfa ce = PRIMARY { hostname = togusa ip_address =192.168.10.10 netmask =255.255.255.0 protocol_ipv6 = no default_route =192.168.10.1}
364
33. Jumpstart Enterprise Toolkit After a while the installation will be complete. You can look for the logle of the installation at /var/sadm/system/logs:
Configuring disk ( c0d0 ) - Creating Fdisk partition table Fdisk partition table for disk c0d0 ( input file for fdisk (1 M ) ) type : 130 active : 128 offset : 16065 size : 33527655 type : 100 active : 0 offset : 0 size : 0 type : 100 active : 0 offset : 0 size : 0 type : 100 active : 0 offset : 0 size : 0 - Creating Solaris disk label ( VTOC ) - Processing the alternate sector slice Creating and checking UFS file systems - Creating / ( c0d0s0 ) Warning : 1608 sector ( s ) in last cylinder unallocated / dev / rdsk / c0d0s0 : 31744440 sectors in 5167 cylinders of 48 tracks , 128 sectors 15500.2 MB in 323 cyl groups (16 c /g , 48.00 MB /g , 5824 i / g ) super - block backups ( for fsck -F ufs -o b =#) at : 32 , 98464 , 196896 , 295328 , 393760 , 492192 , 590624 , 689056 , 787488 , 885920 , Initializing cylinder groups : ...... super - block backups for last 10 cylinder groups at : 30776480 , 30874912 , 30973344 , 31071776 , 31170208 , 31268640 , 31367072 , 31457312 , 31555744 , 31654176 Beginning Solaris software installation Installation of < SUNWkvm > was successful . [...] Installation of < SUNWsolnm > was successful . Solaris 11 software installation succeeded Solaris 11 packages fully installed SUNWkvm [...] SUNWsolnm Customizing system files - Mount points table (/ etc / vfstab ) fd / dev / fd fd no / proc / proc proc no / dev / dsk / c0d0s1 swap / dev / dsk / c0d0s0 / dev / rdsk / c0d0s0 / devices / devices devfs sharefs / etc / dfs / sharetab ctfs / system / contract ctfs objfs / system / object objfs swap / tmp tmpfs yes - Network host addresses (/ etc / hosts ) - Environment variables (/ etc / default / init ) Cleaning devices Customizing system devices - Physical devices (/ devices ) - Logical devices (/ dev ) Installing boot information - Updating boot environment configuration file
/ sharefs no -
no ufs no no -
1 no -
no
365
- Installing boot blocks ( c0d0 ) - Installing boot blocks (/ dev / rdsk / c0d0s0 ) Creating boot_archive for / a updating / a / platform / i86pc / boot_archive updating / a / platform / i86pc / amd64 / boot_archive
You dont have to congure anything in the templates. Every new Solaris 10 installation on x86 will be installed with the matching recommended patch cluster.
366
33. Jumpstart Enterprise Toolkit joe depends on the ncurses library. So we copy this package as well to our JET server.
# pkgtrans ncurses -5.6 - sol10 - x86 - local / tmp all Transferring < SMCncurs > package instance # copy_ c u s t o m _ p a c k a g e s / tmp x86 SMCncurs Transferring < SMCncurs > package instance Packages copied
When you look into /opt/SUNWjet/Templates/togusa you will recognize your old conguration, with a large amount of new lines. But we have to change only a few ones: At rst we change the operation system. Weve used OpenSolaris in the last example, but there are no patches. But weve copied a Solaris 10 media with the name sol10u5 earlier.
base_co n f i g _ C l i e n t O S = sol10u5
Okay, now we want to install the additional packages. You have to add the names of the packages in the line custom_packages.
custom_packages =" SMCncurs SMCjoe "
You dont have to congure the Secure by default module, as this module congures the limited service set when its used in the template. Patching of the Solaris OE doesnt need conguration as well. So we have to change only this two lines.
367
The following module doesnt do really much, as the conguration of the service prole is activated by the sysidcfg le.
SBD : Installing sbd .... SBD : configured
368
When we look for one of the installed patches, we will see its successful installation to the system:
# showrev -p | grep "120273 -20" Patch : 120273 -20 Obsoletes : Requires : 119043 -09 , 121902 -01 , 122532 -04 Incompatibles : Packages : SUNWbzip , SUNWsmagt , SUNWsmcmd , SUNWsmmgr
base_co n f i g _ p r o d u c t s =" custom sbd sds " sds_prod u c t_ v e rs i o n =" default " sds_root_mirror =" c1d1 " sds_use_fmthard =" yes " sd s_ da t a b a s e _ l o c a t i o n s =" rootdisk . s7 :3" sd s_ da t a b a s e _ p a r t i t i o n =" s7 :32" sds_metadb_size ="" sds_root_alias =" rootdisk " s d s _ r o o t _ m i r r o r _ d e v a l i a s _ n a m e =" rootmirror " sd s_ mi r r o r e d _ r o o t _ f l a g =" yes "
369
33. Jumpstart Enterprise Toolkit At rst you have to include the sds module to the base_config_product line. Then choose the disk you want to use for mirroring, in my case its /dev/dsk/c1d1. Line 4 orders the sds module to copy the vtoc of the rst mirror to the second one. When there are only two disks in your system you have to specify the sds_mirror_root_flag with yes in line 10. want to see the metadb replica copies on at least three disks6 . You have to tell the system that it shouldnt obey this rule.
1 no no -
no -
logging
When you want to nd out the correct state of an situation you need at least three copies to be sure. With only two copies, one or the other version may be correct, with three you have two correct copies, so there is a good chance that the two copies represent the correct state. Solaris Volume Manager likes to distribute those over at least three disk to ensure that the failure of a disk wont take out exactly half of the copies
370
d22 : Submirror of d20 State : Okay Size : 530145 blocks (258 MB ) Stripe 0: Device Start Block Dbase c1d1s1 0 No
d10 : Mirror Submirror 0: d11 State : Okay Submirror 1: d12 State : Okay Pass : 1 Read option : roundrobin ( default ) Write option : parallel ( default ) Size : 32836860 blocks (15 GB ) d11 : Submirror of d10 State : Okay Size : 32836860 blocks (15 GB ) Stripe 0: Device Start Block Dbase c0d0s0 0 No
d12 : Submirror of d10 State : Okay Size : 32836860 blocks (15 GB ) Stripe 0: Device Start Block Dbase c1d1s0 0 No
Device Relocation Information : Device Reloc Device ID c1d1 Yes id1 , c m d k @ A V B O X _ H A R D D I S K = VB37711a7b -00576 cc9 c0d0 Yes id1 , c m d k @ A V B O X _ H A R D D I S K = VB97a90791 -9 d191449
The default numbering scheme for the Solaris volume manager is quite simple: The mirror is designated with the rst number in a decade (e.g. 10,20,30), the parts of a mirror are numbered with the next free numbers in the decade. For example: The rst mirror half of the rst mirror get the number 11, the second number gets the number 12. It takes a while until the mirrors are in sync, but after this you have a automatically installed, patched, customized and mirrored system.
371
# copy_p ro du ct _m ed ia jass 4.2.0 / export / home / jmoekamp i386 Transferring < SUNWjass > package instance Packages copied .
Okay, but we have to do another step. There is a patch for the version 4.2.0 of the : 122608-xx. At rst we have to tell JET that there is a patch for this product and version. We have to modify the le patch.matrix in /opt/SUNWjet/Products/jass.
# # Patch matrix for Solaris Security Toolkit ( JASS ) # # <os >: < arch >: < version >: < patchlist > # 10: i386 :4.2.0:122608
Now its easy to integrate the patch. Ive unpacked the patch in the directory /export/home/jmoekamp/pa before:
# copy_ p r o d u c t _ p a t c h e s jass 4.2.0 / export / home / jmoekamp / patch_jass i386 Patches copied .
372
Installation of < SUNWjass > was successful . JASS : SUNWjass installation complete JASS : Register postinstall script postinstall for boot z
Its important to know, that the above conguration installed the SUNWjass package on the system, patches it there and then run runs the toolkit installed on the system. The hardening of the system is started in the background.After a while you will recognize the work of the script. The backup les of the Solaris Security Toolkit are dispersed all over the directories.
bash -3.00 $ ls -l / etc /*. JASS * -rw -r - -r - 1 root other coreadm . conf . JASS .20080523195314 [...] -rw -r - -r - 1 root sys vfstab . JASS .20080523195420 bash -3.00 $ 372 May 23 19:48 / etc /
After the completion of the background JASS run, you have a automatically installed, patched, customized, mirrored and hardened system.
The installation of the Solaris Operating Environment is equal to the normal Jumpstart process, as it relies on the same functions. But then JET comes into the game. After the installation has completed, the script Utils/finish is executed. But where is this le. Its relative to an directory weve specied before. Or to be exact, JET did that for is. This is a snippet from the menu.lst for our system.
373
title Solaris_11 Jumpstart kernel / I86PC . Solaris_11 -1/ platform / i86pc / kernel / unix - install nowin B install_config =192.168.10.1:/ opt / SUNWjet , sysid_config =192.168.10.1:/ opt / SUNWje t / Clients / togusa , install_media =192.168.10.1:/ export / install / media / solaris / x86 / nv 87 , install_boot =192.168.10.1:/ export / install / media / solaris / x86 / nv87 / boot module / I86PC . Solaris_11 -1/ x86 . miniroot
The Utils/finish is relativ to install_config, thus the executed nish script 192.168.10.1:/opt/SUN The NFS mount specied in install_config is one of the rst mounts done on the new system and we can use the content of this directory throughout the installation process. By the way: This is the reason, why the rules.ok is located at this strange position. We can study the further process in logle of the installation. The complete log is located at /var/opt/sun/jet/ in the le jumpstart_install.log. Lets start. At rst the nish script starts to take copy some components from the Jumpstart server to the local disk.
Creating directory : / a / var / opt / sun / jet / post_install Creating directory : / a / var / opt / sun / jet / Utils Creating directory : / a / var / opt / sun / jet / config Creating directory : / a / var / opt / sun / jet / js_media / patch Creating directory : / a / var / opt / sun / jet / js_media / pkg Copying file Clients / togusa / sysidcfg to / a / var / opt / sun / jet / config / sysidcfg Copying file Clients / togusa / profile to / a / var / opt / sun / jet / config / profile Copying file Clients / togusa / host . config to / a / var / opt / sun / jet / config / host . config Copying file Utils / solaris / releaseinfo to / a / var / opt / sun / jet / config / releaseinfo Copying functions to / a / var / opt / sun / jet / Utils / lib Copying file Clients / togusa / module_hints to / a / var / opt / sun / jet / config / module_hints
As you see, the JET copies over part of the toolkit as well as conguration les to the new position. But why are all those directories relative to /a. Well this is easy. In the netbooted mini root, the local disks are mounted relative to /a. The reasoning behind this copy is relatively simple. In the next boots the contents of /opt/SUNWjet/ arent available any longer, as the system doesnt mount it in the next steps. The
374
33. Jumpstart Enterprise Toolkit post installation scripts rely on some helper function. The simplest way to ensure their availability under all circumstances 7 is a simple copy. The next step is the mounting of the directories with patches and product.
Mounting nfs ://192.168.10.1/ export / install / patches on / var / opt / sun / jet / js_media / patch Mounting nfs ://192.168.10.1/ export / install / pkgs on / var / opt / sun / jet / js_media / pkg
Now it gets a little bit complicated, but I will simplify it as far as I can. Depending on the complexity of the setup your conguration may use one or more so called products. A product in JET-speak is a JET module for the installation and conguration of a certain area of the operating system. In any case you will use the product base_config but there may be several ones. Our example uses the products base_config, custom, sds and jass. The JET framework gathers this information from the conguration template. Its stored in this line:
base_co n f i g _ p r o d u c t s =" custom sbd sds jass "
The framework takes this information and to execute the install script in any of this directories. For example it starts at rst the install script in /opt/SUNWjet/Products/base_config/sol as this is default for every installation, after this it will step forward by executing the install script in any of the product directories. The install script has two important roles. At rst it installs packages, patches and les according to the conguration in the templates. At second it registers so called post_install scripts.
375
You see, that the SDS product registered some scripts for boot level 1 and some for boot level z. Lets look further into the installation log. This happens after the rst reboot:
SDS : Running 001. sds .001. create_fmthard fmthard : New volume table of contents now in place . SDS : Running 001. sds .001. set_boot_device SDS : Setting OBP nvramrc rootdisk path [...] SDS : Create 3 copies on c1d1s7 metadb : waiting on / etc / lvm / lock SDS : Running 001. sds .003. cr ea te _r oo t_m ir ro r SDS : Setting OBP nvramrc rootmirror path [...] SDS : Installing GRUB on Alternate Boot Disk . SDS : Running 001. sds .007. c r ea t e _u s e r_ d e vi c e s
376
Attach d22 to d20 Attach d12 to d10 submirror d12 is attached Attach d22 to d20 submirror d22 is attached Running 003. sds .002. a t ta c h _u s e r_ m i rr o r s
With this mechanism, you can implement installation processes, that needs package or programm installations that need several boots to fullll.
377
Running pre - exit scripts ... Pre - exit scripts done . bash -3.00#
The second command generates the ar image itself. With this command, I generate the ash archive togusa.flar in the directory /flar. The -x option excludes the directory /flardir from the flasharchive. The \verb-R= species that the ash archive should contain all lesystems descending to /. The -S ag bypasses the size checks.
# scp togusa . flar jmoekamp@192 .168.10.1:/ export / flar / togusa / togusa . flar
After the creation of the ash archive, you have to transmit it to a server. It doesnt have to be the jumpstart server. it doesnt even have to be an NFS server. It just have to be reachable with HTTP or NFS by the server you plan to install. In this example we will use the Jumpstart Server for this task, thus we will use a share on this system. Dont forget to share this directory via NFS:
# echo " share -F nfs -o anon =0 , sec = sys , ro -d \" Installation Images \" / export / install " >> / etc / dfs / dfstab # shareall
378
The use operating system here isnt important. You will not install this operating system. You will install the operating system contained in the ash archive. Fill out the sysidsection of the system like with a normal system, you just need to ll the data to prevent the system from going interactive. Now we get to the conguration of the ash install. You just dene one or more locations of ash archives. When the installation in your ash archive contains all recommended patches you can save some time at the installation and skip the installation by using yes for the flash_skip_recommended_patches parameter.
f la sh _ a r c h i v e _ l o c a t i o n s =" nfs ://192.168.10.1/ export / flar / togusa / togusa . flar " f l a s h _ s k i p _ r e c o m m e n d e d _ p a t c h e s =" yes "
379
Server : 192.168.10.1 (192.168.10.0/255.255.255.0 , SunOS ) Checking product base_config / solaris Checking product flash FLASH : Checking nfs ://192.168.10.1/ export / flar / togusa / togusa . flar -------------------------------------------------------------Check of client ishikawa -> Passed ....
Please note the single line with FLASH: at the beginning. The JET framework checks for the availability of the ash archive. This prevents one of the most occurring problems with ash... a unaccessible ash archive at the installation. When we look into the profile for ishikawa you will recognize, that all statements regarding cluster to install or similar stu is removed. But there is a new statement. archive_location species the location of the ar image and install_type tells the system to do a ash install.
# cat profile # # This is an automatically generated profile . Please modify the template . # # Created : Sat May 24 12:50:20 CEST 2008 # install_type flash_install archive_location nfs ://192.168.10.1/ export / flar / togusa / togusa . flar partitioning explicit # # Disk layouts # filesys rootdisk . s0 free / filesys rootdisk . s1 256 swap # pwd / opt / SUNWjet / Clients / ishikawa #
380
33. Jumpstart Enterprise Toolkit just see the progress notications of the extraction of the ar archive. You will nd this information in the middle of: /var/sadm/system/logs/install_log
[...] Beginning Flash archive processing Extracting archive : togusa Extracted 0.00 MB ( 0% of 2512.13 MB archive ) [...] Extracted 2512.13 MB (100% of 2512.13 MB archive ) Extraction complete [...]
The JET framework augments this with some housekeeping tasks like deleting the Solaris Volume Manager conguration. As usual you can look into the /var/opt/sun/jet/finish.log logle to nd the related messages:
FLASH : Installing flash .... FLASH : Disabling / a / etc / lvm / mddb . cf -> / a / etc / lvm / mddb . cf . disabled FLASH : Purging entries from / a / etc / lvm / md . tab FLASH : Disabling mi rr or ed _r oo t_ fl ag in / etc / system FLASH : Cleanout crash dump area FLASH : Clear out devfsadm files
381
33. Jumpstart Enterprise Toolkit can easily rescue the personality of the system into the ash archive. To ease up this task, I use the following script:
#!/ bin / sh mkdir -p / var / opt / recovery mkdir -p / var / opt / recovery / etc cp -p / etc / hosts / var / opt / recovery / etc cp -p / etc / shadow / var / opt / recovery / etc cp -p / etc / passwd / var / opt / recovery / etc cp -p / etc / vfstab / var / opt / recovery / etc cp -p / etc / nodename / var / opt / recovery / etc cp -p / etc / hostname .* / var / opt / recovery / etc cp -p / etc / dhcp .* / var / opt / recovery / etc cp -p / etc / defaultdomain / var / opt / recovery / etc cp -p / etc / TIMEZONE / var / opt / recovery / etc mkdir -p / var / opt / recovery / etc / inet cp -p / etc / inet / netmasks / var / opt / recovery / etc / inet / netmasks cp -p / etc / defaultrouter / var / opt / recovery / etc / defaultrouter mkdir -p / var / opt / recovery / var / ldap cp -p / etc /. rootkey / var / opt / recovery / etc cp -p / etc / resolv . conf / var / opt / recovery / etc cp -p / etc / sysidcfg / var / opt / recovery / etc cp -p / var / ldap / ld ap_cli ent_ca che / var / opt / recovery / var / ldap / ldap_clie nt_cac he cp -p / var / ldap / ldap_client_file / var / opt / recovery / var / ldap / ldap_client_file cp -p / var / ldap / ldap_client_cred / var / opt / recovery / var / ldap / ldap_client_cred cp -p / var / ldap / cachemgr . log / var / opt / recovery / var / ldap / cachemgr . log mkdir -p / var / opt / recovery / var / nis cp -p / var / nis / NIS_COLD_START / var / opt / recovery / var / nis mkdir -p / var / opt / recovery / var / yp cp -R -p / var / yp /* / var / opt / recovery / var / yp
When you create a ar archive after running this script, it will include this directory in the archive, thus a newly installed machine with this ash archive will have this directory as well. So you can use it to recover the old status of the system. The process is simple. Just do a cp -R /var/opt/recovery/* /. The process of jumpstarting the server is identical of doing a normal ash install.
382
33.18. Conclusion
JET and Jumpstart are incredible powerful tools to ease the installation of your Solaris systems. You can install your systems via network, congure and customize them. With tools like Jumpstart FLASH its easy to clone systems for large computing cluster. Even when you have only a few systems, its a good practice to install you systems by a Jumpstart server. A bare metal recovery of your system is much easier, you can use a common template to setup your system and beside the hard disk space the jumpstart server needs only resources at the moment of the installation.
383
into your .profile le and execute source ~/.profile to get the new setting without logging out and in again.
384
385
This is the policy of Sun for lifecycles. We wont shorten the time, but often the eective lifetime is much longer, as you will see in the next paragraph.
386
Table 35.1.: Events in the lifecycle of a Solaris release Event Name E1 General Description
E2
E3
E4
E5
E6
E7
after E7
GA is a day of joy, celebrations and marketing. A new Availability major release of Solaris is born, for example the rst release (GA) of Solaris 10. In at least the next 4 and a half years your will see several updates to this version of the operating system. End of Life Okay, one year before we announce the formal End of Life (EOL) Pre- of the product, Sun sends the rst notication/warning to Notication our customers. End of Life Okay, we announce the end of the development of a major (EOL) An- Solaris release. As I told before, this is at least 54 month nouncement after the GA, sometimes longer than that. When we announce the EOL, we trigger the start of the next phase in the lifecycle of a Solaris release, the last order phase. Last Order 90 days after the EOL announcement is the Last-OrderDate (LOD) Date. This is the last day, you can order a Solaris version as an preinstalled image and its the last a purchased system includes the license for this specic operating system. This Last Order Date isnt eective for Support Contracts. You can order a support contract for a Solaris version until its End of Service Life. With the Last Order day the next phase is started: The last ship phase Last Ship In the next 90 days all orders for a new version has to be Date (LSD) fullled. Yeah, you cant order an EOLed operating system for delivery a year after its end-of-life (besides of special agreements). With the Last-Order-Date the retirement of the Solaris Version starts. End of For the next two years you have essentially the same Retirement service than before EOL with some exceptions. No xes Support for cosmetic bugs, no feature enhancements, no quarterly Phase 1 updates. End of In the last phase of the lifecycle, you still get telephone Retirement support for a version and you can still download patches Support for the system, but there will be no new patches. Phase 2 End of After EOSL you cant get further support or patches with Service Life the exception of special agreements between the customer (EOSL) and Sun.
387
35.3. Sidenote
For a customer this long release cycles are optimal, but there is a problem for Sun in it. We dont force our customer to use new versions early. Some customers still use old Solaris 8 versions and they use Solaris 10 like Solaris 8 to keep the processes in sync. There are some technological leaps between 8 and 10, but they dont use the new features. They think they know Solaris, but they know just 8, not 10. The reputation of being somewhat outdated has its root partly in this habit. This is the bad side of the medal, but long support cycles are too important to change this policy...
388
389
35. Long support cycles This work is licensed under the Creative Commons License.
390
391
392
List of Tables
5.1. Events of the contract subsystem . . . . . . . . . . . . . . . . . . . . . . 5.2. Description of service states . . . . . . . . . . . . . . . . . . . . . . . . . 6.1. Factory-congured project in Solaris . . . . . . . . . . . . . . . . . . . . 17.1. Cryptographic Mechanisms for password encryption 17.2. /etc/default/password: standard checks . . . . . . . . 17.3. /etc/default/password: complexity checks . . . . . . . 17.4. /etc/default/password: Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 46 64 154 157 158 159
33.1. dierential archive behavior . . . . . . . . . . . . . . . . . . . . . . . 353 35.1. Events in the lifecycle of a Solaris release . . . . . . . . . . . . . . . . . . 387
393
List of Figures
3.1. 3.2. 3.3. 3.4. Live Live Live Live Upgrade: Upgrade: Upgrade: Upgrade: Situation before start . . . . . . . . . . . . . . . . . Creating the alternate boot environment . . . . . . . Patching/Upgrading the alternate boot environment Switching alternate and actual boot environment . . . . . . . . . . . . . . 28 28 29 29
19.1. Simple virtual network with Crossbow . . . . . . . . . . . . . . . . . . . 169 19.2. Extended virtual network with Crossbow . . . . . . . . . . . . . . . . . . 178 20.1. Simple network with redundant server connection 20.2. Simple network with redundant server connection 20.3. Simple network with redundant server connection 20.4. Conguration for the demo . . . . . . . . . . . . . 20.5. New IPMP - everything okay . . . . . . . . . . . 20.6. New IPMP - e1000g1 failed . . . . . . . . . . . . 20.7. Classic IPMP - everything okay . . . . . . . . . . 20.8. Classic IPMP - e1000g0 failed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 194 195 200 217 217 218 218
24.1. The components of iSCSI . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 25.1. Layering model of COMSTAR iSCSI . . . . . . . . . . . . . . . . . . . . 250 25.2. iSCSI Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 27.1. Independent copy: Before conguration . . . . . . . 27.2. Independent copy: After initialization . . . . . . . . 27.3. Independent copy: Representation of changed data 27.4. Independent copy: Resynchronization . . . . . . . . 27.5. Dependent copy: After initialization . . . . . . . . . 27.6. Dependent copy: Changes on the data volume . . . 27.7. Dependent copy: Resynchronization . . . . . . . . . 27.8. Compact Dependent copy: Initialization . . . . . . 27.9. Compact Dependent copy: Change a rst block . . 27.10. ompact Dependent copy: Change a second block . C 27.11. ompact Dependent copy: Resynchronization . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 281 282 282 283 284 284 285 286 286 287
394